
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[ The Cloudflare Blog ]]></title>
        <description><![CDATA[ Get the latest news on how products at Cloudflare are built, technologies used, and join the teams helping to build a better Internet. ]]></description>
        <link>https://blog.cloudflare.com</link>
        <atom:link href="https://blog.cloudflare.com/" rel="self" type="application/rss+xml"/>
        <language>en-us</language>
        <image>
            <url>https://blog.cloudflare.com/favicon.png</url>
            <title>The Cloudflare Blog</title>
            <link>https://blog.cloudflare.com</link>
        </image>
        <lastBuildDate>Wed, 15 Apr 2026 19:33:07 GMT</lastBuildDate>
        <item>
            <title><![CDATA[Go and enhance your calm: demolishing an HTTP/2 interop problem]]></title>
            <link>https://blog.cloudflare.com/go-and-enhance-your-calm/</link>
            <pubDate>Fri, 31 Oct 2025 13:00:00 GMT</pubDate>
            <description><![CDATA[ HTTP/2 implementations often respond to suspected attacks by closing the connection with an ENHANCE_YOUR_CALM error code. Learn how a common pattern of using Go's HTTP/2 client can lead to unintended errors and the solution to avoiding them. ]]></description>
            <content:encoded><![CDATA[ <p>In September 2025, a thread popped up in our internal engineering chat room asking, "Which part of our stack would be responsible for sending <code>ErrCode=ENHANCE_YOUR_CALM</code> to an HTTP/2 client?" Two internal microservices were experiencing a critical error preventing their communication and the team needed a timely answer.</p><p>In this blog post, we describe the background to well-known HTTP/2 attacks that trigger Cloudflare defences, which close connections. We then document an easy-to-make mistake using Go's standard library that can cause clients to send PING flood attacks and how you can avoid it.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2OL6xA151F9JNR4S0wE0DG/f6ae4b0a261da5d189d82ccfa401104e/image1.png" />
          </figure>
    <div>
      <h3>HTTP/2 is powerful – but it can be easy to misuse</h3>
      <a href="#http-2-is-powerful-but-it-can-be-easy-to-misuse">
        
      </a>
    </div>
    <p><a href="https://www.rfc-editor.org/rfc/rfc9113"><u>HTTP/2</u></a> defines a binary wire format for encoding <a href="https://www.rfc-editor.org/rfc/rfc9110.html"><u>HTTP semantics</u></a>. Request and response messages are encoded as a series of HEADERS and DATA frames, each associated with a logical stream, sent over a TCP connection using TLS. There are also control frames that relate to the management of streams or the connection as a whole. For example, SETTINGS frames advertise properties of an endpoint, WINDOW_UPDATE frames provide flow control credit to a peer so that it can send data, RST_STREAM can be used to cancel or reject a request or response, while GOAWAY can be used to signal graceful or immediate connection closure.</p><p>HTTP/2 provides many powerful features that have legitimate uses. However, with great power comes responsibility and opportunity for accidental or intentional misuse. The specification details a number of <a href="https://datatracker.ietf.org/doc/html/rfc9113#section-10.5"><u>denial-of-service considerations</u></a>. Implementations are advised to harden themselves: "An endpoint that doesn't monitor use of these features exposes itself to a risk of denial of service. Implementations SHOULD track the use of these features and set limits on their use."</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/71AXf477FP8u9znjdqXIh9/d6d22049a3f5f5488b8b7e9ac7f832a9/image3.png" />
          </figure><p>Cloudflare implements many different HTTP/2 defenses, developed over years in order to protect our systems and our customers. Some notable examples include mitigations added in 2019 to address "<a href="https://blog.cloudflare.com/on-the-recent-http-2-dos-attacks/"><u>Netflix vulnerabilities</u></a>" and in 2023 to mitigate <a href="https://blog.cloudflare.com/technical-breakdown-http2-rapid-reset-ddos-attack/"><u>Rapid Reset</u></a> and <a href="https://blog.cloudflare.com/madeyoureset-an-http-2-vulnerability-thwarted-by-rapid-reset-mitigations/"><u>similar</u></a> style attacks.</p><p>When Cloudflare detects that HTTP/2 client behaviour is likely malicious, we close the connection using the GOAWAY frame and include the error code <a href="https://www.rfc-editor.org/rfc/rfc9113#ENHANCE_YOUR_CALM"><code><u>ENHANCE_YOUR_CALM</u></code></a>.</p><p>One of the well-known and common attacks is <a href="https://www.cve.org/CVERecord?id=CVE-2019-9512"><u>CVE-2019-9512</u></a>, aka PING flood: "The attacker sends continual pings to an HTTP/2 peer, causing the peer to build an internal queue of responses. Depending on how efficiently this data is queued, this can consume excess CPU, memory, or both." Sending a <a href="https://www.rfc-editor.org/rfc/rfc9113#section-6.7"><u>PING frame</u></a> causes the peer to respond with a PING acknowledgement (indicated by an ACK flag). This allows for checking the liveness of the HTTP connection, along with measuring the layer 7 round-trip time – both useful things. The requirement to acknowledge a PING, however, provides the potential attack vector since it generates work for the peer.</p><p>A client that PINGs the Cloudflare edge too frequently will trigger our <a href="https://www.cve.org/CVERecord?id=CVE-2019-9512"><u>CVE-2019-9512</u></a> mitigations, causing us to close the connection. Shortly after we <a href="https://blog.cloudflare.com/road-to-grpc/"><u>launched support for gRPC</u></a> in 2020, we encountered interoperability issues with some gRPC clients that sent many PINGs as part of a <a href="https://grpc.io/blog/grpc-go-perf-improvements/#bdp-estimation-and-dynamic-flow-control-window"><u>performance optimization for window tuning</u></a>. We also discovered that the Rust Hyper crate had a feature called Adaptive Window that emulated the design and triggered a similar <a href="https://github.com/hyperium/hyper/issues/2526"><u>problem</u></a> until Hyper made a <a href="https://github.com/hyperium/hyper/pull/2550"><u>fix</u></a>.</p>
    <div>
      <h3>Solving a microservice miscommunication mystery</h3>
      <a href="#solving-a-microservice-miscommunication-mystery">
        
      </a>
    </div>
    <p>When that thread popped up asking which part of our stack was responsible for sending the <code>ENHANCE_YOUR_CALM</code> error code, it was regarding a client communicating over HTTP/2 between two internal microservices.</p><p>We suspected that this was an HTTP/2 mitigation issue and confirmed it was a PING flood mitigation in our logs. But taking a step back, you may wonder why two internal microservices are communicating over the Cloudflare edge at all, and therefore hitting our mitigations. In this case, communicating over the edge provides us with several advantages:</p><ol><li><p>We get to dogfood our edge infrastructure and discover issues like this!</p></li><li><p>We can use Cloudflare Access for authentication. This allows our microservices to be accessed securely by both other services (using service tokens) and engineers (which is invaluable for debugging).</p></li><li><p>Internal services that are written with Cloudflare Workers can easily communicate with services that are accessible at the edge.</p></li></ol>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6v9oBdn5spdqw1BjDW1bS3/a9b36ef9252d580a79c978eb366f7a7a/image2.png" />
          </figure><p>The question remained: Why was this client behaving this way? We traded some ideas as we attempted to get to the bottom of the issue.</p><p>The client had a configuration that would indicate that it didn't need to PING very frequently:</p>
            <pre><code>t2.PingTimeout = 2 * time.Second
t2.ReadIdleTimeout = 5 * time.Second</code></pre>
            <p>However, in situations like this it is generally a good idea to establish ground truth about what is really happening "on the wire." For instance, grabbing a packet capture that can be dissected and explored in Wireshark can provide unequivocal evidence of precisely what was sent over the network. The next best option is detailed/trace logging at the sender or receiver, although sometimes logging can be misleading, so caveat emptor.</p><p>In our particular case, it was simpler to use logging with <code>GODEBUG=http2debug=2</code>. We built a simplified minimal reproduction of the client that triggered the error, helping to eliminate other potential variables. We did some group log analysis, combined with diving into some of the Go standard library code to understand what it was really doing. Issac Asimov is commonly credited with the quote "The most exciting phrase to hear in science, the one that heralds new discoveries, is not 'Eureka!' but 'That's funny...'" and sure enough, within the hour someone declared–

<i>the funny part I see is this:</i></p>
            <pre><code>2025/09/02 17:33:18 http2: Framer 0x14000624540: wrote RST_STREAM stream=9 len=4 ErrCode=CANCEL
2025/09/02 17:33:18 http2: Framer 0x14000624540: wrote PING len=8 ping="j\xe7\xd6R\xdaw\xf8+"</code></pre>
            <p><i>every ping seems to be preceded by a RST_STREAM</i></p><p>Observant readers will recall the earlier mention of Rapid Reset. However, our logs clearly indicated ENHANCE_YOUR_CALM being triggered due to the PING flood. A bit of searching landed us on this <a href="https://groups.google.com/g/grpc-io/c/sWYYQJXHCAQ/m/SWFHxw9IAgAJ"><u>mailing list thread</u></a> and the comment "Sending a PING frame along with an RST_STREAM allows a client to distinguish between an unresponsive server and a slow response." That seemed quite relevant. We also found <a href="https://go-review.googlesource.com/c/net/+/632995"><u>a change that was committed</u></a> related to this topic. This partly answered why there were so many PINGs, but it also raised a new question: Why so many stream resets?

So we went back to the logs and built up a little more context about the interaction:</p>
            <pre><code>2025/09/02 17:33:18 http2: Transport received DATA flags=END_STREAM stream=47 len=0 data=""
2025/09/02 17:33:18 http2: Framer 0x14000624540: wrote RST_STREAM stream=47 len=4 ErrCode=CANCEL
2025/09/02 17:33:18 http2: Framer 0x14000624540: wrote PING len=8 ping="\x97W\x02\xfa&gt;\xa8\xabi"</code></pre>
            <p>The interesting thing here is that the server had sent a DATA frame with the END_STREAM flag set. Per the HTTP/2 stream <a href="https://www.rfc-editor.org/rfc/rfc9113#section-5.1"><u>state machine</u></a>, the stream should have transitioned to <b>closed</b> when a frame with END_STREAM was processed. The client doesn't need to do anything in this state – sending a RST_STREAM is entirely unnecessary.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/R0Bbw1SFYwcyb280RdwjY/578628489c97f67a5ac877a55f4f3e3b/image6.png" />
          </figure><p>A little more digging and noodling and an engineer proclaimed:

<i>I noticed that the reset+ping only happens when you call r</i><code><i>esp.Body.Close()</i></code></p><p><i>I believe Go's HTTP library doesn't actually read the response body automatically, but keeps the stream open for you to use until you call r</i><code><i>esp.Body.Close()</i></code><i>, which you can do at any point you like.</i></p><p>The hilarious thing in our example was that there wasn't actually any HTTP body to read. From the earlier example: <code>received DATA flags=END_STREAM stream=47 len=0 data=""</code>.</p><p>Science and engineering are at times weird and counterintuitive. We decided to tweak our client to read the (absent) body via <code>io.Copy(io.Discard, resp.Body)</code> before closing it. </p><p>Sure enough, this immediately stopped the client sending both a useless RST_STREAM and, by association, a PING frame. </p><p>Mystery solved?</p><p>To prove we had fixed the root cause, the production client was updated with a similar fix. A few hours later, all the ENHANCE_YOUR_CALM closures were eliminated.</p>
    <div>
      <h3>Reading bodies in Go can be unintuitive</h3>
      <a href="#reading-bodies-in-go-can-be-unintuitive">
        
      </a>
    </div>
    <p>It’s worth noting that in some situations, ensuring the response body is always read can sometimes be unintuitive in Go. For example, at first glance it appears that the response body will always be read in the following example:</p>
            <pre><code>resp, err := http.DefaultClient.Do(req)
if err != nil {
	return err
}
defer resp.Body.Close()

if err := json.NewDecoder(resp.Body).Decode(&amp;respBody); err != nil {
	return err
}</code></pre>
            <p>However, <code>json.Decoder</code> stops reading as soon as it finds a complete JSON document or errors. If the response body contains multiple JSON documents or invalid JSON, then the entire response body may still not be read.</p><p>Therefore, in our clients, we’ve started replacing <code>defer response.Body.Close()</code> with the following pattern to ensure that response bodies are always fully read:</p>
            <pre><code>resp, err := http.DefaultClient.Do(req)
if err != nil {
	return err
}
defer func() {
	io.Copy(io.Discard, resp.Body)
	resp.Body.Close()
}()

if err := json.NewDecoder(resp.Body).Decode(&amp;respBody); err != nil {
	return err
}</code></pre>
            
    <div>
      <h2>Actions to take if you encounter ENHANCE_YOUR_CALM</h2>
      <a href="#actions-to-take-if-you-encounter-enhance_your_calm">
        
      </a>
    </div>
    <p>HTTP/2 is a protocol with several features. Many implementations have implemented hardening to protect themselves from misuse of features, which can trigger a connection to be closed. The recommended error code for closing connections in such conditions is ENHANCE_YOUR_CALM. There are numerous HTTP/2 implementations and APIs, which may drive the use of HTTP/2 features in unexpected ways that could appear like attacks.</p><p>If you have an HTTP/2 client that encounters closures with ENHANCE_YOUR_CALM, we recommend that you try to establish ground truth with packet captures (including TLS decryption keys via mechanisms like <a href="https://wiki.wireshark.org/TLS#using-the-pre-master-secret"><u>SSLKEYLOGFILE</u></a>) and/or detailed trace logging. Look for patterns of frequent or repeated frames that might be similar to malicious traffic. Adjusting your client may help avoid it getting misclassified as an attacker.</p><p>If you use Go, we recommend always reading HTTP/2 response bodies (even if empty) in order to avoid sending unnecessary RST_STREAM and PING frames. This is especially important if you use a single connection for multiple requests, which can cause a high frequency of these frames.</p><p>This was also a great reminder of the advantages of dogfooding our own products within our internal services. When we run into issues like this one, our learnings can benefit our customers with similar setups.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6doWYOkW3zbafkANt31knv/b00387716b1971d61eb8b4915ee58783/image5.png" />
          </figure><p></p> ]]></content:encoded>
            <category><![CDATA[HTTP2]]></category>
            <category><![CDATA[Go]]></category>
            <category><![CDATA[DDoS]]></category>
            <guid isPermaLink="false">sucWiZHlaWXeLFddHtkk1</guid>
            <dc:creator>Lucas Pardue</dc:creator>
            <dc:creator>Zak Cutner</dc:creator>
        </item>
        <item>
            <title><![CDATA[Building even faster interpreters in Rust]]></title>
            <link>https://blog.cloudflare.com/building-even-faster-interpreters-in-rust/</link>
            <pubDate>Thu, 24 Sep 2020 11:00:00 GMT</pubDate>
            <description><![CDATA[ Firewall Rules lets customers filter the traffic hitting their site, powered by our Wirefilter engine. We’re excited to share some in-depth optimizations we have recently made to improve the performance of our edge. ]]></description>
            <content:encoded><![CDATA[ <p></p><p>At Cloudflare, we’re <a href="/making-the-waf-40-faster/">constantly working on</a> improving the performance of our edge — and that was exactly what my internship this summer entailed. I’m excited to share some improvements we’ve made to our popular <a href="/how-we-made-firewall-rules/">Firewall Rules</a> product over the past few months.</p><p>Firewall Rules lets customers filter the traffic hitting their site. It’s built using our engine, Wirefilter, which takes powerful boolean expressions written by customers and matches incoming requests against them. Customers can then choose how to respond to traffic which matches these rules. We will discuss some in-depth optimizations we have recently made to Wirefilter, so you may wish to <a href="/building-fast-interpreters-in-rust/">get familiar</a> with how it works if you haven’t already.</p>
    <div>
      <h3>Minimizing CPU usage</h3>
      <a href="#minimizing-cpu-usage">
        
      </a>
    </div>
    <p>As a new member of the Firewall team, I quickly learned that performance is important — even in our security products. We look for opportunities to make our customers’ Internet properties faster where it’s safe to do so, maximizing both security and performance.</p><p>Our engine is already heavily used, powering all of Firewall Rules. But we have bigger plans. More and more products like our <a href="https://support.cloudflare.com/hc/en-us/articles/200172016-Understanding-the-Cloudflare-Web-Application-Firewall-WAF">Web Application Firewall</a> (WAF) will be running behind our Wirefilter-based engine, and it will become responsible for eating up a sizable chunk of our total CPU usage before long.</p>
    <div>
      <h3>How to measure performance?</h3>
      <a href="#how-to-measure-performance">
        
      </a>
    </div>
    <p>Measuring performance is a notoriously tricky task, and as you can probably imagine trying to do this in a highly distributed environment (aka <a href="https://www.cloudflare.com/learning/serverless/glossary/what-is-edge-computing/">Cloudflare’s edge</a>) does not help. We’ve been surprised in the past by optimizations that look good on paper, but, when tested out in production, just don’t seem to do much.</p><p>Our solution? Performance measurement as a service — an isolated and reproducible benchmark for our Firewall engine and a framework for engineers to easily request runs and view results. It’s worth noting that we took a lot of inspiration from the fantastic <a href="https://perf.rust-lang.org/">Rust Compiler benchmarks</a> to build this.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2j6ou7jUofaSYjEjoMej25/a380fb67cdca84b7b95ff13eb8d98203/image3-10.png" />
            
            </figure><p>Our benchmarking framework, showing how performance during different stages of processing Wirefilter expressions has changed over time [1].</p>
    <div>
      <h3>What to measure?</h3>
      <a href="#what-to-measure">
        
      </a>
    </div>
    <p>Our next challenge was to find some meaningful performance metrics. Some experimentation quickly uncovered that time was far too volatile a measure for meaningful comparisons, so we turned to <a href="https://en.wikipedia.org/wiki/Hardware_performance_counter">hardware counters</a> [2]. It’s not hard to find tools to measure these (<a href="https://perf.wiki.kernel.org/index.php/Main_Page">perf</a> and <a href="https://software.intel.com/content/www/us/en/develop/tools/vtune-profiler.html">VTune</a> are two such examples), although they (<a href="https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/api-support/instrumentation-and-tracing-technology-apis/instrumentation-and-tracing-technology-api-reference/collection-control-api.html">mostly</a>) don’t allow control over which parts of the program are recorded. In our case, we wished to individually record measurements for different stages of filter processing — parsing, compilation, analysis, and execution.</p><p>Once again we took inspiration from the Rust compiler, and its <a href="https://blog.rust-lang.org/inside-rust/2020/02/25/intro-rustc-self-profile.html">self-profiling options</a>, using the <a href="https://www.man7.org/linux/man-pages/man2/perf_event_open.2.html">perf_event_open</a> API to record counters from <i>inside</i> our binary. We then output something like the following, which our framework can easily ingest and store for later visualization.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5rMHtBLVjfrHV0BMfTvaZV/a53037be7211be732df9ced5169b8fe7/Screen-Shot-2020-09-24-at-10.20.42.png" />
            
            </figure><p>Output of our benchmarks in <a href="https://jsonlines.org/">JSON Lines</a> format, showing a list of recordings for each combination of hardware counter and Wirefilter processing stage. We’ve used 10 repeats here for readability, but we use around 20, in addition to 5 warmup rounds, within our framework.</p><p>Whilst we mainly focussed on metrics relating to CPU usage, we also use a combination of <a href="https://man7.org/linux/man-pages/man2/getrusage.2.html"><code>getrusage</code></a> and <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=695f055936938c674473ea071ca7359a863551e7"><code>clear_refs</code></a> to find the maximum resident set size (RSS). This is useful to understand the memory impact of <a href="https://github.com/BurntSushi/aho-corasick">particular algorithms</a> in addition to CPU.</p><p>But the challenge was not over. Cloudflare’s standard CI agents use virtualization and sandboxing for security and convenience, but this makes accessing hardware counters virtually impossible. Running our benchmarks on a dedicated machine gave us access to these counters, and ensured more reproducible results.</p>
    <div>
      <h3>Speeding up the speed test</h3>
      <a href="#speeding-up-the-speed-test">
        
      </a>
    </div>
    <p>Our benchmarks were designed from the outset to take an important place in our development process. For instance, we now perform a full benchmark run before releasing each new version to detect performance regressions.</p><p>But with our benchmarks in place, it quickly became clear that we had a problem. Our benchmarks simply weren’t fast enough — at least if we wanted to complete them in less than a few hours! The problem was we have a very large number  of filters. Since our engine would never usually execute requests against this many filters at once it was proving incredibly costly. We came up with a few tricks to cut this down…</p><ul><li><p><b><b><b>Deduplication.</b></b></b> It turns out that only around a third of filters are structurally unique (something that is easy to check as Wirefilter can helpfully serialize to JSON). We managed to cut down a great deal of time by ignoring duplicate filters in our benchmarks.</p></li><li><p><b><b><b>Sampling.</b></b></b> Still, we had too many filters and random sampling presented an easy solution. A more subtle challenge was to make sure that the random sample was always the same to maintain reproducibility.</p></li><li><p><b><b><b>Partitioning.</b></b></b> We worried that deduplication and sampling would cause us to miss important cases that are useful to optimize. By first partitioning filters by Wirefilter language feature, we can ensure we’re getting a good range of filters. It also helpfully gives us more detail about where specifically the impact of a performance change is.</p></li></ul><p>Most of these are trade-offs, but very necessary ones which allow us to run continual benchmarks without development speed grinding to a halt. At the time of writing, we’ve managed to get a benchmark run down to around 20 minutes using these ideas.</p>
    <div>
      <h3>Optimizing our engine</h3>
      <a href="#optimizing-our-engine">
        
      </a>
    </div>
    <p>With a benchmarking framework in place, we were ready to begin testing optimizations. But how do you optimize an interpreter like Wirefilter? <a href="https://en.wikipedia.org/wiki/Just-in-time_compilation">Just-in-time (JIT) compilation</a>, <a href="https://dl.acm.org/doi/10.1145/277652.277743">selective inlining</a> and <a href="http://www.complang.tuwien.ac.at/andi/papers/dotnet_06.pdf">replication</a> were some ideas floating around in the word of interpreters that seemed attractive. After all, we <a href="/building-fast-interpreters-in-rust/">previously wrote</a> about the cost of dynamic dispatch in Wirefilter. All of these techniques aim to reduce that effect.</p><p>However, running some real filters through a profiler tells a different story. Most execution time, around 65%, is spent not resolving dynamic dispatch calls but instead performing <a href="https://developers.cloudflare.com/firewall/cf-firewall-language/operators/#comparison-operators">operations</a> like comparison and searches. Filters currently in production tend to be pretty light on functions, but throw in a few more of these and even less time would be spent on dynamic dispatch. We suspect that even a fair chunk of the remaining 35% is actually spent reading the memory of request fields.</p>
<table>
<thead>
  <tr>
    <th>Function</th>
    <th>CPU time</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>`matches`</span> operator</td>
    <td>0.6%</td>
  </tr>
  <tr>
    <td><span>`in`</span> operator</td>
    <td>1.1%</td>
  </tr>
  <tr>
    <td><span>`eq`</span> operator</td>
    <td>11.8%</td>
  </tr>
  <tr>
    <td><span>`contains`</span> operator</td>
    <td>51.5%</td>
  </tr>
  <tr>
    <td>Everything else</td>
    <td>35.0%</td>
  </tr>
</tbody>
</table>
<figcaption>Breakdown of CPU time while executing a typical production filter.</figcaption>
    <div>
      <h3>An adventure in substring searching</h3>
      <a href="#an-adventure-in-substring-searching">
        
      </a>
    </div>
    <p>By now, you shouldn’t be surprised that the <code>contains</code> operator was one of the first in line for optimization. If you’ve ever written a Firewall Rule, you’re probably already familiar with what it does — it checks whether a substring is present in the field you are matching against. For example, the following expression would match when the host is “example.com” or “<a href="http://www.example.net”">www.example.net”</a>, but not when it is “cloudflare.com”. In <a href="https://en.wikipedia.org/wiki/String-searching_algorithm">string searching algorithms</a>, this is commonly referred to as finding a ‘needle’ (“example”) within a ‘haystack’ (“example.com”).</p><p><code>http.host contains “example”</code></p><p>How does this work under the hood? Ordinarily, we may have used Rust’s <a href="https://doc.rust-lang.org/std/string/struct.String.html#method.contains">`String::contains` function</a> but Wirefilter also allows raw byte expressions that don’t necessarily conform to UTF-8.</p><p><code>http.host contains 65:78:61:6d:70:6c:65</code></p><p>We therefore used the <a href="https://crates.io/crates/memmem">memmem crate</a> which performs a <a href="https://en.wikipedia.org/wiki/Two-way_string-matching_algorithm">two-way substring search algorithm</a> on raw bytes.</p><p>Sounds good, right? It was, and it was working pretty well, although we’d noticed that rewriting `contains` filters using regular expressions could bizarrely often make them faster.</p><p><code>http.host matches “example”</code></p><p>Regular expressions are great, but since they’re far more powerful than the `contains` operator, they shouldn’t be faster than a specialized algorithm in simple cases like this one.</p><p>Something was definitely up. It turns out that <a href="https://crates.io/crates/regex">Rust’s regex library</a> comes equipped with <a href="https://github.com/rust-lang/regex/blob/691606773f525be32a59a0c28eae203a79663706/src/literal/imp.rs#L24">a whole host</a> of specialized matchers for what it deems to be simple expressions like this. The obvious question was whether we could therefore simply use the regex library. Interestingly, you may not have realized that the popular ripgrep tool does <a href="https://github.com/BurntSushi/ripgrep/blob/1b2c1dc67583d70d1d16fc93c90db80bead4fb09/crates/core/args.rs#L1479">just that</a> when searching for fixed-string patterns.</p><p>However, our use case is a little different. Since we’re building an interpreter (and we’re using dynamic dispatch in any case), we would prefer to dispatch to a specialized case for `contains` expressions, rather than matching on some enum deep within the regex crate when the filter is executed. What’s more, there are some <a href="http://0x80.pl/articles/simd-strfind.html">pretty</a> <a href="https://github.com/intel/hyperscan">cool</a> <a href="https://github.com/jneem/teddy">things</a> being done to perform substring searching that leverages SIMD instruction sets. So we wired up our engine to some <a href="https://github.com/WojciechMula/sse4-strstr">previous work</a> by Wojciech Muła and the results were fantastic.</p>
<table>
<thead>
  <tr>
    <th>Benchmark</th>
    <th>Improvement</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>Expressions using <span>`contains`</span> operator</td>
    <td>72.3%</td>
  </tr>
  <tr>
    <td>‘Simple’ expressions</td>
    <td>0.0%</td>
  </tr>
  <tr>
    <td>All expressions</td>
    <td>31.6%</td>
  </tr>
</tbody>
</table>
<figcaption>Improvements in instruction count using Wojciech Muła’s sse4-strstr library over the memmem crate with Wirefilter.</figcaption><p>I encourage you to <a href="http://0x80.pl/articles/simd-strfind.html">read more</a> on “Algorithm 1”, which we used, but it works something like this (I’ve changed the order a little to help make it clearer). It’s worth <a href="https://en.wikipedia.org/wiki/SIMD">reading up</a> on SIMD instructions if you’re unfamiliar with them — they’re the essence behind what makes this algorithm fast.</p><ol><li><p>We fill one SIMD register with the first byte of the needle being searched for, simply repeated over and over.</p></li><li><p>We load as much of our haystack as we can into another SIMD register and perform a bitwise equality operation with our previous register.</p></li><li><p>Now, any position in the resultant register that is 0 cannot be the start of the match since it doesn’t start with the same byte of the needle.</p></li><li><p>We now repeat this process with the last byte of the needle, offsetting the haystack, to rule out any positions that don’t end with the same byte as the needle.</p></li><li><p>Bitwise ANDing these two results together, we (hopefully) have now drastically reduced our potential matches.</p></li><li><p>Each of the remaining potential matches can be checked manually using a memcmp operation. If we find a match, then we’re done.</p></li><li><p>If not, we continue with the next part of our haystack and repeat until we’ve checked the entire thing.</p></li></ol>
    <div>
      <h3>When it goes wrong</h3>
      <a href="#when-it-goes-wrong">
        
      </a>
    </div>
    <p>You may be wondering what happens if our haystack doesn’t fit neatly into registers. In the original algorithm, nothing. It simply continues reading into the oblivion after the end of the haystack until the last register is full, and uses a bitmask to ignore potential false-positives from this additional region of memory.</p><p>As we mentioned, security is our priority when it comes to optimizations, so we could never deploy something with this kind of behaviour. We ended up porting Muła’s library to Rust (we’ve also <a href="https://crates.io/crates/sliceslice">open-sourced the crate</a>!) and performed an <a href="https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/coding-for-neon---part-2-dealing-with-leftovers">overlapping registers modification</a> found in ARM’s blog.</p><p>It’s best illustrated by example — notice the difference between how we would fill registers on an imaginary SIMD instruction-set with 4-byte registers.</p><p><b>Before modification</b></p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3OUJ2AojfLLoz1mZepPjg0/1b98003bf5fb2a44fed2366f6840fcae/image1-16.png" />
            
            </figure><p>How registers are filled in the original implementation for the haystack “abcdefghij”, red squares indicate out of bounds memory.</p><p><b>After modification</b></p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2j5TTWwSXcV1Rtu7Uaoqsf/138c36a096cc6c201777028933a3f49c/image4-10.png" />
            
            </figure><p>How registers are filled with the overlapping modification for the same haystack, notice how ‘g’ and ‘h’ each appear in two registers.</p><p>In our case, repeating some bytes within two different registers will never change the final outcome, so this modification is allowed as-is. However, in reality, we found it was better to use a bitmask to exclude repeated parts of the final register and minimize the number of memcmp calls.</p><p>What if the haystack is too small to even fill a single register? In this case, we can’t use our overlapping trick since there’s nothing to overlap with. Our solution is straightforward: while we were primarily targeting AVX2, which can store 32-bytes in a lane, we can easily move down to another instruction set with smaller registers that the haystack can fit into. In reality, we don’t currently go any smaller than SSE2. Beyond this, we instead use an implementation of the <a href="https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm">Rabin-Karp searching algorithm</a> which appears to perform well.</p>
<table>
<thead>
  <tr>
    <th>Instruction set</th>
    <th>Register size</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>AVX2</td>
    <td>32 bytes</td>
  </tr>
  <tr>
    <td>SSE2</td>
    <td>16 bytes</td>
  </tr>
  <tr>
    <td>SWAR (u64)</td>
    <td>8 bytes</td>
  </tr>
  <tr>
    <td>SWAR (u32)</td>
    <td>4 bytes</td>
  </tr>
  <tr>
    <td>...</td>
    <td>...</td>
  </tr>
</tbody>
</table>
<figcaption>Register sizes in different SIMD instruction sets [3]. We did not consider AVX512 since support for this is not widespread enough.</figcaption>
    <div>
      <h3>Is it always fast?</h3>
      <a href="#is-it-always-fast">
        
      </a>
    </div>
    <p>Choosing the first and last bytes of the needle to rule out potential matches is a great idea. It means that when it does come to performing a memcmp, we can ignore these, as we know they already match. Unfortunately, as Muła points out, this also makes the algorithm susceptible to a worst-case attack in some instances.</p><p>Let’s give an expression that a customer might write to illustrate this.</p><p><code>http.request.uri.path contains “/wp-admin/”</code></p><p>If we try to search for this within a very long sequence of ‘/’s, we will find a potential match in every position and make lots of calls to memcmp — essentially performing a slow <a href="https://en.wikipedia.org/wiki/String-searching_algorithm#Na%C3%AFve_string_search">bruteforce substring search</a>.</p><p>Clearly we need to choose different bytes from the needle. But which ones should we choose? For each choice, an adversary can always find a slightly different, but equally troublesome, worst case. We instead use <a href="https://en.wikipedia.org/wiki/Randomized_algorithm">randomness</a> to throw off our would-be adversary, picking the first byte of the needle as before, but then choosing another <i>random</i> byte to use.</p><p>Our new version is unsurprisingly slower than Muła’s, yet it still exhibits a great improvement over both the memmem and regex crates. Performance, but without sacrificing safety.</p>
<table>
<thead>
  <tr>
    <th>Benchmark</th>
    <th>Improvement</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td></td>
    <td>sse4-strstr (original)</td>
    <td>sliceslice (our version)</td>
  </tr>
  <tr>
    <td>Expressions using <span>`contains`</span> operator</td>
    <td>72.3%</td>
    <td>49.1%</td>
  </tr>
  <tr>
    <td>‘Simple’ expressions</td>
    <td>0.0%</td>
    <td>0.1%</td>
  </tr>
  <tr>
    <td>All expressions</td>
    <td>31.6%</td>
    <td>24.0%</td>
  </tr>
</tbody>
</table>
<figcaption>Improvements in instruction count of using sse4-strstr and sliceslice over the memmem crate with Wirefilter.</figcaption>
    <div>
      <h3>What’s next?</h3>
      <a href="#whats-next">
        
      </a>
    </div>
    <p>This is only a small taste of the performance work we’ve been doing, and we have much more yet to come. Nevertheless, none of this would have been possible without the support of my manager Richard and my mentor Elie, who contributed a lot of these ideas. I’ve learned so much over the past few months, but most of all that Cloudflare is an amazing place to be an intern!</p><p>[1] Since our benchmarks are not run within a production environment, results in this post do not represent traffic on our edge.</p><p>[2] We found instruction counts to be a particularly stable measure, and CPU cycles a particularly unstable one.</p><p>[3] Note that <a href="https://en.wikipedia.org/wiki/SWAR">SWAR</a> is technically not an instruction set, but instead uses regular registers like vector registers.</p> ]]></content:encoded>
            <category><![CDATA[Firewall]]></category>
            <category><![CDATA[Performance]]></category>
            <category><![CDATA[Rust]]></category>
            <category><![CDATA[WAF]]></category>
            <guid isPermaLink="false">519hc6hT0fkJoLpme6IKrP</guid>
            <dc:creator>Zak Cutner</dc:creator>
        </item>
    </channel>
</rss>