
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[ The Cloudflare Blog ]]></title>
        <description><![CDATA[ Get the latest news on how products at Cloudflare are built, technologies used, and join the teams helping to build a better Internet. ]]></description>
        <link>https://blog.cloudflare.com</link>
        <atom:link href="https://blog.cloudflare.com/" rel="self" type="application/rss+xml"/>
        <language>en-us</language>
        <image>
            <url>https://blog.cloudflare.com/favicon.png</url>
            <title>The Cloudflare Blog</title>
            <link>https://blog.cloudflare.com</link>
        </image>
        <lastBuildDate>Wed, 15 Apr 2026 17:46:08 GMT</lastBuildDate>
        <item>
            <title><![CDATA[Why we're rethinking cache for the AI era]]></title>
            <link>https://blog.cloudflare.com/rethinking-cache-ai-humans/</link>
            <pubDate>Thu, 02 Apr 2026 13:00:00 GMT</pubDate>
            <description><![CDATA[ The explosion of AI-bot traffic, representing over 10 billion requests per week, has opened up new challenges and opportunities for cache design. We look at some of the ways AI bot traffic differs from humans, how this impacts CDN cache, and some early ideas for how Cloudflare is designing systems to improve the AI and human experience. ]]></description>
            <content:encoded><![CDATA[ <p>Cloudflare data shows that 32% of traffic across our network originates from <a href="https://radar.cloudflare.com/traffic"><u>automated traffic</u></a>. This includes search engine crawlers, uptime checkers, ad networks — and more recently, AI assistants looking to the web to add relevant data to their knowledge bases as they generate responses with <a href="https://developers.cloudflare.com/reference-architecture/diagrams/ai/ai-rag/"><u>retrieval-augmented generation</u></a> (RAG). Unlike typical human behavior, <a href="https://www.cloudflare.com/learning/ai/what-is-agentic-ai/"><u>AI agents</u></a>, crawlers, and scrapers’ automated behavior may appear aggressive to the server responding to the requests. </p><p>For instance, AI bots frequently issue high-volume requests, often in parallel. Rather than focusing on popular pages, they may access rarely visited or loosely related content across a site, often in sequential, complete scans of the websites. For example, an AI assistant generating a response may fetch images, documentation, and knowledge articles across dozens of unrelated sources.</p><p>Although Cloudflare already makes it easy to <a href="https://blog.cloudflare.com/introducing-ai-crawl-control/"><u>control and limit</u></a> automated access to your content, many sites may <i>want</i> to serve AI traffic. For instance, an application developer may want to guarantee that their developer documentation is up-to-date in foundational AI models, an e-commerce site may want to ensure that product descriptions are part of LLM search results, or publishers may want to get paid for their content through mechanisms such as <a href="https://blog.cloudflare.com/introducing-pay-per-crawl/"><u>pay per crawl</u></a>.</p><p>Website operators therefore face a dichotomy: tune for AI crawlers, or for human traffic. Given both exhibit widely different traffic patterns, current cache architectures force operators to choose one approach to save resources.</p><p>In this post, we’ll explore how AI traffic impacts storage cache, describe some challenges associated with mitigating this impact, and propose directions for the community to consider adapting CDN cache to the AI era.</p><p>This work is a collaborative effort with a team of researchers at <a href="https://ethz.ch/en.html"><u>ETH Zurich</u></a>. The full version of this work was published at the 2025 <a href="https://acmsocc.org/2025/index.html"><u>Symposium on Cloud Computing</u></a> as “<a href="https://dl.acm.org/doi/10.1145/3772052.3772255"><u>Rethinking Web Cache Design for the AI Era</u></a>” by Zhang et al.</p>
    <div>
      <h3>Caching </h3>
      <a href="#caching">
        
      </a>
    </div>
    <p>Let's start with a quick refresher on <a href="https://www.cloudflare.com/learning/cdn/what-is-caching/"><u>caching</u></a>. When a user initiates a request for content on their device, it’s usually sent to the Cloudflare data center closest to them. When the request arrives, we check to see if we have a valid cached copy. If we do, we can serve the content immediately, resulting in a fast response, and a happy user. If the content isn't available to read from our cache, (a "cache miss"), our data centers reach out to the <a href="https://www.cloudflare.com/learning/cdn/glossary/origin-server/"><u>origin server</u></a> to get a fresh copy, which then stays in our cache until it expires or other data pushes it out. </p><p>Keeping the right elements in our cache is critical for reducing our cache misses and providing a great user experience — but what’s “right” for human traffic may be very different from what’s right for AI crawlers!</p>
    <div>
      <h3>AI traffic at Cloudflare</h3>
      <a href="#ai-traffic-at-cloudflare">
        
      </a>
    </div>
    <p>Here, we’ll focus on AI crawler traffic, which has emerged as the most active AI bot type <a href="https://blog.cloudflare.com/crawlers-click-ai-bots-training/"><u>in recent analyses</u></a>, accounting for 80% of the self-identified AI bot traffic we see. AI crawlers fetch content to support real-time AI services, such as answering questions or summarizing pages, as well as to harvest data to build large training datasets for models like <a href="https://www.cloudflare.com/learning/ai/what-is-large-language-model/"><u>LLMs</u></a>.</p><p>From <a href="https://radar.cloudflare.com/ai-insights"><u>Cloudflare Radar</u></a>, we see that the vast majority of single-purpose AI bot traffic is for training, with search as a distant second. (See <a href="https://blog.cloudflare.com/ai-crawler-traffic-by-purpose-and-industry/"><u>this blog post</u></a> for a deep discussion of the AI crawler traffic we see at Cloudflare).</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3WQUiQ36rvMb8rNKruwdLd/1e9003057720b68829c6df3337a840ec/image2.png" />
          </figure><p>While both search and training crawls impact cache through numerous sequential, long-tail accesses, training traffic has properties such as high unique URL ratio, content diversity, and crawling inefficiency that make it even more impactful on cache.</p>
    <div>
      <h3>How does AI traffic differ from other traffic for a CDN?</h3>
      <a href="#how-does-ai-traffic-differ-from-other-traffic-for-a-cdn">
        
      </a>
    </div>
    <p>AI crawler traffic has three main differentiating characteristics: high unique URL ratio, content diversity, and crawling inefficiency.</p><p><a href="https://commoncrawl.github.io/cc-crawl-statistics/plots/crawlsize"><u>Public crawl statistics</u></a> from <a href="https://commoncrawl.org/"><u>Common Crawl</u></a>, which performs large-scale web crawls on a monthly basis, show that over 90% of pages are unique by content. Different AI crawlers also target <a href="https://blog.cloudflare.com/ai-bots/"><u>distinct content types</u></a>: e.g., some specialize in technical documentation, while others focus on source code, media, or blog posts. Finally, AI crawlers do not necessarily follow optimal crawling paths. A substantial fraction of fetches from popular AI crawlers result in 404 errors or redirects, <a href="https://dl.acm.org/doi/abs/10.1145/3772052.3772255"><u>often due to poor URL handling</u></a>. The rate of these ineffective requests varies depending on how well the crawler is tuned to target live, meaningful content. AI crawlers also typically do not employ browser-side caching or session management in the same way human users do. AI crawlers can launch multiple independent instances, and because they don’t share sessions, each may appear as a new visitor to the CDN, even if all instances request the same content.</p><p>Even a single AI crawler is likely to dig deeper into websites and <a href="https://dl.acm.org/doi/epdf/10.1145/3772052.3772255"><u>explore a broader range of content than a typical human user.</u></a> Usage data from Wikipedia shows that <b>pages once considered "long-tail" or rarely accessed are now being frequently requested, shifting the distribution of content popularity within a CDN's cache.</b> In fact, AI agents may iteratively loop to refine search results, scraping the same content repeatedly. We model this to show that this iterative looping leads to low content reuse and broad coverage. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7yH1QLIGCU3mJGXID27Cik/3ba56ff02865b7b141743815d0909be0/image1.png" />
          </figure><p>Our modeling of AI agent behavior shows that as they iteratively loop to refine search results (a common pattern for retrieval-augmented generation), they maintain a consistently high <b>unique access ratio </b>(the red columns above) — typically between 70% and 100%. This means that each loop, while generally increasing <b>accuracy</b> for the agent (represented here by the blue line), is constantly fetching new, unique content rather than revisiting previously seen pages. </p><p><b>This repeat access to long-tail assets churns the cache that the human traffic relies on. That could make existing pre-fetching and traditional cache invalidation strategies less effective as the amount of crawler traffic increases.  </b></p>
    <div>
      <h3>How does AI traffic impact cache?</h3>
      <a href="#how-does-ai-traffic-impact-cache">
        
      </a>
    </div>
    <p>For a <a href="https://www.cloudflare.com/learning/cdn/what-is-a-cdn/"><u>CDN</u></a>, a cache miss means having to go to the origin server to fetch the requested content.  Think of a cache miss like your local library not having a book in house, so you have to wait to get the book from inter-library loan. You’ll get your book eventually, but it will take longer than you wanted. It will also inform your library that having that book in stock locally could be a good idea.  </p><p>As a result of their broad, unpredictable access patterns with long-tail reuse, AI crawlers significantly raise the cache miss rate. And many of our typical methods to improve our cache hit rate, such as <a href="https://blog.cloudflare.com/introducing-speed-brain/"><u>cache speculation</u></a> or prefetching, are significantly less effective.  </p><p>The first chart below shows the difference in cache hit rates for a single node in Cloudflare’s CDN with and without our <a href="https://radar.cloudflare.com/bots/directory?category=AI_CRAWLER&amp;kind=all"><u>identified AI crawlers</u></a>. While the impact of crawlers is still relatively limited, there is a clear drop in hit rate with the addition of AI crawler traffic. We manage our cache with an algorithm called “least recently used”, or LRU. This means that the least-requested content can be evicted from cache first to make space for more popular content when storage space is full. The drop in hit rate implies that LRU is struggling under the repeated scan behavior of AI crawlers.</p><p>The bottom figure shows Al cache misses during this time. Each of those cache misses represents a request to the origin, slowing response times as well as increasing egress costs and load on the origin. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6rsbyos9tv8wzbbXJTrAYh/522b3fed76ce69bb96eb9aaff51ea1b1/image3.png" />
          </figure><p>This surge in AI bot traffic has had real-world impact. The following table from our paper shows the effects on several large websites. Each example links to its source report.</p><table><tr><td><p><b>System</b></p></td><td><p><b>Reported AI Traffic Behavior</b></p></td><td><p><b>Reported Impact</b></p></td><td><p><b>Reported Mitigations</b></p></td></tr><tr><td><p><a href="https://www.wikipedia.org/"><u>Wikipedia</u></a></p></td><td><p>Bulk image scraping for model training<a href="https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-the-operations-of-the-wikimedia-projects/"><u><sup>1</sup></u></a></p></td><td><p>50% surge in multimedia bandwidth usage<a href="https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-the-operations-of-the-wikimedia-projects/"><u><sup>1</sup></u></a></p></td><td><p>Blocked crawler traffic<a href="https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-the-operations-of-the-wikimedia-projects/"><u><sup>1</sup></u></a></p></td></tr><tr><td><p><a href="https://sourcehut.org/"><u>SourceHut</u></a></p></td><td><p>LLM crawlers scraping code repositories<a href="https://incidentdatabase.ai/cite/1001/"><u><sup>2</sup></u></a><sup>,</sup><a href="https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/"><u><sup>3</sup></u></a> </p></td><td><p>Service instability and slowdowns<a href="https://incidentdatabase.ai/cite/1001/"><u><sup>2</sup></u></a><sup>,</sup><a href="https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/"><u><sup>3</sup></u></a> </p></td><td><p>Blocked crawler traffic<a href="https://incidentdatabase.ai/cite/1001/"><u><sup>2</sup></u></a><sup>,</sup><a href="https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/"><u><sup>3</sup></u></a> </p></td></tr><tr><td><p><a href="https://about.readthedocs.com/"><u>Read the Docs</u></a></p></td><td><p>AI crawlers download large files hundreds of times daily<a href="https://incidentdatabase.ai/cite/1001/"><u><sup>2</sup></u></a><sup>,</sup><a href="https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse/"><u><sup>4</sup></u></a></p></td><td><p>Significant bandwidth increase<a href="https://incidentdatabase.ai/cite/1001/"><u><sup>2</sup></u></a><sup>,</sup><a href="https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse/"><u><sup>4</sup></u></a></p></td><td><p>Temporarily blocked crawler traffic, performed IP-based rate limiting, reconfigured CDN to improve caching<a href="https://incidentdatabase.ai/cite/1001/"><u><sup>2</sup></u></a><sup>,</sup><a href="https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse/"><u><sup>4</sup></u></a></p></td></tr><tr><td><p><a href="https://www.fedoraproject.org/"><u>Fedora</u></a></p></td><td><p>AI scrapers recursively crawl package mirrors<a href="https://incidentdatabase.ai/cite/1001/"><u><sup>2</sup></u></a><sup>,</sup><a href="https://cryptodamus.io/en/articles/news/ai-web-scrapers-attacking-open-source-here-s-how-to-fight-back"><u><sup>5</sup></u></a><sup>,</sup><a href="https://www.scrye.com/blogs/nirik/posts/2025/03/15/mid-march-infra-bits-2025/"><u><sup>6</sup></u></a></p></td><td><p>Slow response for human users<a href="https://incidentdatabase.ai/cite/1001/"><u><sup>2</sup></u></a><sup>,</sup><a href="https://cryptodamus.io/en/articles/news/ai-web-scrapers-attacking-open-source-here-s-how-to-fight-back"><u><sup>5</sup></u></a><sup>,</sup><a href="https://www.scrye.com/blogs/nirik/posts/2025/03/15/mid-march-infra-bits-2025/"><u><sup>6</sup></u></a></p></td><td><p>Geo-blocked traffic from known bot sources along with blocking several subnets and even countries<a href="https://incidentdatabase.ai/cite/1001/"><u><sup>2</sup></u></a><sup>,</sup><a href="https://cryptodamus.io/en/articles/news/ai-web-scrapers-attacking-open-source-here-s-how-to-fight-back"><u><sup>5</sup></u></a><sup>,</sup><a href="https://www.scrye.com/blogs/nirik/posts/2025/03/15/mid-march-infra-bits-2025/"><u><sup>6</sup></u></a></p></td></tr><tr><td><p><a href="https://diasporafoundation.org/"><u>Diaspora</u></a></p></td><td><p>Aggressive scraping without respecting robots.txt<a href="https://diaspo.it/posts/2594"><u><sup>7</sup></u></a></p></td><td><p>Slow response and downtime for human users<a href="https://diaspo.it/posts/2594"><u><sup>7</sup></u></a></p></td><td><p>Blocked crawler traffic and added rate limits<a href="https://diaspo.it/posts/2594"><u><sup>7</sup></u></a></p></td></tr></table><p>The impact is severe: Wikimedia experienced a 50% surge in multimedia bandwidth usage due to bulk image scraping. Fedora, which hosts large software packages, and the Diaspora social network suffered from heavy load and poor performance for human users. Many others have noted bandwidth increases or slowdowns from AI bots repeatedly downloading large files. While blocking crawler traffic mitigates some of the impact, a smarter cache architecture would let site operators serve AI crawlers while maintaining response times for their human users.</p>
    <div>
      <h3>AI-aware caching</h3>
      <a href="#ai-aware-caching">
        
      </a>
    </div>
    <p>AI crawlers power live applications such as <a href="https://www.cloudflare.com/learning/ai/retrieval-augmented-generation-rag/"><u>retrieval-augmented generation (RAG)</u></a> or real-time summarization, so latency matters. That’s why these requests should be routed to caches that can balance larger capacity with moderate response times. These caches should still preserve freshness, but can tolerate slightly higher access latency than human-facing caches. </p><p>AI crawlers are also used for building training sets and running large-scale content collection jobs. These workloads can tolerate significantly higher latency and are not time-sensitive. As such, their requests can be served from deep cache tiers that take longer to reach (e.g., origin-side SSD caches), or even delayed using queue-based admission or rate-limiters to prevent backend overload. This also opens the opportunity to defer bulk scraping when infrastructure is under load, without affecting interactive human or AI use cases.</p><p>Existing projects like Cloudflare’s <a href="https://blog.cloudflare.com/an-ai-index-for-all-our-customers/"><u>AI Index</u></a> and <a href="https://blog.cloudflare.com/markdown-for-agents/"><u>Markdown for Agents</u></a> allow website operators to present a simplified or reduced version of websites to known AI agents and bots. We're making plans to do much more to mitigate the impact of AI traffic on CDN cache, leading to better cache performance for everyone. With our collaborators at ETH Zurich, we’re experimenting with two complementary approaches: first, traffic filtering with AI-aware caching algorithms; and second, exploring the addition of an entirely new cache layer to siphon AI crawler traffic to a cache that will improve performance for both AI crawlers and human traffic. </p><p>There are several different types of cache replacement algorithms, such as LRU (“Least Recently Used”), LFU (“Least Frequently Used”), or FIFO (“First-In, First-Out”), that govern how a storage cache chooses to evict elements from the cache when a new element needs to be added and the cache is full. LRU is often the best balance of simplicity, low-overhead, and effectiveness for generic situations, and is widely used. For mixed human and AI bot traffic, however, our initial experiments indicate that a different choice of cache replacement algorithm, particularly using <a href="https://cachemon.github.io/SIEVE-website/"><u>SEIVE</u></a> or <a href="https://s3fifo.com/"><u>S3FIFO</u></a>, could allow human traffic to achieve the same hit rate with or without AI interference. We are also experimenting with developing more directly workload-aware, machine learning-based caching algorithms to customize cache response in real time for a faster and cheaper cache.  </p><p>Long term, we expect that a separate cache layer for AI traffic will be the best way forward. Imagine a cache architecture that routes human and AI traffic to distinct tiers deployed at different layers of the network. Human traffic would continue to be served from edge caches located at CDN PoPs, which prioritize responsiveness and <a href="https://www.cloudflare.com/learning/cdn/what-is-a-cache-hit-ratio/"><u>cache hit rates</u></a>. For AI traffic, cache handling could vary by task type. </p>
    <div>
      <h3>This is just the beginning</h3>
      <a href="#this-is-just-the-beginning">
        
      </a>
    </div>
    <p>The impact of AI bot traffic on cloud infrastructure is only going to grow over the next few years. We need better characterization of the effects on CDNs across the globe, along with bold new cache policies and architectures to address this novel workload and help make a better Internet. </p><p>Cloudflare is already solving the problems we’ve laid out here. Cloudflare reduces bandwidth costs for customers who experience high bot traffic with our AI-aware caching, and with our <a href="https://www.cloudflare.com/ai-crawl-control/"><u>AI Crawl Control</u></a> and <a href="https://www.cloudflare.com/paypercrawl-signup/"><u>Pay Per Crawl</u></a> tools, we give customers better control over who programmatically accesses their content.</p><p>We’re just getting started exploring this space. If you're interested in building new ML-based caching algorithms or designing these new cache architectures, please apply for an internship! We have <a href="https://www.cloudflare.com/en-gb/careers/jobs/?department=Early+Talent"><u>open internship positions</u></a> in Summer and Fall 2026 to work on this and other exciting problems at the intersection of AI and Systems.  </p> ]]></content:encoded>
            <category><![CDATA[Research]]></category>
            <category><![CDATA[Cache]]></category>
            <guid isPermaLink="false">635WBzM8GMiVZhyzKFeWMf</guid>
            <dc:creator>Avani Wildani</dc:creator>
            <dc:creator>Suleman Ahmad</dc:creator>
        </item>
        <item>
            <title><![CDATA[Bringing more transparency to post-quantum usage, encrypted messaging, and routing security]]></title>
            <link>https://blog.cloudflare.com/radar-origin-pq-key-transparency-aspa/</link>
            <pubDate>Fri, 27 Feb 2026 06:00:00 GMT</pubDate>
            <description><![CDATA[ Cloudflare Radar has added new tools for monitoring PQ adoption, KT logs for messaging, and ASPA routing records to track the Internet's migration toward more secure encryption and routing standards.  ]]></description>
            <content:encoded><![CDATA[ <p></p><p>Cloudflare Radar already offers a wide array of <a href="https://radar.cloudflare.com/security/"><u>security insights</u></a> — from application and network layer attacks, to malicious email messages, to digital certificates and Internet routing.</p><p>And today we’re introducing even more. We are launching several new security-related data sets and tools on Radar: </p><ul><li><p>We are extending our post-quantum (PQ) monitoring beyond the client side to now include origin-facing connections. We have also released a new tool to help you check any website's post-quantum encryption compatibility. </p></li><li><p>A new Key Transparency section on Radar provides a public dashboard showing the real-time verification status of Key Transparency Logs for end-to-end encrypted messaging services like WhatsApp, showing when each log was last signed and verified by Cloudflare's Auditor. The page serves as a transparent interface where anyone can monitor the integrity of public key distribution and access the API to independently validate our Auditor’s proofs. </p></li><li><p>Routing Security insights continue to expand with the addition of global, country, and network-level information about the deployment of ASPA, an emerging standard that can help detect and prevent BGP route leaks. </p></li></ul>
    <div>
      <h2>Measuring origin post-quantum support</h2>
      <a href="#measuring-origin-post-quantum-support">
        
      </a>
    </div>
    
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2gs0x3zMZTxios168jT9xW/179d8959b5e0939835cf6facef797457/1.png" />
          </figure><p>Since <a href="https://x.com/CloudflareRadar/status/1788277817362329983"><u>April 2024</u></a>, we have tracked the aggregate growth of client support for post-quantum encryption on Cloudflare Radar, chronicling its global growth from <a href="https://radar.cloudflare.com/adoption-and-usage?dateStart=2024-01-01&amp;dateEnd=2024-01-31#post-quantum-encryption-adoption"><u>under 3% at the start of 2024</u></a>, to <a href="https://radar.cloudflare.com/adoption-and-usage?dateStart=2026-02-01&amp;dateEnd=2026-02-28#post-quantum-encryption-adoption"><u>over 60% in February 2026</u></a>. And in October 2025, <a href="https://blog.cloudflare.com/pq-2025/#what-you-can-do-today-to-stay-safe-against-quantum-attacks"><u>we added the ability</u></a> for users to <a href="https://radar.cloudflare.com/adoption-and-usage#browser-support"><u>check</u></a> whether their browser supports <a href="https://developers.cloudflare.com/ssl/post-quantum-cryptography/pqc-support/#x25519mlkem768"><code><u>X25519MLKEM768</u></code></a> — a hybrid key exchange algorithm combining classical <a href="https://www.rfc-editor.org/rfc/rfc8410"><code><u>X25519</u></code></a> with <a href="https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.203.pdf"><u>ML-KEM</u></a>, a lattice-based post-quantum scheme standardized by NIST. This provides security against both classical and quantum attacks. </p><p>However, post-quantum encryption support on user-to-Cloudflare connections is only part of the story.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/67cvSmOaISIHjrKKRHKPzg/e0ccf032658904fd6beaa7de7340b561/2.png" />
          </figure><p>For content not in our CDN cache, or for uncacheable content, Cloudflare’s edge servers establish a separate connection with a customer’s origin servers to retrieve it. To accelerate the transition to quantum-resistant security for these origin-facing fetches, we <a href="https://blog.cloudflare.com/post-quantum-to-origins/"><u>previously introduced an API</u></a> allowing customers to opt in to preferring post-quantum connections. Today, we’re making post-quantum compatibility of origin servers visible on Radar.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6KvV2meYLEPbNIQyHP6yji/9477a134c8f5f6a7aaecd6257cd59981/3.png" />
          </figure><p>The new origin post-quantum support graph on Radar illustrates the share of customer origins supporting <code>X25519MLKEM768</code>. This data is derived from <a href="https://blog.cloudflare.com/automatically-secure/"><u>our automated TLS scanner,</u></a> which probes TLS 1.3-compatible origins and aggregates the results daily. It is important to note that our scanner tests for support rather than the origin server's specific preference. While an origin may support a post-quantum key exchange algorithm, its local TLS key exchange preference can ultimately dictate the encryption outcome.</p><p>While the headline graph focuses on post-quantum readiness, the scanner also evaluates support for classical key exchange algorithms. Within the Radar <a href="https://radar.cloudflare.com/explorer?dataSet=post_quantum.origin&amp;groupBy=key_agreement#result"><u>Data Explorer view</u></a>, you can also see the full distribution of these supported TLS key exchange methods.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5PBOoQSCcIAQrYezKp1pJU/d4218aba59deef6c21df53856a93040a/4.png" />
          </figure><p>As shown in the graphs above, approximately 10% of origins could benefit from a post-quantum-preferred key agreement today. This represents a significant jump from less than 1% at the start of 2025 — <a href="https://radar.cloudflare.com/explorer?dataSet=post_quantum.origin&amp;groupBy=key_agreement&amp;dt=2025-01-01_2025-12-31"><u>a 10x increase in just over a year</u></a>. We expect this number to grow steadily as the industry continues its migration. This upward trend likely accelerated in 2025 as many server-side TLS libraries, such as <a href="https://openssl-library.org/post/2025-04-08-openssl-35-final-release/"><u>OpenSSL 3.5.0+</u></a>,<a href="https://www.gnutls.org/"><u> GnuTLS 3.8.9+</u></a>, and <a href="https://go.dev/doc/go1.24#cryptotlspkgcryptotls"><u>Go 1.24+</u></a>, enabled hybrid post-quantum key exchange by default, allowing platforms and services to support post-quantum connections simply by upgrading their cryptographic library dependencies.</p><p>In addition to the Radar and Data Explorer graphs, the <a href="https://developers.cloudflare.com/api/resources/radar/subresources/post_quantum/subresources/origin/"><u>origin readiness data is available through the Radar API</u></a> as well.</p><p>As an additional part of our efforts to help the Internet transition to post-quantum cryptography, we are also launching <a href="https://radar.cloudflare.com/post-quantum#website-support"><u>a tool to test whether a specific hostname supports post-quantum encryption</u></a>. These tests can be run against any publicly accessible website, as long as they allow connections from Cloudflare’s <a href="https://www.cloudflare.com/ips/"><u>egress IP address ranges</u></a>. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5dgwK3i7IeLLSUt5xnk4lf/276e25dda3389f6e0ad83a26acd08fec/5.png" />
          </figure><p><sub><i>A screenshot of the tool in Radar to test whether a hostname supports post-quantum encryption.</i></sub></p><p>The tool presents a simple form where users can enter a hostname (such as <a href="https://radar.cloudflare.com/post-quantum?host=cloudflare.com%3A443"><code><u>cloudflare.com</u></code></a> or <a href="https://radar.cloudflare.com/post-quantum?host=www.wikipedia.org%3A443"><code><u>www.wikipedia.org</u></code></a>) and optionally specify a custom port (the default is <a href="https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.xhtml?search=443"><u>443, the standard HTTPS port</u></a>). After clicking "Test", the result displays a tag indicating PQ support status alongside the negotiated TLS key exchange algorithm. If the server prefers PQ secure connections, a green "PQ" tag appears with a message confirming the connection is "post-quantum secure." Otherwise, a red tag indicates the connection is "not post-quantum secure", showing the classical algorithm that was negotiated.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3rfEG4dMlwR4FJkaKXTRWF/8cab135242057ce57f3b0e4a92be4cec/6.png" />
          </figure>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/PXu3kjzwhVkb29kIFREOn/41785c06297e0667ff9e2b261ae9b819/7.png" />
          </figure><p>Under the hood, this tool uses <a href="https://developers.cloudflare.com/containers/"><u>Cloudflare Containers</u></a> — a new capability that allows running container workloads alongside Workers. Since the Workers runtime is not exposed to details of the underlying TLS handshake, Workers cannot initiate TLS scans. Therefore, we created a Go container that leverages the <a href="https://pkg.go.dev/crypto/tls"><code><u>crypto/tls</u></code></a> package's support for post-quantum compatibility checks. The container runs on-demand and performs the actual handshake to determine the negotiated TLS key exchange algorithm, returning results through the <a href="https://developers.cloudflare.com/api/resources/radar/subresources/post_quantum/subresources/tls/methods/support/"><u>Radar API</u></a>.</p><p>With the addition of these origin-facing insights, complementing the existing client-facing insights, we have moved all the post-quantum content to <a href="https://radar.cloudflare.com/post-quantum"><u>its own section on Radar</u></a>. </p>
    <div>
      <h2>Securing E2EE messaging systems with Key Transparency</h2>
      <a href="#securing-e2ee-messaging-systems-with-key-transparency">
        
      </a>
    </div>
    
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/71b8HJK1iT0udJscvkqqI4/778efb329047fca017ff2cf4153330ad/8.png" />
          </figure><p><a href="https://www.cloudflare.com/learning/privacy/what-is-end-to-end-encryption/"><u>End-to-end encrypted (E2EE)</u></a> messaging apps like WhatsApp and Signal have become essential tools for private communication, relied upon by billions of people worldwide. These apps use <a href="https://www.cloudflare.com/learning/ssl/how-does-public-key-encryption-work/"><u>public-key cryptography</u></a> to ensure that only the sender and recipient can read the contents of their messages — not even the messaging service itself. However, there's an often-overlooked vulnerability in this model: users must trust that the messaging app is distributing the correct public keys for each contact.</p><p>If an attacker were able to substitute an incorrect public key in the messaging app's database, they could intercept messages intended for someone else — all without the sender knowing.</p><p>Key Transparency addresses this challenge by creating an auditable, append-only log of public keys — similar in concept to <a href="https://radar.cloudflare.com/certificate-transparency"><u>Certificate Transparency</u></a> for TLS certificates. Messaging apps publish their users' public keys to a transparency log, and independent third parties can verify and vouch that the log has been constructed correctly and consistently over time. In September 2024, Cloudflare <a href="https://blog.cloudflare.com/key-transparency/"><u>announced</u></a> such a Key Transparency auditor for WhatsApp, providing an independent verification layer that helps ensure the integrity of public key distribution for the messaging app's billions of users.</p><p>Today, we're publishing Key Transparency audit data in a new <a href="https://radar.cloudflare.com/key-transparency"><u>Key Transparency section</u></a> on Cloudflare Radar. This section showcases the Key Transparency logs that Cloudflare audits, giving researchers, security professionals, and curious users a window into the health and activity of these critical systems.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1LZ1DUzv0SCgBa0XqDURKP/26ccd8b0741073895cbb52aa7f1d5643/image11.png" />
          </figure><p>The new page launches with two monitored logs: WhatsApp and Facebook Messenger Transport. Each monitored log is displayed as a card containing the following information:</p><ul><li><p><b>Status:</b> Indicates whether the log is online, in initialization, or disabled. An "online" status means the log is actively publishing key updates into epochs that Cloudflare audits. (An epoch represents a set of updates applied to the key directory at a specific time.)</p></li><li><p><b>Last signed epoch:</b> The most recent epoch that has been published by the messaging service's log and acknowledged by Cloudflare. By clicking on the eye icon, users can view the full epoch data in JSON format, including the epoch number, timestamp, cryptographic digest, and signature.</p></li><li><p><b>Last verified epoch:</b> The most recent epoch that Cloudflare has verified. Verification involves checking that the transition of the transparency log data structure from the previous epoch to the current one represents a valid tree transformation — ensuring the log has been constructed correctly. The verification timestamp indicates when Cloudflare completed its audit.</p></li><li><p><b>Root:</b> The current root hash of the <a href="https://github.com/facebook/akd"><u>Auditable Key Directory (AKD)</u></a> tree. This hash cryptographically represents the entire state of the key directory at the current epoch. Like the epoch fields, users can click to view the complete JSON response from the auditor.</p></li></ul><p>The data shown on the page is also available via the Key Transparency Auditor API, with endpoints for <a href="https://developers.cloudflare.com/key-transparency/api/auditor-information/"><u>auditor information</u></a> and <a href="https://developers.cloudflare.com/key-transparency/api/namespaces/"><u>namespaces</u></a>.</p><p>If you would like to perform audit proof verification yourself, you can follow the instructions in our <a href="https://blog.cloudflare.com/key-transparency/"><u>Auditing Key Transparency blog post</u></a>. We hope that these use cases are the first of many that we publish in this Key Transparency section in Radar — if your company or organization is interested in auditing for your public key or related infrastructure, you can <a href="https://www.cloudflare.com/lp/privacy-edge/"><u>reach out to us here</u></a>.</p>
    <div>
      <h2>Tracking RPKI ASPA adoption</h2>
      <a href="#tracking-rpki-aspa-adoption">
        
      </a>
    </div>
    
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2LAbrwY9ziVbe1BzfUyl7K/821a40f86c62dd9b44f7bcaee018dd28/10.png" />
          </figure><p>While the <a href="https://www.cloudflare.com/learning/security/glossary/what-is-bgp/"><u>Border Gateway Protocol (BGP)</u></a> is the backbone of Internet routing, it was designed without built-in mechanisms to verify the validity of the paths it propagates. This inherent trust has long left the global network vulnerable to route leaks and hijacks, where traffic is accidentally or maliciously detoured through unauthorized networks.</p><p>Although <a href="https://en.wikipedia.org/wiki/Resource_Public_Key_Infrastructure"><u>RPKI</u></a> and <a href="https://www.arin.net/resources/manage/rpki/roas/"><u>Route Origin Authorizations (ROAs)</u></a> have successfully hardened the origin of routes, they cannot verify the path traffic takes between networks. This is where <a href="https://datatracker.ietf.org/doc/draft-ietf-sidrops-aspa-verification/"><u>ASPA (Autonomous System Provider Authorization)</u></a><b> </b>comes in. ASPA extends RPKI protection by allowing an <a href="https://www.cloudflare.com/learning/network-layer/what-is-an-autonomous-system/"><u>Autonomous System (AS)</u></a> to cryptographically sign a record listing the networks authorized to propagate its routes upstream. By validating these Customer-to-Provider relationships, ASPA allows systems to detect invalid path announcements with confidence and react accordingly.</p><p>While the specific IETF standard remains <a href="https://datatracker.ietf.org/doc/draft-ietf-sidrops-aspa-verification/"><u>in draft</u></a>, the operational community is moving fast. Support for creating ASPA objects has already landed in the portals of Regional Internet Registries (RIRs) like <a href="https://www.arin.net/announcements/20260120/"><u>ARIN</u></a> and <a href="https://labs.ripe.net/author/tim_bruijnzeels/aspa-in-the-rpki-dashboard-a-new-layer-of-routing-security/"><u>RIPE NCC</u></a>, and validation logic is available in major software routing stacks like <a href="https://www.undeadly.org/cgi?action=article;sid=20231002135058"><u>OpenBGPD</u></a> and <a href="https://bird.network.cz/?get_doc&amp;v=20&amp;f=bird-5.html"><u>BIRD</u></a>.</p><p>To provide better visibility into the adoption of this emerging standard, we have added comprehensive RPKI ASPA support to the <a href="https://radar.cloudflare.com/routing"><u>Routing section</u></a> of Cloudflare Radar. Tracking these records globally allows us to understand how quickly the industry is moving toward better path validation.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6SI6A5vd2bAp3QnBAsJFmZ/24e11445eb0309252d759e88dbf2ba62/11.png" />
          </figure><p>Our new ASPA deployment view allows users to examine the growth of ASPA adoption over time, with the ability to visualize trends across the five <a href="https://en.wikipedia.org/wiki/Regional_Internet_registry"><u>Regional Internet Registries</u></a> (RIRs) based on AS registration. You can view the entire history of ASPA entries, dating back to October 1, 2023, or zoom into specific date ranges to correlate spikes in adoption with industry events, such as the introduction of ASPA features on ARIN and RIPE NCC online dashboards.</p><p>Beyond aggregate trends, we have also introduced a granular, searchable explorer for real-time ASPA content. This table view allows you to inspect the current state of ASPA records, searchable by AS number, AS name, or by filtering for only providers or customer ASNs. This allows network operators to verify that their records are published correctly and to view other networks’ configurations.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/K97G5TC7O1MYwkvFbrdTl/85b27f807401f85d2bbe140f1611a034/12.png" />
          </figure><p>We have also integrated ASPA data directly into the country/region routing pages. Users can now track how different locations are progressing in securing their infrastructure, based on the associated ASPA records from the customer ASNs registered locally.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6mhZyfrHexdo1GDAoKZEd7/44b63675595a01939fa4748210d8c482/13.png" />
          </figure><p>On individual AS pages, we have updated the Connectivity section. Now, when viewing the connections of a network, you may see a visual indicator for "ASPA Verified Provider." This annotation confirms that an ASPA record exists authorizing that specific upstream connection, providing an immediate signal of routing hygiene and trust.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3lVJY4fZWv3KaFdKwLHfAV/aeb2bc27bdccb6a9025345dbaed5b762/14.png" />
          </figure><p>For ASes that have deployed ASPA, we now display a complete list of authorized provider ASNs along with their details. Beyond the current state, Radar also provides a detailed timeline of ASPA activity involving the AS. This history distinguishes between changes initiated by the AS itself ("As customer") and records created by others designating it as a provider ("As provider"), allowing users to immediately identify when specific routing authorizations were established or modified.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/ZIlAn2l0sDTLCyEMMcBI9/871b8d7abffe17b3aee060502eaa4c1c/15.png" />
          </figure><p>Visibility is an essential first step toward broader adoption of emerging routing security protocols like ASPA. By surfacing this data, we aim to help operators deploy protections and assist researchers in tracking the Internet's progress toward a more secure routing path. For those who need to integrate this data into their own workflows or perform deeper analysis, we are also exposing these metrics programmatically. Users can now access ASPA content snapshots, historical timeseries, and detailed changes data using the newly introduced endpoints in the<a href="https://developers.cloudflare.com/api/resources/radar/subresources/bgp/subresources/rpki/subresources/aspa/"> <u>Cloudflare Radar API</u></a>.</p>
    <div>
      <h2>As security evolves, so does our data</h2>
      <a href="#as-security-evolves-so-does-our-data">
        
      </a>
    </div>
    <p>Internet security continues to evolve, with new approaches, protocols, and standards being developed to ensure that information, applications, and networks remain secure. The security data and insights available on Cloudflare Radar will continue to evolve as well. The new sections highlighted above serve to expand existing routing security, transparency, and post-quantum insights already available on Cloudflare Radar. </p><p>If you share any of these new charts and graphs on social media, be sure to tag us: <a href="https://x.com/CloudflareRadar"><u>@CloudflareRadar</u></a> (X), <a href="https://noc.social/@cloudflareradar"><u>noc.social/@cloudflareradar</u></a> (Mastodon), and <a href="https://bsky.app/profile/radar.cloudflare.com"><u>radar.cloudflare.com</u></a> (Bluesky). If you have questions or comments, or suggestions for data that you’d like to see us add to Radar, you can reach out to us on social media, or contact us via <a href="#"><u>email</u></a>.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5jAzDXss7PvszWkwGC0q2g/df14de40bf268052fac11239952fc1ed/16.png" />
          </figure><p></p> ]]></content:encoded>
            <category><![CDATA[Radar]]></category>
            <category><![CDATA[Security]]></category>
            <category><![CDATA[Privacy]]></category>
            <category><![CDATA[Post-Quantum]]></category>
            <category><![CDATA[Routing]]></category>
            <category><![CDATA[Research]]></category>
            <guid isPermaLink="false">1Iy1Qvw9TsOhRwgjUYqFxO</guid>
            <dc:creator>David Belson</dc:creator>
            <dc:creator>Mingwei Zhang</dc:creator>
            <dc:creator>André Jesus</dc:creator>
            <dc:creator>Suleman Ahmad</dc:creator>
            <dc:creator>Sabina Zejnilovic</dc:creator>
            <dc:creator>Thibault Meunier</dc:creator>
            <dc:creator>Mari Galicer</dc:creator>
        </item>
        <item>
            <title><![CDATA[React2Shell and related RSC vulnerabilities threat brief: early exploitation activity and threat actor techniques]]></title>
            <link>https://blog.cloudflare.com/react2shell-rsc-vulnerabilities-exploitation-threat-brief/</link>
            <pubDate>Thu, 11 Dec 2025 16:20:00 GMT</pubDate>
            <description><![CDATA[ Early activity indicates that threat actors quickly integrated this vulnerability into their scanning and reconnaissance routines and targeted critical infrastructure including nuclear fuel, uranium and rare earth elements. We outline the tactics they appear to be using and how Cloudflare is protecting customers.  ]]></description>
            <content:encoded><![CDATA[ <p></p><p>On December 3, 2025, immediately following the public disclosure of the critical, maximum-severity React2Shell vulnerability (CVE-2025-55182), the <a href="https://www.cloudflare.com/cloudforce-one/services/threat-intelligence/"><u>Cloudforce One</u></a> Threat Intelligence team began monitoring for early signs of exploitation. Within hours, we observed scanning and active exploitation attempts, including traffic originating from infrastructure associated with Asian-nexus threat groups.</p><p>Early activity indicates that threat actors quickly integrated this vulnerability into their scanning and reconnaissance routines. We observed systematic probing of exposed systems, testing for the flaw at scale, and incorporating it into broader sweeps of Internet‑facing assets. The identified behavior reveals the actors relied on a combination of tools, such as standard vulnerability scanners and publicly accessible Internet asset discovery platforms, to find potentially vulnerable React Server Components (RSC) deployments exposed to the Internet.</p><p>Patterns in observed threat activity also suggest that the actors focused on identifying specific application metadata — such as icon hashes, <a href="https://www.cloudflare.com/application-services/products/ssl/">SSL certificate</a> details, or geographic region identifiers — to refine their candidate target lists before attempting exploitation. </p><p>In addition to React2Shell, two additional vulnerabilities affecting specific RSC implementations were disclosed: CVE-2025-55183 and CVE-2025-55184. Both vulnerabilities, while distinct from React2Shell, also relate to RSC payload handling and Server Function semantics, and are described in more detail below.</p>
    <div>
      <h2>Background: React2Shell vulnerability (CVE-2025-55182)</h2>
      <a href="#background-react2shell-vulnerability-cve-2025-55182">
        
      </a>
    </div>
    <p>On December 3, 2025, the React Team <a href="https://react.dev/blog/2025/12/03/critical-security-vulnerability-in-react-server-components"><u>disclosed</u></a> a Remote Code Execution (RCE) vulnerability affecting servers using the React Server Components (RSC) Flight protocol. The vulnerability, <a href="https://nvd.nist.gov/vuln/detail/CVE-2025-55182"><u>CVE-2025-55182</u></a>, received a CVSS score of 10.0 and has been informally referred to as React2Shell.</p><p>The underlying cause of the vulnerability is an unsafe deserialization flaw in the RSC Flight data-handling logic. When a server processes attacker-controlled payloads without proper validation, it becomes possible to influence server-side execution flow. In this case, crafted input allows an attacker to inject logic that the server interprets in a privileged context.</p><p>Exploitation is straightforward. A single, specially crafted HTTP request is sufficient; there is no authentication requirement, user interaction, or elevated permissions involved. Once successful, the attacker can execute arbitrary, privileged JavaScript on the affected server.</p><p>This combination of authenticated access, trivial exploitation, and full code execution is what places CVE-2025-55182 at the highest severity level and makes it significant for organizations relying on vulnerable versions of React Server Components. </p><p>In response, Cloudflare has deployed new rules across its network, with the default action set to Block. These new protections are included in both the Cloudflare Free Managed Ruleset (available to all Free customers) and the standard Cloudflare Managed Ruleset (available to all paying customers), as detailed below. More information about the different rulesets can be found in our <a href="https://developers.cloudflare.com/waf/managed-rules/#available-managed-rulesets"><u>documentation</u></a>.
</p><table><tr><th><p><b>CVE</b></p></th><th><p><b>Description</b></p></th><th><p><b>Cloudflare WAF Rule ID</b></p></th></tr><tr><td><p><b>CVE-2025-55182</b></p><p>React - RCE</p></td><td><p>Rules to mitigate React2Shell Exploit</p></td><td><p><b>Paid:</b> 33aa8a8a948b48b28d40450c5fb92fba</p><p><b>Free:</b> 2b5d06e34a814a889bee9a0699702280</p></td></tr><tr><td><p><b>CVE-2025-55182 - 2</b></p><p>React - RCE Bypass</p></td><td><p>Additional rules to mitigate exploit bypass</p></td><td><p><b>Paid:</b> bc1aee59731c488ca8b5314615fce168</p><p><b>Free:</b> cbdd3f48396e4b7389d6efd174746aff</p></td></tr><tr><td><p><b>CVE-2025-55182</b></p><p>Scanner Detection</p></td><td><p>Additional paid WAF rule to catch React2Shell scanning attempts</p></td><td><p><b>Paid:</b> 1d54691cb822465183cb49e2f562cf5c</p></td></tr></table><p>
</p>
    <div>
      <h2>Recently disclosed RSC vulnerabilities</h2>
      <a href="#recently-disclosed-rsc-vulnerabilities">
        
      </a>
    </div>
    <p>In addition to React2Shell, two additional vulnerabilities affecting specific RSC implementations were disclosed. The two vulnerabilities, while distinct from React2Shell, also relate to RSC payload handling and Server Function semantics, with corresponding Cloudflare protections noted below:</p><p></p><table><tr><th><p><b>CVE</b></p></th><th><p><b>Description</b></p></th><th><p><b>Cloudflare WAF Rule ID</b></p></th></tr><tr><td><p><b>CVE-2025-55183</b></p><p>Leaking Server Functions</p></td><td><p>In deployments where Server Function identifiers are insufficiently validated, an attacker may force the server into returning the source body of a referenced function</p></td><td><p><b>Paid:</b> 17c5123f1ac049818765ebf2fefb4e9b

<b>Free:</b> 3114709a3c3b4e3685052c7b251e86aa</p></td></tr><tr><td><p><b>CVE-2025-55184</b></p><p>React Function DoS</p></td><td><p>A crafted RSC Flight Payload containing cyclical Promise references can trigger unbounded recursion or event-loop lockups under certain server configurations, resulting in denial-of-service conditions</p></td><td><p><b>Paid:</b> 2694f1610c0b471393b21aef102ec699</p></td></tr><tr><td><p><b>CVE-2025-67779</b></p></td><td><p>Rule for incomplete fix addressing CVE-2025-55184 in React Server Components </p></td><td><p><b>Paid: </b>2694f1610c0b471393b21aef102ec699</p></td></tr></table><p>
</p>
    <div>
      <h3>Investigation of early scanning and exploitation</h3>
      <a href="#investigation-of-early-scanning-and-exploitation">
        
      </a>
    </div>
    <p>The following analysis details the initial wave of activity observed by Cloudforce One, focusing on threat actor attempts to scan for and exploit the React2Shell vulnerability. While these findings represent activity immediately following the vulnerability's release, and were focused on known threat actors, it is critical to note that the volume and scope of related threat activity have expanded dramatically since these first observations.</p>
    <div>
      <h3>Tactics</h3>
      <a href="#tactics">
        
      </a>
    </div>
    <p>Unsurprisingly, the threat actors were relying heavily on publicly available, commercial, and a variety of other tools to identify vulnerable servers:</p><ul><li><p><b>Vulnerability intelligence</b>: The actors leveraged vulnerability intelligence databases that aggregated CVEs, advisories, and exploits for tracking and prioritization.</p></li><li><p><b>Vulnerability reconnaissance</b>: The actors conducted searches using large-scale reconnaissance services, indicating they are relying on Internet-wide scanning and asset discovery platforms to find exposed systems running React App or RSC components. They also made use of tools that identify the software stack and technologies used by websites.</p></li><li><p><b>Vulnerability scanning</b>: Activity included use of Nuclei (User-Agent: <i>Nuclei - CVE-2025-55182</i>), a popular rapid scanning tool used to deploy YAML-based templates to check for vulnerabilities. The actors were also observed using a highly likely React2Shell scanner associated with the User-Agent "<i>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36 React2ShellScanner/1.0.0</i>".</p></li><li><p><b>Vulnerability exploitation</b>: The actors made use of Burp Suite, a web application security testing platform for identifying and exploiting vulnerabilities in HTTP/S traffic.</p></li></ul>
    <div>
      <h3>Techniques </h3>
      <a href="#techniques">
        
      </a>
    </div>
    <p>
<strong>Recon via Internet-wide scanning and asset discovery platform</strong> <br />
To enumerate potential React2Shell targets, the actors leveraged an Internet-wide scanning and asset-discovery platform commonly used to fingerprint web technologies at scale. Their queries demonstrated a targeted effort to isolate React and Next.js applications — two frameworks directly relevant to the vulnerability — by searching for React-specific icon hashes, framework-associated metadata, and page titles containing React-related keywords. This approach likely allowed them to rapidly build an inventory of exploitable hosts before initiating more direct probing.
</p>
<p>
<strong>Targeting enumeration and filtering </strong><br />
During their reconnaissance phase, the operators applied additional filtering logic to refine their target set and minimize noise. Notably, they excluded Chinese IP space from their searches, indicating that their enumeration workflow intentionally avoided collecting data on possibly domestic infrastructure. They also constrained scanning to specific geographic regions and national networks to identify likely high-value hosts. Beyond basic fingerprinting, the actors leveraged SSL certificate attributes — including issuer details, subject fields, and top-level domains — to surface entities of interest, such as government or critical-infrastructure systems using .gov or other restricted TLDs. This combination of geographic filtering and certificate-based pivoting enabled a more precise enumeration process that prioritized strategically relevant and potentially vulnerable high-value targets. 
</p>
<p>
<strong>Preliminary target analysis</strong><br />
Observed activity reflected a clear focus on strategically significant organizations across multiple regions. Their highest-density probing occurred against networks in Taiwan, Xinjiang Uygur, Vietnam, Japan, and New Zealand — regions frequently associated with geopolitical intelligence collection priorities. Other selective targeting was also observed against entities across the globe, including government (.gov) websites, academic research institutions, and critical‑infrastructure operators. These infrastructure operators specifically included a national authority responsible for the import and export of uranium, rare metals, and nuclear fuel.
</p>
<p>
The actors also prioritized high‑sensitivity technology targets such as enterprise password managers and secure‑vault services, likely due to their potential to provide downstream access to broader organizational credentials and secrets. 
</p>
<p>
Additionally, the campaign targeted edge‑facing SSL VPN appliances whose administrative interfaces may incorporate React-based components, suggesting the actor sought to exploit React2Shell against both traditional web applications and embedded web management frameworks in order to maximize access opportunities.
</p>
<p>
<strong>Early threat actor observations</strong><br />
Cloudforce One analysis confirms that early scanning and exploitation attempts originated from IP addresses previously associated with multiple Asia-affiliated threat actor clusters.  While not all observed IP addresses belong to a single operator, the simultaneous activity suggests shared tooling, infrastructure, or experimentation in parallel among groups with a common purpose and shared targeting objectives. Observed targeting enumeration and filtering (e.g. a focus on Taiwan and Xinjiang Uygur, but exclusion of China), as well as heavy use of certain scanning and asset discovery platforms, suggest general attribution to Asia-linked threat actors.
</p>
    <div>
      <h2>Overall trends</h2>
      <a href="#overall-trends">
        
      </a>
    </div>
    <p>Cloudflare’s Managed Rulesets for React2Shell began detecting significant activity within hours of the vulnerability’s disclosure. The graph below shows the daily hit count across the two exploit-related React2Shell WAF rules. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6ZPNWf2mq7JFWbJapwsasg/61fc8669da21d8fc8b690386b8ba0915/BLOG-3096_2.png" />
          </figure><p><sup>Aggregate rule hit volume over time</sup></p><p>The React2Shell disclosure triggered a surge of opportunistic scanning and exploit behavior. In total, from 2025-12-03 00:00 UTC to 2025-12-11 17:00UTC, we received 582.10M hits. That equates to an average of 3.49M hits per hour, with a maximum number of hits in a single hour reaching 12.72M. The average unique IP count per hour was 3,598, with the maximum number of IPs in an hour being 16,585.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/37fQ8Y7Iq1rKsGiqdzS3oo/7027ce50c100bd46fcb93d3a9a88048d/BLOG-3096_3.png" />
          </figure><p><sup>Hourly count of unique IPs sending React2Shell-related probes </sup></p><p>Our data also shows distinct peaks above 6,387 User-Agents per hour, indicating a heterogeneous mix of tools and frameworks in use, with the average number of unique User-Agents per hour being 2,255. The below graph shows exploit attempts based on WAF rules (Free and Managed) triggering on matching payloads:  </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6FLgmrryaXpy59O8fy5ncm/b6308ead7ad544b5e2524c97449850d6/image2.png" />
          </figure><p><sup>Unique User-Agent strings used in React2Shell-related requests</sup></p><p>To better understand the types of automated tools probing for React2Shell exposure, Cloudflare analyzed the User-Agent strings associated with React2Shell-related requests since December 3, 2025. The data shows a wide variety of scanning tools suggesting broad Internet-wide reconnaissance: </p><table><tr><th><p><b>Top 10 User Agent strings by exploit attempts</b></p></th></tr><tr><td><p>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36 Assetnote/1.0.0</p></td></tr><tr><td><p>Block Security Team/Assetnote-HjJacErLyq2xFe01qaCM1yyzs</p></td></tr><tr><td><p>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36 (GIS - AppSec Team - Project Vision)</p></td></tr><tr><td><p>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36</p></td></tr><tr><td><p>python-requests/2.32.5</p></td></tr><tr><td><p>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36 Assetnote/1.0.0 (ExposureScan)</p></td></tr><tr><td><p>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36</p></td></tr><tr><td><p>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36</p></td></tr><tr><td><p>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36</p></td></tr><tr><td><p>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/18.1 Safari/605.1.1</p></td></tr></table>
    <div>
      <h3>Payload variation and experimentation</h3>
      <a href="#payload-variation-and-experimentation">
        
      </a>
    </div>
    <p>Cloudflare analyzed the payload sizes associated with requests triggering React2Shell-related detection rules. The long-tailed distribution — dominated by sub-kilobyte probes, but punctured by extremely large outliers — suggest actors are testing a wide range of payload sizes:</p><table><tr><th><p><b>Metric</b></p></th><th><p><b>Value</b></p></th></tr><tr><td><p>Maximum payload size</p></td><td><p>375 MB</p></td></tr><tr><td><p>Average payload size</p></td><td><p>3.2 KB</p></td></tr><tr><td><p>p25 (25th Percentile)</p></td><td><p>703 B</p></td></tr><tr><td><p>p75 (75th Percentile)</p></td><td><p>818 B</p></td></tr><tr><td><p>p90 (90th Percentile)</p></td><td><p>2.7 KB</p></td></tr><tr><td><p>p99 (99th Percentile)</p></td><td><p>66.5 KB</p></td></tr><tr><td><p>Standard deviation</p></td><td><p>330 KB</p></td></tr></table>
    <div>
      <h2>Additional React vulnerabilities identified </h2>
      <a href="#additional-react-vulnerabilities-identified">
        
      </a>
    </div>
    <p>In parallel with our ongoing analysis of the React2Shell vulnerability, two additional vulnerabilities affecting React Server Components (RSC) implementations have been identified:</p>
    <div>
      <h3>1. React function DoS</h3>
      <a href="#1-react-function-dos">
        
      </a>
    </div>
    <p>The vulnerability <b>CVE-2025-55184</b> was recently disclosed, revealing that React Server Component frameworks can be forced into a Node.js state where the runtime unwraps an infinite recursion of nested Promises.</p><p>This behavior:</p><ul><li><p>Freezes the server indefinitely</p></li><li><p>Prevents yielding back to the event loop</p></li><li><p>Effectively takes the server offline</p></li><li><p>Does not require any specific Server Action usage — merely the presence of a server capable of processing an RSC Server Action payload </p></li></ul><p>The trigger condition is a cyclic promise reference inside the RSC payload.</p>
    <div>
      <h3>2. Leaking server functions </h3>
      <a href="#2-leaking-server-functions">
        
      </a>
    </div>
    <p>Another vulnerability, <b>CVE-2025-55183</b>, was also recently disclosed, revealing that certain React Server Component frameworks can leak server-only source code under specific conditions.</p><p>If an attacker gains access to a Server Function that:</p><ul><li><p>Accepts an argument that undergoes string coercion, and</p></li><li><p>Does not validate that the argument is of an expected primitive type</p></li></ul><p>then the attacker can coerce that argument into a reference to a different Server Function. The coerced value’s toString() output causes the server to return the source code of the referenced Server Function.</p>
    <div>
      <h2>How Cloudflare is protecting customers</h2>
      <a href="#how-cloudflare-is-protecting-customers">
        
      </a>
    </div>
    <p>Cloudflare’s protection strategy is multi-layered, relying on both the inherent security model of its platform and immediate, proactive updates to its Web Application Firewall (WAF). </p><ul><li><p>Cloudflare Workers: React-based applications and frameworks deployed on Cloudflare Workers are inherently immune. The Workers security model prevents exploits from succeeding at the runtime layer, regardless of the malicious payload.</p></li><li><p>Proactive WAF deployment: Cloudflare urgently deployed WAF rules to detect and block traffic proxied through its network related to React2Shell and the recently disclosed RSC vulnerabilities.   </p></li></ul><p>The Cloudflare security team continues to monitor for additional attack variations and will update protections as necessary to maintain continuous security for all proxied traffic. </p>
    <div>
      <h2>Continuous monitoring </h2>
      <a href="#continuous-monitoring">
        
      </a>
    </div>
    <p>While Cloudflare's emergency actions — the WAF limit increase and immediate rule deployment — have successfully mitigated the current wave of exploitation attempts, this vulnerability represents a persistent and evolving threat. The immediate weaponization of CVE-2025-55182 by sophisticated threat actors underscores the need for continuous defense.</p><p>Cloudflare remains committed to continuous surveillance for emerging exploit variants and refinement of WAF rules to detect evasive techniques. However, network-level protection is not a substitute for remediation at the source. Organizations must prioritize immediate patching of all affected React and Next.js assets. This combination of platform-level WAF defense and immediate application patching remains the only reliable strategy against this critical threat.</p>
    <div>
      <h2>Indicators of Compromise</h2>
      <a href="#indicators-of-compromise">
        
      </a>
    </div>
    <table><tr><th><p><b>Tool/Scanner</b></p></th><th><p><b>User Agent String</b></p></th><th><p><b>Observation/Purpose</b></p></th></tr><tr><td><p><b>Nuclei</b></p></td><td><p>Nuclei - CVE-2025-55182</p></td><td><p>User-Agent for rapid, template-based scanning for React2Shell vulnerability</p></td></tr><tr><td><p><b>React2ShellScanner</b></p></td><td><p>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36 React2ShellScanner/1.0.0</p></td><td><p>User-Agent for a likely custom React2Shell vulnerability scanner</p></td></tr></table><p></p> ]]></content:encoded>
            <category><![CDATA[Vulnerabilities]]></category>
            <category><![CDATA[Threat Intelligence]]></category>
            <category><![CDATA[Research]]></category>
            <guid isPermaLink="false">6hIbIpaov6tE7iKLlTL1gp</guid>
            <dc:creator>Cloudforce One</dc:creator>
        </item>
        <item>
            <title><![CDATA[Fresh insights from old data: corroborating reports of Turkmenistan IP unblocking and firewall testing]]></title>
            <link>https://blog.cloudflare.com/fresh-insights-from-old-data-corroborating-reports-of-turkmenistan-ip/</link>
            <pubDate>Mon, 03 Nov 2025 13:00:00 GMT</pubDate>
            <description><![CDATA[ Cloudflare used historical data to investigate reports of potential new firewall tests in Turkmenistan. Shifts in TCP resets/timeouts across ASNs corroborate large-scale network control system changes.
 ]]></description>
            <content:encoded><![CDATA[ <p>Here at Cloudflare, we frequently use and write about data in the present. But sometimes understanding the present begins with digging into the past.  </p><p>We recently learned of a 2024 <a href="https://turkmen.news/internet-amnistiya-v-turkmenistane-razblokirovany-3-milliarda-ip-adresov-hostingi-i-cdn/"><u>turkmen.news article</u></a> (available in Russian) that reports <a href="https://radar.cloudflare.com/tm"><u>Turkmenistan</u></a> experienced “an unprecedented easing in blocking,” causing over 3 billion previously-blocked IP addresses to become reachable. The same article reports that one of the reasons for unblocking IP addresses was that Turkmenistan may have been testing a new firewall. (The Turkmen government’s tight control over the country’s Internet access <a href="https://www.bbc.com/news/world-asia-16095369"><u>is well-documented</u></a>.) </p><p>Indeed, <a href="https://radar.cloudflare.com/"><u>Cloudflare Radar</u></a> shows a surge of requests coming from Turkmenistan around the same time, as we’ll show below. But we had an additional question: Does the firewall activity show up on Radar, as well? Two years ago, we launched the <a href="https://blog.cloudflare.com/tcp-resets-timeouts/"><u>dashboard on Radar</u></a> to give a window into the TCP connections to Cloudflare that close due to resets and timeouts. These stand out because they are considered ungraceful mechanisms to close TCP connections, according to the TCP specification. </p><p>In this blog post, we go back in time to share what Cloudflare saw in connection resets and timeouts. We must remind our readers that, as passive observers, there are <a href="https://blog.cloudflare.com/connection-tampering/#limitations-of-our-data"><u>limitations on what we can glean from the data</u></a>. For example, our data can’t reveal attribution. Even so, the ability to observe our environment <a href="https://blog.cloudflare.com/tricky-internet-measurement/"><u>can be insightful</u></a>. In a recent example, our visibility into resets and timeouts helped corroborate reports of large-scale <a href="https://blog.cloudflare.com/russian-internet-users-are-unable-to-access-the-open-internet/"><u>blocking and traffic tampering by Russia</u></a>.</p>
    <div>
      <h3>Turkmenistan requests where there were none before</h3>
      <a href="#turkmenistan-requests-where-there-were-none-before">
        
      </a>
    </div>
    <p>Let’s look first at the number of requests, since those should increase if IP addresses are unblocked. In mid-June 2024 Cloudflare started receiving a noticeable increase in HTTP requests, consistent with <a href="https://turkmen.news/internet-amnistiya-v-turkmenistane-razblokirovany-3-milliarda-ip-adresov-hostingi-i-cdn/"><u>reports</u></a> of Turkmenistan unblocking IPs.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3Kqaxxjv9g52RVMWg92AYu/e57468cf523702cadd634c34775be033/BLOG_3069_2.png" />
          </figure><p><sup>Source: </sup><a href="https://radar.cloudflare.com/traffic/tm?dateStart=2024-06-01&amp;dateEnd=2024-06-30"><sup>Cloudflare Radar</sup></a></p>
    <div>
      <h3>Overall TCP resets and timeouts</h3>
      <a href="#overall-tcp-resets-and-timeouts">
        
      </a>
    </div>
    <p>The Transmission Control Protocol (TCP) is a lower-layer mechanism used to create a connection between clients and servers, and also carries <a href="https://radar.cloudflare.com/adoption-and-usage#http1x-vs-http2-vs-http3"><u>70% of HTTP traffic</u></a> to Cloudflare. A TCP connection works <a href="https://blog.cloudflare.com/connection-tampering/#explaining-tampering-with-telephone-calls"><u>much like a telephone call</u></a> between humans, who follow graceful conventions to end a call—and who are acutely aware when conventions are broken if a call ends abruptly.  </p><p>TCP also defines conventions to end the connection gracefully, and we developed <a href="https://blog.cloudflare.com/tcp-resets-timeouts/"><u>mechanisms to detect</u></a> when they don’t. An ungraceful end is triggered by a reset instruction or a timeout. Some are due to benign artifacts of software design or human user behaviours. However, sometimes they are exploited by <a href="https://blog.cloudflare.com/connection-tampering"><u>third parties to close connections</u></a> in everything from school and enterprise firewalls or software, to zero-rating on mobile plans, to nation-state filtering.</p><p>When we look at connections from Turkmenistan, we see that on June 13, 2024, the combined proportion of the four coloured regions increases; each coloured region represents ungraceful ends at a distinct stage of the connection lifetime. In addition to the combined increase, the relative proportions between stages (or colours) changes as well.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1hNDpdNS9lDPKg3jFHigiL/ff3de33af7974c5d32ba421cbbc3c42e/BLOG_3069_3.png" />
          </figure><p><sup>Source: </sup><a href="https://radar.cloudflare.com/security/network-layer/tm?dateStart=2023-10-01&amp;dateEnd=2023-11-30#tcp-resets-and-timeouts"><sup>Cloudflare Radar</sup></a></p><p>Further changes appeared in the weeks that followed. Among them are an increase in Post-PSH (orange) anomalies starting around July 4; a reduction in Post-ACK (light blue) anomalies around July 13; and an increase in anomalies later in connections (green) starting July 22.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6IavKOkF7tB02MtNqJPqqD/f08c78f65894e751b7c9fce9820dee85/BLOG_3069_4.png" />
          </figure><p><sup>Source: </sup><a href="https://radar.cloudflare.com/security/network-layer/tm?dateStart=2024-07-01&amp;dateEnd=2024-07-30#tcp-resets-and-timeouts"><sup>Cloudflare Radar</sup></a></p><p>The shifts above <i>could</i> be explained by a large firewall system. It’s important to keep in mind that data in each of the connection stages (captured by the four coloured regions in the graphs) can be explained by browser implementations or user actions. However, the scale of the data would need a great number of browsers or users doing the same thing to show up. Similarly, individual changes in behaviour would be lost unless they occur in large numbers at the same time.</p>
    <div>
      <h3>Digging down to individual networks</h3>
      <a href="#digging-down-to-individual-networks">
        
      </a>
    </div>
    <p>We’ve learned that it can be helpful to look at the data for individual networks to reveal common patterns between different networks in different regions <a href="https://blog.cloudflare.com/tcp-resets-timeouts/#zero-rating-in-mobile-networks"><u>operated by single entities</u></a>. </p><p>Looking at individual networks within Turkmenistan, trends and timelines appear more pronounced. July 22 in particular sees greater proportions of anomalies associated with the <a href="https://www.cloudflare.com/learning/ssl/what-is-sni/"><u>Server Name Indication</u></a>, or domain name, rather than the IP address (dark blue), although the connection stage where the anomalies appear varies by individual network.</p><p>The general Turkmenistan trends are largely mirrored in connections from <a href="https://radar.cloudflare.com/as20661"><u>AS20661 (TurkmenTelecom)</u></a>, indicating that this <a href="https://www.cloudflare.com/learning/network-layer/what-is-an-autonomous-system/"><u>autonomous system</u></a> (AS) accounts for <a href="https://radar.cloudflare.com/tm#autonomous-systems"><u>a large proportion of Turkmenistan’s traffic</u></a> to Cloudflare’s network. There is a notable reduction in Post-ACK (light blue) anomalies starting around July 26.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5ukNOB1CYUAPW2s7ofdqMK/7d1dca367374db90627413e2c40a6ee3/BLOG_3069_5.png" />
          </figure><p><sup>Source: </sup><a href="https://radar.cloudflare.com/security/network-layer/tm?dateStart=2023-10-01&amp;dateEnd=2023-11-30#tcp-resets-and-timeouts"><sup>Cloudflare Radar</sup></a></p><p>A different picture emerges from <a href="https://radar.cloudflare.com/as51495"><u>AS51495 (Ashgabat City Telephone Network)</u></a>. Post-ACK anomalies almost completely disappear on July 12, corresponding with an increase in anomalies during the Post-PSH stage. An increase of anomalies in the Later (green) connection stage on July 22 is apparent for this AS as well.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7btBYWx2VVVg0MH10yY9ot/17e87bf94f97b1cd43139e432f189770/BLOG_3069_6.png" />
          </figure><p><sup>Source: </sup><a href="https://radar.cloudflare.com/security/network-layer/tm?dateStart=2023-10-01&amp;dateEnd=2023-11-30#tcp-resets-and-timeouts"><sup>Cloudflare Radar</sup></a></p><p>Finally, for <a href="https://radar.cloudflare.com/as59974"><u>AS59974 (Altyn Asyr)</u></a>, you can see below that there is a clear spike in Post-ACK anomalies starting July 22. This is the stage of the connection where a firewall could have seen the SNI, and chooses to drop the packets immediately, so they never reach Cloudflare’s servers.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4pxUHjzkRwnbmaSsgkhiKd/b56fbc84e2fdcd8b889b6e8b3a68dc40/BLOG_3069_7.png" />
          </figure><p><sup>Source: </sup><a href="https://radar.cloudflare.com/security/network-layer/tm?dateStart=2023-10-01&amp;dateEnd=2023-11-30#tcp-resets-and-timeouts"><sup>Cloudflare Radar</sup></a></p>
    <div>
      <h3>Timeouts and resets in context, never isolation</h3>
      <a href="#timeouts-and-resets-in-context-never-isolation">
        
      </a>
    </div>
    <p>We’ve previously discussed <a href="https://blog.cloudflare.com/tcp-resets-timeouts/"><u>how to use the resets and timeouts</u></a> data because, while useful, it can also be misinterpreted. Radar’s data on resets and timeouts is unique among operators, but in isolation it’s incomplete and subject to human bias. </p><p>Take the figure above for AS59974 where Post-ACK (light blue) anomalies markedly increased on July 22. The Radar view is proportional, meaning that the increase in proportion could be explained by greater numbers of anomalies – but could also be explained, for example, by a smaller number of valid requests. Indeed, looking at the HTTP request levels for the same AS, there was a similarly pronounced drop starting on the same day, as shown below. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2PAYPpcFeInis6zo4lWrSx/f28a1f84fbe5b1c21659911b11331c30/BLOG_3069_8.png" />
          </figure><p><sup>Source: </sup><a href="https://radar.cloudflare.com/security/network-layer/tm?dateStart=2023-10-01&amp;dateEnd=2023-11-30#tcp-resets-and-timeouts"><sup>Cloudflare Radar</sup></a></p><p>If we look at the same two graphs before July 22, however, rates of reset and timeout anomalies do not appear to mirror the very large shifts up and down in HTTP requests.</p>
    <div>
      <h3>Looking ahead can also mean looking behind</h3>
      <a href="#looking-ahead-can-also-mean-looking-behind">
        
      </a>
    </div>
    <p>These charts from Radar above offer a way to analyze news events from a different angle, by looking at requests and TCP connection resets and timeouts. Does this data tell us definitively that new firewalls were being tested in Turkmenistan? No. But the trends in the data are consistent with what we could expect to see if that were the case.</p><p>If thinking about ways to use the resets and timeouts data going forward, we’d encourage also looking at the data in retrospect—or even further past to improve context.</p><p>A natural question might be, for example, “If Turkmenistan stopped blocking IPs in mid-2024, what did the data say beforehand?” The figure below captures October and November 2023. (The red-shaded region contains missing data due to the <a href="https://blog.cloudflare.com/post-mortem-on-cloudflare-control-plane-and-analytics-outage"><u>Nov. 2 Cloudflare control plane and metrics outage</u></a>.) Signals about the Internet in Turkmenistan were evolving well before the <a href="https://turkmen.news/internet-amnistiya-v-turkmenistane-razblokirovany-3-milliarda-ip-adresov-hostingi-i-cdn/"><u>news article</u></a> that prompted us to look.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2W4MfieKNV24PmvynAAIfO/af42a2328059eb15fba0619372973887/BLOG_3069_9.png" />
          </figure><p><sup>Source: </sup><a href="https://radar.cloudflare.com/security/network-layer/tm?dateStart=2023-10-01&amp;dateEnd=2023-11-30#tcp-resets-and-timeouts"><sup>Cloudflare Radar</sup></a></p>
    <div>
      <h3>What’s next?</h3>
      <a href="#whats-next">
        
      </a>
    </div>
    <p>To learn more, see our guide about <a href="https://blog.cloudflare.com/tcp-resets-timeouts/"><u>how to use the resets and timeouts data available on Radar</u></a>, as well as the technical details about our <a href="https://blog.cloudflare.com/connection-tampering/"><u>third-party tampering measurement </u></a>and some perspectives by a former <a href="https://blog.cloudflare.com/experience-of-data-at-scale/"><u>intern who helped drive</u></a> the study. </p><p>We’re proud to offer a unique view of TCP connection anomalies on Radar. It’s a testament to the long-lived benefits that emerge when approaching <a href="https://blog.cloudflare.com/tricky-internet-measurement/"><u>Internet measurement as a science</u></a>. In keeping with the open spirit of science, we’ve also shared how we<a href="https://blog.cloudflare.com/tricky-internet-measurement/"><u> detect and log resets and timeouts</u></a> so that others can reproduce the observability on their servers, whether by hobbyists or other large operators.</p> ]]></content:encoded>
            <category><![CDATA[Radar]]></category>
            <category><![CDATA[Research]]></category>
            <category><![CDATA[Internet Shutdown]]></category>
            <category><![CDATA[Internet Trends]]></category>
            <category><![CDATA[Trends]]></category>
            <category><![CDATA[Consumer Services]]></category>
            <guid isPermaLink="false">404c64k0KinGRYZkfe0xum</guid>
            <dc:creator>Luke Valenta</dc:creator>
            <dc:creator>Marwan Fayed</dc:creator>
        </item>
        <item>
            <title><![CDATA[Beyond IP lists: a registry format for bots and agents]]></title>
            <link>https://blog.cloudflare.com/agent-registry/</link>
            <pubDate>Thu, 30 Oct 2025 22:00:00 GMT</pubDate>
            <description><![CDATA[ We propose an open registry format for Web Bot Auth to move beyond IP-based identity. This allows any origin to discover and verify cryptographic keys for bots, fostering a decentralized and more trustworthy ecosystem. ]]></description>
            <content:encoded><![CDATA[ <p></p><p>As bots and agents start <a href="https://blog.cloudflare.com/web-bot-auth/"><u>cryptographically signing their requests</u></a>, there is a growing need for website operators to learn public keys as they are setting up their service. I might be able to find the public key material for well-known fetchers and crawlers, but what about the next 1,000 or next 1,000,000? And how do I find their public key material in order to verify that they are who they say they are? This problem is called <i>discovery.</i></p><p>We share this problem with <a href="https://aws.amazon.com/bedrock/agentcore/"><u>Amazon Bedrock AgentCore</u></a>, a comprehensive agentic platform to build, deploy and operate highly capable agents at scale, and their <a href="https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/browser-tool.html"><u>AgentCore Browser</u></a>, a fast, secure, cloud-based browser runtime to enable AI agents to interact with websites at scale. The AgentCore team wants to make it easy for each of their customers to sign <i>their own requests</i>, so that Cloudflare and other operators of CDN infrastructure see agent signatures from individual agents rather than AgentCore as a monolith. (Note: this method does not identify individual users.) In order to do this, Cloudflare needed a way to ingest and register the public keys of AgentCore’s customers at scale. </p><p>In this blog post, we propose a registry of bots and agents as a way to easily discover them on the Internet. We also outline how <a href="https://blog.cloudflare.com/web-bot-auth/"><u>Web Bot Auth</u></a> can be expanded with a registry format. Similar to IP lists that can be authored by anyone and easily imported, the <a href="https://datatracker.ietf.org/doc/draft-meunier-webbotauth-registry/"><u>registry format</u></a> is a list of URLs at which to retrieve agent keys and can be authored and imported easily.</p><p>We believe such registries should foster and strengthen an open ecosystem of curators that website operators can trust.</p>
    <div>
      <h2>A need for more trustworthy authentication</h2>
      <a href="#a-need-for-more-trustworthy-authentication">
        
      </a>
    </div>
    <p>In May, we introduced a protocol proposal called <a href="https://blog.cloudflare.com/web-bot-auth/"><u>Web Bot Auth</u></a>, which describes how bot and agent developers can cryptographically sign requests coming from their infrastructure. </p><p>There have now been multiple implementations of the proposed protocol, from <a href="https://vercel.com/changelog/vercels-bot-verification-now-supports-web-bot-auth"><u>Vercel</u></a> to <a href="https://changelog.shopify.com/posts/authorize-custom-crawlers-and-tools-with-new-crawler-access-keys"><u>Shopify</u></a> to <a href="https://www.cloudflare.com/press/press-releases/2025/cloudflare-collaborates-with-leading-payments-companies-to-secure-and-enable-agentic-commerce/"><u>Visa</u></a>. It has been actively <a href="https://mailarchive.ietf.org/arch/browse/web-bot-auth/"><u>discussed</u></a> and contributions have been made. Web Bot Auth marks a first step towards moving from brittle identification, like IPs and user agents, to more trustworthy cryptographic authentication. However, like IP addresses, cryptographic keys are a pseudonymous form of identity. If you operate a website without the scale and reach of large CDNs, how do you discover the public key of known crawlers?</p><p>The first protocol proposal suggested one approach: bot operators would provide a newly-defined HTTP header Signature-Agent that refers to an HTTP endpoint hosting their keys. Similar to IP addresses, the default is to allow all, but if a particular operator is making too many requests, you can start taking actions: increase their rate limit, contact the operator, etc.</p><p>Here’s an example from <a href="https://help.shopify.com/en/manual/promoting-marketing/seo/crawling-your-store"><u>Shopify's online store</u></a>:</p>
            <pre><code>Signature-Agent: "https://shopify.com"</code></pre>
            
    <div>
      <h2>A registry format</h2>
      <a href="#a-registry-format">
        
      </a>
    </div>
    <p>With all that in mind, we come to the following problem. How can Cloudflare ensure customers have control over the traffic they want to allow, with sensible defaults, while fostering an open curation ecosystem that doesn’t lock in customers or small origins?</p><p>Such an ecosystem exists for lists of IP addresses (e.g.<a href="https://github.com/antoinevastel/avastel-bot-ips-lists/blob/master/avastel-proxy-bot-ips-1day.txt"><u> avestel-bots-ip-lists</u></a>) and robots.txt (e.g.<a href="https://github.com/ai-robots-txt/ai.robots.txt"><u> ai-robots-txt</u></a>). For both, you can find canonical lists on the Internet to easily configure your website to allow or disallow traffic from those IPs. They provide direct configuration for your nginx or haproxy, and you can use it to configure your Cloudflare account. For instance, I could import the robots.txt below:</p>
            <pre><code>User-agent: MyBadBot
Disallow: /</code></pre>
            <p>This is where the registry format comes in, providing a list of URLs pointing to Signature Agent keys:</p>
            <pre><code># AI Crawler
https://chatgpt.com/.well-known/http-message-signatures-directory
https://autorag.ai.cloudflare.com/.well-known/http-message-signatures-directory
 
# Test signature agent card
https://http-message-signatures-example.research.cloudflare.com/.well-known/http-message-signatures-directory</code></pre>
            <p>And that's it. A registry could contain a list of all known signature agents, a curated list for academic research agents, for search agents, etc.</p><p>Anyone can maintain and host these lists. Similar to IP or robots.txt list, you can host such a registry on any public file system. This means you can have a repository on GitHub, put the file on Cloudflare R2, or send it as an email attachment. Cloudflare intends to provide one of the first instances of this registry, so that others can contribute to it or reference it when building their own. </p>
    <div>
      <h2>Learn more about an incoming request</h2>
      <a href="#learn-more-about-an-incoming-request">
        
      </a>
    </div>
    <p>Knowing the Signature-Agent is great, but not sufficient. For instance, to be a verified bot, Cloudflare requires a contact method, in case requests from that infrastructure suddenly fail or change format in a way that causes unexpected errors upstream. In fact, there is a lot of information an origin might want to know: a name for the operator, a contact method, a logo, the expected crawl rate, etc.</p><p>Therefore, to complement the registry format, we have proposed a <a href="https://thibmeu.github.io/http-message-signatures-directory/draft-meunier-webbotauth-registry.html#name-signature-agent-card"><u>signature-agent card format</u> that </a>extends the JWKS directory (<a href="https://www.rfc-editor.org/rfc/rfc7517"><u>RFC 7517</u></a>) with additional metadata. Similar to an old-fashioned contact card, it includes all the important information someone might want to know about your agent or crawler. </p><p>We provide an example below for illustration. Note that the fields may change: introducing jwks-uri, logo being more descriptive, etc.</p>
            <pre><code>{
  "client_name": "Example Bot",
  "client_uri": "https://example.com/bot/about.html",
  "logo_uri": "https://example.com/",
  "contacts": ["mailto:bot-support@example.com"],
  "expected-user-agent": "Mozilla/5.0 ExampleBot",
  "rfc9309-product-token": "ExampleBot",
  "rfc9309-compliance": ["User-Agent", "Allow", "Disallow", "Content-Usage"],
  "trigger": "fetcher",
  "purpose": "tdm",
  "targeted-content": "Cat pictures",
  "rate-control": "429",
  "rate-expectation": "avg=10rps;max=100rps",
  "known-urls": ["/", "/robots.txt", "*.png"],
  "keys": [{
    "kty": "OKP",
    "crv": "Ed25519",
    "kid": "NFcWBst6DXG-N35nHdzMrioWntdzNZghQSkjHNMMSjw",
    "x": "JrQLj5P_89iXES9-vFgrIy29clF9CC_oPPsw3c5D0bs",
    "use": "sig",
    "nbf": 1712793600,
    "exp": 1715385600
  }]
}</code></pre>
            
    <div>
      <h2>Operating a registry</h2>
      <a href="#operating-a-registry">
        
      </a>
    </div>
    <p>Amazon Bedrock AgentCore, an agentic platform for building and deploying AI agents at scale, adopted Web Bot Auth for its AgentCore Browser service (learn more in <a href="https://aws.amazon.com/blogs/machine-learning/reduce-captchas-for-ai-agents-browsing-the-web-with-web-bot-auth-preview-in-amazon-bedrock-agentcore-browser/">their post)</a>. AgentCore Browser intends to transition from a service signing key that is currently available in their public preview, to customer-specific keys, once the protocol matures. Cloudflare and other operators of origin protection service will be able to see and validate signatures from individual AgentCore customers rather than AgentCore as a whole.</p><p>Cloudflare also offers a registry for bots and agents it trusts, provided through Radar. It uses the <a href="https://assets.radar.cloudflare.com/bots/signature-agent-registry.txt"><u>registry format</u></a> to allow for the consumption of bots trusted by Cloudflare on your server.</p><p>You can use these registries today – we’ve provided a demo in Go for <a href="https://caddyserver.com/"><u>Caddy server</u></a> that would allow us to import keys from multiple registries. It’s on <a href="https://github.com/cloudflare/web-bot-auth/pull/52"><u>cloudflare/web-bot-auth</u></a>. The configuration looks like this:</p>
            <pre><code>:8080 {
    route {
        # httpsig middleware is used here
        httpsig {
            registry "http://localhost:8787/test-registry.txt"
            # You can specify multiple registries. All tags will be checked independantly
            registry "http://example.test/another-registry.txt"
        }

        # Responds if signature is valid
        handle {
            respond "Signature verification succeeded!" 200
        }
    }
}</code></pre>
            <p>There are several reasons why you might want to operate and curate a registry leveraging the <a href="https://www.ietf.org/archive/id/draft-meunier-webbotauth-registry-01.html#name-signature-agent-card"><u>Signature Agent Card format</u></a>:</p><ol><li><p><b>Monitor incoming </b><code><b>Signature-Agent</b></code><b>s.</b> This should allow you to collect signature-agent cards of agents reaching out to your domain.</p></li><li><p><b>Import them from existing registries, and categorize them yourself.</b> There could be a general registry constructed from the monitoring step above, but registries might be more useful with more categories.</p></li><li><p><b>Establish direct relationships with agents.</b> Cloudflare does this for its<a href="https://radar.cloudflare.com/bots#verified-bots"> <u>bot registry</u></a> for instance, or you might use a public GitHub repository where people can open issues.</p></li><li><p><b>Learn from your users.</b> If you offer a security service, allowing your customers to specify the registries/signature-agents they want to let through allows you to gain valuable insight.</p></li></ol>
    <div>
      <h2>Moving forward</h2>
      <a href="#moving-forward">
        
      </a>
    </div>
    <p>As cryptographic authentication for bots and agents grows, the need for discovery increases.</p><p>With the introduction of a lightweight format and specification to attach metadata to Signature-Agent, and curate them in the form of registries, we begin to address this need. The HTTP Message Signature directory format is being expanded to include some self-certified metadata, and the registry maintains a curation ecosystem.</p><p>Down the line, we predict that clients and origins will choose the signature-agent they trust, use a common format to migrate their configuration between CDN providers, and rely on a third-party registry for curation. We are working towards integrating these capabilities into our bot management and rule engines.</p><p>If you’d like to experiment, our demo is on <a href="https://github.com/cloudflare/web-bot-auth/pull/52"><u>GitHub</u></a>. If you’d like to help us, <a href="https://blog.cloudflare.com/cloudflare-1111-intern-program/"><u>we’re hiring 1,111 interns</u></a> over the course of next year, and have <a href="https://www.cloudflare.com/careers/"><u>open positions</u></a>.</p> ]]></content:encoded>
            <category><![CDATA[Research]]></category>
            <category><![CDATA[Bots]]></category>
            <guid isPermaLink="false">3VeTsp2f9v3B1QZF0oglUV</guid>
            <dc:creator>Thibault Meunier</dc:creator>
            <dc:creator>Maxime Guerreiro</dc:creator>
        </item>
        <item>
            <title><![CDATA[Anonymous credentials: rate-limiting bots and agents without compromising privacy]]></title>
            <link>https://blog.cloudflare.com/private-rate-limiting/</link>
            <pubDate>Thu, 30 Oct 2025 13:00:00 GMT</pubDate>
            <description><![CDATA[ As AI agents change how the Internet is used, they create a challenge for security. We explore how Anonymous Credentials can rate limit agent traffic and block abuse without tracking users or compromising their privacy. ]]></description>
            <content:encoded><![CDATA[ <p></p><p>The way we interact with the Internet is changing. Not long ago, ordering a pizza meant visiting a website, clicking through menus, and entering your payment details. Soon, you might just <a href="https://www.cnet.com/tech/services-and-software/i-had-chatgpt-order-me-a-pizza-this-could-change-everything/"><u>ask your phone</u></a> to order a pizza that matches your preferences. A program on your device or on a remote server, which we call an <a href="https://developers.cloudflare.com/agents/concepts/what-are-agents/"><u>AI agent</u></a>, would visit the website and orchestrate the necessary steps on your behalf.</p><p>Of course, agents can do much more than order pizza. Soon we might use them to buy concert tickets, plan vacations, or even write, review, and merge pull requests. While some of these tasks will eventually run locally, for now, most are powered by massive AI models running in the biggest datacenters in the world. As agentic AI increases in popularity, we expect to see a large increase in traffic from these AI platforms and a corresponding drop in traffic from more conventional sources (like your phone).</p><p>This shift in traffic patterns has prompted us to assess how to keep our customers online and secure in the AI era. On one hand, the nature of requests are changing: Websites optimized for human visitors will have to cope with faster, and potentially greedier, agents. On the other hand, AI platforms may soon become a significant source of attacks, originating from malicious users of the platforms themselves.</p><p>Unfortunately, existing tools for managing such (mis)behavior are likely too coarse-grained to manage this transition. For example, <a href="https://blog.cloudflare.com/per-customer-bot-defenses/"><u>when Cloudflare detects that a request is part of a known attack pattern</u></a>, the best course of action often is to block all subsequent requests from the same source. When the source is an AI agent platform, this could mean inadvertently blocking all users of the same platform, even honest ones who just want to order pizza. We started addressing this problem <a href="https://blog.cloudflare.com/web-bot-auth/"><u>earlier this year</u></a>. But as agentic AI grows in popularity, we think the Internet will need more fine-grained mechanisms of managing agents without impacting honest users.</p><p>At the same time, we firmly believe that any such security mechanism must be designed with user privacy at its core. In this post, we'll describe how to use <b>anonymous credentials (AC)</b> to build these tools. Anonymous credentials help website operators to enforce a wide range of security policies, like rate-limiting users or blocking a specific malicious user, without ever having to identify any user or track them across requests.</p><p>Anonymous credentials are <a href="https://mailarchive.ietf.org/arch/msg/privacy-pass/--JXbGvkHnLq1iHQKJAnfn5eH9A/"><u>under development at IETF</u></a> in order to provide a standard that can work across websites, browsers, platforms. It's still in its early stages, but we believe this work will play a critical role in keeping the Internet secure and private in the AI era. We will be contributing to this process as we work towards real-world deployment. This is still early days. If you work in this space, we hope you will follow along and contribute as well.</p>
    <div>
      <h2>Let’s build a small agent</h2>
      <a href="#lets-build-a-small-agent">
        
      </a>
    </div>
    <p>To help us discuss how AI agents are affecting web servers, let’s build an agent ourselves. Our goal is to have an agent that can order a pizza from a nearby pizzeria. Without an agent, you would open your browser, figure out which pizzeria is nearby, view the menu and make selections, add any extras (double pepperoni), and proceed to checkout with your credit card. With an agent, it’s the same flow —except the agent is opening and orchestrating the browser on your behalf.</p><p>In the traditional flow, there’s a human all along the way, and each step has a clear intent: list all pizzerias within 3 Km of my current location; pick a pizza from the menu; enter my credit card; and so on. An agent, on the other hand, has to infer each of these actions from the prompt "order me a pizza."</p><p>In this section, we’ll build a simple program that takes a prompt and can make outgoing requests. Here’s an example of a simple <a href="https://workers.cloudflare.com/"><u>Worker</u></a> that takes a specific prompt and generates an answer accordingly. You can find the code on <a href="https://github.com/cloudflareresearch/mini-ai-agent-demo"><u>GitHub</u></a>:</p>
            <pre><code>export default {
   async fetch(request: Request, env: Env, ctx: ExecutionContext): Promise&lt;Response&gt; {
       const out = await env.AI.run("@cf/meta/llama-3.1-8b-instruct-fp8", {
           prompt: `I'd like to order a pepperoni pizza with extra cheese.
                    Please deliver it to Cloudflare Austin office.
                    Price should not be more than $20.`,
       });


       return new Response(out.response);
   },
} satisfies ExportedHandler&lt;Env&gt;;</code></pre>
            <p>In this context, the LLM provides its best answer. It gives us a plan and instruction, but does not perform the action on our behalf. You and I are able to take a list of instructions and act upon it because we have agency and can affect the world. To allow our agent to interact with more of the world, we’re going to give it control over a web browser.</p><p>Cloudflare offers a <a href="https://developers.cloudflare.com/browser-rendering"><u>Browser Rendering</u></a> service that can bind directly into our Worker. Let’s do that. The following code uses <a href="https://www.stagehand.dev/"><u>Stagehand</u></a>, an automation framework that makes it simple to control the browser. We pass it an instance of Cloudflare remote browser, as well as a client for <a href="https://developers.cloudflare.com/workers-ai/"><u>Workers AI</u></a>.</p>
            <pre><code>import { Stagehand } from "@browserbasehq/stagehand";
import { endpointURLString } from "@cloudflare/playwright";
import { WorkersAIClient } from "./workersAIClient"; // wrapper to convert cloudflare AI


export default {
   async fetch(request: Request, env: Env, ctx: ExecutionContext): Promise&lt;Response&gt; {
       const stagehand = new Stagehand({
           env: "LOCAL",
           localBrowserLaunchOptions: { cdpUrl: endpointURLString(env.BROWSER) },
           llmClient: new WorkersAIClient(env.AI),
           verbose: 1,
       });
       await stagehand.init();


       const page = stagehand.page;
       await page.goto("https://mini-ai-agent.cloudflareresearch.com/llm");


       const { extraction } = await page.extract("what are the pizza available on the menu?");
       return new Response(extraction);
   },
} satisfies ExportedHandler&lt;Env&gt;;</code></pre>
            <p>You can access that code for yourself on <a href="https://mini-ai-agent.cloudflareresearch.com/llm"><i><u>https://mini-ai-agent.cloudflareresearch.com/llm</u></i></a>. Here’s the response we got on October 10, 2025:</p>
            <pre><code>Margherita Classic: $12.99
Pepperoni Supreme: $14.99
Veggie Garden: $13.99
Meat Lovers: $16.99
Hawaiian Paradise: $15.49</code></pre>
            <p>Using the screenshot API of browser rendering, we can also inspect what the agent is doing. Here's how the browser renders the page in the example above:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6lXTePCTUORCyyOWNNcwZ8/5978abd1878f78107a2c9606c3a1ef51/image4.png" />
          </figure><p>Stagehand allows us to identify components on the page, such as <code>page.act(“Click on pepperoni pizza”)</code> and <code>page.act(“Click on Pay now”)</code>. This eases interaction between the developer and the browser.</p><p>To go further, and instruct the agent to perform the whole flow autonomously, we have to use the appropriately named <a href="https://docs.stagehand.dev/basics/agent"><u>agent</u></a> mode of Stagehand. This feature is not yet supported by Cloudflare Workers, but is provided below for completeness.</p>
            <pre><code>import { Stagehand } from "@browserbasehq/stagehand";
import { endpointURLString } from "@cloudflare/playwright";
import { WorkersAIClient } from "./workersAIClient";


export default {
   async fetch(request: Request, env: Env, ctx: ExecutionContext): Promise&lt;Response&gt; {
       const stagehand = new Stagehand({
           env: "LOCAL",
           localBrowserLaunchOptions: { cdpUrl: endpointURLString(env.BROWSER) },
           llmClient: new WorkersAIClient(env.AI),
           verbose: 1,
       });
       await stagehand.init();
       
       const agent = stagehand.agent();
       const result = await agent.execute(`I'd like to order a pepperoni pizza with extra cheese.
                                           Please deliver it to Cloudflare Austin office.
                                           Price should not be more than $20.`);


       return new Response(result.message);
   },
} satisfies ExportedHandler&lt;Env&gt;;</code></pre>
            <p>We can see that instead of adding step-by-step instructions, the agent is provided control. To actually pay, it would need access to a payment method such as a <a href="https://en.wikipedia.org/wiki/Controlled_payment_number"><u>virtual credit card</u></a>.</p><p>The prompt had some subtlety in that we’ve scoped the location to Cloudflare’s Austin office. This is because while the agent responds to us, it needs to understand our context. In this case, the agent operates out of Cloudflare edge, a location remote to us. This implies we are unlikely to pick up a pizza from this <a href="https://www.cloudflare.com/learning/cdn/glossary/data-center/"><u>data center</u></a> if it was ever delivered.</p><p>The more capabilities we provide to the agent, the more it has the ability to create some disruption. Instead of someone having to make 5 clicks at a slow rate of 1 request per 10 seconds, they’d have a program running in a data center possibly making all 5 requests in a second.</p><p>This agent is simple, but now imagine many thousands of these — some benign, some not — running at datacenter speeds. This is the challenge origins will face.</p>
    <div>
      <h2>Protecting origins</h2>
      <a href="#protecting-origins">
        
      </a>
    </div>
    <p>For humans to interact with the online world, they need a web browser and some peripherals with which to direct the behavior of that browser. Agents are another way of directing a browser, so it may be tempting to think that not much is actually changing from the origin's point of view. Indeed, the most obvious change from the origin's point of view is merely where traffic comes from:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/304j2MNDUNwAaipqmH2Jbt/35beb792bda327a6cf0db3b642bbc4d6/unnamed-1.png" />
          </figure><p>The reason this change is significant has to do with the tools the server has to manage traffic. Websites generally try to be as permissive as possible, but they also need to manage finite resources (bandwidth, CPU, memory, storage, and so on). There are a few basic ways to do this:</p><ol><li><p><b>Global security policy</b>: A server may opt to slow down, CAPTCHA, or even temporarily block requests from all users. This policy may be applied to an entire site, a specific resource, or to requests classified as being part of a known or likely attack pattern. Such mechanisms may be deployed in reaction to an observed spike in traffic, as in a DDoS attack, or in anticipation of a spike in legitimate traffic, as in <a href="https://developers.cloudflare.com/waiting-room/"><u>Waiting Room</u></a>.</p></li><li><p><b>Incentives</b>: Servers sometimes try to incentivize users to use the site when more resources are available. For instance, a server price may be lower depending on the location or request time. This could be implemented with a <a href="https://developers.cloudflare.com/rules/snippets/when-to-use/"><u>Cloudflare Snippet</u></a>.</p></li></ol><p>While both tools can be effective, they also sometimes cause significant collateral damage. For example, while rate limiting a website's login endpoint <a href="https://developers.cloudflare.com/waf/rate-limiting-rules/best-practices/#protecting-against-credential-stuffing"><u>can help prevent credential stuffing attacks</u></a>, it also degrades the user experience for non-attackers. Before resorting to such measures, servers will first try to apply the security policy (whether a rate limit, a CAPTCHA, or an outright block) to individual users or groups of users.</p><p>However, in order to apply a security policy to individuals, the server needs some way of identifying them. Historically, this has been done via some combination of IP addresses, <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent"><u>User-Agent</u></a>, an account tied to the user identity (if available), and other fingerprints. Like most cloud service providers, Cloudflare has a <a href="https://developers.cloudflare.com/waf/rate-limiting-rules/best-practices/"><u>dedicated offering</u></a> for per-user rate limits based on such heuristics.</p><p>Fingerprinting works for the most part. However, it's unequitably distributed. On mobile, users <a href="https://blog.cloudflare.com/eliminating-captchas-on-iphones-and-macs-using-new-standard/#captchas-dont-work-in-mobile-environments-pats-remove-the-need-for-them"><u>have an especially difficult time solving CAPTCHA</u></a>s, when using a VPN they’re <a href="https://arstechnica.com/tech-policy/2023/07/meta-blocking-vpn-access-to-threads-in-eu/"><u>more</u></a> <a href="https://help.netflix.com/en/node/277"><u>likely</u></a> <a href="https://www.theregister.com/2015/10/19/bbc_cuts_off_vpn_to_iplayer/"><u>to</u></a> <a href="https://torrentfreak.com/hulu-blocks-vpn-users-over-piracy-concerns-140425/"><u>be</u></a> <a href="https://developers.cloudflare.com/cloudflare-one/connections/connect-devices/warp/troubleshooting/common-issues/"><u>blocked</u></a>, and when using <a href="https://www.peteresnyder.com/static/papers/speedreader-www2019.pdf"><u>reading mode</u></a> they can mess up their fingerprint, preventing rendering of the page.</p><p>Likewise, agentic AI only exacerbates the limitations of fingerprinting. Not only will more traffic be concentrated on a smaller source IP range, the agents themselves will run the same software and hardware platform, making it harder to distinguish honest from malicious users.</p><p>Something that could help is <a href="https://blog.cloudflare.com/web-bot-auth/"><u>Web Bot Auth</u></a>, which would allow agents to identify to the origin which platform they're operated by. However, we wouldn't want to extend this mechanism — intended for identifying the platform itself — to identifying individual users of the platforms, as this would create an unacceptable privacy risk for these users.</p><p>We need some way of implementing security controls for individual users without identifying them. But how? The Privacy Pass protocol provides <a href="https://blog.cloudflare.com/eliminating-captchas-on-iphones-and-macs-using-new-standard/#captchas-dont-work-in-mobile-environments-pats-remove-the-need-for-them"><u>a partial solution</u></a>.</p>
    <div>
      <h2>Privacy Pass and its limitations</h2>
      <a href="#privacy-pass-and-its-limitations">
        
      </a>
    </div>
    <p>Today, one of the most prominent use cases for Privacy Pass is to <a href="https://www.cloudflare.com/learning/bots/what-is-rate-limiting/"><u>rate limit</u></a> requests from a user to an origin, as we have <a href="https://blog.cloudflare.com/privacy-pass-standard/#privacy-pass-protocol"><u>discussed before</u></a>. The protocol works roughly as follows. The client is <b>issued</b> a number of <b>tokens</b>. Each time it wants to make a request, it <b>redeems</b> one of its tokens to the origin; the origin allows the request through only if the token is <b>fresh</b>, i.e., has never been observed before by the origin.</p><p>In order to use Privacy Pass for per-user rate-limiting, it's necessary to limit the number of tokens issued to each user (e.g., 100 tokens per user per hour). To rate limit an AI agent, this role would be fulfilled by the AI platform. To obtain tokens, the user would log in with the platform, and said platform would allow the user to get tokens from the issuer. The AI platform fulfills the <b>attester</b> role in Privacy Pass <a href="https://datatracker.ietf.org/doc/html/rfc9576"><u>parlance</u></a>. The attester is the party guaranteeing the per-user property of the rate limit. The AI platform, as an attester, is incentivized to enforce this token distribution as it stakes its reputation: Should it allow for too many tokens to be issued, the issuer could distrust them.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/62Rz5eS1UMm2pKorpowEGg/4949220bdf2fa3c39ccfa17d4df70fff/token__1_.png" />
          </figure><p>The issuance and redemption protocols are designed to have two properties:</p><ul><li><p>Tokens are <b>unforgeable</b>: only the issuer can issue valid tokens.</p></li><li><p>Tokens are <b>unlinkable: </b>no party, including the issuer, attester, or origin, can tell which user a token was issued to. </p></li></ul><p>These properties can be achieved using a <a href="https://csrc.nist.gov/glossary/term/cryptographic_primitive"><u>cryptographic primitive</u></a> called a <a href="https://blog.cloudflare.com/privacy-pass-the-math/"><u>blind signature</u></a><b> </b>scheme. In a conventional signature scheme, the signer uses its <b>private key</b> to produce a signature for a message. Later on, a verifier can use the signer’s <b>public key</b> to verify the signature. Blind signature schemes work in the same way, except that the message to be signed is blinded such that the signer doesn't know the message it's signing. The client “blinds” the message to be signed and sends it to the server, which then computes a blinded signature over the blinded message. The client obtains the final signature by unblinding the signature.  </p><p>This is exactly how the standardised Privacy Pass issuance protocols are defined by <a href="https://www.rfc-editor.org/rfc/rfc9578"><u>RFC 9578</u></a>:</p><ul>
  <li>
    <strong>Issuance:</strong> The user generates a random message 
    <strong>$k$</strong> 
    which we call the 
    <strong>nullifier</strong>. Concretely, this is just a random, 32-byte string. It then blinds the nullifier and sends it to the issuer. The issuer replies with a blind signature. Finally, the user unblinds the signature to get 
    <strong>$\sigma$</strong>, 
    a signature for the nullifier 
    <strong>$k$</strong>. The token is the pair 
    <strong>$(k, \sigma)$</strong>.
  </li>
  <li>
    <strong>Redemption:</strong> When the user presents 
    <strong>$(k, \sigma)$</strong>, 
    the origin checks that 
    <strong>$\sigma$</strong> 
    is a valid signature for the nullifier 
    <strong>$k$</strong> 
    and that 
    <strong>$k$</strong> 
    is fresh. If both conditions hold, then it accepts and lets the request through.
  </li>
</ul><p>Blind signatures are simple, cheap, and perfectly suited for many applications. However, they have some limitations that make them unsuitable for our use case.</p><p>First, the communication cost of the issuance protocol is too high. For each token issued, the user sends a 256-byte, blinded nullifier and the issuer replies with a 256-byte blind signature (assuming RSA-2048 is used). That's 0.5KB of additional communication per request, or 500KB for every 1,000 requests. This is manageable as we’ve seen in a <a href="https://eprint.iacr.org/2023/414.pdf"><u>previous experiment</u></a> for Privacy Pass, but not ideal. Ideally, the bandwidth would be sublinear in the rate limit we want to enforce. An alternative to blind signatures with lower compute time are Oblivious Pseudorandom Functions (<a href="https://datatracker.ietf.org/doc/rfc9497/"><u>VOPRF</u></a>), but the bandwidth is still asymptotically linear. We’ve <a href="https://blog.cloudflare.com/privacy-pass-the-math/"><u>discussed them in the past</u></a>, as they served as the basis for early deployments of Privacy Pass.</p><p>Second, blind signatures can't be used to rate-limit on a per-origin basis. Ideally, when issuing $N$ tokens to the client, the client would be able to redeem at most $N$ tokens at any origin server that can verify the token's validity. However, the client can't safely redeem the same token at more than one server because it would be possible for the servers to link those redemptions to the same client. What's needed is some mechanism for what we'll call <b>late origin-binding</b>: transforming a token for redemption at a particular origin in a way that's unlinkable to other redemptions of the same token.</p><p>Third, once a token is issued, it can't be revoked: it remains valid as long as the issuer's public key is valid. This makes it impossible for an origin to block a specific user if it detects an attack, or if its tokens are compromised. The origin can block the offending request, but the user can continue to make requests using its remaining token budget.</p>
    <div>
      <h2>Anonymous credentials and the future of Privacy Pass</h2>
      <a href="#anonymous-credentials-and-the-future-of-privacy-pass">
        
      </a>
    </div>
    <p>As noted by <a href="https://dl.acm.org/doi/pdf/10.1145/4372.4373"><u>Chaum</u></a> in 1985, an <b>anonymous credential</b> system allows users to obtain a credential from an issuer, and later prove possession of this credential, in an unlinkable way, without revealing any additional information. Also, it is possible to demonstrate that some attributes are attached to the credential.</p><p>One way to think of an anonymous credential is as a kind of blind signature with some additional capabilities: late-binding (link a token to an origin after issuance), multi-show (generate multiple tokens from a single issuer response), and expiration distinct from key rotation (token validity decoupled of the issuer cryptographic key validity). In the redemption flow for Privacy Pass, the client presents the unblinded message and signature to the server. To accept the redemption, the server needs to verify the signature. In an AC system, the client only presents a <b>part of the message</b>. In order for the server to accept the request, the client needs to prove to the server that it knows a valid signature for the entire message without revealing the whole thing.</p><p>The flow we described above would therefore include this additional <b>presentation</b> step. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7pb3ZDoAHbDLEt0mxtf67T/b6be11710a7df7a4df7d1c89788285a7/credentials__2_.png" />
          </figure><p>Note that the tokens generated through blind signatures or VOPRFs can only be used once, so they can be regarded as <i>single-use tokens</i>. However, there exists a type of anonymous credentials that allows tokens to be used multiple times. For this to work, the issuer grants a <i>credential</i> to the user, who can later derive at most <i>N</i> many single-use tokens for redemption. Therefore, the user can send multiple requests, at the expense of a single issuance session.  </p><p>The table below describes how blind signatures and anonymous credentials provide features of interest to rate limiting.</p><table><tr><td><p><b>Feature</b></p></td><td><p><b>Blind Signature</b></p></td><td><p><b>Anonymous Credential</b></p></td></tr><tr><td><p><b>Issuing Cost</b></p></td><td><p>Linear complexity: issuing 10 signatures is 10x as expensive as issuing one signature</p></td><td><p>Sublinear complexity: signing 10 attributes is cheaper than 10 individual signatures</p></td></tr><tr><td><p><b>Proof Capability</b></p></td><td><p>Only prove that a message has been signed</p></td><td><p>Allow efficient proving of partial statements (i.e., attributes)</p></td></tr><tr><td><p><b>State Management</b></p></td><td><p>Stateless</p></td><td><p>Stateful</p></td></tr><tr><td><p><b>Attributes</b></p></td><td><p>No attributes</p></td><td><p>Public (e.g. expiry time) and private state</p></td></tr></table><p>
  Let's see how a simple anonymous credential scheme works. The client's message consists of the pair 
  <strong>$(k, C)$</strong>, 
  where 
  <strong>$k$</strong> 
  is a 
  <strong>nullifier</strong> and 
  <strong>$C$</strong> 
  is a 
  <strong>counter</strong> representing the remaining number of times the client can access a resource. The value of the counter is controlled by the server: when the client redeems its credential, it presents both the nullifier and the counter. In response, the server checks that signature of the message is valid and that the nullifier is fresh, as before. Additionally, the server also
</p><ol><li><p>checks that the counter is greater than zero; and</p></li><li><p>decrements the counter issuing a new credential for the updated counter and a fresh nullifier.</p></li></ol><p>A blind signature could be used to meet this functionality. However, whereas the nullifier can be blinded as before, it would be necessary to handle the counter in plaintext so that the server can check that the counter is valid (Step 1) and update it (Step 2). This creates an obvious privacy risk since the server, which is in control of the counter, can use it to link multiple presentations by the same client. For example, when you reach out to buy a pepperoni pizza, the origin could assign you a special counter value, which eases fingerprinting when you present it a second time. Fortunately, there exist anonymous credentials designed to close this kind of privacy gap.</p><p>The scheme above is a simplified version of Anonymous Credit Tokens (<a href="https://datatracker.ietf.org/doc/draft-schlesinger-cfrg-act/"><u>ACT</u></a>), one of the anonymous credential schemes being considered for adoption by the <a href="https://datatracker.ietf.org/wg/privacypass/about/"><u>Privacy Pass working group</u></a> at IETF. The key feature of ACT is its <b>statefulness</b>: upon successful redemption, the server re-issues a new credential with updated nullifier and counter values. This creates a feedback loop between the client and server that can be used to express a variety of security policies.</p><p>By design, it's not possible to present ACT credentials multiple times simultaneously: the first presentation must be completed so that the re-issued credential can be presented in the next request. <b>Parallelism </b>is the key feature of Anonymous Rate-limited Credential (<a href="https://datatracker.ietf.org/doc/html/draft-yun-cfrg-arc-00"><u>ARC</u></a>), another scheme under discussion at the Privacy Pass working group. ARCs can be presented across multiple requests in parallel up to the presentation limit determined during issuance.</p><p>Another important feature of ARC is its support for late origin-binding: when a client is issued an ARC with presentation limit $N$, it can safely use its credential to present up to $N$ times to any origin that can verify the credential.</p><p>These are just examples of relevant features of some anonymous credentials. Some applications may benefit from a subset of them; others may need additional features. Fortunately, both ACT and ARC can be constructed from a small set of cryptographic primitives that can be easily adapted for other purposes.</p>
    <div>
      <h2>Building blocks for anonymous credentials</h2>
      <a href="#building-blocks-for-anonymous-credentials">
        
      </a>
    </div>
    <p>ARC and ACT share two primitives in common: <a href="https://eprint.iacr.org/2013/516.pdf"><b><u>algebraic MACs</u></b></a>, which provide for limited computations on the blinded message; and <a href="https://en.wikipedia.org/wiki/Zero-knowledge_proof"><b><u>zero-knowledge proofs (ZKP)</u></b></a> for proving validity of the part of the message not revealed to the server. Let's take a closer look at each.</p>
    <div>
      <h3>Algebraic MACs</h3>
      <a href="#algebraic-macs">
        
      </a>
    </div>
    <p>A Message Authenticated Code (MAC) is a cryptographic tag used to verify a message's authenticity (that it comes from the claimed sender) and integrity (that it has not been altered). Algebraic MACs are built from mathematical structures like <a href="https://en.wikipedia.org/wiki/Group_action"><u>group actions</u></a>. The algebraic structure gives them some additional functionality, one of them being a <i>homomorphism</i> that we can blind easily to conceal the actual value of the MAC. Adding a random value on an algebraic MAC blinds the value.</p><p>Unlike blind signatures, both ACT and ARC are only <i>privately</i> verifiable, meaning the issuer and the origin must both have the issuer's private key. Taking Cloudflare as an example, this means that a credential issued by Cloudflare can only be redeemed by an origin behind Cloudflare. Publicly verifiable variants of both are possible, but at an additional cost.</p>
    <div>
      <h3>Zero-Knowledge Proofs for linear relations</h3>
      <a href="#zero-knowledge-proofs-for-linear-relations">
        
      </a>
    </div>
    <p>Zero knowledge proofs (ZKP) allow us to prove a statement is true without revealing the exact value that makes the statement true. The ZKP is constructed by a prover in such a way that it can only be generated by someone who actually possesses the secret. The verifier can then run a quick mathematical check on this proof. If the check passes, the verifier is convinced that the prover's initial statement is valid. The crucial property is that the proof itself is just data that confirms the statement; it contains no other information that could be used to reconstruct the original secret.</p><p>For ARC and ACT, we want to prove <i>linear relations</i> of secrets. In ARC, a user needs to prove that different tokens are linked to the same original secret credential. For example, a user can generate a proof showing that a <i>request token</i> was derived from a valid <i>issued credential</i>. The system can verify this proof to confirm the tokens are legitimately connected, all without ever learning the underlying secret credential that ties them together. This allows the system to validate user actions while guaranteeing their privacy.</p><p>Proving simple linear relations can be extended to prove a number of powerful statements, for example that a number is in range. For example, this is useful to prove that you have a positive balance on your account. To prove your balance is positive, you prove that you can encode your balance in binary. Let’s say you can at most have 1024 credits in your account. To prove your balance is non-zero when it is, for example, 12, you prove two things simultaneously: first, that you have a set of binary bits, in this case 12=(1100)<sub>2</sub>, and second, that a linear equation using these bits (8*1 + 4*1 + 2*0 + 1*0) correctly adds up to your total committed balance. This convinces the verifier that the number is validly constructed without them learning the exact value. This is how it works for powers of two, but it can <a href="https://github.com/chris-wood/draft-arc/pull/38"><u>easily be extended to arbitrary ranges</u></a>.</p><p>The mathematical structure of algebraic MACs allows easy blinding and evaluation. The structure also allows for an easy proof that a MAC has been evaluated with the private key without revealing the MAC. In addition, ARC could use ZKPs to prove that a nonce has not been spent before. In contrast, ACT uses ZKPs to prove we have enough of a balance left on our token. The balance is subtracted homomorphically using more group structure.</p>
    <div>
      <h2>How much does this all cost?</h2>
      <a href="#how-much-does-this-all-cost">
        
      </a>
    </div>
    <p>Anonymous credentials allow for more flexibility, and have the potential to reduce the communication cost, compared to blind signatures in certain applications. To identify such applications, we need to measure the concrete communication cost of these new protocols. In addition, we need to understand how their CPU usage compares to blind signatures and oblivious pseudorandom functions.</p><p>We measure the time that each participant spends at each stage of some AC schemes. We also report the size of messages transmitted across the network. For ARC, ACT, and VOPRF, we'll use <a href="https://doi.org/10.17487/RFC9496"><u>ristretto255</u></a> as the prime group and SHAKE128 for hashing. For Blind RSA, we'll use a 2048-bit modulus and SHA-384 for hashing.</p><p>Each algorithm was implemented in Go, on top of the <a href="https://github.com/cloudflare/circl"><u>CIRCL</u></a> library. We plan to open source the code once the specifications of ARC and ACT begin to stabilize.</p><p>Let’s take a look at the most widely used deployment in Privacy Pass: Blind RSA. Redemption time is low, and most of the cost lies with the server at issuance time. Communication cost is mostly constant and in the order of 256 bytes.</p>
<div><table><thead>
  <tr>
    <th><span>Blind RSA</span><br /><a href="https://doi.org/10.17487/RFC9474"><span>RFC9474</span></a><span>(RSA-2048+SHA384)</span></th>
    <th><span>1 Token</span></th>
  </tr>
  <tr>
    <th><span>Time</span></th>
    <th><span>Message Size</span></th>
  </tr></thead>
<tbody>
  <tr>
    <td><span>Issuance</span></td>
    <td><span>Client (Blind)</span></td>
    <td><span>63 µs</span></td>
    <td><span>256 B</span></td>
  </tr>
  <tr>
    <td><span>Server (Evaluate)</span></td>
    <td><span>2.69 ms</span></td>
    <td><span>256 B</span></td>
  </tr>
  <tr>
    <td><span>Client (Finalize)</span></td>
    <td><span>37 µs</span></td>
    <td><span>256 B</span></td>
  </tr>
  <tr>
    <td><span>Redemption</span></td>
    <td><span>Client</span></td>
    <td><span> –</span></td>
    <td><span>300 B</span></td>
  </tr>
  <tr>
    <td><span>Server</span></td>
    <td><span>37 µs</span></td>
    <td><span>–</span></td>
  </tr>
</tbody></table></div><p>When looking at VOPRF, verification time on the server is slightly higher than for Blind RSA, but communication cost and issuance are much faster. Evaluation time on the server is 10x faster for 1 token, and more than 25x faster when using <a href="https://datatracker.ietf.org/doc/draft-ietf-privacypass-batched-tokens/"><u>amortized token issuance</u></a>. Communication cost per token is also more appealing, with a message size at least 3x lower.</p>
<div><table><thead>
  <tr>
    <th><span>VOPRF</span><br /><a href="https://doi.org/10.17487/RFC9497"><span>RFC9497</span></a><span>(Ristretto255+SHA512)</span></th>
    <th><span>1 Token</span></th>
    <th><span>1000 Amortized issuances</span></th>
  </tr>
  <tr>
    <th><span>Time</span></th>
    <th><span>Message Size</span></th>
    <th><span>Time </span><br /><span>(per token)</span></th>
    <th><span>Message Size </span><br /><span>(per token)</span></th>
  </tr></thead>
<tbody>
  <tr>
    <td><span>Issuance</span></td>
    <td><span>Client (Blind)</span></td>
    <td><span>54 µs</span></td>
    <td><span>32 B</span></td>
    <td><span>54 µs</span></td>
    <td><span>32 B</span></td>
  </tr>
  <tr>
    <td><span>Server (Evaluate)</span></td>
    <td><span>260 µs</span></td>
    <td><span>96 B</span></td>
    <td><span>99 µs</span></td>
    <td><span>32.064 B</span></td>
  </tr>
  <tr>
    <td><span>Client (Finalize)</span></td>
    <td><span>376 µs</span></td>
    <td><span>64 B</span></td>
    <td><span>173 µs</span></td>
    <td><span>64 B</span></td>
  </tr>
  <tr>
    <td><span>Redemption</span></td>
    <td><span>Client</span></td>
    <td><span> –</span></td>
    <td><span>96 B</span></td>
    <td><span>–</span></td>
  </tr>
  <tr>
    <td><span>Server</span></td>
    <td><span>57 µs</span></td>
    <td><span>–</span></td>
  </tr>
</tbody></table></div><p>This makes VOPRF tokens appealing for applications requiring a lot of tokens that can accept a slightly higher redemption cost, and that don’t need public verifiability.</p><p>Now, let’s take a look at the figures for ARC and ACT anonymous credential schemes. For both schemes we measure the time to issue a credential that can be presented at most $N=1000$ times.</p>
<div><table><thead>
  <tr>
    <th><span>Issuance</span><br /><span>Credential Generation</span></th>
    <th><span>ARC</span></th>
    <th><span>ACT</span></th>
  </tr>
  <tr>
    <th><span>Time</span></th>
    <th><span>Message Size</span></th>
    <th><span>Time</span></th>
    <th><span>Message Size</span></th>
  </tr></thead>
<tbody>
  <tr>
    <td><span>Client (Request)</span></td>
    <td><span>323 µs</span></td>
    <td><span>224 B</span></td>
    <td><span>64 µs</span></td>
    <td><span>141 B</span></td>
  </tr>
  <tr>
    <td><span>Server (Response)</span></td>
    <td><span>1349 µs</span></td>
    <td><span>448 B</span></td>
    <td><span>251 µs</span></td>
    <td><span>176 B</span></td>
  </tr>
  <tr>
    <td><span>Client (Finalize)</span></td>
    <td><span>1293 µs</span></td>
    <td><span>128 B</span></td>
    <td><span>204 µs</span></td>
    <td><span>176 B</span></td>
  </tr>
  <tr>
    <td></td>
  </tr>
  <tr>
    <td><span>Redemption</span><br /><span>Credential Presentation</span></td>
    <td><span>ARC</span></td>
    <td><span>ACT</span></td>
  </tr>
  <tr>
    <td><span>Time</span></td>
    <td><span>Message Size</span></td>
    <td><span>Time</span></td>
    <td><span>Message Size</span></td>
  </tr>
  <tr>
    <td><span>Client (Present)</span></td>
    <td><span>735 µs</span></td>
    <td><span>288 B</span></td>
    <td><span> 1740 µs</span></td>
    <td><span>1867 B</span></td>
  </tr>
  <tr>
    <td><span>Server (Verify/Refund)</span></td>
    <td><span>740 µs</span></td>
    <td><span>–</span></td>
    <td><span>1785 µs</span></td>
    <td><span>141 B</span></td>
  </tr>
  <tr>
    <td><span>Client (Update)</span></td>
    <td><span>–</span></td>
    <td><span>–</span></td>
    <td><span>508 µs</span></td>
    <td><span>176 B</span></td>
  </tr>
</tbody></table></div><p>As we would hope, the communication cost and the server’s runtime is much lower than a batched issuance with either Blind RSA or VOPRF. For example, a VOPRF issuance of 1000 tokens takes 99 ms (99 µs per token) <i>vs</i> 1.35 ms for issuing one ARC credential that allows for 1000 presentations. This is about 70x faster. The trade-off is that presentation is more expensive, both for the client and server.</p><p>How about ACT? Like ARC, we would expect the communication cost of issuance grows much slower with respect to the credits issued. Our implementation bears this out. However, there are some interesting performance differences between ARC and ACT: issuance is much cheaper for ACT than it is for ARC, but redemption is the opposite.</p><p>What's going on? The answer has largely to do with what each party needs to prove with ZKPs at each step. For example, during ACT redemption, the client proves to the server (in zero-knowledge) that its counter $C$ is in the desired range, i.e., $0 \leq C \leq N$. The proof size is on the order of $\log_{2} N$, which accounts for the larger message size. In the current version, ARC redemption does not involve range proofs, but a range proof may be added in a <a href="http://mailarchive.ietf.org/arch/msg/privacy-pass/A3VHUdHqhslwBzYEQjcaXzYQAxQ/"><u>future version</u></a>. Meanwhile, the statements the client and server need to prove during ARC issuance are a bit more complicated than for ARC presentation, which accounts for the difference in runtime there.</p><p>The advantage of anonymous credentials, as discussed in the previous sections, is that issuance only has to be performed once. When a server evaluates its cost, it takes into account the cost of all issuances and the cost of all verifications. At present, only accounting for credentials costs, it’s cheaper for a server to issue and verify tokens than verify an anonymous credential presentation.</p><p>The advantage of multiple-use anonymous credentials is that instead of the issuer generating $N$ tokens, the bulk of computation is offloaded to the clients. This is more scoped. Late origin binding allows them to work for multiple origins/namespace, range proof to decorrelate expiration from key rotation, and refund to provide a dynamic rate limit. Their current applications are dictated by the limitation of single-use token based schemes, more than by the added efficiency they provide. This seems to be an exciting area to explore, and see if closing the gap is possible.</p>
    <div>
      <h2>Managing agents with anonymous credentials</h2>
      <a href="#managing-agents-with-anonymous-credentials">
        
      </a>
    </div>
    <p>Managing agents will likely require features from both ARC and ACT.</p><p>ARC already has much of the functionality we need: it supports rate limiting, is communication-efficient, and it supports late origin-binding. Its main downside is that, once an ARC credential is issued, it can't be revoked. A malicious user can always make up to <i>N</i> requests to any origin it wants.</p><p>We can allow for a limited form of revocation by pairing ARC with blind signatures (or VOPRF). Each presentation of the ARC credential is accompanied by a Privacy Pass token: upon successful presentation, the client is issued another Privacy Pass token it can use during the next presentation. To revoke a credential, the server would simply not re-issue the token:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6EiHmkbLef6kXsQU473fcX/d1d4018eaf2abd42b9690ae5d01494dc/image1.png" />
          </figure><p>This scheme is already quite useful. However, it has some important limitations:</p><ul><li><p>Parallel presentation across origins is not possible: the client must wait for the request to one origin to succeed before it can initiate a request to a second origin.</p></li><li><p>Revocation is <i>global</i> rather than per-origin, meaning the credential is not only revoked for the origin to whom it was presented, but for every origin it can be presented to. We suspect this will be undesirable in some cases. For example, an origin may want to revoke if a request violates its <code>robots.txt</code> policy; but the same request may have been accepted by other origins.  </p></li></ul><p>A more fundamental limitation of this design is that the decision to revoke can only be made on the basis of a single request — the one in which the credential was presented. It may be risky to decide to block a user on the basis of a single request; in practice, attack patterns may only emerge across many requests. ACT's statefulness enables at least a rudimentary form of this kind of defense. Consider the following scheme:</p><ul><li><p><b>Issuance: </b>The client is issued an ARC with presentation limit $N=1$.</p></li><li><p><b>Presentation:</b></p><ul><li><p>When the client presents its ARC credential to an origin for the first time, the server issues an ACT credential with a valid initial state.</p></li><li><p>When the client presents an ACT with valid state (e.g., credit counter greater than 0), the origin either:</p><ul><li><p>refuses to issue a new ACT, thereby revoking the credential. It would only do so if it had high confidence that the request was part of an attack; or</p></li><li><p>issues a new ACT with state updated to reduce the ACT credit by the amount of resources consumed while processing the request.</p></li></ul></li></ul></li></ul><p>Benign requests wouldn't change the state by much (if at all), but suspicious requests might impact the state in a way that gets the user closer to their rate limit much faster.</p>
    <div>
      <h2>Demo</h2>
      <a href="#demo">
        
      </a>
    </div>
    <p>To see how this idea works in practice, let's look at a working example that uses the <a href="https://developers.cloudflare.com/agents/model-context-protocol/"><u>Model Context Protocol</u></a>. The demo below is built using<a href="https://developers.cloudflare.com/agents/model-context-protocol/tools/"> <u>MCP Tools</u></a>. <a href="https://modelcontextprotocol.info/tools/"><u>Tools</u></a> are extensions the AI agent can call to extend its capabilities. They don't need to be integrated at release time within the MCP client. This provides a nice and easy prototyping avenue for anonymous credentials.</p><p>Tools are offered by the server via an MCP compatible interface. You can see details on how to build such MCP servers in a <a href="https://blog.cloudflare.com/remote-model-context-protocol-servers-mcp/"><u>previous blog</u></a>.</p><p>In our pizza context, this could look like a pizzeria that offers you a voucher. Each voucher gets you 3 pizza slices. Mocking a design, an integration within a chat application could look as follows:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5WD5MYoSMYGyRW2biwe6j4/bde101967276a72d48d9e494a23db5fa/image5.png" />
          </figure>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5SEqaVpwFxS1D21oyjjbN8/80dde2484f43c15e206ecfda991c286a/image9.png" />
          </figure><p>The first panel presents all tools exposed by the MCP server. The second one showcases an interaction performed by the agent calling these tools.</p><p>To look into how such a flow would be implemented, let’s write the MCP tools, offer them in an MCP server, and manually orchestrate the calls with the <a href="https://modelcontextprotocol.io/docs/tools/inspector"><u>MCP Inspector</u></a>.</p><p>The MCP server should provide two tools:</p><ul><li><p><code>act-issue </code>which issues an ACT credential valid for 3 requests. The code used here is an earlier version of the IETF draft which has some limitations.</p></li><li><p><code>act-redeem</code> makes a presentation of the local credential, and fetches our pizza menu.</p></li></ul><p>First, we run <code>act-issue</code>. At this stage, we could ask the agent to run an<a href="https://modelcontextprotocol.info/specification/draft/basic/authorization/"> <u>OAuth flow</u></a>, fetch an internal authentication endpoint, or to compute a proof of work.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6sLS7jMfTHPjVW5vMvsTWX/2d2b10fdb12c64f0e33fee89e09eab85/image10.png" />
          </figure><p>This gives us 3 credits to spend against an origin. Then, we run <code>act-redeem</code></p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1YTc0Wohrsqw3hizAOmJjU/4534cccbc490ad0aa09522a3875693af/image8.png" />
          </figure><p>Et voilà. If we run <code>act-redeem</code> once more, we see we have one fewer credit.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/a0zmBfl46hX33hWoXyGyX/86649d9f435562c95a85ec72fbf33022/image3.png" />
          </figure><p>You can test it yourself, here are the <a href="https://github.com/cloudflareresearch/anonymous-credentials-agent-demo"><u>source codes</u></a> available. The MCP server is written in<a href="https://github.com/modelcontextprotocol/rust-sdk/"> <u>Rust</u></a> to integrate with the <a href="https://github.com/SamuelSchlesinger/anonymous-credit-tokens/stargazers"><u>ACT rust</u></a> library. The <a href="https://act-client-demo.cloudflareresearch.com/"><u>browser-based client</u></a> works similarly, check it out.</p>
    <div>
      <h2>Moving further</h2>
      <a href="#moving-further">
        
      </a>
    </div>
    <p>In this post, we’ve presented a concrete approach to rate limit agent traffic. It is in full control of the client, and is built to protect the user's privacy. It uses emerging standards for anonymous credentials, integrates with MCP, and can be readily deployed on Cloudflare Workers.</p><p>We're on the right track, but there are still questions that remain. As we touched on before, a notable limitation of both ARC and ACT is that they are only <i>privately verifiable</i>. This means that the issuer and origin need to share a private key, for issuing and verifying the credential respectively. There are likely to be deployment scenarios for which this isn't possible. Fortunately, there may be a path forward for these cases using<i> pairing-</i>based cryptography, as in the <a href="https://datatracker.ietf.org/doc/draft-irtf-cfrg-bbs-signatures/"><u>BBS signature specification</u></a> making its way through IETF. We’re also exploring post-quantum implications in a <a href="https://blog.cloudflare.com/pq-anonymous-credentials/"><u>concurrent post</u></a>.</p><p>If you are an agent platform, an agent developer, or a browser, all our code is available on <a href="https://github.com/cloudflareresearch/anonymous-credentials-agent-demo"><u>GitHub</u></a> for you to experiment. Cloudflare is actively working on vetting this approach for real-world use cases.</p><p>The specification and discussion are happening within the IETF and W3C. This ensures the protocols are built in the open, and receive participation from experts. Improvements are still to be made to clarify the right performance-to-privacy tradeoff, or even the story to deploy on the open web.</p><p>If you’d like to help us, <a href="https://blog.cloudflare.com/cloudflare-1111-intern-program/"><u>we’re hiring 1,111 interns</u></a> over the course of next year, and have <a href="https://www.cloudflare.com/careers/early-talent/"><u>open positions</u></a>.</p> ]]></content:encoded>
            <category><![CDATA[Research]]></category>
            <category><![CDATA[Rate Limiting]]></category>
            <guid isPermaLink="false">1znqOjDHsm8kxWujPMhsgA</guid>
            <dc:creator>Thibault Meunier</dc:creator>
            <dc:creator>Christopher Patton</dc:creator>
            <dc:creator>Lena Heimberger</dc:creator>
            <dc:creator>Armando Faz-Hernández</dc:creator>
        </item>
        <item>
            <title><![CDATA[Measuring characteristics of TCP connections at Internet scale]]></title>
            <link>https://blog.cloudflare.com/measuring-network-connections-at-scale/</link>
            <pubDate>Wed, 29 Oct 2025 13:00:00 GMT</pubDate>
            <description><![CDATA[ Researchers and practitioners have been studying connections almost as long as the Internet that supports them. Today, Cloudflare’s global network receives millions of connections per second. We explore various characteristics of TCP connections, including lifetimes, sizes, and more. ]]></description>
            <content:encoded><![CDATA[ <p></p><p>Every interaction on the Internet—including loading a web page, streaming a video, or making an API call—starts with a connection. These fundamental logical connections consist of a stream of packets flowing back and forth between devices.</p><p>Various aspects of these network connections have captured the attention of researchers and practitioners for as long as the Internet has existed. The interest in connections even predates the label, as can be seen in the seminal 1991 paper, “<a href="https://dl.acm.org/doi/10.1145/115994.116003"><u>Characteristics of wide-area TCP/IP conversations</u></a>.” By any name, the Internet measurement community has been steeped in characterizations of Internet communication for <i>decades</i>, asking everything from “how long?” and “how big?” to “how often?” – and those are just to start.</p><p>Surprisingly, connection characteristics on the wider Internet are largely unavailable. While anyone can  use tools (e.g., <a href="https://www.wireshark.org/"><u>Wireshark</u></a>) to capture data locally, it’s virtually impossible to measure connections globally because of access and scale. Moreover, network operators generally do not share the characteristics they observe — assuming that non-trivial time and energy is taken to observe them.</p><p>In this blog post, we move in another direction by sharing aggregate insights about connections established through our global CDN. We present characteristics of <a href="https://developers.cloudflare.com/fundamentals/reference/tcp-connections/"><u>TCP</u></a> connections—which account for about <a href="https://radar.cloudflare.com/adoption-and-usage"><u>70% of HTTP requests</u></a> to Cloudflare—providing empirical insights that are difficult to obtain from client-side measurements alone.</p>
    <div>
      <h2>Why connection characteristics matter</h2>
      <a href="#why-connection-characteristics-matter">
        
      </a>
    </div>
    <p>Characterizing system behavior helps us predict the impact of changes. In the context of networks, consider a new routing algorithm or transport protocol: how can you measure its effects? One option is to deploy the change directly on live networks, but this is risky. Unexpected consequences could disrupt users or other parts of the network, making a “deploy-first” approach potentially unsafe or ethically questionable.</p><p>A safer alternative to live deployment as a first step is simulation. Using simulation, a designer can get important insights about their scheme without having to build a full version. But simulating the whole Internet is challenging, as described by another highly seminal work, “<a href="https://dl.acm.org/doi/10.1145/268437.268737"><u>Why we don't know how to simulate the Internet</u></a>”.</p><p>To run a useful simulation, we need it to behave like the real system we’re studying. That means generating synthetic data that mimics real-world behavior. Often, we do this by using statistical distributions — mathematical descriptions of how the real data behaves. But before we can create those distributions, we first need to characterize the data — to measure and understand its key properties. Only then can our simulation produce realistic results.</p>
    <div>
      <h2>Unpacking the dataset</h2>
      <a href="#unpacking-the-dataset">
        
      </a>
    </div>
    <p>The value of any data depends on its collection mechanism. Every dataset has blind spots, biases, and limitations, and ignoring these can lead to misleading conclusions. By examining the finer details — how the data was gathered, what it represents, and what it excludes — we can better understand its reliability and make informed decisions about how to use it. Let’s take a closer look at our collected telemetry.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5ksUQ7xlzXPWp2hH7eX4dG/124456d20c6fd5e7e185d68865aee6fa/image5.png" />
          </figure><p><b>Dataset Overview</b>. The data describes TCP connections, labeled <i>Visitor to Cloudflare</i> in the above diagram, which serve requests via HTTP 1.0, 1.1, and 2.0 that make up <a href="https://radar.cloudflare.com/adoption-and-usage">about 70%</a> of all 84 million HTTP requests per second, on average, received at our global CDN servers.</p><p><b>Sampling.</b> The passively collected snapshot of data is drawn from a uniformly sampled 1% of all TCP connections to Cloudflare between October 7 and October 15, 2025. Sampling takes place at each individual client-facing server to mitigate biases that may appear by sampling at the datacenter level.</p><p><b>Diversity.</b> Unlike many large operators, whose traffic is primarily their own and dominated by a few services such as search, social media, or streaming video, the vast majority of Cloudflare’s workload comes from our customers, who choose to put Cloudflare in front of their websites to help protect, improve performance, and reduce costs. This diversity of customers brings a wide variety of web applications, services, and users from around the world. As a result, the connections we observe are shaped by a broad range of client devices and application-specific behaviors that are constantly evolving.</p><p><b>What we log.</b> Each entry in the log consists of socket-level metadata captured via the Linux kernel’s <a href="https://man7.org/linux/man-pages/man7/tcp.7.html"><u>TCP_INFO</u></a> struct, alongside the SNI and the number of requests made during the connection. The logs exclude individual HTTP requests, transactions, and details. We restrict our use of the logs to connection metadata statistics such as duration and number of packets transmitted, as well as the number of HTTP requests processed.</p><p><b>Data capture.</b> We have elected to represent ‘useful’ connections in our dataset that have been fully processed, by characterizing only those connections that close gracefully with <a href="https://blog.cloudflare.com/tcp-resets-timeouts/#tcp-connections-from-establishment-to-close"><u>a FIN packet</u></a>. This excludes connections intercepted by attack mitigations, or that timeout, or that abort because of a RST packet.</p><p>Since a graceful close does not in itself indicate a ‘useful’ connection, <b>we additionally require at least one successful HTTP request</b> during the connection to filter out idle or non-HTTP connections from this analysis — interestingly, these make up 11% of all TCP connections to Cloudflare that close with a FIN packet.</p><p>If you’re curious, we’ve also previously blogged about the details of Cloudflare’s <a href="https://blog.cloudflare.com/how-we-make-sense-of-too-much-data/"><u>overall logging mechanism</u></a> and <a href="https://blog.cloudflare.com/http-analytics-for-6m-requests-per-second-using-clickhouse/"><u>post-processing pipeline</u></a>.  </p>
    <div>
      <h2>Visualizing connection characteristics</h2>
      <a href="#visualizing-connection-characteristics">
        
      </a>
    </div>
    <p>Although networks are inherently dynamic and trends can change over time, the large-scale patterns we observe across our global infrastructure remain remarkably consistent over time. While our data offers a global view of connection characteristics, distributions can still vary according to regional traffic patterns.</p><p>In our visualizations we represent characteristics with <a href="https://en.wikipedia.org/wiki/Cumulative_distribution_function"><u>cumulative distribution function (CDF)</u></a> graphs, specifically their <a href="https://en.wikipedia.org/wiki/Empirical_distribution_function"><u>empirical equivalents</u></a>. CDFs are particularly useful for gaining a macroscopic view of the distribution. They give a clear picture of both common and extreme cases in a single view. We use them in the illustrations below to make sense of large-scale patterns. To better interpret the distributions, we also employ log-scaled axes to account for the presence of extreme values common to networking data.</p><p>A long-standing question about Internet connections relates to “<a href="https://en.wikipedia.org/wiki/Elephant_flow"><u>Elephants and Mice</u></a>”; practitioners and researchers are entirely aware that most flows are small and some are huge, yet little data exists to inform the lines that divide them. This is where our presentation begins.</p>
    <div>
      <h3>Packet Counts</h3>
      <a href="#packet-counts">
        
      </a>
    </div>
    <p>Let’s start by taking a look at the distribution of the number of <i>response</i> packets sent in connections by Cloudflare servers back to the clients.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5qaPCul0l7bdOQfaxL1Wbn/d0ef9cc108ba35d49593029baed7cb86/image12.png" />
          </figure><p>On the graph, the x-axis represents the number of response packets sent in log-scale, while the y-axis shows the cumulative fraction of connections below each packet count. The average response consists of roughly 240 packets, but the distribution is highly skewed. The median is 12 packets, which indicates that 50% of Internet connections consist of <i>very few packets</i>.<i> </i>Extending further to<i> </i>the 90th percentile, connections carry only 107 packets.</p><p>This stark contrast highlights the heavy-tailed nature of Internet traffic: while a few connections transport massive amounts of data—like video streams or large file transfers—most interactions are tiny, delivering small web objects, microservice traffic, or API responses.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3Mf6VwD2Xq8aBwQP1V9aX5/1a20d6fa2caab5c719591db8b232f6a1/image11.png" />
          </figure><p>The above plot breaks down the packet count distribution by HTTP protocol version. For HTTP/1.X (both HTTP 1.0 and 1.1 combined) connections, the median response consists of just 10 packets, and 90% of connections carry fewer than 63 response packets. In contrast, HTTP/2 connections show larger responses, with a median of 16 packets and a 90th percentile of 170 packets. This difference likely reflects how HTTP/2 multiplexes multiple streams over a single connection, often consolidating more requests and responses into fewer connections, which increases the total number of packets exchanged per connection. HTTP/2 connections also have additional control-plane frames and flow-control messages that increase response packet counts.</p><p>Despite these differences, the combined view displays the same heavy-tailed pattern: a small fraction of connections carry enormous volumes of data (<a href="https://en.wikipedia.org/wiki/Elephant_flow"><u>elephant flows</u></a>), extending to millions of packets, while most remain lightweight (<a href="https://en.wikipedia.org/wiki/Mouse_flow"><u>mice flows</u></a>).</p><p>So far, we’ve focused on the total number of packets sent from our servers to clients, but another important dimension of connection behavior is the balance between packets sent and received, illustrated below.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5VZeU0d2EYLxPl3SaTPJBb/6b46a793d6eea178838c4f5b2572caf1/image2.png" />
          </figure><p>The x-axis shows the ratio of packets sent by our servers to packets received from clients, visualized as a CDF. Across all connections, the median ratio is 0.91, meaning that in half of connections, clients send slightly more packets than the server responds with. This excess of client-side packets primarily reflects <a href="https://www.cloudflare.com/learning/ssl/transport-layer-security-tls/"><u>TLS</u></a> handshake initiation (ClientHello), HTTP control request headers, and data acknowledgements (ACKs), causing the client to typically transmit more packets than the server returns with the content payload — particularly for low-volume connections that dominate the distribution.</p><p>The mean ratio is higher, at 1.28, due to a long tail of client-heavy connections, such as large downloads typical of CDN workloads. Most connections fall within a relatively narrow range: 10% of connections have a ratio below 0.67, and 90% are below 1.85. However, the long-tailed behavior highlights the diversity of Internet traffic: extreme values arise from both upload-heavy and download-heavy connections. The variance of 3.71 reflects these asymmetric flows, while the bulk of connections maintain a roughly balanced upload-to-download exchange.</p>
    <div>
      <h3>Bytes sent</h3>
      <a href="#bytes-sent">
        
      </a>
    </div>
    <p>Another dimension to look at the data is using bytes sent by our servers to clients, which captures the actual volume of data delivered over each connection. This metric is derived from tcpi_bytes_sent, also covering (re)transmitted segment payloads while excluding the TCP header, as defined in <a href="https://github.com/torvalds/linux/blob/v6.14/include/uapi/linux/tcp.h#L222-L312"><u>linux/tcp.h</u></a> and aligned with <a href="https://www.rfc-editor.org/rfc/rfc4898.html"><u>RFC 4898</u></a> (TCP Extended Statistics MIB).</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1VZs6F65RQjyyEUUxZSP2L/b0edd986738e9128c16dcbecb7d83761/image3.png" />
          </figure><p>The plots above break down bytes sent by HTTP protocol version. The x-axis represents the total bytes sent by our servers over each connection. The patterns are generally consistent with what we observed in the packet count distributions.</p><p>For HTTP/1.X, the median response delivers 4.8 KB, and 90% of connections send fewer than 51 KB. In contrast, HTTP/2 connections show slightly larger responses, with a median of 6 KB and a 90th percentile of 146 KB. The mean is much higher—224 KB for HTTP/1.x and 390 KB for HTTP/2—reflecting a small number of very large transfers. These long-tailed extreme flows can reach tens of gigabytes per connection, while some very lightweight connections carry minimal payloads: the minimum for HTTP/1.X is 115 bytes and for HTTP/2 it is 202 bytes.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2xRYaXYQbte6MszIT92uky/837ebdc842c9784a9c413ad886f7a5d6/image6.png" />
          </figure><p>By making use of the tcpi_bytes_received metric, we can now look at the ratio of bytes sent to bytes received per connection to better understand the balance of data exchange. This ratio captures how asymmetric each connection is — essentially, how much data our servers send compared to what they receive from clients. Across all connections, the median ratio is 3.78, meaning that in half of all cases, servers send nearly four times more data than they receive. The average is far higher at 81.06, showing a strong long tail driven by download-heavy flows. Again we see the heavy long-tailed distribution, a small fraction of extreme cases push the ratio into the millions, with more extreme values of data transfers towards clients.</p>
    <div>
      <h3>Connection duration</h3>
      <a href="#connection-duration">
        
      </a>
    </div>
    <p>While packet and byte counts capture how much data is exchanged, connection duration provides insight into how that exchange unfolds over time.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5noP7Acqu2Ky4hCGtETH1F/92c7bd220d57232fb40440624d227a78/image8.png" />
          </figure><p>The CDF above shows the distribution of connection durations (lifetimes) in seconds. A reminder that the x-axis is log-scale. Across all connections, the median duration is just 4.7 seconds, meaning half of connections complete in under five seconds. The mean is much higher at 96 seconds, reflecting a small number of long-lived connections that skew the average. Most connections fall within a window of 0.1 seconds (10th percentile) to 300 seconds (90th percentile). We also observe some extremely long-lived connections lasting multiple days, possibly maintained via <a href="https://developers.cloudflare.com/fundamentals/reference/tcp-connections/#tcp-connections-and-keep-alives"><u>keep-alives</u></a> for connection reuse without hitting <a href="https://developers.cloudflare.com/fundamentals/reference/connection-limits/"><u>our default idle timeout limits</u></a>. These long-lived connections typically represent persistent sessions or multimedia traffic, while the majority of web traffic remains short, bursty, and transient.</p>
    <div>
      <h3>Request counts</h3>
      <a href="#request-counts">
        
      </a>
    </div>
    <p>A single connection can carry multiple HTTP requests for web traffic. This reveals patterns about connection multiplexing.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4hsoigL4rFtIyRJpdSUXwh/5ef82b3c0cf5b25b8dc13ed38761f895/image7.png" />
          </figure><p>The above shows the number of HTTP requests (in log-scale) that we see on a single connection, broken down by HTTP protocol version. Right away, we can see that for both HTTP/1.X (mean 3 requests) and HTTP/2 (mean 8 requests) connections, the median number of requests is just 1, reinforcing the prevalence of limited connection reuse. However, because HTTP/2 supports multiplexing multiple streams over a single connection, the 90th percentile rises to 10 requests, with occasional extreme cases carrying thousands of requests, which can be amplified due to <a href="https://blog.cloudflare.com/connection-coalescing-experiments/"><u>connection coalescing</u></a>. In contrast, HTTP/1.X connections have much lower request counts. This aligns with protocol design: HTTP/1.0 followed a “one request per connection” philosophy, while HTTP/1.1 introduced persistent connections — even combining both versions, it’s rare to see HTTP/1.X connections carrying more than two requests at the 90th percentile.</p><p>The prevalence of short-lived connections can be partly explained by automated clients or scripts that tend to open new connections rather than maintaining long-lived sessions. To explore this intuition, we split the data between traffic originating from data centers (likely automated) and typical user traffic (user-driven), using client ASNs as a proxy.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1DhUbNv8cjQVGOqKUai7KU/fecc8eaa488ec216bfb14084a518501b/image9.png" />
          </figure><p>The plot above shows that non-DC (user-driven) traffic has slightly higher request counts per connection, consistent with browsers or apps fetching multiple resources over a single persistent connection, with a mean of 5 requests and a 90th percentile of 5 requests per connection. In contrast, DC-originated traffic has a mean of roughly 3 requests and a 90th percentile of 2, validating our expectation. Despite these differences, the median number of requests remains 1 for both groups highlighting that, regardless of origin of connections, most are genuinely brief.</p>
    <div>
      <h2>Inferring path characteristics from connection-level data</h2>
      <a href="#inferring-path-characteristics-from-connection-level-data">
        
      </a>
    </div>
    <p>Connection-level measurements can also provide insights into underlying path characteristics. Let’s examine this in more detail.</p>
    <div>
      <h3>Path MTU</h3>
      <a href="#path-mtu">
        
      </a>
    </div>
    <p>The maximum transmission unit (<a href="https://www.cloudflare.com/learning/network-layer/what-is-mtu/"><u>MTU</u></a>) along the network path is often referred to as the Path MTU (PMTU). PMTU determines the largest packet size that can traverse a connection without fragmentation or packet drop, affecting throughput, efficiency, and latency. The Linux TCP stack on our servers tracks the largest segment size that can be sent without fragmentation along the path for a connection, as part of <a href="https://blog.cloudflare.com/path-mtu-discovery-in-practice/"><u>Path MTU discovery.</u></a></p><p>From that data we saw that the median (and the 90th percentile!) PMTU was 1500 bytes, which aligns with the typical Ethernet MTU and is <a href="https://en.wikipedia.org/wiki/Maximum_transmission_unit"><u>considered standard</u></a> for most Internet paths. Interestingly, the 10th percentile sits at 1,420 bytes, reflecting cases where paths include network links with slightly smaller MTUs—common in some <a href="https://blog.cloudflare.com/migrating-from-vpn-to-access/"><u>VPNs</u></a>, <a href="https://blog.cloudflare.com/increasing-ipv6-mtu/"><u>IPv6tov4 tunnels</u></a>, or older networking equipment that impose stricter limits to avoid fragmentation. At the extreme, we have seen MTU as small as 552 bytes for IPv4 connections which relates to the minimum allowed PMTU value <a href="https://www.kernel.org/doc/html/v6.5/networking/ip-sysctl.html#:~:text=Default%3A%20FALSE-,min_pmtu,-%2D%20INTEGER"><u>by the Linux kernel</u></a>.</p>
    <div>
      <h3>Initial congestion window</h3>
      <a href="#initial-congestion-window">
        
      </a>
    </div>
    <p>A key parameter in transport protocols is the congestion window (CWND), which is the number of packets that can be transmitted without waiting for an acknowledgement from the receiver. We call these packets or bytes “in-flight.” During a connection, the congestion window evolves dynamically throughout a connection.</p><p>However, the initial congestion window (ICWND) at the start of a data transfer can have an outsized impact, especially for short-lived connections, which dominate Internet traffic as we’ve seen above. If the ICWND is set too low, small and medium transfers take additional round-trip times to reach bottleneck bandwidth, slowing delivery. Conversely, if it’s too high, the sender risks overwhelming the network, causing unnecessary packet loss and retransmissions — potentially for all connections that share the bottleneck link.</p><p>A reasonable estimate of the ICWND can be taken as the congestion window size at the instant the TCP sender transitions out of <a href="https://www.rfc-editor.org/rfc/rfc5681#section-3.1"><u>slow start</u></a>. This transition marks the point at which the sender shifts from exponential growth to congestion-avoidance, having inferred that further growth may risk congestion. The figure below shows the distribution of congestion window sizes at the moment slow start exits — as calculated by <a href="https://blog.cloudflare.com/http-2-prioritization-with-nginx/#bbr-congestion-control"><u>BBR</u></a>. The median is roughly 464 KB, which corresponds to about 310 packets per connection with a typical 1,500-byte MTU, while extreme flows carry tens of megabytes in flight. This variance reflects the diversity of TCP connections and the dynamically evolving nature of the networks carrying traffic.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7BzqE6HSQgkriWisqS3Yx3/de4dc12a453d162884e9a015ccb40348/image4.png" />
          </figure><p>It’s important to emphasize that these values reflect a mix of network paths, including not only paths between Cloudflare and end users, but also between Cloudflare and neighboring datacenters, which are typically well provisioned and offer higher bandwidth.</p><p>Our initial inspection of the above distribution left us doubtful, because the values seem very high. We then realized the numbers are an artifact of behaviour specific to BBR, in which it sets the congestion window higher than its estimate of the path’s available capacity, <a href="https://en.wikipedia.org/wiki/Bandwidth-delay_product"><u>bandwidth delay product (BDP)</u></a>. The inflated value is <a href="https://www.ietf.org/archive/id/draft-cardwell-iccrg-bbr-congestion-control-01.html#name-state-machine-operation"><u>by design</u></a>. To prove the hypothesis, we re-plot the distribution from above in the figure below alongside BBR’s estimate of BDP. The difference is clear between BBR’s congestion window of unacknowledged packets and its BDP estimate.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/34YFSv4Zdp82qszNM79XsH/3c147dfd5c5006fe55abb53dab47bef1/image10.png" />
          </figure><p>The above plot adds the computed BDP values in context with connection telemetry. The median BDP comes out to be roughly 77 KB, which is roughly 50 packets. If we compare this to the congestion window distribution taken from above, we see BDP estimations from recently closed connections are much more stable.</p><p>We are using these insights to help identify reasonable initial congestion window sizes and the circumstances for them. Our own experiments internally make clear that ICWND sizes can affect performance by as much as 30-40% for smaller connections. Such insights will potentially help to revisit efforts to find better initial congestion window values, which has been a default of <a href="https://datatracker.ietf.org/doc/html/rfc6928"><u>10 packets</u></a> for more than a decade.</p>
    <div>
      <h3>Deeper understanding, better performance</h3>
      <a href="#deeper-understanding-better-performance">
        
      </a>
    </div>
    <p>We observed that Internet connections are highly heterogeneous, confirming decades-long observations of strong heavy-tail characteristics consistent with “<a href="https://en.wikipedia.org/wiki/Elephant_flow"><u>elephants and mice</u></a>” phenomenon. Ratios of upload to download bytes are unsurprising for larger flows, but surprisingly small for short flows, highlighting the asymmetric nature of Internet traffic. Understanding these connection characteristics continues to inform ways to improve connection performance, reliability, and user experience.</p><p>We will continue to build on this work, and plan to publish connection-level statistics on <a href="https://radar.cloudflare.com/"><u>Cloudflare Radar</u></a> so that others can similarly benefit.</p><p>Our work on improving our network is ongoing, and we welcome researchers, academics, <a href="https://blog.cloudflare.com/cloudflare-1111-intern-program/"><u>interns</u></a>, and anyone interested in this space to reach out at <a href="#"><u>ask-research@cloudflare.com</u></a>. By sharing knowledge and working together, we all can continue to make the Internet faster, safer, and more reliable for everyone.</p> ]]></content:encoded>
            <category><![CDATA[Research]]></category>
            <category><![CDATA[Better Internet]]></category>
            <category><![CDATA[Insights]]></category>
            <category><![CDATA[TCP]]></category>
            <guid isPermaLink="false">5jyi6dhHiLQu3BVMVGKrVG</guid>
            <dc:creator>Suleman Ahmad</dc:creator>
            <dc:creator>Peter Wu</dc:creator>
        </item>
        <item>
            <title><![CDATA[One IP address, many users: detecting CGNAT to reduce collateral effects]]></title>
            <link>https://blog.cloudflare.com/detecting-cgn-to-reduce-collateral-damage/</link>
            <pubDate>Wed, 29 Oct 2025 13:00:00 GMT</pubDate>
            <description><![CDATA[ IPv4 scarcity drives widespread use of Carrier-Grade Network Address Translation, a practice in ISPs and mobile networks that places many users behind each IP address, along with their collected activity and volumes of traffic. We introduce the method we’ve developed to detect large-scale IP sharing globally and mitigate the issues that result.  ]]></description>
            <content:encoded><![CDATA[ <p>IP addresses have historically been treated as stable identifiers for non-routing purposes such as for geolocation and security operations. Many operational and security mechanisms, such as blocklists, rate-limiting, and anomaly detection, rely on the assumption that a single IP address represents a cohesive<b>, </b>accountable<b> </b>entity or even, possibly, a specific user or device.</p><p>But the structure of the Internet has changed, and those assumptions can no longer be made. Today, a single IPv4 address may represent hundreds or even thousands of users due to widespread use of <a href="https://en.wikipedia.org/wiki/Carrier-grade_NAT"><u>Carrier-Grade Network Address Translation (CGNAT)</u></a>, VPNs, and proxy<b> </b>middleboxes. This concentration of traffic can result in <a href="https://blog.cloudflare.com/consequences-of-ip-blocking/"><u>significant collateral damage</u></a> – especially to users in developing regions of the world – when security mechanisms are applied without taking into account the multi-user nature of IPs.</p><p>This blog post presents our approach to detecting large-scale IP sharing globally. We describe how we <a href="https://www.cloudflare.com/learning/ai/how-to-secure-training-data-against-ai-data-leaks/">build reliable training data</a>, and how detection can help avoid unintentional bias affecting users in regions where IP sharing is most prevalent. Arguably it's those regional variations that motivate our efforts more than any other. </p>
    <div>
      <h2>Why this matters: Potential socioeconomic bias</h2>
      <a href="#why-this-matters-potential-socioeconomic-bias">
        
      </a>
    </div>
    <p>Our work was initially motivated by a simple observation: CGNAT is a likely unseen source of bias on the Internet. Those biases would be more pronounced wherever there are more users and few addresses, such as in developing regions. And these biases can have profound implications for user experience, network operations, and digital equity.</p><p>The reasons are understandable for many reasons, not least because of necessity. Countries in the developing world often have significantly fewer available IPs, and more users. The disparity is a historical artifact of how the Internet grew: the largest blocks of IPv4 addresses were allocated decades ago, primarily to organizations in North America and Europe, leaving a much smaller pool for regions where Internet adoption expanded later. </p><p>To visualize the IPv4 allocation gap, we plot country-level ratios of users to IP addresses in the figure below. We take online user estimates from the <a href="https://data.worldbank.org/indicator/IT.NET.USER.ZS"><u>World Bank Group</u></a> and the number of IP addresses in a country from Regional Internet Registry (RIR) records. The colour-coded map that emerges shows that the usage of each IP address is more concentrated in regions that generally have poor Internet penetration. For example, large portions of Africa and South Asia appear with the highest user-to-IP ratios. Conversely, the lowest user-to-IP ratios appear in Australia, Canada, Europe, and the USA — the very countries that otherwise have the highest Internet user penetration numbers.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2YBdqPx0ALt7pY7rmQZyLQ/049922bae657a715728700c764c4af16/BLOG-3046_2.png" />
          </figure><p>The scarcity of IPv4 address space means that regional differences can only worsen as Internet penetration rates increase. A natural consequence of increased demand in developing regions is that ISPs would rely even more heavily on CGNAT, and is compounded by the fact that CGNAT is common in mobile networks that users in developing regions so heavily depend on. All of this means that <a href="https://datatracker.ietf.org/doc/html/rfc7021"><u>actions known to be based</u></a> on IP reputation or behaviour would disproportionately affect developing economies. </p><p>Cloudflare is a global network in a global Internet. We are sharing our methodology so that others might benefit from our experience and help to mitigate unintended effects. First, let’s better understand CGNAT.</p>
    <div>
      <h3>When one IP address serves multiple users</h3>
      <a href="#when-one-ip-address-serves-multiple-users">
        
      </a>
    </div>
    <p>Large-scale IP address sharing is primarily achieved through two distinct methods. The first, and more familiar, involves services like VPNs and proxies. These tools emerge from a need to secure corporate networks or improve users' privacy, but can be used to circumvent censorship or even improve performance. Their deployment also tends to concentrate traffic from many users onto a small set of exit IPs. Typically, individuals are aware they are using such a service, whether for personal use or as part of a corporate network.</p><p>Separately, another form of large-scale IP sharing often goes unnoticed by users: <a href="https://en.wikipedia.org/wiki/Carrier-grade_NAT"><u>Carrier-Grade NAT (CGNAT)</u></a>. One way to explain CGNAT is to start with a much smaller version of network address translation (NAT) that very likely exists in your home broadband router, formally called a Customer Premises Equipment (or CPE), which translates unseen private addresses in the home to visible and routable addresses in the ISP. Once traffic leaves the home, an ISP may add an additional enterprise-level address translation that causes many households or unrelated devices to appear behind a single IP address.</p><p>The crucial difference between large-scale IP sharing is user choice: carrier-grade address sharing is not a user choice, but is configured directly by Internet Service Providers (ISPs) within their access networks. Users are not aware that CGNATs are in use. </p><p>The primary driver for this technology, understandably, is the exhaustion of the IPv4 address space. IPv4's 32-bit architecture supports only 4.3 billion unique addresses — a capacity that, while once seemingly vast, has been completely outpaced by the Internet's explosive growth. By the early 2010s, Regional Internet Registries (RIRs) had depleted their pools of unallocated IPv4 addresses. This left ISPs unable to easily acquire new address blocks, forcing them to maximize the use of their existing allocations.</p><p>While the long-term solution is the transition to IPv6, CGNAT emerged as the immediate, practical workaround. Instead of assigning a unique public IP address to each customer, ISPs use CGNAT to place multiple subscribers behind a single, shared IP address. This practice solves the problem of IP address scarcity. Since translated addresses are not publicly routable, CGNATs have also had the positive side effect of protecting many home devices that might be vulnerable to compromise. </p><p>CGNATs also create significant operational fallout stemming from the fact that hundreds or even thousands of clients can appear to originate from a single IP address. <b>This means an IP-based security system may inadvertently block or throttle large groups of users as a result of a single user behind the CGNAT engaging in malicious activity.</b></p><p>This isn't a new or niche issue. It has been recognized for years by the Internet Engineering Task Force (IETF), the organization that develops the core technical standards for the Internet. These standards, known as Requests for Comments (RFCs), act as the official blueprints for how the Internet should operate. <a href="https://www.rfc-editor.org/rfc/rfc6269.html"><u>RFC 6269</u></a>, for example, discusses the challenges of IP address sharing, while <a href="https://datatracker.ietf.org/doc/html/rfc7021"><u>RFC 7021</u></a> examines the impact of CGNAT on network applications. Both explain that traditional abuse-mitigation techniques, such as blocklisting or rate-limiting, assume a one-to-one relationship between IP addresses and users: when malicious activity is detected, the offending IP address can be blocked to prevent further abuse.</p><p>In shared IPv4 environments, such as those using CGNAT or other address-sharing techniques, this assumption breaks down because multiple subscribers can appear under the same public IP. Blocking the shared IP therefore penalizes many innocent users along with the abuser. In 2015 Ofcom, the UK's telecommunications regulator, reiterated these concerns in a <a href="https://oxil.uk/research/mc159-report-on-the-implications-of-carrier-grade-network-address-translators-final-report"><u>report</u></a> on the implications of CGNAT where they noted that, “In the event that an IPv4 address is blocked or blacklisted as a source of spam, the impact on a CGNAT would be greater, potentially affecting an entire subscriber base.” </p><p>While the hope was that CGNAT was only a temporary solution until the eventual switch to IPv6, as the old proverb says, nothing is more permanent than a temporary solution. While IPv6 deployment continues to lag, <a href="https://blog.apnic.net/2022/01/19/ip-addressing-in-2021/"><u>CGNAT deployments have become increasingly common</u></a>, and so do the related problems. </p>
    <div>
      <h2>CGNAT detection at Cloudflare</h2>
      <a href="#cgnat-detection-at-cloudflare">
        
      </a>
    </div>
    <p>To enable a fairer treatment of users behind CGNAT IPs by security techniques that rely on IP reputation, our goal is to identify large-scale IP sharing. This allows traffic filtering to be better calibrated and collateral damage minimized. Additionally, we want to distinguish CGNAT IPs from other large-scale sharing (LSS) IP technologies, such as VPNs and proxies, because we may need to take different approaches to different kinds of IP-sharing technologies.</p><p>To do this, we decided to take advantage of Cloudflare’s extensive view of the active IP clients, and build a supervised learning classifier that would distinguish CGNAT and VPN/proxy IPs from IPs that are allocated to a single subscriber (non-LSS IPs), based on behavioural characteristics. The figure below shows an overview of our supervised classifier: </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7tFXZByKRCYxVaAFDG0Xda/d81e7f09b5d12e03e39c266696df9cc3/BLOG-3046_3.png" />
          </figure><p>While our classification approach is straightforward, a significant challenge is the lack of a reliable, comprehensive, and labeled dataset of CGNAT IPs for our training dataset.</p>
    <div>
      <h3>Detecting CGNAT using public data sources </h3>
      <a href="#detecting-cgnat-using-public-data-sources">
        
      </a>
    </div>
    <p>Detection begins by building an initial dataset of IPs believed to be associated with CGNAT. Cloudflare has vast HTTP and traffic logs. Unfortunately there is no signal or label in any request to indicate what is or is not a CGNAT. </p><p>To build an extensive labelled dataset to train our ML classifier, we employ a combination of network measurement techniques, as described below. We rely on public data sources to help disambiguate an initial set of large-scale shared IP addresses from others in Cloudflare’s logs.   </p>
    <div>
      <h4>Distributed Traceroutes</h4>
      <a href="#distributed-traceroutes">
        
      </a>
    </div>
    <p>The presence of a client behind CGNAT can often be inferred through traceroute analysis. CGNAT requires ISPs to insert a NAT step that typically uses the Shared Address Space (<a href="https://datatracker.ietf.org/doc/html/rfc6598"><u>RFC 6598</u></a>) after the customer premises equipment (CPE). By running a traceroute from the client to its own public IP and examining the hop sequence, the appearance of an address within 100.64.0.0/10 between the first private hop (e.g., 192.168.1.1) and the public IP is a strong indicator of CGNAT.</p><p>Traceroute can also reveal multi-level NAT, which CGNAT requires, as shown in the diagram below. If the ISP assigns the CPE a private <a href="https://datatracker.ietf.org/doc/html/rfc1918"><u>RFC 1918</u></a> address that appears right after the local hop, this indicates at least two NAT layers. While ISPs sometimes use private addresses internally without CGNAT, observing private or shared ranges immediately downstream combined with multiple hops before the public IP strongly suggests CGNAT or equivalent multi-layer NAT.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/57k4gwGCHcPggIWtSy36HU/6cf8173c1a4c568caa25a1344a516e9e/BLOG-3046_4.png" />
          </figure><p>Although traceroute accuracy depends on router configurations, detecting private and shared IP ranges is a reliable way to identify large-scale IP sharing. We apply this method to distributed traceroutes from over 9,000 RIPE Atlas probes to classify hosts as behind CGNAT, single-layer NAT, or no NAT.</p>
    <div>
      <h4>Scraping WHOIS and PTR records</h4>
      <a href="#scraping-whois-and-ptr-records">
        
      </a>
    </div>
    <p>Many operators encode metadata about their IPs in the corresponding reverse DNS pointer (PTR) record that can signal administrative attributes and geographic information. We first query the DNS for PTR records for the full IPv4 space and then filter for a set of known keywords from the responses that indicate a CGNAT deployment. For example, each of the following three records matches a keyword (<code>cgnat</code>, <code>cgn</code> or <code>lsn</code>) used to detect CGNAT address space:</p><p><code>node-lsn.pool-1-0.dynamic.totinternet.net
103-246-52-9.gw1-cgnat.mobile.ufone.nz
cgn.gsw2.as64098.net</code></p><p>WHOIS and Internet Routing Registry (IRR) records may also contain organizational names, remarks, or allocation details that reveal whether a block is used for CGNAT pools or residential assignments. </p><p>Given that both PTR and WHOIS records may be manually maintained and therefore may be stale, we try to sanitize the extracted data by validating the fact that the corresponding ISPs indeed use CGNAT based on customer and market reports. </p>
    <div>
      <h4>Collecting VPN and proxy IPs </h4>
      <a href="#collecting-vpn-and-proxy-ips">
        
      </a>
    </div>
    <p>Compiling a list of VPN and proxy IPs is more straightforward, as we can directly find such IPs in public service directories for anonymizers. We also subscribe to multiple VPN providers, and we collect the IPs allocated to our clients by connecting to a unique HTTP endpoint under our control. </p>
    <div>
      <h2>Modeling CGNAT with machine learning</h2>
      <a href="#modeling-cgnat-with-machine-learning">
        
      </a>
    </div>
    <p>By combining the above techniques, we accumulated a dataset of labeled IPs for more than 200K CGNAT IPs, 180K VPNs &amp; proxies and close to 900K IPs allocated that are not LSS IPs. These were the entry points to modeling with machine learning.</p>
    <div>
      <h3>Feature selection</h3>
      <a href="#feature-selection">
        
      </a>
    </div>
    <p>Our hypothesis was that aggregated activity from CGNAT IPs is distinguishable from activity generated from other non-CGNAT IP addresses. Our feature extraction is an evaluation of that hypothesis — since networks do not disclose CGNAT and other uses of IPs, the quality of our inference is strictly dependent on our confidence in the training data. We claim the key discriminator is diversity, not just volume. For example, VM-hosted scanners may generate high numbers of requests, but with low information diversity. Similarly, globally routable CPEs may have individually unique characteristics, but with volumes that are less likely to be caught at lower sampling rates.</p><p>In our feature extraction, we parse a 1% sampled HTTP requests log for distinguishing features of IPs compiled in our reference set, and the same features for the corresponding /24 prefix (namely IPs with the same first 24 bits in common). We analyse the features for each of the VPNs, proxies, CGNAT, or non LSS IP. We find that features from the following broad categories are key discriminators for the different types of IPs in our training dataset:</p><ul><li><p><b>Client-side signals:</b> We analyze the aggregate properties of clients connecting from an IP. A large, diverse user base (like on a CGNAT) naturally presents a much wider statistical variety of client behaviors and connection parameters than a single-tenant server or a small business proxy.</p></li><li><p><b>Network and transport-level behaviors:</b> We examine traffic at the network and transport layers. The way a large-scale network appliance (like a CGNAT) manages and routes connections often leaves subtle, measurable artifacts in its traffic patterns, such as in port allocation and observed network timing.</p></li><li><p><b>Traffic volume and destination diversity:</b> We also model the volume and "shape" of the traffic. An IP representing thousands of independent users will, on average, generate a higher volume of requests and target a much wider, less correlated set of destinations than an IP representing a single user.</p></li></ul><p>Crucially, to distinguish CGNAT from VPNs and proxies (which is absolutely necessary for calibrated security filtering), we had to aggregate these features at two different scopes: per-IP and per /24 prefixes. CGNAT IPs are typically allocated large blocks of IPs, whereas VPNs IPs are more scattered across different IP prefixes. </p>
    <div>
      <h3>Classification results</h3>
      <a href="#classification-results">
        
      </a>
    </div>
    <p>We compute the above features from HTTP logs over 24-hour intervals to increase data volume and reduce noise due to DHCP IP reallocation. The dataset is split into 70% training and 30% testing sets with disjoint /24 prefixes, and VPN and proxy labels are merged due to their similarity and lower operational importance compared to CGNAT detection.</p><p>Then we train a multi-class <a href="https://xgboost.readthedocs.io/en/stable/"><u>XGBoost</u></a> model with class weighting to address imbalance, assigning each IP to the class with the highest predicted probability. XGBoost is well-suited for this task because it efficiently handles large feature sets, offers strong regularization to prevent overfitting, and delivers high accuracy with limited parameter tuning. The classifier achieves 0.98 accuracy, 0.97 weighted F1, and 0.04 log loss. The figure below shows the confusion matrix of the classification.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/26i81Pe0yjlftHfIDrjB5X/45d001447fc52001a25176c8036a92cb/BLOG-3046_5.png" />
          </figure><p>Our model is accurate for all three labels. The errors observed are mainly misclassifications of VPN/proxy IPs as CGNATs, mostly for VPN/proxy IPs that are within a /24 prefix that is also shared by broadband users outside of the proxy service. We also evaluate the prediction accuracy using <a href="https://scikit-learn.org/stable/modules/cross_validation.html"><u>k-fold cross validation</u></a>, which provides a more reliable estimate of performance by training and validating on multiple data splits, reducing variance and overfitting compared to a single train–test split. We select 10 folds and we evaluate the <a href="https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc"><u>Area Under the ROC Curve</u></a> (AUC) and the multi-class logloss. We achieve a macro-average AUC of 0.9946 (σ=0.0069) and log loss of 0.0429 (σ=0.0115). Prefix-level features are the most important contributors to classification performance.</p>
    <div>
      <h3>Users behind CGNAT are more likely to be rate limited</h3>
      <a href="#users-behind-cgnat-are-more-likely-to-be-rate-limited">
        
      </a>
    </div>
    <p>The figure below shows the daily number of CGNAT IP inferences generated by our CDN-deployed detection service between December 17, 2024 and January 9, 2025. The number of inferences remains largely stable, with noticeable dips during weekends and holidays such as Christmas and New Year’s Day. This pattern reflects expected seasonal variations, as lower traffic volumes during these periods lead to fewer active IP ranges and reduced request activity.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7hiYstptHAK6tFQrM2kEsf/7f8192051156fc6eaecdf26a829ef11c/BLOG-3046_6.png" />
          </figure><p>Next, recall that actions that rely on IP reputation or behaviour may be unduly influenced by CGNATs. One such example is bot detection. In an evaluation of our systems, we find that bot detection is resilient to those biases. However, we also learned that customers are more likely to rate limit IPs that we find are CGNATs.</p><p>We analyze bot labels by analyzing how often requests from CGNAT and non-CGNAT IPs are labeled as bots. <a href="https://www.cloudflare.com/resources/assets/slt3lc6tev37/JYknFdAeCVBBWWgQUtNZr/61844a850c5bba6b647d65e962c31c9c/BDES-863_Bot_Management_re_edit-_How_it_Works_r3.pdf"><u>Cloudflare assigns a bot score</u></a> to each HTTP request using CatBoost models trained on various request features, and these scores are then exposed through the Web Application Firewall (WAF), allowing customers to apply filtering rules. The median bot rate is nearly identical for CGNAT (4.8%) and non-CGNAT (4.7%) IPs. However, the mean bot rate is notably lower for CGNATs (7%) than for non-CGNATs (13.1%), indicating different underlying distributions. Non-CGNAT IPs show a much wider spread, with some reaching 100% bot rates, while CGNAT IPs cluster mostly below 15%. This suggests that non-CGNAT IPs tend to be dominated by either human or bot activity, whereas CGNAT IPs reflect mixed behavior from many end users, with human traffic prevailing.</p><p>Interestingly, despite bot scores that indicate traffic is more likely to be from human users, CGNAT IPs are subject to rate limiting three times more often than non-CGNAT IPs. This is likely because multiple users share the same public IP, increasing the chances that legitimate traffic gets caught by customers’ bot mitigation and firewall rules.</p><p>This tells us that users behind CGNAT IPs are indeed susceptible to collateral effects, and identifying those IPs allows us to tune mitigation strategies to disrupt malicious traffic quickly while reducing collateral impact on benign users behind the same address.</p>
    <div>
      <h2>A global view of the CGNAT ecosystem</h2>
      <a href="#a-global-view-of-the-cgnat-ecosystem">
        
      </a>
    </div>
    <p>One of the early motivations of this work was to understand if our knowledge about IP addresses might hide a bias along socio-economic boundaries—and in particular if an action on an IP address may disproportionately affect populations in developing nations, often referred to as the Global South. Identifying where different IPs exist is a necessary first step.</p><p>The map below shows the fraction of a country’s inferred CGNAT IPs over all IPs observed in the country. Regions with a greater reliance on CGNAT appear darker on the map. This view highlights the geodiversity of CGNATs in terms of importance; for example, much of Africa and Central and Southeast Asia rely on CGNATs. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4P2XcuEebKfcYdCgykMWuP/4a0aa86bd619ba24533de6862175e919/BLOG-3046_7.png" />
          </figure><p>As further evidence of continental differences, the boxplot below shows the distribution of distinct user agents per IP across /24 prefixes inferred to be part of a CGNAT deployment in each continent. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7bqJSHexFuXFs4A8am1ibQ/591be6880e8f58c9d61b147aaf0487f5/BLOG-3046_8.png" />
          </figure><p>Notably, Africa has a much higher ratio of user agents to IP addresses than other regions, suggesting more clients share the same IP in African <a href="https://www.cloudflare.com/learning/network-layer/what-is-an-autonomous-system/"><u>ASNs</u></a>. So, not only do African ISPs rely more extensively on CGNAT, but the number of clients behind each CGNAT IP is higher. </p><p>While the deployment rate of CGNAT per country is consistent with the users-per-IP ratio per country, it is not sufficient by itself to confirm deployment. The scatterplot below shows the number of users (according to <a href="https://stats.labs.apnic.net/aspop/"><u>APNIC user estimates</u></a>) and the number of IPs per ASN for ASNs where we detect CGNAT. ASNs that have fewer available IP addresses than their user base appear below the diagonal. Interestingly the scatterplot indicates that many ASNs with more addresses than users still choose to deploy CGNAT. Presumably, these ASNs provide additional services beyond broadband, preventing them from dedicating their entire address pool to subscribers. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/34GKPlJWvkwudU5MbOtots/c883760a7c448b12995997e3e6e51979/BLOG-3046_9.png" />
          </figure>
    <div>
      <h3>What this means for everyday Internet users</h3>
      <a href="#what-this-means-for-everyday-internet-users">
        
      </a>
    </div>
    <p>Accurate detection of CGNAT IPs is crucial for minimizing collateral effects in network operations and for ensuring fair and effective application of security measures. Our findings underscore the potential socio-economic and geographical variations in the use of CGNATs, revealing significant disparities in how IP addresses are shared across different regions. </p><p>At Cloudflare we are going beyond just using these insights to evaluate policies and practices. We are using the detection systems to improve our systems across our application security suite of features, and working with customers to understand how they might use these insights to improve the protections they configure.</p><p>Our work is ongoing and we’ll share details as we go. In the meantime, if you’re an ISP or network operator that operates CGNAT and want to help, get in touch at <a href="#"><u>ask-research@cloudflare.com</u></a>. Sharing knowledge and working together helps make better and equitable user experience for subscribers, while preserving web service safety and security.</p> ]]></content:encoded>
            <category><![CDATA[Research]]></category>
            <category><![CDATA[WAF]]></category>
            <category><![CDATA[Web Application Firewall]]></category>
            <category><![CDATA[Better Internet]]></category>
            <category><![CDATA[Security]]></category>
            <category><![CDATA[Bots]]></category>
            <category><![CDATA[IPv4]]></category>
            <category><![CDATA[Network Services]]></category>
            <guid isPermaLink="false">9cTCNUkDdgVjdBN6M6JLv</guid>
            <dc:creator>Vasilis Giotsas</dc:creator>
            <dc:creator>Marwan Fayed</dc:creator>
        </item>
        <item>
            <title><![CDATA[How to build your own VPN, or: the history of WARP]]></title>
            <link>https://blog.cloudflare.com/how-to-build-your-own-vpn-or-the-history-of-warp/</link>
            <pubDate>Wed, 29 Oct 2025 13:00:00 GMT</pubDate>
            <description><![CDATA[ WARP’s initial implementation resembled a VPN that allows Internet access through it. Here’s how we built it – and how you can, too.  ]]></description>
            <content:encoded><![CDATA[ <p></p><p>Linux’s networking capabilities are a crucial part of how Cloudflare serves billions of requests in the face of DDoS attacks. The tools it provides us are <a href="https://blog.cloudflare.com/why-we-use-the-linux-kernels-tcp-stack/"><u>invaluable and useful</u></a>, and a constant stream of contributions from developers worldwide ensures it <a href="https://blog.cloudflare.com/cloudflare-architecture-and-how-bpf-eats-the-world/"><u>continually gets more capable and performant</u></a>.</p><p>When we developed <a href="https://blog.cloudflare.com/1111-warp-better-vpn/"><u>WARP, our mobile-first performance and security app</u></a>, we faced a new challenge: how to securely and efficiently egress arbitrary user packets for millions of mobile clients from our edge machines. This post explores our first solution, which was essentially building our own high-performance VPN with the Linux networking stack. We needed to integrate it into our existing network; not just directly linking it into our CDN service, but providing a way to securely egress arbitrary user packets from Cloudflare machines. The lessons we learned here helped us develop new <a href="https://www.cloudflare.com/en-gb/zero-trust/products/gateway/"><u>products</u></a> and <a href="https://blog.cloudflare.com/icloud-private-relay/"><u>capabilities</u></a> and discover more strange things besides. But first, how did we get started?</p>
    <div>
      <h2>A bridge between two worlds</h2>
      <a href="#a-bridge-between-two-worlds">
        
      </a>
    </div>
    <p>WARP’s initial implementation resembled a virtual private network (VPN) that allows Internet access through it. Specifically, a Layer 3 VPN – a tunnel for IP packets.</p><p>IP packets are the building blocks of the Internet. When you send data over the Internet, it is split into small chunks and sent separately in packets, each one labeled with a destination address (who the packet goes to) and a source address (who to send a reply to). If you are connected to the Internet, you have an IP address.</p><p>You may not have a <i>unique</i> IP address, though. This is certainly true for IPv4 which, despite our and many others’ long-standing efforts to move everyone to IPv6, is still in widespread use. IPv4 has only 4 billion possible addresses and they have all been assigned – you’re gonna have to share.</p><p>When you use WiFi at home, work or the coffee shop, you’re connected to a local network. Your device is assigned a local IP address to talk to the access point and any other devices in your network. However, that address has no meaning outside of the local network. You can’t use that address in IP packets sent over the Internet, because every local IPv4 network uses <a href="https://en.wikipedia.org/wiki/Private_network"><u>the same few sets of addresses</u></a>.</p><p>So how does Internet access work? Local IPv4 networks generally employ a <i>router</i>, a device to perform network-address translation (NAT). NAT is used to convert the private IPv4 network addresses allocated to devices on the local-area network to a small set of publicly-routable addresses given by your Internet service provider. The router keeps track of the conversions it applies between the two networks in a translation table. When a packet is received on either network, the router consults the translation table and applies the appropriate conversion before sending the packet to the opposite network.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5uT2VOMUn2fJ9NleofEfVB/b871de07a16714f1d05b2b3d0d547aa7/image6.png" />
          </figure><p><sup>Diagram of a router using NAT to bridge connections from devices on a private network to the public Internet</sup></p><p>A VPN that provides Internet access is no different in this respect to a LAN – the only unusual aspect is that the user of the VPN communicates with the VPN server over the public Internet. The model is simple: private network IP packets are tunnelled, or encapsulated, in public IP packets addressed to the VPN server.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/613OhwoQSh2JHzIsBLzo8U/876446bed57eb8b70ba9ecac0d8f0c75/image5.png" />
          </figure><p><sup>Schematic of HTTPS packets being encapsulated between a VPN client and server</sup></p><p>Most times, VPN software only handles the encapsulation and decapsulation of packets, and gives you a virtual network device to send and receive packets on the VPN. This gives you the freedom to configure the VPN however you like. For WARP, we need our servers to act as a router between the VPN client and the Internet.</p>
    <div>
      <h2>NAT’s how you do it</h2>
      <a href="#nats-how-you-do-it">
        
      </a>
    </div>
    <p>Linux – the operating system powering our servers – can be configured to perform routing with NAT in its <a href="https://en.wikipedia.org/wiki/Netfilter"><u>Netfilter</u></a> subsystem. Netfilter is frequently configured through nftables or iptables rules. Configuring a “source NAT” to rewrite the source IP of outgoing packets is achieved with a single rule:</p><p><code>nft add rule ip nat postrouting oifname "eth0" ip saddr 10.0.0.0/8 snat to 198.51.100.42</code></p><p>This rule configures Netfilter’s NAT feature to perform source address translation for any packet matching the following criteria:</p><ol><li><p>The source address is the 10.0.0.0/8 private network subnet - in this example, let’s say VPN clients have addresses from this subnet.</p></li><li><p>The packet shall be sent on the “eth0” interface - in this example, it’s the server’s only physical network interface, and thus the route to the public Internet.</p></li></ol><p>Where these two conditions are true, we apply the “snat” action to rewrite the source IP packet, from whichever address the VPN client is using, to our example server’s public IP address 198.51.100.42. We keep track of the original and rewritten addresses in the rewrite table.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4sUznAhNxIXRCdhjILq6fe/539a2ee09eb149ae9856172043a7d527/image1.png" />
          </figure><p><sup>Schematic of an encapsulated packet being decapsulated and rewritten by a VPN server</sup></p><p><sup></sup><a href="https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/10/html/configuring_firewalls_and_packet_filters/configuring-nat-using-nftables"><u>You may require additional configuration</u></a> depending on how your distribution ships nftables – nftables is more flexible than the deprecated iptables, but has fewer “implicit” tables ready to use.</p><p>You also might need to <a href="https://linux-audit.com/kernel/sysctl/net/net.ipv4.ip_forward/"><u>enable IP forwarding in general</u></a>, as by default you don’t want a machine connected to two different networks to forward between them without realising it.</p>
    <div>
      <h2>A conntrack is a conntrack is a conntrack</h2>
      <a href="#a-conntrack-is-a-conntrack-is-a-conntrack">
        
      </a>
    </div>
    <p>We said before that a router keeps track of the conversions between addresses in the two networks. In the diagram above, that state is held in the rewrite table.</p><p>In practice, any device may only implement NAT usefully if it understands the TCP and UDP protocols, in particular how they use port numbers to support multiple independent flows of data on a single IP address. The NAT device – in our case Linux – ensures that a unique source port and address is used for each connection, and reassigns the port if required. It also needs to understand the lifecycle of a TCP connection, so that it knows when it is safe to reuse a port number: with only 65,536 possible ports, port reuse is essential.</p><p>Linux Netfilter has the <i>conntrack</i> module, widely used to implement a stateful firewall that protects servers against spoofed or unexpected packets, preventing them interfering with legitimate connections. This protection is possible because it understands TCP and the valid state of a connection. This capability means it’s perfectly positioned to implement NAT, too. In fact, all packet rewriting is implemented by conntrack.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5HjxbjpRIJIPygV4zMo4XL/7ff4e11334e8e64826be1f29f5e5fb17/image2.png" />
          </figure><p><sup>A diagram showing the steps taken by conntrack to validate and rewrite packets</sup></p><p>As a stateful firewall, the conntrack module maintains a table of all connections it has seen. If you know all of the active connections, you can rewrite a new connection to a port that is not in use.</p><p>In the “snat” rule above, Netfilter adds an entry to the rewrite table, but doesn’t change the packet yet. Only <a href="https://wiki.nftables.org/wiki-nftables/index.php/Mangling_packet_headers"><u>basic packet changes are permitted within nftables</u></a>. We must wait for packet processing to reach the conntrack module, which selects a port unused by any active connection, and only then rewrites the packet.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6qT3d8JXiTYLwQWsVOCtcQ/ff8c8adcb209f2cdc2578dc1218923ca/image4.png" />
          </figure><p><sup>A diagram showing the roles of netfilter and conntrack when applying NAT to traffic</sup></p>
    <div>
      <h2>Marky mark and the firewall bunch</h2>
      <a href="#marky-mark-and-the-firewall-bunch">
        
      </a>
    </div>
    <p>Another mode of conntrack is to assign a persistent mark to packets belonging to a connection. The mark can be referenced in nftables rules to implement different firewall policies, or to control routing decisions.</p><p>Suppose you want to prevent specific addresses (e.g. from a guest network) from accessing certain services on your machine. You could add a firewall rule for each service denying access to those addresses. However, if you need to change the set of addresses to block, you have to update every rule accordingly.</p><p>Alternatively, you could use one rule to apply a mark to packets coming from the addresses you wish to block, and then reference the mark in all the service rules that implement the block. Now if you wish to change the addresses, you need only update a single rule to change the scope of that packet mark.</p><p>This is most beneficial to control routing behaviour, as routing rules cannot make decisions on as many attributes of the packet as Netfilter can. Using marks allows you to select packets based on powerful Netfilter rules.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/33J5E9eds0JiGVNqOInJ0K/829d033b3ee255093ff1927c0b03f4fb/image3.png" />
          </figure><p><sup>A diagram showing netfilter marking specific packets to apply special routing rules</sup></p><p>The code powering the WARP service was written by Cloudflare in Rust, a security-focused systems programming language. We took great care implementing <a href="https://github.com/cloudflare/boringtun"><u>boringtun</u></a> - our WireGuard implementation - and <a href="https://blog.cloudflare.com/zero-trust-warp-with-a-masque/"><u>MASQUE</u></a>. But even if you think the front door is impenetrable, it is good security practice to employ defense-in-depth.</p><p>One example is distinguishing IP packets that come from clients vs. packets that originate elsewhere in our network. One common method is to allocate a unique IP space to WARP traffic and distinguish it based on IP address, but this can be fragile if we need to apply a configuration change to renumber our internal networks – remember IPv4’s limited address space! Instead we can do something simpler.</p><p>To bring IP packets from WARP clients into the Linux networking stack, WARP uses a <a href="https://blog.cloudflare.com/virtual-networking-101-understanding-tap/"><u>TUN device</u></a> – Linux’s name for the virtual network device that programs can use to send and receive IP packets. A TUN device can be configured similarly to any other network device like Ethernet or Wi-Fi adapters, including firewall and routing.</p><p>Using nftables, we mark all packets output on WARP’s TUN device. We have to explicitly store the mark in conntrack’s state table on the outgoing path and retrieve it for the incoming packet, as netfilter can use packet marks independently of conntrack.</p>
            <pre><code>table ip mangle {
    chain forward {
        type filter hook forward priority mangle; policy accept;
        oifname "fishtun" counter ct mark set 42
    }
    chain prerouting {
        type filter hook prerouting priority mangle; policy accept;
        counter meta mark set ct mark
    }
}</code></pre>
            <p>We also need to add a routing rule to return marked packets to the TUN device:</p><p><code>ip rule add fwmark 42 table 100 priority 10
ip route add 0.0.0.0/0 proto static dev warp-tun table 100</code></p><p>Now we’re done. All connections from WARP are clearly identified and can be firewalled separately from locally-originated connections or other nodes on our network. Conntrack handles NAT for us, and the connection marks tell us which tracked connections were made by WARP clients.</p>
    <div>
      <h2>The end?</h2>
      <a href="#the-end">
        
      </a>
    </div>
    <p>In our first version of WARP, we enabled clients to access arbitrary Internet hosts by combining multiple components of Linux’s networking stack. Each of our edge servers had a single IP address from an allocation dedicated to WARP, and we were able to configure NAT, routing, and appropriate firewall rules using standard and well-documented methods.</p><p>Linux is flexible and easy to configure, but it would require one IPv4 address per machine. Due to IPv4 address exhaustion, this approach would not scale to Cloudflare’s large network. Assigning a dedicated IPv4 address for every machine that runs the WARP server results in an eye-watering address lease bill. To bring costs down, we would have to limit the number of servers running WARP, increasing the operational complexity of deploying it.</p><p>We had ideas, but we would have to give up the easy path Linux gave us. <a href="https://blog.cloudflare.com/cloudflare-servers-dont-own-ips-anymore/"><u>IP sharing seemed to us the most promising solution</u></a>, but how much has to change if a single machine can only receive packets addressed to a narrow set of ports? We will reveal all in a follow-up blog post, but if you are the kind of curious problem-solving engineer who is already trying to imagine solutions to this problem, look at <a href="https://www.cloudflare.com/en-gb/careers/jobs/?department=Engineering"><u>our open positions</u></a> – we’d like to hear from you!</p> ]]></content:encoded>
            <category><![CDATA[Research]]></category>
            <category><![CDATA[WARP]]></category>
            <category><![CDATA[Linux]]></category>
            <guid isPermaLink="false">3ClsS6mSOdk413zjE9GH6t</guid>
            <dc:creator>Chris Branch</dc:creator>
        </item>
        <item>
            <title><![CDATA[Defending QUIC from acknowledgement-based DDoS attacks]]></title>
            <link>https://blog.cloudflare.com/defending-quic-from-acknowledgement-based-ddos-attacks/</link>
            <pubDate>Wed, 29 Oct 2025 13:00:00 GMT</pubDate>
            <description><![CDATA[ We identified and patched two DDoS vulnerabilities in our QUIC implementation related to packet acknowledgements. Cloudflare customers were not affected. We examine the "Optimistic ACK" attack vector and our solution, which dynamically skips packet numbers to validate client behavior.  ]]></description>
            <content:encoded><![CDATA[ <p></p><p>On April 10th, 2025 12:10 UTC, a security researcher notified Cloudflare of two vulnerabilities (<a href="https://www.cve.org/CVERecord?id=CVE-2025-4820"><u>CVE-2025-4820</u></a> and <a href="https://www.cve.org/CVERecord?id=CVE-2025-4821"><u>CVE-2025-4821</u></a>) related to QUIC packet acknowledgement (ACK) handling, through our <a href="https://hackerone.com/cloudflare?type=team"><u>Public Bug Bounty</u></a> program. These were DDoS vulnerabilities in the <a href="https://github.com/cloudflare/quiche"><u>quiche</u></a> library, and Cloudflare services that use it. quiche is Cloudflare's open-source implementation of QUIC protocol, which is the transport protocol behind <a href="https://blog.cloudflare.com/http3-the-past-present-and-future/"><u>HTTP/3</u></a>.</p><p>Upon notification, Cloudflare engineers patched the affected infrastructure, and the researcher confirmed that the DDoS vector was mitigated. <b>Cloudflare’s investigation revealed no evidence that the vulnerabilities were being exploited or that any customers were affected.</b> quiche versions prior to 0.24.4 were affected.</p><p>Here, we’ll explain why ACKs are important to Internet protocol design and how they help ensure fair network usage. Finally, we will explain the vulnerabilities and discuss our mitigation for the Optimistic ACK attack: a dynamic CWND-aware skip frequency that scales with a connection’s send rate.</p>
    <div>
      <h3>Internet Protocols and Attack Vectors</h3>
      <a href="#internet-protocols-and-attack-vectors">
        
      </a>
    </div>
    <p>QUIC is an Internet transport protocol that offers equivalent features to <a href="https://www.cloudflare.com/learning/ddos/glossary/tcp-ip/"><u>TCP</u></a> (Transmission Control Protocol) and <a href="https://www.cloudflare.com/learning/ssl/transport-layer-security-tls/"><u>TLS</u></a> (Transport Layer Security). QUIC runs over <a href="https://www.cloudflare.com/en-gb/learning/ddos/glossary/user-datagram-protocol-udp/"><u>UDP</u></a> (User Datagram Protocol), is encrypted by default and offers a few benefits over the prior set of protocols (including smaller handshake time, connection migration, and preventing <a href="https://en.wikipedia.org/wiki/Head-of-line_blocking"><u>head-of-line blocking</u></a> that can manifest in TCP). Similar to TCP, QUIC relies on packet acknowledgements to make general progress. For example, ACKs are used for liveliness checks, validation, loss recovery signals, and congestion algorithm signals.</p><p>ACKs are an important source of signals for Internet protocols, which necessitates validation to ensure a malicious peer is not subverting these signals. Cloudflare's QUIC implementation, quiche, lacked ACK range validation, which meant a peer could send an ACK range for packets never sent by the endpoint; this was patched in <a href="https://www.cve.org/CVERecord?id=CVE-2025-4821"><u>CVE-2025-4821</u></a>. Additionally, a sophisticated attacker could  mount an attack by predicting and preemptively sending ACKs (a technique called Optimistic ACK); this was patched in <a href="https://www.cve.org/CVERecord?id=CVE-2025-4820"><u>CVE-2025-4820</u></a>. By exploiting the lack of ACK validation, an attacker can cause an endpoint to artificially expand its send rate; thereby gaining an unfair advantage over other connections. In the extreme case this can be a DDoS attack vector caused by higher server CPU utilization and an amplification of network traffic.</p>
    <div>
      <h3>Fairness and Congestion control</h3>
      <a href="#fairness-and-congestion-control">
        
      </a>
    </div>
    <p>A typical CDN setup includes hundreds of server processes, serving thousands of concurrent connections. Each connection has its own recovery and congestion control algorithm that is responsible for determining its fair share of the network. The Internet is a shared resource that relies on well-behaved transport protocols correctly implementing congestion control to ensure fairness.</p><p>To illustrate the point, let’s consider a shared network where the first connection (blue) is operating at capacity. When a new connection (green) joins and probes for capacity, it will trigger packet loss, thereby signaling the blue connection to reduce its send rate. The probing can be highly dynamic and although convergence might take time, the hope is that both connections end up sharing equal capacity on the network.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/44jjkcx22rpD7VdnZKsnPD/4d514e73c885a729bd973b3efb2564bf/image4.jpg" />
          </figure><p><sup>New connection joining the shared network. Existing flows make room for the new flow.</sup></p><p>In order to ensure fairness and performance, each endpoint uses a <a href="http://blog.cloudflare.com/cubic-and-hystart-support-in-quiche/"><u>Congestion Control</u></a> algorithm. There are various algorithms but for our purposes let's consider <a href="https://www.rfc-editor.org/rfc/rfc9438.html"><u>Cubic</u></a>, a loss-based algorithm. Cubic, when in steady state, periodically explores higher sending rates. As the peer ACKs new packets, Cubic unlocks additional sending capacity (congestion window) to explore even higher send rates. Cubic continues to increase its send rate until it detects congestion signals (e.g., packet loss), indicating that the network is potentially at capacity and the connection should lower its sending rate.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5FvyfLs39CrWnHv8JjiJkd/d44cc31229e4dafa062d607c4214cba0/image6.png" />
          </figure><p><sup>Cubic congestion control responding to loss on the network.</sup></p>
    <div>
      <h3>The role of ACKs</h3>
      <a href="#the-role-of-acks">
        
      </a>
    </div>
    <p>ACKs are a feedback mechanism that Internet protocols use to make progress. A server serving a large file download will send that data across multiple packets to the client. Since networks are lossy, the client is responsible for ACKing when it has received a packet from the server, thus confirming delivery and progress. Lack of an ACK indicates that the packet has been lost and that the data might require retransmission. This feedback allows the server to confirm when the client has received all the data that it requested.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1LMkCz6BB4aUav8pVhM1Mb/30f94cdaa857a08af3b8c0b9bb24de91/Screenshot_2025-10-28_at_15.23.05.png" />
          </figure><p><sup>The server delivers packets and the client responds with ACKs.</sup></p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3Sa33xjYHj52KZZTL4ITWv/d0347affc68318b36da988331c55fd6c/Screenshot_2025-10-28_at_15.23.38.png" />
          </figure><p><sup>The server delivers packets, but packet [2] is lost. The client responds with ACKs only for packets [1, 3], thereby signalling that packet [2] was lost.</sup></p><p>In QUIC, packet numbers don't have to be sequential; that means skipping packet numbers is natively supported. Additionally, a <a href="https://www.rfc-editor.org/rfc/rfc9000.html#name-ack-frames"><u>QUIC ACK Frame</u></a> can contain gaps and multiple ACK ranges. As we will see, the built-in support for skipping packet numbers is a unique feature of QUIC (over TCP) that will help us enforce ACK validation.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2azePr06Z0kGVQwdEaaqbx/5ab6844b4d515444393ab0b8ca33bf1d/Screenshot_2025-10-28_at_15.25.05.png" />
          </figure><p><sup>The server delivering packets, but skipping packet [4]. The client responds with ACKs only for packets it received, and not sending an ACK for packet [4].</sup></p><p>ACKs also provide signals that control an endpoint's send rate and help provide fairness and performance. Delay between ACKs, variations in the delay, and lack of ACKs provide valuable signals, which suggest a change in the network and are important inputs to a congestion control algorithm.</p>
    <div>
      <h3>Skipping packets to avoid ACK delay</h3>
      <a href="#skipping-packets-to-avoid-ack-delay">
        
      </a>
    </div>
    <p>QUIC allows endpoints to encode the ACK delay: the time by which the ACK for packet number 'X' was intentionally delayed from when the endpoint received packet number 'X.' This delay can result from normal packet processing or be an implementation-specific optimization. For example, since ACKs processing can be expensive (both for CPU and network), delaying ACKs can allow for batching and reducing the associated overhead.</p><blockquote><p>If the sender wants to elicit a faster acknowledgement on PTO, it can skip a packet number to eliminate the acknowledgement delay. -- <a href="https://www.rfc-editor.org/rfc/rfc9002.html#section-6.2.4">https://www.rfc-editor.org/rfc/rfc9002.html#section-6.2.4</a></p></blockquote><p>However, since a delay in ACK signal also delays peer feedback, this can be detrimental for loss recovery. QUIC endpoints can therefore signal the peer to avoid delaying an ACK packet by skipping a packet number. This detail will become important as we will see later in the post.</p>
    <div>
      <h3>Validating ACK range</h3>
      <a href="#validating-ack-range">
        
      </a>
    </div>
    <p>It is expected that a well-behaved client should only send ACKs for packets that it has received. A lack of validation meant that it was possible for the client to send a very large ACK range for packets never sent by the server. For example, assuming the server has sent packets 0-5, a client was able to send an ACK Frame with the range 0-100.</p><p>By itself this is not actually a huge deal since quiche is smart enough to drop larger ACKs and only process ACKs for packets it has sent. However, as we will see in the next section, this made the Optimistic ACK vulnerability easier to exploit.</p><p>The fix was to enforce ACK range validation based on the largest packets sent by the server and close the connection on violation. This matches the RFC recommendation.</p><blockquote><p>An endpoint SHOULD treat receipt of an acknowledgment for a packet it did not send as a connection error of type PROTOCOL_VIOLATION, if it is able to detect the condition. -- <a href="https://www.rfc-editor.org/rfc/rfc9000#section-13.1">https://www.rfc-editor.org/rfc/rfc9000#section-13.1</a></p></blockquote>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/aPajmSD1NWaWvFv2aXAhs/480054b6514f3a1ddad219e4e81388f5/Screenshot_2025-10-28_at_15.26.15.png" />
          </figure><p><sup>The server validating ACKs: the client sending ACK for packets [4..5] not sent by the server. The server closes the connection since ACK validation fails.</sup></p>
    <div>
      <h3>Optimistic ACK attack</h3>
      <a href="#optimistic-ack-attack">
        
      </a>
    </div>
    <p>In the following scenario, let’s assume the client is trying to mount an Optimistic ACK attack against the server. The goal of a client mounting the attack is to cause the server to send at a high rate. To achieve a high send rate, the client needs to deliver ACKs quickly back to the server, thereby providing an artificially low <a href="https://www.cloudflare.com/learning/cdn/glossary/round-trip-time-rtt/"><u>RTT</u></a> / high bandwidth signal. Since packet numbers are typically monotonically increasing, a clever client can predict the next packet number and preemptively send ACKs (artificial ACK).</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2xCY6yXFysB3yPxfa4TjOb/962a74feaf95e520abf037bd12e19db7/Screenshot_2025-10-28_at_15.28.39.png" />
          </figure><p><sup>Optimistic ACK attack: the client predicting packets sent by the server and preemptively sending ACKs. ACK validation does not help here.</sup></p><p>If the server has proper ACK validation, an invalid ACK for packets not yet sent by the server should trigger a connection close (without ACK range validation, the attack is trivial to execute). Therefore, a malicious client needs to be clever about pacing the artificial ACKs so they arrive just as the server has sent the packet. If the attack is done correctly, the server will see a very low RTT, and result in an inflated send rate.</p><blockquote><p>An endpoint that acknowledges packets it has not received might cause a congestion controller to permit sending at rates beyond what the network supports. An endpoint MAY skip packet numbers when sending packets to detect this behavior. An endpoint can then immediately close the connection with a connection error of type PROTOCOL_VIOLATION -- <a href="https://www.rfc-editor.org/rfc/rfc9000#section-21.4">https://www.rfc-editor.org/rfc/rfc9000#section-21.4</a></p></blockquote>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2fppvXzvdTOugNzCtxgiH5/897da7f980f1de95bdafa1aee423dcf2/Screenshot_2025-10-28_at_15.40.37.png" />
          </figure><p><sup>Preventing an Optimistic ACK attack: the client predicting packets sent by the server and preemptively sending ACKs. Since the server skipped packet [4], it is able to detect the invalid ACK and close the connection.</sup></p><p>The <a href="https://www.rfc-editor.org/rfc/rfc9000#section-21.4"><u>QUIC RFC</u></a> mentions the Optimistic ACK attack and suggests skipping packets to detect this attack. By skipping packets, the client is unable to easily predict the next packet number and risks connection close if the server implements invalid ACK range validation. Implementation details – like how many packet numbers to skip and how often – are missing, however.</p><p>The [malicious] client transmission pattern does not indicate any malicious behavior.</p><blockquote><p>As such, the bit rate towards the server follows normal behavior. Considering that QUIC packets are end-to-end encrypted, a middlebox cannot identify the attack by analyzing the client’s traffic. -- <a href="https://louisna.github.io/files/2025-anrw-oack.pdf">MAY is not enough! QUIC servers SHOULD skip packet numbers</a></p></blockquote><p>Ideally, the client would like to use as few resources as possible, while simultaneously causing the server to use as many as possible. In fact, as the security researchers confirmed in their paper: it is difficult to detect a malicious QUIC client using external traffic analysis, and it’s therefore necessary for QUIC implementations to mitigate the Optimistic ACK attack by skipping packets.</p><p>The Optimistic ACK vulnerability is not unique to QUIC. In fact the vulnerability was first discovered against TCP. However, since TCP does not natively support skipping packet numbers, an Optimistic ACK attack in TCP is harder to mitigate and can require additional DDoS analysis. By allowing for packet skipping, QUIC is able to prevent this type of attack at the protocol layer and more effectively ensure correctness and fairness over untrusted networks.</p>
    <div>
      <h3>How often to skip packet numbers</h3>
      <a href="#how-often-to-skip-packet-numbers">
        
      </a>
    </div>
    <p>According to the QUIC RFC, skipping packet numbers currently has two purposes. The first is to elicit a faster acknowledgement for loss recovery and the second is to mitigate an Optimistic ACK attack. A QUIC implementation skipping packets for Optimistic ACK attack therefore needs to skip frequently enough to mitigate the attack, while considering the effects on eliminating ACK delay.</p><p>Since packet skipping needs to be unpredictable, a simple implementation could be to skip packet numbers based on a random number from a static range. However, since the number of packets increases as the send rate increases, this has the downside of not adapting to the send rate. At smaller send rates, a static range will be too frequent, while at higher send rates it won't be frequent enough and therefore be less effective. It's also arguably most important to validate the send rate when there are higher send rates. It therefore seems necessary to adapt the skip frequency based on the send rate.</p><p>Congestion window (CWND) is a parameter used by congestion control algorithms to determine the amount of bytes that can be sent per round. Since the send rate increases based on the amount of bytes ACKed (capped by bytes sent), we claim that CWND makes a great proxy for dynamically adjusting the skip frequency. This CWND-aware skip frequency allows all connections, regardless of current send rate, to effectively mitigate the Optimistic ACK attack.</p>
            <pre><code>// c: the current packet number
// s: range of random packet number to skip from
//
// curr_pn
//  |
//  v                 |--- (upper - lower) ---|
// [c x x x x x x x x s s s s s s s s s s s s s x x]
//    |--min_skip---| |------skip_range-------|

const DEFAULT_INITIAL_CONGESTION_WINDOW_PACKETS: usize = 10;
const MIN_SKIP_COUNTER_VALUE: u64 = DEFAULT_INITIAL_CONGESTION_WINDOW_PACKETS * 2;

let packets_per_cwnd = (cwnd / max_datagram_size) as u64;
let lower = packets_per_cwnd / 2;
let upper = packets_per_cwnd * 2;

let skip_range = upper - lower;
let rand_skip_value = rand(skip_range);

let skip_pn = MIN_SKIP_COUNTER_VALUE + lower + rand_skip_value;</code></pre>
            <p><sup>Skip frequency calculation in quiche.</sup></p>
    <div>
      <h3>Timeline</h3>
      <a href="#timeline">
        
      </a>
    </div>
    <p>All timestamps are in UTC.</p><ul><li><p>2025–04-10 12:10 - Cloudflare is notified of an ACK validation and Optimistic ACK vulnerability via the Bug Bounty Program.</p></li><li><p>2025-04-19 00:20 – Cloudflare confirms both vulnerabilities are reproducible and begins working on fix.</p></li><li><p>2025-05-02 20:12 - Security patch is complete and infrastructure patching starts.</p></li><li><p>2025–05-16 04:52 - Cloudflare infrastructure patching is complete.</p></li><li><p>New quiche version released.</p></li></ul>
    <div>
      <h3>Conclusion</h3>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>We would like to sincerely thank <a href="https://louisna.github.io/"><u>Louis Navarre</u></a> and <a href="https://perso.uclouvain.be/olivier.bonaventure/blog/html/pages/bio.html"><u>Olivier Bonaventure</u></a> from <a href="https://www.uclouvain.be/en"><u>UCLouvain</u></a>, who responsibly disclosed this issue via our <a href="https://www.cloudflare.com/en-gb/disclosure/"><u>Cloudflare Bug Bounty Program</u></a>, allowing us to identify and mitigate the vulnerability. They also published a <a href="https://louisna.github.io/publication/2025-anrw-oack"><u>paper</u></a> with their findings, notifying 10 other QUIC implementations that also suffered from the Optimistic ACK vulnerability. </p><p>We welcome further <a href="https://www.cloudflare.com/en-gb/disclosure/"><u>submissions</u></a> from our community of researchers to continually improve the security of all of our products and open source projects.</p> ]]></content:encoded>
            <category><![CDATA[Research]]></category>
            <category><![CDATA[QUIC]]></category>
            <category><![CDATA[Protocols]]></category>
            <category><![CDATA[Vulnerabilities]]></category>
            <category><![CDATA[Security]]></category>
            <guid isPermaLink="false">1vU4Xmgau85ysMJVxTEx09</guid>
            <dc:creator>Apoorv Kothari</dc:creator>
            <dc:creator>Louis Navarre (Guest author)</dc:creator>
        </item>
        <item>
            <title><![CDATA[So long, and thanks for all the fish: how to escape the Linux networking stack]]></title>
            <link>https://blog.cloudflare.com/so-long-and-thanks-for-all-the-fish-how-to-escape-the-linux-networking-stack/</link>
            <pubDate>Wed, 29 Oct 2025 13:00:00 GMT</pubDate>
            <description><![CDATA[ Many products at Cloudflare aren’t possible without pushing the limits of network hardware and software to deliver improved performance, increased efficiency, or novel capabilities such as soft-unicast, our method for sharing IP subnets across data centers. Happily, most people do not need to know the intricacies of how your operating system handles network and Internet access in general. Yes, even most people within Cloudflare. But sometimes we try to push well beyond the design intentions of Linux’s networking stack. This is a story about one of those attempts. ]]></description>
            <content:encoded><![CDATA[ <p><b></b><a href="https://www.goodreads.com/quotes/2397-there-is-a-theory-which-states-that-if-ever-anyone"><u>There is a theory which states</u></a> that if ever anyone discovers exactly what the Linux networking stack does and why it does it, it will instantly disappear and be replaced by something even more bizarre and inexplicable.</p><p>There is another theory which states that Git was created to track how many times this has already happened.</p><p>Many products at Cloudflare aren’t possible without pushing the limits of network hardware and software to deliver improved performance, increased efficiency, or novel capabilities such as <a href="https://blog.cloudflare.com/cloudflare-servers-dont-own-ips-anymore/"><u>soft-unicast, our method for sharing IP subnets across data centers</u></a>. Happily, most people do not need to know the intricacies of how your operating system handles network and Internet access in general. Yes, even most people within Cloudflare.</p><p>But sometimes we try to push well beyond the design intentions of Linux’s networking stack. This is a story about one of those attempts.</p>
    <div>
      <h2>Hard solutions for soft problems</h2>
      <a href="#hard-solutions-for-soft-problems">
        
      </a>
    </div>
    <p>My previous blog post about the Linux networking stack teased a problem matching the ideal model of soft-unicast with the basic reality of IP packet forwarding rules. Soft-unicast is the name given to our method of sharing IP addresses between machines. <a href="https://blog.cloudflare.com/cloudflare-servers-dont-own-ips-anymore/"><u>You may learn about all the cool things we do with it</u></a>, but as far as a single machine is concerned, it has dozens to hundreds of combinations of IP address and source-port range, any of which may be chosen for use by outgoing connections.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1NsU3FdxgJ0FNL78SDCo9D/65a27e8fd4339d3318a1b55b5979e3c6/image3.png" />
          </figure><p>The SNAT target in iptables supports a source-port range option to restrict the ports selected during NAT. In theory, we could continue to use iptables for this purpose, and to support multiple IP/port combinations we could use separate packet marks or multiple TUN devices. In actual deployment we would have to overcome challenges such as managing large numbers of iptables rules and possibly network devices, interference with other uses of packet marks, and deployment and reallocation of existing IP ranges.</p><p>Rather than increase the workload on our firewall, we wrote a single-purpose service dedicated to egressing IP packets on soft-unicast address space. For reasons lost in the mists of time, we named it SLATFATF, or “fish” for short. This service’s sole responsibility is to proxy IP packets using soft-unicast address space and manage the lease of those addresses.</p><p>WARP is not the only user of soft-unicast IP space in our network. Many Cloudflare products and services make use of the soft-unicast capability, and many of them use it in scenarios where we create a TCP socket in order to proxy or carry HTTP connections and other TCP-based protocols. Fish therefore needs to lease addresses that are not used by open sockets, and ensure that sockets cannot be opened to addresses leased by fish.</p><p>Our first attempt was to use distinct per-client addresses in fish and continue to let Netfilter/conntrack apply SNAT rules. However, we discovered an unfortunate interaction between Linux’s socket subsystem and the Netfilter conntrack module that reveals itself starkly when you use packet rewriting.</p>
    <div>
      <h2>Collision avoidance</h2>
      <a href="#collision-avoidance">
        
      </a>
    </div>
    <p>Suppose we have a soft-unicast address slice, 198.51.100.10:9000-9009. Then, suppose we have two separate processes that want to bind a TCP socket at 198.51.100.10:9000 and connect it to 203.0.113.1:443. The first process can do this successfully, but the second process will receive an error when it attempts to connect, because there is already a socket matching the requested 5-tuple.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2eXmHlyC0pdDUkZ9OI3JI/b83286088b4efa6ddee897e8b5d3b191/image8.png" />
          </figure><p>Instead of creating sockets, what happens when we emit packets on a TUN device with the same destination IP but a unique source IP, and use source NAT to rewrite those packets to an address in this range?</p><p>If we add an nftables “snat” rule that rewrites the source address to 198.51.100.10:9000-9009, Netfilter will create an entry in the conntrack table for each new connection seen on fishtun, mapping the new source address to the original one. If we try to forward more connections on that TUN device to the same destination IP, new source ports will be selected in the requested range, until all ten available ports have been allocated; once this happens, new connections will be dropped until an existing connection expires, freeing an entry in the conntrack table.</p><p>Unlike when binding a socket, Netfilter will simply pick the first free space in the conntrack table. However, if you use up all the possible entries in the table <a href="https://blog.cloudflare.com/conntrack-tales-one-thousand-and-one-flows/"><u>you will get an EPERM error when writing an IP packet</u></a>. Either way, whether you bind kernel sockets or you rewrite packets with conntrack, errors will indicate when there isn’t a free entry matching your requirements.</p><p>Now suppose that you combine the two approaches: a first process emits an IP packet on the TUN device that is rewritten to a packet on our soft-unicast port range. Then, a second process binds and connects a TCP socket with the same addresses as that IP packet:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/57KuCP4vkp4TGPiLwDRPZv/c066279cd8a84a511f09ed5218488cec/image7.png" />
          </figure><p>The first problem is that there is no way for the second process to know that there is an active connection from 198.51.100.10:9000 to 203.0.113.1:443, at the time the <code>connect() </code>call is made. The second problem is that the connection is successful from the point of view of that second process.</p><p>It should not be possible for two connections to share the same 5-tuple. Indeed, they don’t. Instead, the source address of the TCP socket is <a href="https://github.com/torvalds/linux/blob/v6.15/net/netfilter/nf_nat_core.c#L734"><u>silently rewritten to the next free port</u></a>.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3DWpWJ5gBIDoEhimxIR8TT/fd3d8bd46353cd42ed09a527d4841da8/image6.png" />
          </figure><p>This behaviour is present even if you use conntrack without either SNAT or MASQUERADE rules. It usually happens that the lifetime of conntrack entries matches the lifetime of the sockets they’re related to, but this is not guaranteed, and you cannot depend on the source address of your socket matching the source address of the generated IP packets.</p><p>Crucially for soft-unicast, it means conntrack may rewrite our connection to have a source port outside of the port slice assigned to our machine. This will silently break the connection, causing unnecessary delays and false reports of connection timeouts. We need another solution.</p>
    <div>
      <h2>Taking a breather</h2>
      <a href="#taking-a-breather">
        
      </a>
    </div>
    <p>For WARP, the solution we chose was to stop rewriting and forwarding IP packets, instead to terminate all TCP connections within the server and proxy them to a locally-created TCP socket with the correct soft-unicast address. This was an easy and viable solution that we already employed for a portion of our connections, such as those directed at the CDN, or intercepted as part of the Zero Trust Secure Web Gateway. However, it does introduce additional resource usage and potentially increased latency compared to the status quo. We wanted to find another way (to) forward.</p>
    <div>
      <h2>An inefficient interface</h2>
      <a href="#an-inefficient-interface">
        
      </a>
    </div>
    <p>If you want to use both packet rewriting and bound sockets, you need to decide on a single source of truth. Netfilter is not aware of the socket subsystem, but most of the code that uses sockets and is also aware of soft-unicast is code that Cloudflare wrote and controls. A slightly younger version of myself therefore thought it made sense to change our code to work correctly in the face of Netfilter’s design.</p><p>Our first attempt was to use the Netlink interface to the conntrack module, to inspect and manipulate the connection tracking tables before sockets were created. <a href="https://docs.kernel.org/userspace-api/netlink/intro.html"><u>Netlink is an extensible interface to various Linux subsystems</u></a> and is used by many command-line tools like <a href="https://man7.org/linux/man-pages/man8/ip.8.html"><u>ip</u></a> and, in our case, <a href="https://conntrack-tools.netfilter.org/manual.html"><u>conntrack-tools</u></a>. By creating the conntrack entry for the socket we are about to bind, we can guarantee that conntrack won’t rewrite the connection to an invalid port number, and ensure success every time. Likewise, if creating the entry fails, then we can try another valid address. This approach works regardless of whether we are binding a socket or forwarding IP packets.</p><p>There is one problem with this — it’s not terribly efficient. Netlink is slow compared to the bind/connect socket dance, and when creating conntrack entries you have to specify a timeout for the flow and delete the entry if your connection attempt fails, to ensure that the connection table doesn’t fill up too quickly for a given 5-tuple. In other words, you have to manually reimplement <a href="https://sysctl-explorer.net/net/ipv4/tcp_tw_reuse/"><u>tcp_tw_reuse</u></a> option to support high-traffic destinations with limited resources. In addition, a stray RST packet can erase your connection tracking entry. At our scale, anything like this that can happen, will happen. It is not a place for fragile solutions.</p>
    <div>
      <h2>Socket to ‘em</h2>
      <a href="#socket-to-em">
        
      </a>
    </div>
    <p>Instead of creating conntrack entries, we can abuse kernel features for our own benefit. Some time ago Linux added <a href="https://lwn.net/Articles/495304/"><u>the TCP_REPAIR socket option</u></a>, ostensibly to support connection migration between servers e.g. to relocate a VM. The scope of this feature allows you to create a new TCP socket and specify its entire connection state by hand.</p><p>An alternative use of this is to create a “connected” socket that never performed the TCP three-way handshake needed to establish that connection. At least, the kernel didn’t do that — if you are forwarding the IP packet containing a TCP SYN, you have more certainty about the expected state of the world.</p><p>However, the introduction of <a href="https://en.wikipedia.org/wiki/TCP_Fast_Open"><u>TCP Fast Open</u></a> provides an even simpler way to do this: you can create a “connected” socket that doesn’t perform the traditional three-way handshake, on the assumption that the SYN packet — when sent with its initial payload — contains a valid cookie to immediately establish the connection. However, as nothing is sent until you write to the socket, this serves our needs perfectly.</p><p>You can try this yourself:</p>
            <pre><code>TCP_FASTOPEN_CONNECT = 30
TCP_FASTOPEN_NO_COOKIE = 34
s = socket(AF_INET, SOCK_STREAM)
s.setsockopt(SOL_TCP, TCP_FASTOPEN_CONNECT, 1)
s.setsockopt(SOL_TCP, TCP_FASTOPEN_NO_COOKIE, 1)
s.bind(('198.51.100.10', 9000))
s.connect(('1.1.1.1', 53))</code></pre>
            <p>Binding a “connected” socket that nevertheless corresponds to no actual socket has one important feature: if other processes attempt to bind to the same addresses as the socket, they will fail to do so. This satisfies the problem we had at the beginning to make packet forwarding coexist with socket usage.</p>
    <div>
      <h2>Jumping the queue</h2>
      <a href="#jumping-the-queue">
        
      </a>
    </div>
    <p>While this solves one problem, it creates another. By default, you can’t use an IP address for both locally-originated packets and forwarded packets.</p><p>For example, we assign the IP address 198.51.100.10 to a TUN device. This allows any program to create a TCP socket using the address 198.51.100.10:9000. We can also write packets to that TUN device with the address 198.51.100.10:9001, and Linux can be configured to forward those packets to a gateway, following the same route as the TCP socket. So far, so good.</p><p>On the inbound path, TCP packets addressed to 198.51.100.10:9000 will be accepted and data put into the TCP socket. TCP packets addressed to 198.51.100.10:9001, however, will be dropped. They are not forwarded to the TUN device at all.</p><p>Why is this the case? Local routing is special. If packets are received to a local address, they are treated as “input” and not forwarded, regardless of any routing you think should apply. Behold the default routing rules:</p><p><code>cbranch@linux:~$ ip rule
cbranch@linux:~$ ip rule
0:        from all lookup local
32766:    from all lookup main
32767:    from all lookup default</code></p><p>The rule priority is a nonnegative integer, the smallest priority value is evaluated first. This requires some slightly awkward rule manipulation to “insert” a lookup rule at the beginning that redirects marked packets to the packet forwarding service’s TUN device; you have to delete the existing rule, then create new rules in the right order. However, you don’t want to leave the routing rules without any route to the “local” table, in case you lose a packet while manipulating these rules. In the end, the result looks something like this:</p><p><code>ip rule add fwmark 42 table 100 priority 10
ip rule add lookup local priority 11
ip rule del priority 0
ip route add 0.0.0.0/0 proto static dev fishtun table 100</code></p><p>As with WARP, we simplify connection management by assigning a mark to packets coming from the “fishtun” interface, which we can use to route them back there. To prevent locally-originated TCP sockets from having this same mark applied, we assign the IP to the loopback interface instead of fishtun, leaving fishtun with no assigned address. But it doesn’t need one, as we have explicit routing rules now.</p>
    <div>
      <h2>Uncharted territory</h2>
      <a href="#uncharted-territory">
        
      </a>
    </div>
    <p>While testing this last fix, I ran into an unfortunate problem. It did not work in our production environment.</p><p>It is not simple to debug the path of a packet through Linux’s networking stack. There are a few tools you can use, such as setting nftrace in nftables or applying the LOG/TRACE targets in iptables, which help you understand which rules and tables are applied for a given packet.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7ofuljq2tDVVUyzyPOMYSp/3da5954ef254aa3aae5397b310f6dcad/image5.png" />
          </figure><p><sup></sup><a href="https://en.m.wikipedia.org/wiki/File:Netfilter-packet-flow.svg"><sup><u>Schematic for the packet flow paths through Linux networking and *tables</u></sup></a><sup> by </sup><a href="https://commons.wikimedia.org/wiki/User_talk:Jengelh"><sup>Jan Engelhardt</sup></a></p><p>Our expectation is that the packet will pass the prerouting hook, a routing decision is made to send the packet to our TUN device, then the packet will traverse the forward table. By tracing packets originating from the IP of a test host, we could see the packets enter the prerouting phase, but disappear after the ‘routing decision’ block.</p><p>While there is a block in the diagram for “socket lookup”, this occurs after processing the input table. Our packet doesn’t ever enter the input table; the only change we made was to create a local socket. If we stop creating the socket, the packet passes to the forward table as before.</p><p>It turns out that part of the ‘routing decision’ involves some protocol-specific processing. For IP packets, <a href="https://github.com/torvalds/linux/blob/89be9a83ccf1f88522317ce02f854f30d6115c41/net/ipv4/ip_input.c#L317"><u>routing decisions can be cached</u></a>, and some basic address validation is performed. In 2012, an additional feature was added: <a href="https://lore.kernel.org/all/20120619.163911.2094057156011157978.davem@davemloft.net/"><u>early demux</u></a>. The rationale being, at this point in packet processing we are already looking up something, and the majority of packets received are expected to be for local sockets, rather than an unknown packet or one that needs to be forwarded somewhere. In this case, why not look up the socket directly here and save yourself an extra route lookup?</p>
    <div>
      <h2>The workaround at the end of the universe</h2>
      <a href="#the-workaround-at-the-end-of-the-universe">
        
      </a>
    </div>
    <p>Unfortunately for us, we just created a socket and didn’t want it to receive packets. Our adjustment to the routing table is ignored, because that routing lookup is skipped entirely when the socket is found. Raw sockets avoid this by receiving all packets regardless of the routing decision, but the packet rate is too high for this to be efficient. The only way around this is disabling the early demux feature. According to the patch’s claims, though, this feature improves performance: how far will performance regress on our existing workloads if we disable it?</p><p>This calls for a simple experiment: set the <a href="https://docs.kernel.org/6.16/networking/ip-sysctl.html"><u>net.ipv4.tcp_early_demux</u></a> syscall to 0 on some machines in a datacenter, let it run for a while, then compare the CPU usage with machines using default settings and the same hardware configuration as the machines under test.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3ypZGWN811vIQu04YERP8m/709e115068bad3994c88ce899cdfba29/image4.png" />
          </figure>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5eF441OrGSDwvAFEFYWbtT/40c330d687bf7e30597d046274d959e1/image2.png" />
          </figure>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/34gBimlHXXvLLbGJpriVJA/39f7408dd6ef37aaff3f0fa50a37518f/image1.png" />
          </figure><p>The key metrics are CPU usage from /proc/stat. If there is a performance degradation, we would expect to see higher CPU usage allocated to “softirq” — the context in which Linux network processing occurs — with little change to either userspace (top) or kernel time (bottom). The observed difference is slight, and mostly appears to reduce efficiency during off-peak hours.</p>
    <div>
      <h2>Swimming upstream</h2>
      <a href="#swimming-upstream">
        
      </a>
    </div>
    <p>While we tested different solutions to IP packet forwarding, we continued to terminate TCP connections on our network. Despite our initial concerns, the performance impact was small, and the benefits of increased visibility into origin reachability, fast internal routing within our network, and simpler observability of soft-unicast address usage flipped the burden of proof: was it worth trying to implement pure IP forwarding and supporting two different layers of egress?</p><p>So far, the answer is no. Fish runs on our network today, but with the much smaller responsibility of handling ICMP packets. However, when we decide to tunnel all IP packets, we know exactly how to do it.</p><p>A typical engineering role at Cloudflare involves solving many strange and difficult problems at scale. If you are the kind of goal-focused engineer willing to try novel approaches and explore the capabilities of the Linux kernel despite minimal documentation, look at <a href="https://www.cloudflare.com/en-gb/careers/jobs/?department=Engineering"><u>our open positions</u></a> — we would love to hear from you!</p> ]]></content:encoded>
            <category><![CDATA[Research]]></category>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Egress]]></category>
            <guid isPermaLink="false">x9Fb6GXRm3RObU5XezhnE</guid>
            <dc:creator>Chris Branch</dc:creator>
        </item>
        <item>
            <title><![CDATA[State of the post-quantum Internet in 2025]]></title>
            <link>https://blog.cloudflare.com/pq-2025/</link>
            <pubDate>Tue, 28 Oct 2025 13:00:00 GMT</pubDate>
            <description><![CDATA[ Today over half of human-initiated traffic with Cloudflare is protected against harvest-now/decrypt-later with post-quantum encryption. What once was a cool science project, is the new security baseline for the Internet. We’re not done yet: in this blog post we’ll take measure where we are, what we expect for the coming years, and what you can do today. ]]></description>
            <content:encoded><![CDATA[ <p></p><p>This week, the last week of October 2025, we reached a major milestone for Internet security: the majority of human-initiated traffic with Cloudflare is <a href="https://radar.cloudflare.com/adoption-and-usage#post-quantum-encryption"><u>using</u></a> post-quantum encryption mitigating the <a href="https://blog.cloudflare.com/the-quantum-menace/"><u>threat</u></a> of <a href="https://en.wikipedia.org/wiki/Harvest_now,_decrypt_later"><u>harvest-now/decrypt-later</u></a>.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1EUFTKSnJptvd5WDGvB9Rf/4865f75c71e43f2c261d393322d24f34/image5.png" />
          </figure><p>We want to use this joyous moment to give an update on the current state of the migration of the Internet to post-quantum cryptography and the long road ahead. Our last <a href="https://blog.cloudflare.com/pq-2024/"><u>overview</u></a> was 21 months ago, and quite a lot has happened since. A lot of it has been passed as we <a href="https://blog.cloudflare.com/pq-2024/"><u>predicted</u></a>: finalization of the NIST standards; broad adoption of post-quantum encryption; more detailed roadmaps from regulators; progress on building quantum computers; some cryptography was broken (not to worry: nothing close to what’s deployed); and new exciting cryptography was proposed.</p><p>But there were also a few surprises: there was a giant leap in progress towards Q-day by improving quantum algorithms, and we had a proper scare because of a new quantum algorithm. We’ll cover all this and more: what we expect for the coming years; and what you can do today.</p>
    <div>
      <h2>The quantum threat</h2>
      <a href="#the-quantum-threat">
        
      </a>
    </div>
    <p>First things first: why are we changing our cryptography? It’s because of <b>quantum computers</b>. <a href="https://www.cloudflare.com/learning/ssl/quantum/what-is-quantum-computing/"><u>These marvelous devices</u></a>, instead of restricting themselves to zeroes and ones, compute using more of what nature actually affords us: quantum superposition, interference, and entanglement. This allows quantum computers to excel at certain very specific computations, notably simulating nature itself, which will be very helpful in developing new materials.</p><p>Quantum computers are not going to replace regular computers, though: they’re actually much worse than regular computers at most tasks that matter for our daily lives. Think of them as graphic cards or neural engines — specialized devices for specific computations, not general-purpose ones.</p><p>Unfortunately, quantum computers also <a href="https://blog.cloudflare.com/the-quantum-menace"><u>excel</u></a> at breaking key cryptography that still is in common use today, such as RSA and elliptic curves (ECC). Thus, we are moving to <b>post-quantum cryptography</b>: cryptography designed to be resistant against quantum attack. We’ll discuss the exact impact on the different types of cryptography later on.</p><p>For now, quantum computers are rather anemic: they’re simply not good enough today to crack any real-world cryptographic keys. That doesn’t mean we shouldn’t worry yet: encrypted traffic can be <a href="https://en.wikipedia.org/wiki/Harvest_now,_decrypt_later"><u>harvested today</u></a>, and decrypted after <b>Q-day</b>: the day that quantum computers are capable of breaking today’s still widely used cryptography such as RSA-2048. We call that a “harvest-now-decrypt-later” attack.</p><p>Using factoring as a benchmark, quantum computers don’t impress at all: the largest number factored by a quantum computer without cheating is 15, a record that’s easily beaten in a <a href="https://eprint.iacr.org/2025/1237.pdf"><u>variety of funny ways</u></a>. It’s tempting to disregard quantum computers until they start beating classical computers on factoring, but that would be a big mistake. Even conservative estimates place Q-day <a href="https://youtu.be/nJxENYdsB6c?si=doosb_aZRpQgo6X8&amp;t=1302"><u>less than three years</u></a> after the day that quantum computers beat classical computers on factoring. So how do we track progress?</p>
    <div>
      <h3>Quantum numerology</h3>
      <a href="#quantum-numerology">
        
      </a>
    </div>
    <p>There are two categories to consider in the march towards Q-day: progress on quantum hardware, and algorithmic improvements to the software that runs on that hardware. We have seen significant progress on both fronts.</p>
    <div>
      <h4>Progress on quantum hardware</h4>
      <a href="#progress-on-quantum-hardware">
        
      </a>
    </div>
    <p>Like clockwork, every year there are news stories of new quantum computers with record-breaking number of qubits. This focus on counting qubits is also quite misleading. To start, quantum computers are analogue machines, and there is always some noise interfering with the computation.</p><p>There are big differences between the different types of technology used to build quantum computers: <a href="https://en.wikipedia.org/wiki/Transmon"><u>silicon-based</u></a> quantum computers seem to scale well, are quick to execute instructions, but have very noisy qubits. This does not mean they’re useless: with <a href="https://en.wikipedia.org/wiki/Quantum_error_correction"><u>quantum error correcting codes</u></a> one can effectively turn millions of noisy silicon qubits into a few thousand high-fidelity ones, which could be enough to <a href="https://quantum-journal.org/papers/q-2021-04-15-433/"><u>break RSA</u></a>. <a href="https://www.quantinuum.com/products-solutions/quantinuum-systems"><u>Trapped-ion quantum computers</u></a>, on the other hand, have much less noise, but have been harder to scale. Only a few hundred-thousand trapped-ion qubits could potentially draw the curtain on RSA-2048.</p><div>
  
</div>
<p></p><p><sup>Timelapse of </sup><a href="https://sam-jaques.appspot.com/quantum_landscape"><sup><u>state-of-art</u></sup></a><sup> in quantum computing from 2021 through 2025 by qubit count on the x-axis and noise on the y-axis. The dots in the gray area are the various quantum computers out there. Once the shaded gray area hits the left-most red line, we’re in trouble as that means a quantum computer can break large RSA keys. Compiled by </sup><a href="https://sam-jaques.appspot.com/"><sup><u>Samuel Jaques</u></sup></a><sup> of the University of Waterloo.</sup></p><p>We’re only scratching the surface with the number of qubits and noise. There are low-level details that can make a big difference, such as the interconnectedness of qubits. More importantly, the graph doesn’t capture how scalable the engineering behind the records is.</p><p>To wit, on these graphs the progress on quantum computers seems to have stalled the last two years, whereas for experts, Google’s <a href="https://blog.google/technology/research/google-willow-quantum-chip/"><u>December 2024 Willow announcement</u></a> that is unremarkable on the graph, is in reality a <a href="https://scottaaronson.blog/?p=8525"><u>real milestone</u></a> achieving the first logical qubit in the surface code in a scalable manner. <a href="https://sam-jaques.appspot.com/quantum_landscape_2024"><u>Quoting</u></a> Sam Jaques:</p><blockquote><p>When I first read these results [Willow’s achievements], I felt chills of “Oh wow, quantum computing is actually real”.</p></blockquote><p>It’s a real milestone, but not an unexpected leap. Quoting Sam again:</p><blockquote><p>Despite my enthusiasm, this is more or less where we should expect to be, and maybe a bit late. All of the big breakthroughs they demonstrated are steps we needed to take to even hope to reach the 20 million qubit machine that could break RSA. There are no unexpected breakthroughs. Think of it like the increases in transistor density of classical chips each year: an impressive feat, but ultimately business-as-usual.</p></blockquote><p>Business-as-usual is also the strategy: the superconducting qubit approach pursued by Google for Willow has always had the clearest path forward attacking the difficulties head-on requiring fewest leaps in engineering.</p><p>Microsoft pursues the opposite strategy with their bet on <a href="https://en.wikipedia.org/wiki/Topological_quantum_computer"><u>topological qubits</u></a>. These are qubits that in theory would mostly not be unaffected by noise. However, they have not been fully realized in hardware. If these can be built in a scalable way, they’d be far superior to superconducting qubits. But we don’t even know if these can be built to begin with. Early 2025 Microsoft announced the <a href="https://scottaaronson.blog/?p=8669"><u>Majorana 1</u></a> chip, which demonstrates how these could be built. The chip is far from a full demonstrator though: it doesn’t support any computation and hence doesn’t even show up in Sam’s comparison graph earlier.</p><p>In between topological and superconducting qubits, there are many other approaches that labs across the world pursue that do show up in the graph, such as QuEra with <a href="https://www.quera.com/neutral-atom-platform"><u>neutral atoms</u></a> and Quantinuum with <a href="https://www.quantinuum.com/products-solutions/quantinuum-systems/system-model-h2"><u>trapped ions</u></a>.</p><p>Progress on the hardware side of getting to Q-day has received by far the most amount of press interest. The biggest breakthrough in the last two years isn’t on the hardware side though.</p>
    <div>
      <h3>Progress on quantum software</h3>
      <a href="#progress-on-quantum-software">
        
      </a>
    </div>
    
    <div>
      <h4>The biggest breakthrough so far: Craig Gidney’s optimisations</h4>
      <a href="#the-biggest-breakthrough-so-far-craig-gidneys-optimisations">
        
      </a>
    </div>
    <p>We thought we’d need about <a href="https://quantum-journal.org/papers/q-2021-04-15-433/"><u>20 million qubits</u></a> with the superconducting approach to break RSA-2048. It turns out we can do it with much less. In a stunningly comprehensive June 2025 paper, <a href="https://algassert.com/about.html"><u>Craig Gidney</u></a> shows that with clever quantum software optimisations we need fewer than <a href="https://arxiv.org/pdf/2505.15917"><u>one million qubits</u></a>. This is the reason the red lines in Sam’s graph above, marking the size of a quantum computer to break RSA, dramatically shift to the left in 2025.</p><p>To put this achievement into perspective, let’s just make a wild guess and say Google can maintain a sort of Moore’s law and doubles the number of physical qubits every one-and-a-half years. That’s a much faster pace than Google demonstrated so far, but it’s also not unthinkable they could achieve this once the groundwork has been laid. Then it’d take until 2052 to reach 20 million qubits, but only until 2045 to reach one million: Craig single-handedly brought Q-day <b>seven years</b> closer!</p><p>How much further can software optimisations go? Pushing it lower than 100,000 superconducting qubits seems impossible to Sam, and <a href="https://sam-jaques.appspot.com/quantum_landscape_2025"><u>he’d expect</u></a> more than 242,000 superconducting qubits are required to break RSA-2048. With the wild guess on quantum computer progress before, that’d correspond to a Q-day of 2039 and 2041+ respectively.</p><p>Although Craig’s estimate makes detailed and reasonable assumptions on the architecture of a large-scale superconducting qubits quantum computer, it’s still a guess, and these estimates could be off quite a bit.</p>
    <div>
      <h4>A proper scare: Chen’s algorithm</h4>
      <a href="#a-proper-scare-chens-algorithm">
        
      </a>
    </div>
    <p>On the algorithmic side, we might not only see improvements to existing quantum algorithms, but also the discovery of completely new quantum algorithms. April 2024, Yilei Chen published <a href="https://eprint.iacr.org/2024/555"><u>a preprint</u></a> claiming to have found such a new quantum algorithm to solve certain lattice problems, which are close, but not the same as those we rely on for the post-quantum cryptography we deploy. This caused a proper stir: even if it couldn’t attack our post-quantum algorithms today, could Chen’s algorithm be improved? To get a sense for potential improvements, you need to understand what the algorithm is really doing on a higher level. With Chen’s algorithm that’s hard, as it’s very complex, much more so than Shor’s quantum algorithm that breaks RSA. So it took some time for experts to <a href="https://nigelsmart.github.io/LWE.html"><u>start</u></a> <a href="https://sam-jaques.appspot.com/static/files/555-notes.pdf"><u>seeing</u></a> limitations to Chen’s approach, and in fact, after ten days they discovered a fundamental bug in the algorithm: the approach doesn’t work. Crisis averted.</p><p>What to take from this? Optimistically, this is business as usual for cryptography, and lattices are in a better shape now as one avenue of attack turned out to be a dead end. Realistically, it <i>is</i> a reminder that we have a lot of eggs in the lattices basket. As we’ll see later, presently there isn’t a real alternative that works everywhere.</p><p>Proponents of quantum key distribution (QKD) might chime in that QKD solves exactly that by being secure thanks to the laws of nature. Well, there are some asterixes to put on that claim, but more fundamentally no one has figured out how to scale QKD beyond point-to-point connections, <a href="https://blog.cloudflare.com/you-dont-need-quantum-hardware/"><u>as we argue in this blog post</u></a>.</p><p>It’s good to speculate about what cryptography might be broken by a completely new attack, but let’s not forget the matter at hand: a lot of cryptography is going to be broken by quantum computers for sure. Q-day is coming; the question is when.</p>
    <div>
      <h2>Is Q-day always fifteen years away?</h2>
      <a href="#is-q-day-always-fifteen-years-away">
        
      </a>
    </div>
    <p>If you've been working on or around cryptography and security long enough, then you have probably heard that "Q-day is X years away" every year for the last several years. This can make it feel like Q-day is always "some time in the future" — until we put such a claim in the proper context.</p>
    <div>
      <h3>What do experts think?</h3>
      <a href="#what-do-experts-think">
        
      </a>
    </div>
    <p>Since 2019, the <a href="https://globalriskinstitute.org/"><u>Global Risk Institute</u></a> has performed a yearly survey amongst experts, asking how probable it is that RSA-2048 will be broken within 5, 10, 15, 20 or 30 years. These are the results <a href="https://globalriskinstitute.org/publication/2024-quantum-threat-timeline-report/"><u>for 2024</u></a>, whose interviews happened before Willow’s release and Gidney’s breakthrough.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3dx58nMhiJJd3DsQkaHwYF/84e9d8781912925d3b745f50291b00df/image6.png" />
          </figure><p><sup>Global Risk Institute expert survey results from 2024 on the likelihood of a quantum computer breaking RSA-2048 within different timelines.</sup></p><p>As the middle column in this chart shows, well over half of the interviewed experts thought there was at least a ~50% chance that a quantum computer will break RSA-2048 within 15 years. Let’s look up the historical answers from <a href="https://globalriskinstitute.org/publication/quantum-threat-timeline/"><u>2019</u></a>, <a href="https://globalriskinstitute.org/publication/quantum-threat-timeline-report-2020/"><u>2020</u></a>, <a href="https://globalriskinstitute.org/publication/2021-quantum-threat-timeline-report-global-risk-institute-global-risk-institute/"><u>2021</u></a>, <a href="https://globalriskinstitute.org/publication/2022-quantum-threat-timeline-report/"><u>2022</u></a>, and <a href="https://globalriskinstitute.org/publication/2023-quantum-threat-timeline-report/"><u>2023</u></a>. Here we plot the likelihood for Q-day within 15 years (of the time of the interview):</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4rMVWq9lDr49n9BmDkH2Ye/73d14f83f553becedf29dd11ce25deb1/image10.png" />
          </figure><p><sup>Historical answers in the quantum threat timeline reports for the chance of Q-day within 15 years.</sup></p><p>This shows that answers are slowly trending to more certainty, but at the rate we would expect? With six years of answers, we can plot how consistent the predictions are over a year: does the 15-year estimate for 2019 match the 10-year estimate for 2024?</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1cc2fWho4kYRjhJebG6Vll/12fdb65939b8e0143606d04747cfcca9/Screenshot_2025-10-28_at_12.28.49.png" />
          </figure><p><sup>Historical answers in the quantum threat timeline report over the years on the date of Q-day. The x-axis is the alleged year for Q-day and the y-axis shows the fraction of interviewed experts that think it’s at least ~50% (left) or 70% (right) likely to happen then.</sup></p><p>If we ask experts when Q-day could be with about even odds (graph on the left), then they mostly keep saying the same thing over the years: yes, could be 15 years away. However, if we press for more certainty, and ask for Q-day with &gt;70% probability (graph on the right), then the experts are mostly consistent over the years. For instance: one-fifth thought 2034 both in the 2019 and 2024 interviews.</p><p>So, if you want a consistent answer from an expert, don’t ask them when Q-day could be, but when it’s probably there. Now, it’s good fun to guess about Q-day, but the honest answer is that no one really knows for sure: there are just too many unknowns. And in the end, the date of Q-day is far less important than the deadlines set by regulators.</p>
    <div>
      <h3>What action do regulators take?</h3>
      <a href="#what-action-do-regulators-take">
        
      </a>
    </div>
    <p>We can also look at the timelines of various regulators. In 2022, the National Security Agency (NSA) released their <a href="https://media.defense.gov/2025/May/30/2003728741/-1/-1/0/CSA_CNSA_2.0_ALGORITHMS.PDF"><u>CNSA 2.0 guidelines</u></a>, which has deadlines between 2030 and 2033 for migrating to post-quantum cryptography. Also in 2022, the US federal government <a href="http://web.archive.org/web/20240422052137/https://www.whitehouse.gov/briefing-room/statements-releases/2022/05/04/national-security-memorandum-on-promoting-united-states-leadership-in-quantum-computing-while-mitigating-risks-to-vulnerable-cryptographic-systems/"><u>set 2035</u></a> as the target to have the United States fully migrated, from which the new administration hasn’t deviated. In 2024 Australia set 2030 as their <a href="https://www.theregister.com/2024/12/17/australia_dropping_crypto_keys/"><u>aggressive deadline</u></a> to migrate. Early 2025, the UK NCSC matched the common <a href="https://www.ncsc.gov.uk/guidance/pqc-migration-timelines"><u>2035</u></a> as the deadline for the United Kingdom. Mid-2025, the European Union published <a href="https://digital-strategy.ec.europa.eu/en/library/coordinated-implementation-roadmap-transition-post-quantum-cryptography"><u>their roadmap</u></a> with 2030 and 2035 as deadlines depending on the application.</p><p>Far from all national regulators have provided post-quantum migration timelines, but those that do generally stick to the 2030–2035 timeframe.</p>
    <div>
      <h3>When is Q-day?</h3>
      <a href="#when-is-q-day">
        
      </a>
    </div>
    <p>So when will quantum computers start causing trouble? Whether it’s 2034 or 2050, for sure it will be <b>too soon</b>. The immense success of cryptography over fifty years means it’s all around us now, from dishwasher, to pacemaker, to satellite. Most upgrades will be easy, and fit naturally in the product’s lifecycle, but there will be a long tail of difficult and costly upgrades.</p><p>Now, let’s take a look at the migration to post-quantum cryptography.</p>
    <div>
      <h2>Mitigating the quantum threat: two migrations</h2>
      <a href="#mitigating-the-quantum-threat-two-migrations">
        
      </a>
    </div>
    <p>To help prioritize, it is important to understand that there is a big difference in the difficulty, impact, and urgency of the post-quantum migration for the different kinds of cryptography required to create secure connections. In fact, for most organizations there will be two post-quantum migrations: <b>key agreement</b> and <b>signatures / certificates</b>. Let’s explain this for the case of creating a secure connection when visiting a website in a browser.</p>
    <div>
      <h3>Already post-quantum secure: symmetric cryptography</h3>
      <a href="#already-post-quantum-secure-symmetric-cryptography">
        
      </a>
    </div>
    <p>The cryptographic workhorse of a connection is a <b>symmetric cipher </b>such as AES-GCM. It’s what you would think of when thinking of cryptography: both parties, in this case the browser and server, have a shared key, and they encrypt / decrypt their messages with the same key. Unless you have that key, you can’t read anything, or modify anything.</p><p>The good news is that symmetric ciphers, such as <a href="https://blog.cloudflare.com/go-crypto-bridging-the-performance-gap/"><u>AES-GCM</u></a>, are already post-quantum secure. There is a common misconception that <a href="https://en.wikipedia.org/wiki/Grover%27s_algorithm"><u>Grover’s quantum algorithm</u></a> requires us to double the length of symmetric keys. On closer inspection of the algorithm, it’s clear that it is <a href="https://blog.cloudflare.com/nist-post-quantum-surprise#grover-s-algorithm"><u>not</u></a> <a href="https://www.youtube.com/watch?v=eB4po9Br1YY"><u>practical</u></a>. The way <a href="https://www.nist.gov/"><u>NIST</u></a>, the US National Institute for Standards and Technology (who have been spearheading the standardization of post-quantum cryptography) defines their post-quantum security levels is very telling. They define a specific security level by saying the scheme should be as hard to crack using either a classical or quantum computer as an existing symmetric cipher as follows:</p><table><tr><td><p><b>Level</b></p></td><td><p><b>Definition,</b> as least as hard to break as … </p></td><td><p><b>Example</b></p></td></tr><tr><td><p>1</p></td><td><p>To recover the key of <b>AES-128</b> by exhaustive search</p></td><td><p>ML-KEM-512, SLH-DSA-128s</p></td></tr><tr><td><p>2</p></td><td><p>To find a collision in <b>SHA256</b> by exhaustive search</p></td><td><p>ML-DSA-44</p></td></tr><tr><td><p>3</p></td><td><p>To recover the key of <b>AES-192</b> by exhaustive search</p></td><td><p>ML-KEM-768, ML-DSA-65</p></td></tr><tr><td><p>4</p></td><td><p>To find a collision in <b>SHA384</b> by exhaustive search</p></td><td><p></p></td></tr><tr><td><p>5</p></td><td><p>To recover the key of <b>AES-256</b> by exhaustive search</p></td><td><p>ML-KEM-1024, SLH-DSA-256s, ML-DSA-87</p></td></tr></table><p><sup>NIST PQC security levels, higher is harder to break (“more secure”). The examples ML-DSA, SLH-DSA and ML-KEM are covered below.</sup></p><p>There are good intentions behind suggesting doubling the key lengths of symmetric cryptography. In many use cases, the extra cost is not that high, and it mitigates any theoretical risk completely. Scaling symmetric cryptography is cheap: double the bits is typically far less than half the cost. So on the surface, it is simple advice.</p><p>But if we insist on AES-256, it seems only logical to insist on NIST PQC level 5 for the public key cryptography as well. The problem is that public key cryptography does not scale very well. Depending on the scheme, going from level 1 to level 5 typically more than doubles data usage and CPU cost. As we’ll see, deploying post-quantum signatures at level 1 is already painful, and deploying them at level 5 is debilitating.</p><p>But more importantly, organizations only have limited resources. We wouldn’t want an organization to prioritize upgrading AES-128 at the cost of leaving the definitely quantum-vulnerable RSA around.</p>
    <div>
      <h3>First migration: key agreement</h3>
      <a href="#first-migration-key-agreement">
        
      </a>
    </div>
    <p>Symmetric ciphers are not enough on their own: how do I know which key to use when visiting a website for the first time? The browser can’t just send a random key, as everyone listening in would see that key as well. You’d think it’s impossible, but there is some clever math to solve this, so that the browser and server can agree on a shared key. Such a scheme is called a <b>key agreement </b>mechanism, and is performed in the TLS <a href="https://www.cloudflare.com/learning/ssl/what-happens-in-a-tls-handshake/"><u>handshake</u></a>. In 2024 almost all traffic is secured with <a href="https://en.wikipedia.org/wiki/Curve25519"><u>X25519</u></a>, a Diffie–Hellman-style key agreement, but its security is completely broken by <a href="https://en.wikipedia.org/wiki/Shor%27s_algorithm"><u>Shor’s algorithm</u></a> on a quantum computer. Thus, any communication secured today with Diffie–Hellman, when stored, can be decrypted in the future by a quantum computer.</p><p>This makes it <b>urgent</b> to upgrade key agreement today. Luckily post-quantum key agreement is relatively straight-forward to deploy, and as we saw before, half the requests with Cloudflare end 2025 are already secured with post-quantum key agreement!</p>
    <div>
      <h3>Second migration: signatures / certificates</h3>
      <a href="#second-migration-signatures-certificates">
        
      </a>
    </div>
    <p>The key agreement allows secure agreement on a key, but there is a big gap: we do not know <i>with whom</i> we agreed on the key. If we only do key agreement, an attacker in the middle can do separate key agreements with the browser and server, and re-encrypt any exchanged messages. To prevent this we need one final ingredient: authentication.</p><p>This is achieved using <b>signatures</b>. When visiting a website, say <a href="https://cloudflare.com"><u>cloudflare.com</u></a>, the web server presents a <b>certificate</b> signed by a <a href="https://en.wikipedia.org/wiki/Certificate_authority"><u>certification authority</u></a> (CA) that vouches that the public key in that certificate is controlled by <a href="https://cloudflare.com"><u>cloudflare.com</u></a>. In turn, the web server signs the handshake and shared key using the private key corresponding to the public key in the certificate. This allows the client to be sure that they’ve done a key agreement with <a href="https://cloudflare.com"><u>cloudflare.com</u></a>.</p><p>RSA and ECDSA are commonly used traditional signature schemes today. Again, Shor’s algorithm makes short work of them, allowing a quantum attacker to forge any signature. That means that an attacker with a quantum computer can impersonate (and <a href="https://en.wikipedia.org/wiki/Man-in-the-middle_attack"><u>MitM</u></a>) any website for which we accept non post-quantum certificates.</p><p>This attack can only be performed after quantum computers are able to crack RSA / ECDSA. This makes upgrading signature schemes for TLS on the face of it less urgent, as we only need to have everyone migrated before Q-day rolls around. Unfortunately, we will see that migration to post-quantum signatures is much <b>more difficult</b>, and will require more time.</p>
    <div>
      <h2>Progress timeline</h2>
      <a href="#progress-timeline">
        
      </a>
    </div>
    <p>Before we dive into the technical challenges of migrating the Internet to post-quantum cryptography, let’s have a look at how we got here, and what to expect in the coming years. Let’s start with how post-quantum cryptography came to be.</p>
    <div>
      <h3>Origin of post-quantum cryptography</h3>
      <a href="#origin-of-post-quantum-cryptography">
        
      </a>
    </div>
    <p>Physicists Feynman and Manin independently proposed quantum computers <a href="https://plato.stanford.edu/entries/qt-quantcomp/"><u>around 1980</u></a>. It took another 14 years before Shor published <a href="https://ieeexplore.ieee.org/abstract/document/365700"><u>his algorithm</u></a> attacking RSA / ECC. Most post-quantum cryptography predates Shor’s famous algorithm.</p><p>There are various branches of post-quantum cryptography, of which the most prominent are lattice-based, hash-based, multivariate, code-based, and isogeny-based. Except for isogeny-based cryptography, none of these were initially conceived as post-quantum cryptography. In fact, early code-based and hash-based schemes are contemporaries of RSA, being proposed in the 1970s, and comfortably predate the publication of Shor’s algorithm in 1994. Also, the first multivariate scheme from 1988 is comfortably older than Shor’s algorithm. It is a nice coincidence that the most successful branch, lattice-based cryptography, is Shor’s closest contemporary, being proposed <a href="https://dl.acm.org/doi/pdf/10.1145/237814.237838"><u>in 1996</u></a>. For comparison, elliptic curve cryptography, which is widely used today, was first proposed in 1985.</p><p>In the years after the publication of Shor’s algorithm, cryptographers took measure of the existing cryptography: what’s clearly broken, and what could be post-quantum secure? In 2006, the first annual <a href="https://postquantum.cr.yp.to/"><u>International Workshop on Post-Quantum Cryptography</u></a> took place. From that conference, an introductory text <a href="https://www.researchgate.net/profile/Nicolas-Sendrier-2/publication/226115302_Code-Based_Cryptography/links/540d62d50cf2df04e7549388/Code-Based-Cryptography.pdf"><u>was prepared</u></a>, which holds up rather well as an introduction to the field. A notable caveat is the <a href="https://eprint.iacr.org/2022/214.pdf"><u>demise</u></a> of the <a href="https://www.pqcrainbow.org/"><u>Rainbow</u></a> signature scheme. In that same year, 2006, the elliptic-curve key-agreement X25519 <a href="https://cr.yp.to/ecdh/curve25519-20060209.pdf"><u>was proposed</u></a>, which now secures the majority of Internet connections, either on its own or as a hybrid with the post-quantum ML-KEM-768. </p>
    <div>
      <h2>NIST completes the first generation of PQC standards</h2>
      <a href="#nist-completes-the-first-generation-of-pqc-standards">
        
      </a>
    </div>
    <p>Ten years later, in 2016, <a href="https://nist.gov"><u>NIST</u></a>, the US National Institute of Standards and Technology, <a href="https://csrc.nist.gov/CSRC/media/Projects/Post-Quantum-Cryptography/documents/call-for-proposals-final-dec-2016.pdf"><u>launched a public competition</u></a> to standardize post-quantum cryptography. They used a similar open format as was used to standardize <a href="https://en.wikipedia.org/wiki/Advanced_Encryption_Standard"><u>AES</u></a> in 2001, and <a href="https://en.wikipedia.org/wiki/NIST_hash_function_competition"><u>SHA3</u></a> in 2012. Anyone can participate by submitting schemes and evaluating the proposals. Cryptographers from all over the world submitted algorithms. To focus attention, the list of submissions were whittled down over three rounds. From the original 82, based on public feedback, eight made it into the final round. From those eight, in 2022, NIST chose to <a href="https://blog.cloudflare.com/nist-post-quantum-surprise"><u>pick four to standardize first</u></a>: one <b>KEM </b>(for key agreement) and three signature schemes.</p><table><tr><td><p><b>Old name</b></p></td><td><p><b>New name</b></p></td><td><p><b>Branch</b></p></td></tr><tr><td><p>Kyber</p></td><td><p><b>ML-KEM</b> (<a href="https://csrc.nist.gov/pubs/fips/203/final"><u>FIPS 203</u></a>)
Module-lattice based Key-Encapsulation Mechanism Standard</p></td><td><p>Lattice-based</p></td></tr><tr><td><p>Dilithium</p></td><td><p><b>ML-DSA </b>(<a href="https://csrc.nist.gov/pubs/fips/204/final"><u>FIPS 204</u></a>)</p><p>Module-lattice based Digital Signature Standard</p></td><td><p>Lattice-based</p></td></tr><tr><td><p>SPHINCS<sup>+</sup></p></td><td><p><b>SLH-DSA</b> (<a href="https://csrc.nist.gov/pubs/fips/205/final"><u>FIPS 205</u></a>)</p><p>Stateless Hash-Based Digital Signature Standard</p></td><td><p>Hash-based</p></td></tr><tr><td><p>Falcon</p></td><td><p><b>FN-DSA </b>(not standardised yet)<b>
</b>FFT over NTRU lattices Digital Signature Standard</p></td><td><p>Lattice-based</p></td></tr></table><p>The final standards for the first three have been published August 2024. FN-DSA is late and we’ll discuss that later.</p><p>ML-KEM is the only post-quantum key agreement standardised now, and despite some occasional difficulty with its larger key sizes, it’s mostly a drop-in upgrade.</p><p>The situation is rather different with the signatures: it’s quite telling that NIST chose to pursue standardising three already. And there are even more signatures set to be <a href="https://blog.cloudflare.com/another-look-at-pq-signatures/"><u>standardized in the future</u></a>. The reason is that none of the proposed signatures are close to ideal. In short, they all have much larger keys and signatures than we’re used to.</p><p>From a security standpoint SLH-DSA is the most conservative choice, but also the worst performer. For public key and signature sizes, FN-DSA is as good as it gets for these three, but it is difficult to implement signing safely because of floating-point arithmetic. Due to FN-DSA’s limited applicability and design complexity, NIST chose to focus on the other three schemes first.</p><p>This leaves ML-DSA as the default pick. More in depth comparisons are included below.</p>
    <div>
      <h2>Adoption of PQC in protocol standards</h2>
      <a href="#adoption-of-pqc-in-protocol-standards">
        
      </a>
    </div>
    <p>Having NIST’s standards is not enough. It’s also required to standardize the way the new algorithms are used in higher level protocols. In many cases, such as key agreement in TLS, this can be as simple as assigning an identifier to the new algorithms. In other cases, such as <a href="https://www.cloudflare.com/dns/dnssec/how-dnssec-works/"><u>DNSSEC</u></a>, it requires a bit more thought. Many working groups at the <a href="https://www.ietf.org/"><u>IETF</u></a> have been preparing for years for the arrival of NIST’s final standards, and we expected many protocol integrations to be finalized soon after, before the end of 2024. That was too optimistic: some are done, but many are not finished yet.</p><p>Let’s start with the good news and look at what is done.</p><ul><li><p>The hybrid TLS key agreement <a href="https://datatracker.ietf.org/doc/draft-ietf-tls-ecdhe-mlkem/"><u>X25519MLKEM768</u></a> that combines X25519 and ML-KEM-768 (more about it later) is ready to use and is indeed quite widely deployed. Other protocols are likewise adopting ML-KEM in a hybrid mode of operation, such as <a href="https://datatracker.ietf.org/doc/draft-ietf-ipsecme-ikev2-mlkem/"><u>IPsec</u></a>, which is ready to go for simple setups. (For certain setups, there is a <a href="https://datatracker.ietf.org/doc/draft-ietf-ipsecme-ikev2-downgrade-prevention/"><u>little wrinkle</u></a> that still needs to be figured out. We’ll cover that in a future blog post.)

It might be surprising that the corresponding RFCs have not been published yet. Registering a key agreement to TLS or IPsec does not require an RFC though. In both cases, the RFC is still being pursued to avoid confusion for those that would expect an RFC, and for TLS it’s required to mark the key agreement as recommended.</p></li><li><p>For signatures, ML-DSA’s integration <a href="https://datatracker.ietf.org/doc/draft-ietf-lamps-dilithium-certificates/"><u>in X.509</u></a> certificates and <a href="https://datatracker.ietf.org/doc/draft-ietf-tls-mldsa/"><u>TLS</u></a> are good to go. The former is a freshly minted RFC, and the latter doesn’t require one.</p></li></ul><p>Now, for the bad news. At the time of writing, October 2025, the IETF hasn’t <a href="https://datatracker.ietf.org/doc/draft-ietf-lamps-pq-composite-sigs/"><u>locked down</u></a> how to do hybrid certificates: certificates where both a post-quantum and a traditional signature scheme are combined. But it’s close. We hope this’ll be figured out early 2026.</p><p>But if it’s just assigning some identifiers, what’s the cause of the delay? Mostly it’s about choice. Let’s start with the choices that had to be made in ML-DSA.</p>
    <div>
      <h4>ML-DSA delays: much ado about prehashing and private key formats</h4>
      <a href="#ml-dsa-delays-much-ado-about-prehashing-and-private-key-formats">
        
      </a>
    </div>
    <p>The two major topics of discussion for ML-DSA certificates were prehashing and the private key format.</p><p>Prehashing is where one part of the system hashes the message, and another creates the final signatures. This is useful, if you don’t want to send a big file to an <a href="https://en.wikipedia.org/wiki/Hardware_security_module"><u>HSM</u></a> to sign. Early drafts of ML-DSA  support prehashing with SHAKE256, but that <a href="https://csrc.nist.gov/csrc/media/Projects/post-quantum-cryptography/documents/faq/fips204-sec6-03192025.pdf"><u>was</u></a> not obvious. In the final version of ML-DSA, NIST included two variants: regular ML-DSA, and an explicitly prehashed version, where you are allowed to choose any hash. Having different variants is not ideal, as users will have to choose which one to pick; not all software might support all variants; and testing/validation has to be done for all. It’s not controversial to want to pick just one variant, but the issue <a href="https://globalplatform.org/wp-content/uploads/2025/01/4_ML-DSA-and-ML-KEM-Landmines-1.pdf"><u>is</u></a> <a href="https://keymaterial.net/2024/11/05/hashml-dsa-considered-harmful/"><u>which</u></a>. After plenty of debate, regular ML-DSA was chosen.</p><p>The second matter is <a href="https://datatracker.ietf.org/meeting/122/materials/slides-122-pquip-the-great-private-key-war-of-25-02.pdf"><u>private key forma</u></a>t. Because of the way that candidates are compared on performance benchmarks, it looks good for the original ML-DSA submission to cache some computation in the private key. This means that the private key is larger (several kilobytes) than it needs to be and requires more validation steps. It was suggested to cut the private key down to its bare essentials: just a 32-byte <i>seed</i>. For the final standard, NIST decided to allow both the seed and the original larger private key. This is not <a href="https://keymaterial.net/2025/02/19/how-not-to-format-a-private-key/"><u>ideal</u></a>: better stick to one of the two. In this case, the IETF wasn’t able to make a choice, and even added a third option: a pair of both the seed and expanded private key. Technically almost everyone agreed that <i>seed</i> is the superior choice, but the reason it wasn’t palatable is that some vendors already created keys for which they didn’t keep the <i>seed</i> around. Yes, we already have post-quantum legacy. It took almost a year to make these two choices.</p>
    <div>
      <h4>Hybrids require many choices</h4>
      <a href="#hybrids-require-many-choices">
        
      </a>
    </div>
    <p>To define an ML-DSA hybrid signature scheme, there are many more choices to make. With which traditional scheme to combine ML-DSA? What security levels on both sides. Then we also need to make choices for both schemes: which private key format to use? Which hash to use with ECDSA? Hybrids have new questions of their own. Do we allow reuse of the keys in the hybrid, and for that, do we want to prevent stripping attacks? Also, the question of prehashing returns with a third option: prehash on the hybrid level.</p><p>The <a href="https://datatracker.ietf.org/doc/draft-ietf-lamps-pq-composite-sigs/12/"><u>October 2025 draft</u></a> for ML-DSA hybrid signatures contains 18 variants, down from <a href="https://datatracker.ietf.org/doc/draft-ietf-lamps-pq-composite-sigs/03/"><u>26</u></a> a year earlier. Again, everyone agrees that that is too much, but it’s been hard to whittle it down further. To help end-users choose, a short list was added, which started with three options, and of course grew itself to <a href="https://www.ietf.org/archive/id/draft-ietf-lamps-pq-composite-sigs-12.html#section-11.3"><u>six</u></a>. Of those, we think MLDSA44-ECDSA-P256-SHA256 will see wide support and use on the Internet.</p><p>Now, let’s return to key agreement for which the standards have been set.</p>
    <div>
      <h2>TLS stacks get support for ML-KEM</h2>
      <a href="#tls-stacks-get-support-for-ml-kem">
        
      </a>
    </div>
    <p>The next step is software support. Not all ecosystems can move at the same speed, but we’ve seen major adoption of post-quantum key agreement to counter store-now/decrypt-later already. Recent versions of all major browsers, and many TLS libraries and platforms, notably OpenSSL, Go, and recent Apple OSes have enabled X25519MLKEM768 by default. We keep an overview <a href="https://developers.cloudflare.com/ssl/post-quantum-cryptography/pqc-support/"><u>here</u></a>.</p><p>Again, for TLS there is a big difference again between key agreement and signatures. For key agreement, the server and client can add and enable support for post-quantum key agreement independently. Once enabled on both sides, TLS negotiation will use post-quantum key agreement. We go into detail on TLS negotiation in <a href="https://blog.cloudflare.com/post-quantum-for-all#tls-anchor"><u>this blog post</u></a>. If your product just uses TLS, your store-now/decrypt-now problem could be solved by a simple software update of the TLS library.</p><p>Post-quantum <a href="https://www.cloudflare.com/application-services/products/ssl/">TLS certificates</a> are more of a hassle. Unless you control both ends, you’ll need to install two certificates: one post-quantum certificate for the new clients, and a traditional one for the old clients. If you aren’t using <a href="https://www.cloudflare.com/application-services/solutions/certificate-lifecycle-management/">automated issuance of certificates</a> yet, this might be a good reason to <a href="https://letsencrypt.org/docs/client-options/"><u>check that out</u></a>. TLS allows the client to signal which signature schemes it supports so that the server can choose to serve a post-quantum certificate only to those clients that support it. Unfortunately, although almost all TLS libraries support setting up multiple certificates, not all servers expose that configuration. If they do, it will still require a configuration change in most cases. (Although undoubtedly <a href="https://caddyserver.com/"><u>caddy</u></a> will do it for you.)</p><p>Talking about post-quantum certificates: it will take some time before Certification Authorities (CAs) can issue them. Their <a href="https://csrc.nist.gov/glossary/term/hardware_security_module_hsm"><u>HSMs</u></a> will first need (hardware) support, which then will need to be audited. Also, the <a href="https://cabforum.org/"><u>CA/Browser forum</u></a> needs to approve the use of the new algorithms. Root programs have different opinions about timelines. From the grapevine, we hear one of the root programs is preparing a pilot to accept one-year ML-DSA-87 certificates, perhaps even before the end of 2025. A CA/Browser forum ballot is <a href="https://github.com/cabforum/servercert/pull/624"><u>being drafted</u></a> to support this. Chrome on the other hand, <a href="https://www.youtube.com/live/O_BXzJv16zQ?t=19274s"><u>prefers</u></a> to solve the large certificate issue first. For the early movers, the audits are likely to be the bottleneck, as there will be a lot of submissions after the publication of the NIST standards. Although we’ll see the first post-quantum certificates in 2026, it’s unlikely they will be broadly available or trusted by all browsers before 2027.</p><p>We are in an interesting in-between time, where a lot of Internet traffic is protected by post-quantum key agreement, but not a single public post-quantum certificate is used.</p>
    <div>
      <h2>The search continues for more schemes</h2>
      <a href="#the-search-continues-for-more-schemes">
        
      </a>
    </div>
    <p>NIST is not quite done standardizing post-quantum cryptography. There are two more post-quantum competitions running: <b>round 4</b> and the <b>signatures onramp</b>.</p>
    <div>
      <h3>Round 4 winner: HQC</h3>
      <a href="#round-4-winner-hqc">
        
      </a>
    </div>
    <p>NIST only standardized one post-quantum key agreement so far: ML-KEM. They’d like to have a second one, a <b>backup KEM</b>, not based on lattices in case those turn out to be weaker than expected. To find it,  they extended the original competition with a fourth round to pick a backup KEM among the finalists. In March 2025, <a href="https://www.nist.gov/news-events/news/2025/03/nist-selects-hqc-fifth-algorithm-post-quantum-encryption#:~:text=NIST%20has%20chosen%20a%20new,were%20discovered%20in%20ML%2DKEM."><u>HQC</u></a> was <a href="https://www.nist.gov/news-events/news/2025/03/nist-selects-hqc-fifth-algorithm-post-quantum-encryption#:~:text=NIST%20has%20chosen%20a%20new,were%20discovered%20in%20ML%2DKEM."><u>selected</u></a> to be standardized.</p><p>HQC performs much worse than ML-KEM on every single metric. HQC-1, the lowest security level variant, requires 7kB of data on the wire. This is almost double the 3kB required for ML-KEM-1024, the highest security level variant. There is a similar gap in CPU performance. Also HQC scales worse with security level: where ML-KEM-1024 is about double the cost of ML-KEM-512, the highest security level of HQC requires three times the data (21kB!) and more than four times the compute.</p><p>What about the security? To hedge against gradually improved attacks, ML-KEM-768 has a clear edge over HQC-1, it performs much better, and it has a huge security margin at level 3 compared to level 1. What about leaps? Both ML-KEM and HQC use a similar algebraic structure on top of plain lattices and codes respectively: it is not inconceivable that a breakthrough there could apply to both. Now, also without the algebraic structure, codes and lattices feel related. We’re well into speculation: a catastrophic attack on lattices might not affect codes, but it wouldn’t be surprising too if it did. After all, RSA and ECC that are more dissimilar are both broken by quantum computers.</p><p>There might still be peace of mind to keep HQC around just in case. Here, we’d like to share an anecdote from the chaotic week when it was not clear yet that Chen’s quantum algorithm against lattices was flawed. What to replace ML-KEM with if it would be affected? HQC was briefly considered, but it was clear that an adjusted variant of ML-KEM would still be much more performant.</p><p>Stepping back: that we’re looking for a <i>second</i> efficient KEM is a luxury position. If I were granted a wish for a new post-quantum scheme, I wouldn’t ask for a better KEM, but for a better signature scheme. Let’s see if I get lucky.</p>
    <div>
      <h3>Signatures onramp</h3>
      <a href="#signatures-onramp">
        
      </a>
    </div>
    <p>In late 2022, after announcing the first four picks, NIST also called a new competition, dubbed the <i>signatures onramp</i>, to find <a href="https://csrc.nist.gov/projects/pqc-dig-sig"><u>additional signature schemes</u></a>. The competition has two goals. The first is hedging against cryptanalytic breakthroughs against lattice-based cryptography. NIST would like to standardize a signature that performs better than SLH-DSA (both in size and compute), but is not based on lattices. Secondly, they’re looking for a signature scheme that might do well in use cases where the current roster doesn’t do well: we will discuss those at length later on in this post.</p><p>In July 2023, NIST posted the <a href="https://csrc.nist.gov/news/2023/additional-pqc-digital-signature-candidates"><u>40 submissions</u></a> they received for a first round of public review. The cryptographic community got to work, and as is quite normal for a first round, many of the schemes were broken within a week. By February 2024, ten submissions were broken completely, and several others were weakened drastically. Out of the standing candidates, in October 2024, NIST selected 14 submissions for the second round.</p><p>A year ago, we wrote <a href="https://blog.cloudflare.com/another-look-at-pq-signatures/"><u>a blog post</u></a> covering these 14 submissions in great detail. The short of it: there has been amazing progress on post-quantum signature schemes. We will touch briefly upon them later on, and give some updates on the advances since last year. It is worth mentioning that just like the main post-quantum competition, the selection process will take many years. It is unlikely that any of these onramp signature schemes will be standardized before 2028 — if they’re not broken in the first place. That means that although they’re very welcome in the future, we can’t trust that better signature schemes will solve our problems today. As Eric Rescorla, the editor of TLS 1.3, <a href="https://educatedguesswork.org/posts/pq-emergency/"><u>writes</u></a>: “You go to war with the algorithms you have, not the ones you wish you had.”</p><p>With that in mind, let's look at the progress of deployments.</p>
    <div>
      <h2>Migrating the Internet to post-quantum key agreement</h2>
      <a href="#migrating-the-internet-to-post-quantum-key-agreement">
        
      </a>
    </div>
    <p>Now that we have the big picture, let’s dive into some finer details about this X25519MLKEM768 that’s widely deployed now.</p><p>First the post-quantum part. ML-KEM was submitted under the name <a href="https://pq-crystals.org/kyber/index.shtml"><u>CRYSTALS-Kyber</u></a>. Even though it’s a US standard, its designers work in industry and academia across France, Switzerland, the Netherlands, Belgium, Germany, Canada, China, and the United States. Let’s have a look at its performance.</p>
    <div>
      <h2>ML-KEM versus X25519</h2>
      <a href="#ml-kem-versus-x25519">
        
      </a>
    </div>
    <p>Today the vast majority of clients use the traditional key agreement X25519. Let’s compare that to ML-KEM.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/VCx6lbwzhKt4FywhRAZbk/4b7956adbb9a7690d3c3c6ce5d830fe1/Screenshot_2025-10-28_at_13.41.31.png" />
          </figure><p><sup>Size and CPU compared between X25519 and ML-KEM. Performance varies considerably by hardware platform and implementation constraints, and should be taken as a rough indication only.</sup></p><p>ML-KEM-512, -768 and -1024 aim to be as resistant to (quantum) attack as AES-128, -192 and -256 respectively. Even at the AES-128 level, ML-KEM is much bigger than X25519, requiring 800+768=1,568 bytes over the wire, whereas X25519 requires a mere 64 bytes.</p><p>On the other hand, even ML-KEM-1024 is typically significantly faster than X25519, although this can vary quite a bit depending on your platform and implementation.</p>
    <div>
      <h2>ML-KEM-768 and X25519</h2>
      <a href="#ml-kem-768-and-x25519">
        
      </a>
    </div>
    <p>We are not taking advantage of that speed boost just yet. Like many other early adopters, we like to play it safe and deploy a <b>hybrid</b> key-agreement <a href="https://datatracker.ietf.org/doc/draft-ietf-tls-ecdhe-mlkem/"><u>combining</u></a> X25519 and ML-KEM-768. This combination might surprise you for two reasons.</p><ol><li><p>Why combine X25519 (“128 bits of security”) with ML-KEM-768 (“192 bits of security”)?</p></li><li><p>Why bother with the non post-quantum X25519?</p></li></ol><p>The apparent security level mismatch is a hedge against improvements in cryptanalysis in lattice-based cryptography. There is a lot of trust in the (non post-quantum) security of X25519: matching AES-128 is more than enough. Although we are comfortable in the security of ML-KEM-512 today, over the coming decades cryptanalysis could improve. Thus, we’d like to keep a margin for now.</p><p>The inclusion of X25519 has two reasons. First, there is always a remote chance that a breakthrough renders all variants of ML-KEM insecure. In that case, X25519 still provides non-post-quantum security, and our post-quantum migration didn’t make things worse.</p><p>More important is that we do not only worry about attacks on the algorithm, but also on the implementation. A noteworthy example where we dodged a bullet is that of <a href="https://kyberslash.cr.yp.to/"><u>KyberSlash</u></a>, a timing attack that affected many implementations of Kyber (an earlier version of ML-KEM), including <a href="https://github.com/cloudflare/circl/security/advisories/GHSA-9763-4f94-gfch"><u>our own</u></a>. Luckily KyberSlash does not affect Kyber as it is used in TLS. A similar implementation mistake that would actually affect TLS, is likely to require an active attacker. In that case, the likely aim of the attacker wouldn’t be to decrypt data decades down the line, but steal a cookie or other token, or inject a payload. Including X25519 prevents such an attack.</p><p>So how well do ML-KEM-768 and X25519 together perform in practice?</p>
    <div>
      <h2>Performance and protocol ossification</h2>
      <a href="#performance-and-protocol-ossification">
        
      </a>
    </div>
    
    <div>
      <h3>Browser experiments</h3>
      <a href="#browser-experiments">
        
      </a>
    </div>
    <p>Being well aware of potential compatibility and performance issues, Google started <a href="https://security.googleblog.com/2016/07/experimenting-with-post-quantum.html"><u>a first experiment</u></a> with post-quantum cryptography back in 2016, the same year NIST started their competition. This was followed up by a second larger joint experiment by <a href="https://blog.cloudflare.com/towards-post-quantum-cryptography-in-tls/"><u>Cloudflare</u></a> and <a href="https://www.imperialviolet.org/2018/12/12/cecpq2.html"><u>Google</u></a> in 2018. We tested two different hybrid post-quantum key agreements: CECPQ2, which is a combination of the lattice-based NTRU-HRSS and X25519, and CECPQ2b, a combination of the isogeny-based SIKE and again X25519. NTRU-HRSS is very similar to ML-KEM in size, but is computationally somewhat more taxing on the client-side. SIKE on the other hand, has very small keys, is computationally very expensive, and was <a href="https://eprint.iacr.org/2022/975.pdf"><u>completely broken</u></a> in 2022. With respect to TLS handshake times, X25519+NTRU-HRSS performed very well.</p><p>Unfortunately, a small but significant fraction of clients experienced broken connections with NTRU-HRSS. The reason: the size of the NTRU-HRSS keyshares. In the past, when creating a TLS connection, the first message sent by the client, the so-called <i>ClientHello</i>, almost always fit within a single network packet. The TLS specification allows for a larger <i>ClientHello</i>, however no one really made use of that. Thus, protocol ossification strikes again as there are some middleboxes, load-balancers, and other software that tacitly assume the <i>ClientHello</i> always fits in a single packet.</p>
    <div>
      <h2>Long road to 50%</h2>
      <a href="#long-road-to-50">
        
      </a>
    </div>
    <p>Over the subsequent years, we kept experimenting with PQ, switching to Kyber in 2022, and ML-KEM in 2024. Chrome did a great job reaching out to vendors whose products were <a href="https://tldr.fail/"><u>incompatible</u></a>. If it were not for these compatibility issues, we would’ve likely seen Chrome ramp up post-quantum key agreement five years earlier. It took until March 2024 before Chrome felt comfortable enough to enable post-quantum key agreement by default on Desktop. After that many other clients, and all major browsers, have joined Chrome in enabling post-quantum key agreement by default. An incomplete timeline:</p><table><tr><td><p>July 2016</p></td><td><p>Chrome’s <a href="https://security.googleblog.com/2016/07/experimenting-with-post-quantum.html"><u>first experiment with PQ</u></a> (CECPQ)</p></td></tr><tr><td><p>June 2018</p></td><td><p><a href="https://blog.cloudflare.com/the-tls-post-quantum-experiment/"><u>Cloudflare</u></a> / <a href="https://www.imperialviolet.org/2018/12/12/cecpq2.html"><u>Google</u></a> experiment (CECPQ2)</p></td></tr><tr><td><p>October 2022</p></td><td><p>Cloudflare <a href="https://blog.cloudflare.com/post-quantum-for-all/"><u>enables</u></a> PQ by default server side</p></td></tr><tr><td><p>November 2023</p></td><td><p>Chrome ramps up PQ to 10% on Desktop</p></td></tr><tr><td><p>March 2024</p></td><td><p>Chrome <a href="https://blog.chromium.org/2024/05/advancing-our-amazing-bet-on-asymmetric.html"><u>enables</u></a> PQ by default on Desktop</p></td></tr><tr><td><p>August 2024</p></td><td><p>Go <a href="https://github.com/golang/go/issues/67061"><u>enables</u></a> PQ by default</p></td></tr><tr><td><p>November 2024</p></td><td><p>Chrome enables PQ by default on Android and Firefox on Desktop.</p></td></tr><tr><td><p>April 2025</p></td><td><p><a href="https://openssl-library.org/post/2025-04-08-openssl-35-final-release/"><u>OpenSSL</u></a> enables PQ by default</p></td></tr><tr><td><p>October 2025</p></td><td><p>Apple is <a href="https://support.apple.com/en-us/122756"><u>rolling out</u></a> PQ by default with the release of iOS / iPadOS / macOS 26.</p></td></tr></table><p>It’s noteworthy that there is a gap between Chrome enabling PQ on Desktop and on Android. Although ML-KEM doesn’t have a large performance impact, as seen in the graphs, it’s certainly not negligible, especially on the long tail of slower connections more prevalent on mobile, and it required more consideration to proceed.</p><p>But we’re finally here now: over 50% (and rising!) of human traffic is protected against store-now/decrypt-later, making post-quantum key agreement the new security baseline for the Web.</p><p>Browsers are one side of the equation, what about servers?</p>
    <div>
      <h3>Server-side support</h3>
      <a href="#server-side-support">
        
      </a>
    </div>
    <p>Back in 2022 we <a href="https://blog.cloudflare.com/post-quantum-for-all/"><u>enabled</u></a> post-quantum key agreement server side for basically all customers. Google did the same for most of their servers (except GCP) in 2023. Since then many have followed. Jan Schaumann has been posting regular scans of the top 100k domains. In his September 2025 post, <a href="https://www.netmeister.org/blog/pqc-use-2025-09.html"><u>he reports</u></a> 39% support PQ now, up from 28% only six months earlier. In his survey, we see not only support rolling out on large service providers, such as Amazon, Fastly, Squarespace, Google, and Microsoft, but also a trickle of self-hosted servers adding support hosted at Hetzner and OVHcloud.</p><p>This is the publicly accessible web. What about servers behind a service like Cloudflare?</p>
    <div>
      <h3>Support at origins</h3>
      <a href="#support-at-origins">
        
      </a>
    </div>
    <p>In <a href="https://blog.cloudflare.com/post-quantum-to-origins"><u>September 2023</u></a>, we added support for our customers to enable post-quantum key agreement on connections from Cloudflare to their origins. That’s connection (3) in the following diagram:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7dRJxj1f2otMM41sEKFoFG/d722378a6f74c4033787897334bb4e7a/image12.png" />
          </figure><p><sup>Typical connection flow when a visitor requests an uncached page.</sup></p><p>Back in 2023 only 0.5% of origins supported post-quantum key agreement. Through 2024 that hasn’t changed much. This year, in 2025, we see support slowly pick up with software support rolling out, and we’re now at 3.7%.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6LaKKWKWTli5NETFHlQ1za/e9eb1e750a72e62bdc522207451e7085/image7.png" />
          </figure><p><sup>Fraction of origins that support the post-quantum key agreement X25519MLKEM768.</sup></p><p>3.7% doesn’t sound impressive at all compared to the previous 50% and 39% for clients and public servers respectively, but it’s nothing to scoff at. There is much more diversity in origins than there are in clients: many more people have to do something to make that number move up. But it’s still a more than seven-fold increase, and let’s not forget that back in 2024 we celebrated reaching 1.8% of client support.For customers, origins aren’t always easy to upgrade at all. Does that mean missing out on post-quantum security? No, not necessarily: you can secure the connection between Cloudflare and your origin by setting up <a href="https://developers.cloudflare.com/cloudflare-one/connections/connect-networks/"><u>Cloudflare Tunnel</u></a> as a sidecar to your origin.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3bfxtdySAPtc9hn9Qroztz/8233f1fecbed214b9584af6648488587/image3.png" />
          </figure>
    <div>
      <h3>Ossification</h3>
      <a href="#ossification">
        
      </a>
    </div>
    <p>Support is all well and good, but as we saw with browser experiments, protocol ossification is a big concern. What does it look like with origins? Well, it depends.</p><p>There are two ways to enable post-quantum key agreement: the fast way, and the slow but safer way. In both cases, if the origin doesn’t support post-quantum, they’ll fall back safely to traditional key agreement. We explain the details in this <a href="https://blog.cloudflare.com/post-quantum-to-origins"><u>blog post</u></a>, but in short, in the fast way we send the post-quantum keys immediately, and in the safer way we postpone it by one roundtrip using <i>HelloRetryRequest</i>. All major browsers use the fast way.</p><p>We have been regularly scanning all origins to see what they support. The good news is that all origins supported the safe but slow method. The fast method didn’t fare as well, as we found that 0.05% of connections would break. That’s too high to enable the fast method by default. We did enable PQ to origins using the safer method by default for all non-enterprise customers and enterprise customers can opt in.</p><p>We are not satisfied though until it’s fast and enabled for everyone. That’s why we’ll <a href="https://blog.cloudflare.com/automatically-secure/#post-quantum-era"><u>automatically enable</u></a> post-quantum to origins using the fast method for all customers, if our scans show it’s safe.</p>
    <div>
      <h3>Internal connections</h3>
      <a href="#internal-connections">
        
      </a>
    </div>
    <p>So far all the connections we’ve been talking about are between Cloudflare and external parties. There are also a lot of internal connections within Cloudflare (marked 2 in the two diagrams above.) In 2023 we <a href="https://blog.cloudflare.com/post-quantum-cryptography-ga/"><u>made a big push</u></a> to upgrade our internal connections to post-quantum key agreement. Compared to all the other post-quantum efforts we pursue, this has been, by far, the biggest job: we asked every engineering team in the company to stop what they’re doing; take stock of the data and connections that their products secure; and upgrade them to post-quantum key agreement. In most cases the upgrade was simple. In fact, many teams were already upgraded by pulling in software updates. Still, figuring out that you’re already done can take quite some time! On a positive note, we didn’t see any performance or ossification issues in this push.</p><p>We have upgraded the majority of internal connections, but a long tail remains, which we continue to work on. The most important connection that we didn’t get to upgrade in 2023 is the connection between WARP client and Cloudflare. In September 2025 we <a href="https://blog.cloudflare.com/post-quantum-warp/"><u>upgraded it</u></a>, by moving from Wireguard to QUIC.</p>
    <div>
      <h2>Outlook</h2>
      <a href="#outlook">
        
      </a>
    </div>
    <p>As we’ve seen, post-quantum key agreement, despite initial trouble with protocol ossification, has been straightforward to deploy. In the vast majority of cases it’s an uneventful software update. And with 50% deployment (and rising), it’s the new security baseline for the Internet.</p><p>Let’s turn to the second, more difficult migration.</p>
    <div>
      <h2>Migrating the Internet to post-quantum signatures</h2>
      <a href="#migrating-the-internet-to-post-quantum-signatures">
        
      </a>
    </div>
    <p>Now, we’ll turn our attention to upgrading the signatures used on the Internet.</p>
    <div>
      <h2>The zoo of post-quantum signatures</h2>
      <a href="#the-zoo-of-post-quantum-signatures">
        
      </a>
    </div>
    <p>We wrote a <a href="https://blog.cloudflare.com/another-look-at-pq-signatures/"><u>long deep dive</u></a> in the field of post-quantum signature schemes last year, November 2024. Most of that is still up-to-date, but there have been some exciting developments. Here we’ll just go over some highlights and some exciting updates of last year.</p><p>Let’s start by sizing up the post-quantum signatures we have available today at the AES-128 security level: ML-DSA-44 and the two variants of SLH-DSA. We use ML-DSA-44 as the baseline, as that’s the scheme that’s going to see the most widespread use initially. As a comparison, we also include the venerable Ed25519 and RSA-2048 in wide use today, as well as FN-DSA-512 which will be standardised soon and a sample of nine for TLS promising signature schemes from the signatures onramp.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4NC2lO6hXKEFOgaVgQ7ExO/a0a65ddbb24d11ad96405f19aa344f4b/Screenshot_2025-10-28_at_13.18.54.png" />
          </figure><p><sup>Comparison of various signature schemes at the security level of AES-128. CPU times vary significantly by platform and implementation constraints and should be taken as a rough indication only. ⚠️ FN-DSA signing time when using fast but dangerous floating-point arithmetic — see warning below. ⚠️ SQISign signing is not timing side-channel secure.</sup></p><p>It is immediately clear that none of the post-quantum signature schemes comes even close to being a drop-in replacement for Ed25519 (which is comparable to ECDSA P-256) as most of the signatures are simply much bigger. The exceptions are SQISign, MAYO, SNOVA, and UOV from the onramp, but they’re far from ideal. MAYO, SNOVA, and UOV have large public keys, and SQISign requires a great amount of computation.</p>
    <div>
      <h3>Be careful with FN-DSA</h3>
      <a href="#be-careful-with-fn-dsa">
        
      </a>
    </div>
    <p>Looking ahead a bit: the best of the first competition seems to be FN-DSA-512. FN-DSA-512’s signatures and public key together are <i>only</i> 1,563 bytes, with somewhat reasonable signing time. FN-DSA has an <b>achilles heel</b> though — for acceptable signing performance, it requires fast floating-point arithmetic. Without it, signing is about 20 times slower. But speed is not enough, as the floating-point arithmetic has to run in constant time — without it, the FN-DSA private key can be recovered by timing signature creation. Writing safe FN-DSA implementations has turned out to be quite challenging, which makes FN-DSA dangerous when signatures are generated on the fly, such as in a TLS handshake. It is good to stress that this only affects signing. FN-DSA verification does not require floating-point arithmetic (and during verification there wouldn’t be a private key to leak anyway.)</p>
    <div>
      <h2>There are many signatures on the web</h2>
      <a href="#there-are-many-signatures-on-the-web">
        
      </a>
    </div>
    <p>The biggest pain-point of migrating the Internet to post-quantum signatures, is that there are a lot of signatures even in a single connection. When you visit this very website for the first time, we send <b>five signatures and two public keys</b>.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6frWEoCLnBEZ5qztV8XoT4/25bd315190d8914f42f282679d6f525a/image9.png" />
          </figure><p>The majority of these are for the <b>certificate chain</b>: the CA signs the intermediate certificate, which signs the leaf certificate, which in turn signs the TLS transcript to prove the authenticity of the server. If you’re keeping count: we’re still two signatures short.</p><p>These are for <b>SCTs</b> required for <a href="https://certificate.transparency.dev/howctworks/"><u>certificate transparency</u></a>. Certificate transparency (CT) is a key, but lesser known, part of the <a href="https://smallstep.com/blog/everything-pki/#web-pki-vs-internal-pki"><u>Web PKI</u></a>, the ecosystem that secures browser connections. Its goal is to publicly log every certificate issued, so that misissuances can be detected after the fact. It’s the system that’s behind <a href="http://crt.sh"><u>crt.sh</u></a> and <a href="https://blog.cloudflare.com/new-regional-internet-traffic-and-certificate-transparency-insights-on-radar/"><u>Cloudflare Radar</u></a>. CT has shown its value once more very recently by surfacing a <a href="https://blog.cloudflare.com/unauthorized-issuance-of-certificates-for-1-1-1-1/"><u>rogue certificate for 1.1.1.1</u></a>.</p><p>Certificate transparency works by having independent parties run <i>CT logs</i>. Before issuing a certificate, a CA must first submit it to at least two different CT logs. An SCT is a signature of a CT log that acts as a proof, a <i>receipt</i>, that the certificate has been logged.</p>
    <div>
      <h3>Tailoring signature schemes</h3>
      <a href="#tailoring-signature-schemes">
        
      </a>
    </div>
    <p>There are two aspects of how a signature can be used that are worthwhile to highlight: whether the <b>public key is included</b> with the signature, and whether the signature is <b>online</b> or <b>offline</b>.</p><p>For the SCTs and the signature of the root on the intermediate, the public key is not transmitted during the handshake. Thus, for those, a signature scheme with smaller signatures but larger public keys, such as MAYO, SNOVA, or UOV, would be particularly well-suited. For the other signatures, the public key is included, and it’s more important to minimize the sizes of the combined public key and signature.</p><p>The handshake signature is the only signature that is created online — all the other signatures are created ahead of time.  The handshake signature is created and verified only once, whereas the other signatures are typically verified many times by different clients. This means that for the handshake signature, it’s advantageous to balance signing and verification time which are both in the <i>hot path</i>, whereas for the other signatures having better verification time at the cost of slower signing is worthwhile. This is one of the advantages RSA still enjoys over elliptic curve signatures today.</p><p>Putting together different signature schemes is a fun puzzle, but it also comes with some drawbacks. Using multiple different schemes increases the attack surface because an algorithmic or implementation vulnerability in one compromises the whole. Also, the whole ecosystem needs to implement and optimize multiple algorithms, which is a significant burden.</p>
    <div>
      <h2>Putting it together</h2>
      <a href="#putting-it-together">
        
      </a>
    </div>
    <p>So, what are some reasonable combinations to try?</p>
    <div>
      <h3>With NIST’s current picks</h3>
      <a href="#with-nists-current-picks">
        
      </a>
    </div>
    <p>With the draft standards available today, we do not have a lot of options.</p><p>If we simply switch to ML-DSA-44 for all signatures, we’re adding 15kB of data that needs to be transmitted from the server to the client during the TLS handshake. Is that a lot? Probably. We will address that later on.</p><p>If we wait a bit and replace all but the handshake signature with FN-DSA-512, we’re looking at adding only 7kB. That’s much better, but I have to repeat that it’s difficult to implement FN-DSA-512 signing safely without timing side channels, and there is a good chance we’ll shoot ourselves in the foot if we’re not careful. Another way to shoot ourselves in the foot <i>today</i> is with stateful hash-based signatures, as we explain <a href="https://blog.cloudflare.com/pq-2024/#stateful-hash-based-signatures"><u>here</u></a>. All in all, FN-DSA-512 and stateful hash-based signatures tempt us with a similar and clear performance benefit over ML-DSA-44, but are difficult to use safely.</p>
    <div>
      <h3>Signatures on the horizon</h3>
      <a href="#signatures-on-the-horizon">
        
      </a>
    </div>
    <p>There are some promising new signature schemes submitted to the NIST onramp.</p><p>Purely looking at sizes, SQISign I is the clear winner, even beating RSA-2048. Unfortunately, the computation required for signing, and crucially verification, are too high. SQISign is in a worse position than FN-DSA with implementation security: it’s very complicated and it’s unclear how to perform signing in <i>constant time</i>. For niche applications, SQISign might be useful, but for general adoption verification times need to improve significantly, even if that requires a larger signature. Over the last few years there has been amazing progress in improving verification time; simplifying the algorithm; and <a href="https://eprint.iacr.org/2025/832"><u>implementation security</u></a> for (variants of) SQISign. They’re not there yet, but the gap has shrunk much more than we’d have expected. If the pace of improvement holds, then a future SQISign could well be viable for TLS.</p><p>One conservative contender is <a href="https://link.springer.com/chapter/10.1007/3-540-48910-X_15"><u>UOV (unbalanced oil and vinegar)</u></a>. It is an old multivariate scheme with a large public key (66.5kB), but small signatures (96 bytes). Over the decades, there have been many attempts to add some structure to UOV public keys, to get a better balance between public key and signature size. Many of these so-called <i>structured multivariate </i>schemes, which includes Rainbow and GeMMS, unfortunately have been broken dramatically <a href="https://eprint.iacr.org/2022/214.pdf"><u>“with a laptop over the weekend”</u></a>. MAYO and SNOVA, which we’ll get to in a bit, are the latest attempts at structured multivariate. UOV itself has remained mostly unscathed. Surprisingly in 2025, Lars Ran found a completely new <a href="https://eprint.iacr.org/2025/1143"><u>“wedges” attack</u></a> on UOV. It doesn’t affect UOV much, although SNOVA and MAYO are hit harder. Why the attack is noteworthy, is that it’s based on a relatively simple idea: it is surprising it wasn’t found before. Now, getting back to performance: if we combine UOV for the root and SCTs with ML-DSA-44 for the others, we’re looking at only 10kB — close to FN-DSA-512.</p><p>Now, let’s turn to the main event:</p>
    <div>
      <h3>The fight between MAYO versus SNOVA</h3>
      <a href="#the-fight-between-mayo-versus-snova">
        
      </a>
    </div>
    <p>Looking at the roster today, MAYO and particularly SNOVA look great from a performance standpoint. Last year, SNOVA and MAYO were closer in performance, but they have diverged quite a bit.</p><p><a href="https://pqmayo.org/"><u>MAYO</u></a> is designed by the cryptographer that broke <a href="https://eprint.iacr.org/2022/214.pdf"><u>Rainbow</u></a>. As a structured multivariate scheme, its security requires careful scrutiny, but its utility (assuming it is not broken) is very appealing. MAYO allows for a fine-grained tradeoff between signature and public key size. For the submission, to keep things simple, the authors proposed two concrete variants: MAYO<sub>one</sub> with balanced signature (454 bytes) and public key (1.4kB) sizes, and MAYO<sub>two</sub> that has signatures of 216 bytes, while keeping the public key manageable at 4.3kB. Verification times are excellent, while signing times are somewhat slower than ECDSA, but far better than RSA. Combining both variants in the obvious way, we’re only looking at 4.3kB. These numbers are a bit higher than last year, as MAYO adjusted its parameters again slightly to account for newly discovered attacks.</p><p>Over the competition, <a href="https://snova.pqclab.org/"><u>SNOVA</u></a> has been hit harder by attacks than MAYO. SNOVA’s response has been more aggressive: instead of just tweaking parameters to adjust, they have also made larger changes to the internals of the scheme, to counter the attacks and to get a performance improvement to boot. Combining SNOVA<sub>(37,17,16,2)</sub> and SNOVA<sub>(24,5,23,4)</sub> in the obvious way, we’re looking at adding just an amazing 2.1kB.</p><p>We see a face-off shaping up between the risky but much smaller SNOVA, and the conservative but slower MAYO. Zooming out, both have very welcome performance, and both are too risky to deploy now. Ran’s new wedges attack is an example that the field of multivariate cryptanalysis still holds surprises, and needs more eyes and time. It’s too soon to pick a winner between SNOVA and MAYO: let them continue to compete. Even if they turn out to be secure, neither is likely to be standardized by 2029, which means we cannot rely on them for the initial migration to post-quantum authentication.</p><p>Stepping back, is the 15kB for ML-DSA-44 actually that bad?</p>
    <div>
      <h2>Do we really care about the extra bytes?</h2>
      <a href="#do-we-really-care-about-the-extra-bytes">
        
      </a>
    </div>
    <p>On average, around 18 million TLS connections are established with Cloudflare per second. Upgrading each to ML-DSA, would take 2.1Tbps, which is 0.5% of our current total network capacity. No problem so far. The question is how these extra bytes affect performance.</p><p>It will take 15kB extra to swap in ML-DSA-44. That’s a lot compared to the typical handshake today, but it’s not a lot compared to the JavaScript and images served on many web pages. The key point is that the change we must make here affects every single TLS connection, whether it’s used for a bloated website, or a time-critical API call. Also, it’s not just about waiting a bit longer. If you have spotty cellular reception, that extra data can make the difference between being able to load a page, and having the connection time out. (As an aside, talking about bloat: many apps perform a <a href="https://thomwiggers.nl/publication/tls-on-android/tls-on-android.pdf"><u>surprisingly high number of TLS handshakes</u></a>).</p><p>Just like with key agreement, performance isn’t our only concern: we also want the connection to succeed in the first place. Back in 2021, <a href="https://blog.cloudflare.com/sizing-up-post-quantum-signatures/"><u>we ran an experiment</u></a> artificially enlarging the certificate chain to simulate larger post-quantum certificates. We summarize the result <a href="https://blog.cloudflare.com/pq-2024/#do-we-really-care-about-the-extra-bytes"><u>here</u></a>. One key take-away is that some clients or middleboxes don’t like certificate chains larger than 10kB. This is problematic for a <a href="https://eprint.iacr.org/2018/063.pdf"><u>single-certificate migration</u></a> strategy. In this approach, the server installs a single traditional certificate that contains a separate post-quantum certificate in a so-called non-critical extension. A client that does not support post-quantum certificates will ignore the extension. In this approach, installing the single certificate will immediately break all clients with compatibility issues, making it a non-starter. On the performance side there is also a steep drop in performance at 10kB because of the initial congestion window.</p><p>
</p><p>Is 9kB too much? The slowdown in TLS handshake time would be approximately 15%. We felt the latter is workable, but far from ideal: such a slowdown is noticeable and people might hold off deploying post-quantum certificates before it’s too late.
</p><p>Chrome is more cautious and set 10% as their target for maximum TLS handshake time regression. They <a href="https://dadrian.io/blog/posts/pqc-signatures-2024/#fnref:3"><u>report</u></a> that deploying post-quantum key agreement has already incurred a 4% slowdown in TLS handshake time, for the extra 1.1kB from server-to-client and 1.2kB from client-to-server. That slowdown is proportionally larger than the 15% we found for 9kB, but that could be explained by slower upload speeds than download speeds. </p><p>There has been pushback against the focus on TLS handshake times. One argument is that session resumption alleviates the need for sending the certificates again. A second argument is that the data required to visit a typical website dwarfs the additional bytes for post-quantum certificates. One example is this <a href="https://www.amazon.science/publications/the-impact-of-data-heavy-post-quantum-tls-1-3-on-the-time-to-last-byte-of-real-world-connections"><u>2024 publication</u></a>, where Amazon researchers have simulated the impact of large post-quantum certificates on data-heavy TLS connections. They argue that typical connections transfer multiple requests and hundreds of kilobytes, and for those the TLS handshake slowdown disappears in the margin.</p><p>Are session resumption and hundreds of kilobytes over a connection typical though? We’d like to share what we see. We focus on QUIC connections, which are likely initiated by browsers or browser-like clients. Of all QUIC connections with Cloudflare that carry at least one HTTP request, 27% are <a href="https://blog.cloudflare.com/even-faster-connection-establishment-with-quic-0-rtt-resumption/"><u>resumptions</u></a>, meaning that key material from a previous TLS connection is reused, avoiding the need to transmit certificates. The median number of bytes transferred from server-to-client over a resumed QUIC connection is 4.4kB, while the average is 259kB. For non-resumptions the median is 8.1kB and average is 583kB. This vast difference between median and average indicates that a small fraction of data-heavy connections skew the average. In fact, only 15.5% of all QUIC connections transfer more than 100kB.</p><p>The median certificate chain today (with compression) is <a href="https://datatracker.ietf.org/doc/html/draft-ietf-tls-cert-abridge-02#section-4"><u>3.2kB</u></a>. That means that almost 40% of all data transferred from server to client on more than half of the non-resumed QUIC connections are just for the certificates, and this only gets worse with post-quantum algorithms. For the majority of QUIC connections, using ML-DSA-44 as a drop-in replacement for classical signatures would more than double the number of transmitted bytes over the lifetime of the connection.</p><p>It sounds quite bad if the vast majority of data transferred for a typical connection is just for the post-quantum certificates. It’s still only a proxy for what is actually important: the effect on metrics relevant to the end-user, such as the browsing experience (e.g. <a href="https://web.dev/articles/optimize-lcp"><u>largest contentful paint</u></a>) and the amount of data those certificates take from a user’s monthly data cap. We will continue to investigate and get a better understanding of the impact.</p>
    <div>
      <h2>Way forward for post-quantum authentication</h2>
      <a href="#way-forward-for-post-quantum-authentication">
        
      </a>
    </div>
    <p>The path for migrating the Internet to post-quantum authentication is much less clear than with key agreement. Unless we can get performance much closer to today’s authentication, we expect the vast majority to keep post-quantum authentication disabled. Postponing enabling post-quantum authentication until Q-day draws near carries a real risk that we will not see the issues before it’s too late to fix. That’s why it’s essential to make post-quantum authentication performant enough to be turned on by default.</p><p>We’re exploring various ideas to reduce the number of signatures, in increasing order of ambition: leaving out intermediates; KEMTLS; and Merkle Tree Certificates. We covered these in <a href="https://blog.cloudflare.com/pq-2024/#reducing-number-of-signatures"><u>detail last year</u></a>. Most progress has been made on the last one: <a href="https://datatracker.ietf.org/doc/draft-davidben-tls-merkle-tree-certs/"><u>Merkle Tree Certificates</u></a> (MTC). In this proposal, in the common case, all signatures except the handshake signature are replaced by a short &lt;800 byte Merkle tree proof. This could well allow for post-quantum authentication that’s actually faster than using traditional certificates today! Together with Chrome, we’re going to try it out by the end of the year: read about it in <a href="https://blog.cloudflare.com/bootstrap-mtc/"><u>this blog</u></a> post.</p>
    <div>
      <h3>Not just TLS, authentication, and key agreement</h3>
      <a href="#not-just-tls-authentication-and-key-agreement">
        
      </a>
    </div>
    <p>Despite its length, in this blog post, we have only really touched upon migrating TLS. And even TLS we did not cover completely, as we have not discussed <a href="https://blog.cloudflare.com/announcing-encrypted-client-hello"><u>Encrypted ClientHello</u></a> (we didn’t forget about it). Although important, TLS is not the only protocol key to the security of the Internet. We want to briefly mention a few other challenges, but cannot go into detail. One particular challenge is DNSSEC, which is responsible for securing the resolution of domain names.</p><p>Although key agreement and signatures are the most widely used cryptographic primitives, over the last few years we have seen the adoption of more <a href="https://github.com/fancy-cryptography/fancy-cryptography"><u>esoteric cryptography</u></a> to serve more advanced use cases, such as unlinkable tokens with <a href="https://blog.cloudflare.com/privacy-pass-standard"><u>Privacy Pass</u></a> / <a href="https://blog.cloudflare.com/eliminating-captchas-on-iphones-and-macs-using-new-standardhttps://blog.cloudflare.com/eliminating-captchas-on-iphones-and-macs-using-new-standard"><u>PAT</u></a>, anonymous credentials, and <a href="https://blog.cloudflare.com/inside-geo-key-manager-v2"><u>attribute based encryption</u></a> to name a few. For most of these advanced cryptographic schemes, there is no known practical post-quantum alternative yet. Although to our delight there have been great advances in post-quantum anonymous credentials.</p>
    <div>
      <h2>What you can do today to stay safe against quantum attacks</h2>
      <a href="#what-you-can-do-today-to-stay-safe-against-quantum-attacks">
        
      </a>
    </div>
    <p>To summarize, there are two main post-quantum migrations to keep an eye on: key agreement, and certificates.</p><p>We recommend moving to <b>post-quantum key agreement </b>to counter store-now/decrypt-later attacks, which only requires a software update on both sides. That means that with the quick adoption (we’re <a href="https://developers.cloudflare.com/ssl/post-quantum-cryptography/pqc-support/"><u>keeping a list</u></a>) of X25519MLKEM768 across software and services, you might well be secure already against store-now/decrypt-later! On Cloudflare Radar you can <a href="https://radar.cloudflare.com/adoption-and-usage#browser-support"><u>check</u></a> whether your browser supports X25519MLKEM768; if you use Firefox, there is <a href="https://addons.mozilla.org/en-US/firefox/addon/pqspy/"><u>an extension</u></a> to check support of websites while you visit; you can scan whether your website supports it <a href="https://pqscan.io/"><u>here</u></a>; and you can use Wireshark to check for it <a href="https://www.netmeister.org/blog/tls-hybrid-kex.html"><u>on the wire</u></a>.</p><p>Those are just spot checks. For a proper migration, you’ll need to figure out where cryptography is used. That’s a tall order, as most organizations have a hard time tracking all software, services, and external vendors they use in the first place. There will be systems that are difficult to upgrade or have external dependencies, but in many cases it’s simple. In fact, in many cases, you’ll spend a lot of time to find out that they are already done.</p><p>As figuring out <i>what to do</i> is the bulk of the work, it’s perhaps tempting to split that out as a first milestone: create a detailed inventory first; the so-called <a href="https://github.com/IBM/CBOM"><u>cryptographic bill of materials</u></a> (CBOM). Don’t let an inventory become a goal on its own: we need to keep our eyes on the ball. Most cases are easy: if you figured out what to do to migrate in one case, don’t wait and context switch, but just do it. That doesn’t mean it’ll be fast: this is a marathon not a sprint, but you’ll be surprised how much ground can be covered by getting started.</p><p><b>Certificates.</b> At the time of writing this blog in October 2025, the final standards for post-quantum certificates are not set yet. Hopefully that won’t take too long to resolve. But there is much that you can do now to prepare for post-quantum certificates that you won’t regret at all. Keep software up-to-date. Automate certificate issuance. Ensure you can install multiple certificates.</p><p>In case you’re worried about protocol ossification, there is no reason to wait: the final post-quantum standards will not be very different from the draft. You can test with preliminary implementations (or large dummy certificates) today.</p><p>The post-quantum migration is quite unique. Typically, if cryptography is broken, it’s either sudden or gradually making it easy to ignore for a time. In both cases, migrations in the end are rushed. With the quantum threat, we know for sure that we’ll need to replace a lot of cryptography, but we also have time. Instead of just a chore, we invite you to see this as an opportunity: we have to do maintenance now on many systems that rarely get touched. Instead of just hotfixes, now is the opportunity to rethink past choices. </p><p>At least, if you start now. Good luck with your migration, and if you hit any issues, do reach out: ask-research@cloudflare.com</p> ]]></content:encoded>
            <category><![CDATA[Research]]></category>
            <category><![CDATA[Post-Quantum]]></category>
            <guid isPermaLink="false">7nIcJ4ZbXuMXHQ9tPi2P4f</guid>
            <dc:creator>Bas Westerbaan</dc:creator>
        </item>
        <item>
            <title><![CDATA[Keeping the Internet fast and secure: introducing Merkle Tree Certificates]]></title>
            <link>https://blog.cloudflare.com/bootstrap-mtc/</link>
            <pubDate>Tue, 28 Oct 2025 13:00:00 GMT</pubDate>
            <description><![CDATA[ Cloudflare is launching an experiment with Chrome to evaluate fast, scalable, and quantum-ready Merkle Tree Certificates, all without degrading performance or changing WebPKI trust relationships. ]]></description>
            <content:encoded><![CDATA[ <p>The world is in a race to build its first quantum computer capable of solving practical problems not feasible on even the largest conventional supercomputers. While the quantum computing paradigm promises many benefits, it also threatens the security of the Internet by breaking much of the cryptography we have come to rely on.</p><p>To mitigate this threat, Cloudflare is helping to migrate the Internet to Post-Quantum (PQ) cryptography. Today, <a href="https://radar.cloudflare.com/adoption-and-usage#post-quantum-encryption"><u>about 50%</u></a> of traffic to Cloudflare's edge network is protected against the most urgent threat: an attacker who can intercept and store encrypted traffic today and then decrypt it in the future with the help of a quantum computer. This is referred to as the <a href="https://en.wikipedia.org/wiki/Harvest_now,_decrypt_later"><u>harvest now, decrypt later</u></a><i> </i>threat.</p><p>However, this is just one of the threats we need to address. A quantum computer can also be used to crack a server's <a href="https://www.cloudflare.com/application-services/products/ssl/">TLS certificate</a>, allowing an attacker to impersonate the server to unsuspecting clients. The good news is that we already have PQ algorithms we can use for quantum-safe authentication. The bad news is that adoption of these algorithms in TLS will require significant changes to one of the most complex and security-critical systems on the Internet: the Web Public-Key Infrastructure (WebPKI).</p><p>The central problem is the sheer size of these new algorithms: signatures for ML-DSA-44, one of the most performant PQ algorithms standardized by NIST, are 2,420 bytes long, compared to just 64 bytes for ECDSA-P256, the most popular non-PQ signature in use today; and its public keys are 1,312 bytes long, compared to just 64 bytes for ECDSA. That's a roughly 20-fold increase in size. Worse yet, the average TLS handshake includes a number of public keys and signatures, adding up to 10s of kilobytes of overhead per handshake. This is enough to have a <a href="https://blog.cloudflare.com/another-look-at-pq-signatures/#how-many-added-bytes-are-too-many-for-tls"><u>noticeable impact</u></a> on the performance of TLS.</p><p>That makes drop-in PQ certificates a tough sell to enable today: they don’t bring any security benefit before Q-day — the day a cryptographically relevant quantum computer arrives — but they do degrade performance. We could sit and wait until Q-day is a year away, but that’s playing with fire. Migrations always take longer than expected, and by waiting we risk the security and privacy of the Internet, which is <a href="https://developers.cloudflare.com/ssl/edge-certificates/universal-ssl/"><u>dear to us</u></a>.</p><p>It's clear that we must find a way to make post-quantum certificates cheap enough to deploy today by default for everyone — not just those that can afford it. In this post, we'll introduce you to the plan we’ve brought together with industry partners to the <a href="https://datatracker.ietf.org/group/plants/about/"><u>IETF</u></a> to redesign the WebPKI in order to allow a smooth transition to PQ authentication with no performance impact (and perhaps a performance improvement!). We'll provide an overview of one concrete proposal, called <a href="https://datatracker.ietf.org/doc/draft-davidben-tls-merkle-tree-certs/"><u>Merkle Tree Certificates (MTCs)</u></a>, whose goal is to whittle down the number of public keys and signatures in the TLS handshake to the bare minimum required.</p><p>But talk is cheap. We <a href="https://blog.cloudflare.com/experiment-with-pq/"><u>know</u></a> <a href="https://blog.cloudflare.com/announcing-encrypted-client-hello/"><u>from</u></a> <a href="https://blog.cloudflare.com/why-tls-1-3-isnt-in-browsers-yet/"><u>experience</u></a> that, as with any change to the Internet, it's crucial to test early and often. <b>Today we're announcing our intent to deploy MTCs on an experimental basis in collaboration with Chrome Security.</b> In this post, we'll describe the scope of this experiment, what we hope to learn from it, and how we'll make sure it's done safely.</p>
    <div>
      <h2>The WebPKI today — an old system with many patches</h2>
      <a href="#the-webpki-today-an-old-system-with-many-patches">
        
      </a>
    </div>
    <p>Why does the TLS handshake have so many public keys and signatures?</p><p>Let's start with Cryptography 101. When your browser connects to a website, it asks the server to <b>authenticate</b> itself to make sure it's talking to the real server and not an impersonator. This is usually achieved with a cryptographic primitive known as a digital signature scheme (e.g., ECDSA or ML-DSA). In TLS, the server signs the messages exchanged between the client and server using its <b>secret key</b>, and the client verifies the signature using the server's <b>public key</b>. In this way, the server confirms to the client that they've had the same conversation, since only the server could have produced a valid signature.</p><p>If the client already knows the server's public key, then only <b>1 signature</b> is required to authenticate the server. In practice, however, this is not really an option. The web today is made up of around a billion TLS servers, so it would be unrealistic to provision every client with the public key of every server. What's more, the set of public keys will change over time as new servers come online and existing ones rotate their keys, so we would need some way of pushing these changes to clients.</p><p>This scaling problem is at the heart of the design of all PKIs.</p>
    <div>
      <h3>Trust is transitive</h3>
      <a href="#trust-is-transitive">
        
      </a>
    </div>
    <p>Instead of expecting the client to know the server's public key in advance, the server might just send its public key during the TLS handshake. But how does the client know that the public key actually belongs to the server? This is the job of a <b>certificate</b>.</p><p>A certificate binds a public key to the identity of the server — usually its DNS name, e.g., <code>cloudflareresearch.com</code>. The certificate is signed by a Certification Authority (CA) whose public key is known to the client. In addition to verifying the server's handshake signature, the client verifies the signature of this certificate. This establishes a chain of trust: by accepting the certificate, the client is trusting that the CA verified that the public key actually belongs to the server with that identity.</p><p>Clients are typically configured to trust many CAs and must be provisioned with a public key for each. Things are much easier however, since there are only 100s of CAs instead of billions. In addition, new certificates can be created without having to update clients.</p><p>These efficiencies come at a relatively low cost: for those counting at home, that's <b>+1</b> signature and <b>+1</b> public key, for a total of <b>2 signatures and 1 public key</b> per TLS handshake.</p><p>That's not the end of the story, however. As the WebPKI has evolved, so have these chains of trust grown a bit longer. These days it's common for a chain to consist of two or more certificates rather than just one. This is because CAs sometimes need to rotate<b> </b>their keys, just as servers do. But before they can start using the new key, they must distribute the corresponding public key to clients. This takes time, since it requires billions of clients to update their trust stores. To bridge the gap, the CA will sometimes use the old key to issue a certificate for the new one and append this certificate to the end of the chain.</p><p>That's<b> +1</b> signature and<b> +1</b> public key, which brings us to<b> 3 signatures and 2 public keys</b>. And we still have a little ways to go.</p>
    <div>
      <h3>Trust but verify</h3>
      <a href="#trust-but-verify">
        
      </a>
    </div>
    <p>The main job of a CA is to verify that a server has control over the domain for which it’s requesting a certificate. This process has evolved over the years from a high-touch, CA-specific process to a standardized, <a href="https://datatracker.ietf.org/doc/html/rfc8555/"><u>mostly automated process</u></a> used for issuing most certificates on the web. (Not all CAs fully support automation, however.) This evolution is marked by a number of security incidents in which a certificate was <b>mis-issued </b>to a party other than the server, allowing that party to impersonate the server to any client that trusts the CA.</p><p>Automation helps, but <a href="https://en.wikipedia.org/wiki/DigiNotar#Issuance_of_fraudulent_certificates"><u>attacks</u></a> are still possible, and mistakes are almost inevitable. <a href="https://blog.cloudflare.com/unauthorized-issuance-of-certificates-for-1-1-1-1/"><u>Earlier this year</u></a>, several certificates for Cloudflare's encrypted 1.1.1.1 resolver were issued without our involvement or authorization. This apparently occurred by accident, but it nonetheless put users of 1.1.1.1 at risk. (The mis-issued certificates have since been revoked.)</p><p>Ensuring mis-issuance is detectable is the job of the Certificate Transparency (CT) ecosystem. The basic idea is that each certificate issued by a CA gets added to a public <b>log</b>. Servers can audit these logs for certificates issued in their name. If ever a certificate is issued that they didn't request itself, the server operator can prove the issuance happened, and the PKI ecosystem can take action to prevent the certificate from being trusted by clients.</p><p>Major browsers, including Firefox and Chrome and its derivatives, require certificates to be logged before they can be trusted. For example, Chrome, Safari, and Firefox will only accept the server's certificate if it appears in at least two logs the browser is configured to trust. This policy is easy to state, but tricky to implement in practice:</p><ol><li><p>Operating a CT log has historically been fairly expensive. Logs ingest billions of certificates over their lifetimes: when an incident happens, or even just under high load, it can take some time for a log to make a new entry available for auditors.</p></li><li><p>Clients can't really audit logs themselves, since this would expose their browsing history (i.e., the servers they wanted to connect to) to the log operators.</p></li></ol><p>The solution to both problems is to include a signature from the CT log along with the certificate. The signature is produced immediately in response to a request to log a certificate, and attests to the log's intent to include the certificate in the log within 24 hours.</p><p>Per browser policy, certificate transparency adds <b>+2</b> signatures to the TLS handshake, one for each log. This brings us to a total of <b>5 signatures and 2 public keys</b> in a typical handshake on the public web.</p>
    <div>
      <h3>The future WebPKI</h3>
      <a href="#the-future-webpki">
        
      </a>
    </div>
    <p>The WebPKI is a living, breathing, and highly distributed system. We've had to patch it a number of times over the years to keep it going, but on balance it has served our needs quite well — until now.</p><p>Previously, whenever we needed to update something in the WebPKI, we would tack on another signature. This strategy has worked because conventional cryptography is so cheap. But <b>5 signatures and 2 public keys </b>on average for each TLS handshake is simply too much to cope with for the larger PQ signatures that are coming.</p><p>The good news is that by moving what we already have around in clever ways, we can drastically reduce the number of signatures we need.</p>
    <div>
      <h3>Crash course on Merkle Tree Certificates</h3>
      <a href="#crash-course-on-merkle-tree-certificates">
        
      </a>
    </div>
    <p><a href="https://datatracker.ietf.org/doc/draft-davidben-tls-merkle-tree-certs/"><u>Merkle Tree Certificates (MTCs)</u></a> is a proposal for the next generation of the WebPKI that we are implementing and plan to deploy on an experimental basis. Its key features are as follows:</p><ol><li><p>All the information a client needs to validate a Merkle Tree Certificate can be disseminated out-of-band. If the client is sufficiently up-to-date, then the TLS handshake needs just <b>1 signature, 1 public key, and 1 Merkle tree inclusion proof</b>. This is quite small, even if we use post-quantum algorithms.</p></li><li><p>The MTC specification makes certificate transparency a first class feature of the PKI by having each CA run its own log of exactly the certificates they issue.</p></li></ol><p>Let's poke our head under the hood a little. Below we have an MTC generated by one of our internal tests. This would be transmitted from the server to the client in the TLS handshake:</p>
            <pre><code>-----BEGIN CERTIFICATE-----
MIICSzCCAUGgAwIBAgICAhMwDAYKKwYBBAGC2ksvADAcMRowGAYKKwYBBAGC2ksv
AQwKNDQzNjMuNDguMzAeFw0yNTEwMjExNTMzMjZaFw0yNTEwMjgxNTMzMjZaMCEx
HzAdBgNVBAMTFmNsb3VkZmxhcmVyZXNlYXJjaC5jb20wWTATBgcqhkjOPQIBBggq
hkjOPQMBBwNCAARw7eGWh7Qi7/vcqc2cXO8enqsbbdcRdHt2yDyhX5Q3RZnYgONc
JE8oRrW/hGDY/OuCWsROM5DHszZRDJJtv4gno2wwajAOBgNVHQ8BAf8EBAMCB4Aw
EwYDVR0lBAwwCgYIKwYBBQUHAwEwQwYDVR0RBDwwOoIWY2xvdWRmbGFyZXJlc2Vh
cmNoLmNvbYIgc3RhdGljLWN0LmNsb3VkZmxhcmVyZXNlYXJjaC5jb20wDAYKKwYB
BAGC2ksvAAOB9QAAAAAAAAACAAAAAAAAAAJYAOBEvgOlvWq38p45d0wWTPgG5eFV
wJMhxnmDPN1b5leJwHWzTOx1igtToMocBwwakt3HfKIjXYMO5CNDOK9DIKhmRDSV
h+or8A8WUrvqZ2ceiTZPkNQFVYlG8be2aITTVzGuK8N5MYaFnSTtzyWkXP2P9nYU
Vd1nLt/WjCUNUkjI4/75fOalMFKltcc6iaXB9ktble9wuJH8YQ9tFt456aBZSSs0
cXwqFtrHr973AZQQxGLR9QCHveii9N87NXknDvzMQ+dgWt/fBujTfuuzv3slQw80
mibA021dDCi8h1hYFQAA
-----END CERTIFICATE-----</code></pre>
            <p>Looks like your average PEM encoded certificate. Let's decode it and look at the parameters:</p>
            <pre><code>$ openssl x509 -in merkle-tree-cert.pem -noout -text
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 531 (0x213)
        Signature Algorithm: 1.3.6.1.4.1.44363.47.0
        Issuer: 1.3.6.1.4.1.44363.47.1=44363.48.3
        Validity
            Not Before: Oct 21 15:33:26 2025 GMT
            Not After : Oct 28 15:33:26 2025 GMT
        Subject: CN=cloudflareresearch.com
        Subject Public Key Info:
            Public Key Algorithm: id-ecPublicKey
                Public-Key: (256 bit)
                pub:
                    04:70:ed:e1:96:87:b4:22:ef:fb:dc:a9:cd:9c:5c:
                    ef:1e:9e:ab:1b:6d:d7:11:74:7b:76:c8:3c:a1:5f:
                    94:37:45:99:d8:80:e3:5c:24:4f:28:46:b5:bf:84:
                    60:d8:fc:eb:82:5a:c4:4e:33:90:c7:b3:36:51:0c:
                    92:6d:bf:88:27
                ASN1 OID: prime256v1
                NIST CURVE: P-256
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature
            X509v3 Extended Key Usage:
                TLS Web Server Authentication
            X509v3 Subject Alternative Name:
                DNS:cloudflareresearch.com, DNS:static-ct.cloudflareresearch.com
    Signature Algorithm: 1.3.6.1.4.1.44363.47.0
    Signature Value:
        00:00:00:00:00:00:02:00:00:00:00:00:00:00:02:58:00:e0:
        44:be:03:a5:bd:6a:b7:f2:9e:39:77:4c:16:4c:f8:06:e5:e1:
        55:c0:93:21:c6:79:83:3c:dd:5b:e6:57:89:c0:75:b3:4c:ec:
        75:8a:0b:53:a0:ca:1c:07:0c:1a:92:dd:c7:7c:a2:23:5d:83:
        0e:e4:23:43:38:af:43:20:a8:66:44:34:95:87:ea:2b:f0:0f:
        16:52:bb:ea:67:67:1e:89:36:4f:90:d4:05:55:89:46:f1:b7:
        b6:68:84:d3:57:31:ae:2b:c3:79:31:86:85:9d:24:ed:cf:25:
        a4:5c:fd:8f:f6:76:14:55:dd:67:2e:df:d6:8c:25:0d:52:48:
        c8:e3:fe:f9:7c:e6:a5:30:52:a5:b5:c7:3a:89:a5:c1:f6:4b:
        5b:95:ef:70:b8:91:fc:61:0f:6d:16:de:39:e9:a0:59:49:2b:
        34:71:7c:2a:16:da:c7:af:de:f7:01:94:10:c4:62:d1:f5:00:
        87:bd:e8:a2:f4:df:3b:35:79:27:0e:fc:cc:43:e7:60:5a:df:
        df:06:e8:d3:7e:eb:b3:bf:7b:25:43:0f:34:9a:26:c0:d3:6d:
        5d:0c:28:bc:87:58:58:15:00:00</code></pre>
            <p>While some of the parameters probably look familiar, others will look unusual. On the familiar side, the subject and public key are exactly what we might expect: the DNS name is <code>cloudflareresearch.com</code> and the public key is for a familiar signature algorithm, ECDSA-P256. This algorithm is not PQ, of course — in the future we would put ML-DSA-44 there instead.</p><p>On the unusual side, OpenSSL appears to not recognize the signature algorithm of the issuer and just prints the raw OID and bytes of the signature. There's a good reason for this: the MTC does not have a signature in it at all! So what exactly are we looking at?</p><p>The trick to leave out signatures is that a Merkle Tree Certification Authority (MTCA) produces its <i>signatureless</i> certificates <i>in batches</i> rather than individually. In place of a signature, the certificate has an <b>inclusion proof</b> of the certificate in a batch of certificates signed by the MTCA.</p><p>To understand how inclusion proofs work, let's think about a slightly simplified version of the MTC specification. To issue a batch, the MTCA arranges the unsigned certificates into a data structure called a <b>Merkle tree</b> that looks like this:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4LGhISsS07kbpSgDkqx8p2/68e3b36deeca7f97139654d2c769df68/image3.png" />
          </figure><p>Each leaf of the tree corresponds to a certificate, and each inner node is equal to the hash of its children. To sign the batch, the MTCA uses its secret key to sign the head of the tree. The structure of the tree guarantees that each certificate in the batch was signed by the MTCA: if we tried to tweak the bits of any one of the certificates, the treehead would end up having a different value, which would cause the signature to fail.</p><p>An inclusion proof for a certificate consists of the hash of each sibling node along the path from the certificate to the treehead:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4UZZHkRwsBLWXRYeop4rXv/8598cde48c27c112bc4992889f3d5799/image1.gif" />
          </figure><p>Given a validated treehead, this sequence of hashes is sufficient to prove inclusion of the certificate in the tree. This means that, in order to validate an MTC, the client also needs to obtain the signed treehead from the MTCA.</p><p>This is the key to MTC's efficiency:</p><ol><li><p>Signed treeheads can be disseminated to clients out-of-band and validated offline. Each validated treehead can then be used to validate any certificate in the corresponding batch, eliminating the need to obtain a signature for each server certificate.</p></li><li><p>During the TLS handshake, the client tells the server which treeheads it has. If the server has a signatureless certificate covered by one of those treeheads, then it can use that certificate to authenticate itself. That's <b>1 signature,1 public key and 1 inclusion proof</b> per handshake, both for the server being authenticated.</p></li></ol><p>Now, that's the simplified version. MTC proper has some more bells and whistles. To start, it doesn’t create a separate Merkle tree for each batch, but it grows a single large tree, which is used for better transparency. As this tree grows, periodically (sub)tree heads are selected to be shipped to browsers, which we call <b>landmarks</b>. In the common case browsers will be able to fetch the most recent landmarks, and servers can wait for batch issuance, but we need a fallback: MTC also supports certificates that can be issued immediately and don’t require landmarks to be validated, but these are not as small. A server would provision both types of Merkle tree certificates, so that the common case is fast, and the exceptional case is slow, but at least it’ll work.</p>
    <div>
      <h2>Experimental deployment</h2>
      <a href="#experimental-deployment">
        
      </a>
    </div>
    <p>Ever since early designs for MTCs emerged, we’ve been eager to experiment with the idea. In line with the IETF principle of “<a href="https://www.ietf.org/runningcode/"><u>running code</u></a>”, it often takes implementing a protocol to work out kinks in the design. At the same time, we cannot risk the security of users. In this section, we describe our approach to experimenting with aspects of the Merkle Tree Certificates design <i>without</i> changing any trust relationships.</p><p>Let’s start with what we hope to learn. We have lots of questions whose answers can help to either validate the approach, or uncover pitfalls that require reshaping the protocol — in fact, an implementation of an early MTC draft by <a href="https://www.cs.ru.nl/masters-theses/2025/M_Pohl___Implementation_and_Analysis_of_Merkle_Tree_Certificates_for_Post-Quantum_Secure_Authentication_in_TLS.pdf"><u>Maximilian Pohl</u></a> and <a href="https://www.ietf.org/archive/id/draft-davidben-tls-merkle-tree-certs-07.html#name-acknowledgements"><u>Mia Celeste</u></a> did exactly this. We’d like to know:</p><p><b>What breaks?</b> Protocol ossification (the tendency of implementation bugs to make it harder to change a protocol) is an ever-present issue with deploying protocol changes. For TLS in particular, despite having built-in flexibility, time after time we’ve found that if that flexibility is not regularly used, there will be buggy implementations and middleboxes that break when they see things they don’t recognize. TLS 1.3 deployment <a href="https://blog.cloudflare.com/why-tls-1-3-isnt-in-browsers-yet/"><u>took years longer</u></a> than we hoped for this very reason. And more recently, the rollout of PQ key exchange in TLS caused the Client Hello to be split over multiple TCP packets, something that many middleboxes <a href="https://tldr.fail/"><u>weren't ready for</u></a>.</p><p><b>What is the performance impact?</b> In fact, we expect MTCs to <i>reduce </i>the size of the handshake, even compared to today's non-PQ certificates. They will also reduce CPU cost: ML-DSA signature verification is about as fast as ECDSA, and there will be far fewer signatures to verify. We therefore expect to see a <i>reduction in latency</i>. We would like to see if there is a measurable performance improvement.</p><p><b>What fraction of clients will stay up to date? </b>Getting the performance benefit of MTCs requires the clients and servers to be roughly in sync with one another. We expect MTCs to have fairly short lifetimes, a week or so. This means that if the client's latest landmark is older than a week, the server would have to fallback to a larger certificate. Knowing how often this fallback happens will help us tune the parameters of the protocol to make fallbacks less likely.</p><p>In order to answer these questions, we are implementing MTC support in our TLS stack and in our certificate issuance infrastructure. For their part, Chrome is implementing MTC support in their own TLS stack and will stand up infrastructure to disseminate landmarks to their users.</p><p>As we've done in past experiments, we plan to enable MTCs for a subset of our free customers with enough traffic that we will be able to get useful measurements. Chrome will control the experimental rollout: they can ramp up slowly, measuring as they go and rolling back if and when bugs are found.</p><p>Which leaves us with one last question: who will run the Merkle Tree CA?</p>
    <div>
      <h3>Bootstrapping trust from the existing WebPKI</h3>
      <a href="#bootstrapping-trust-from-the-existing-webpki">
        
      </a>
    </div>
    <p>Standing up a proper CA is no small task: it takes years to be trusted by major browsers. That’s why Cloudflare isn’t going to become a “real” CA for this experiment, and Chrome isn’t going to trust us directly.</p><p>Instead, to make progress on a reasonable timeframe, without sacrificing due diligence, we plan to "mock" the role of the MTCA. We will run an MTCA (on <a href="https://github.com/cloudflare/azul/"><u>Workers</u></a> based on our <a href="https://blog.cloudflare.com/azul-certificate-transparency-log/"><u>StaticCT logs</u></a>), but for each MTC we issue, we also publish an existing certificate from a trusted CA that agrees with it. We call this the <b>bootstrap certificate</b>. When Chrome’s infrastructure pulls updates from our MTCA log, they will also pull these bootstrap certificates, and check whether they agree. Only if they do, they’ll proceed to push the corresponding landmarks to Chrome clients. In other words, Cloudflare is effectively just “re-encoding” an existing certificate (with domain validation performed by a trusted CA) as an MTC, and Chrome is using certificate transparency to keep us honest.</p>
    <div>
      <h2>Conclusion</h2>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>With almost 50% of our traffic already protected by post-quantum encryption, we’re halfway to a fully post-quantum secure Internet. The second part of our journey, post-quantum certificates, is the hardest yet though. A simple drop-in upgrade has a noticeable performance impact and no security benefit before Q-day. This means it’s a hard sell to enable today by default. But here we are playing with fire: migrations always take longer than expected. If we want to keep an ubiquitously private and secure Internet, we need a post-quantum solution that’s performant enough to be enabled by default <b>today</b>.</p><p>Merkle Tree Certificates (MTCs) solves this problem by reducing the number of signatures and public keys to the bare minimum while maintaining the WebPKI's essential properties. We plan to roll out MTCs to a fraction of free accounts by early next year. This does not affect any visitors that are not part of the Chrome experiment. For those that are, thanks to the bootstrap certificates, there is no impact on security.</p><p>We’re excited to keep the Internet fast <i>and</i> secure, and will report back soon on the results of this experiment: watch this space! MTC is evolving as we speak, if you want to get involved, please join the IETF <a href="https://mailman3.ietf.org/mailman3/lists/plants@ietf.org/"><u>PLANTS mailing list</u></a>.</p> ]]></content:encoded>
            <category><![CDATA[Post-Quantum]]></category>
            <category><![CDATA[Research]]></category>
            <category><![CDATA[Cryptography]]></category>
            <category><![CDATA[Security]]></category>
            <category><![CDATA[TLS]]></category>
            <category><![CDATA[Chrome]]></category>
            <category><![CDATA[Google]]></category>
            <category><![CDATA[IETF]]></category>
            <category><![CDATA[Transparency]]></category>
            <category><![CDATA[Rust]]></category>
            <category><![CDATA[Open Source]]></category>
            <category><![CDATA[Cloudflare Workers]]></category>
            <guid isPermaLink="false">4jURWdZzyjdrcurJ4LlJ1z</guid>
            <dc:creator>Luke Valenta</dc:creator>
            <dc:creator>Christopher Patton</dc:creator>
            <dc:creator>Vânia Gonçalves</dc:creator>
            <dc:creator>Bas Westerbaan</dc:creator>
        </item>
        <item>
            <title><![CDATA[A framework for measuring Internet resilience]]></title>
            <link>https://blog.cloudflare.com/a-framework-for-measuring-internet-resilience/</link>
            <pubDate>Tue, 28 Oct 2025 13:00:00 GMT</pubDate>
            <description><![CDATA[ We present a data-driven framework to quantify cross-layer Internet resilience. We also share a list of measurements with which to quantify facets of Internet resilience for geographical areas. ]]></description>
            <content:encoded><![CDATA[ <p></p><p>On July 8, 2022, a massive outage at Rogers, one of Canada's largest telecom providers, knocked out Internet and mobile services for over 12 million users. Why did this single event have such a catastrophic impact? And more importantly, why do some networks crumble in the face of disruption while others barely stumble?</p><p>The answer lies in a concept we call <b>Internet resilience</b>: a network's ability not just to stay online, but to withstand, adapt to, and rapidly recover from failures.</p><p>It’s a quality that goes far beyond simple "uptime." True resilience is a multi-layered capability, built on everything from the diversity of physical subsea cables to the security of BGP routing and the health of a competitive market. It's an emergent property much like <a href="https://en.wikipedia.org/wiki/Psychological_resilience"><u>psychological resilience</u></a>: while each individual network must be robust, true resilience only arises from the collective, interoperable actions of the entire ecosystem. In this post, we'll introduce a data-driven framework to move beyond abstract definitions and start quantifying what makes a network resilient. All of our work is based on public data sources, and we're sharing our metrics to help the entire community build a more reliable and secure Internet for everyone.</p>
    <div>
      <h2>What is Internet resilience?</h2>
      <a href="#what-is-internet-resilience">
        
      </a>
    </div>
    <p>In networking, we often talk about "reliability" (does it work under normal conditions?) and "robustness" (can it handle a sudden traffic surge?). But resilience is more dynamic. It's the ability to gracefully degrade, adapt, and most importantly, recover. For our work, we've adopted a pragmatic definition:</p><p><b><i>Internet resilience</i></b><i> is the measurable capability of a national or regional network ecosystem to maintain diverse and secure routing paths in the face of challenges, and to rapidly restore connectivity following a disruption.</i></p><p>This definition links the abstract goal of resilience to the concrete, measurable metrics that form the basis of our analysis.</p>
    <div>
      <h3>Local decisions have global impact</h3>
      <a href="#local-decisions-have-global-impact">
        
      </a>
    </div>
    <p>The Internet is a global system but is built out of thousands of local pieces. Every country depends on the global Internet for economic activity, communication, and critical services, yet most of the decisions that shape how traffic flows are made locally by individual networks.</p><p>In most national infrastructures like water or power grids, a central authority can plan, monitor, and coordinate how the system behaves. The Internet works very differently. Its core building blocks are Autonomous Systems (ASes), which are networks like ISPs, universities, cloud providers or enterprises. Each AS controls autonomously how it connects to the rest of the Internet, which routes it accepts or rejects, how it prefers to forward traffic, and with whom it interconnects. That’s why they’re called Autonomous Systems in the first place! There’s no global controller. Instead, the Internet’s routing fabric emerges from the collective interaction of thousands of independent networks, each optimizing for its own goals.</p><p>This decentralized structure is one of the Internet’s greatest strengths: no single failure can bring the whole system down. But it also makes measuring resilience at a country level tricky. National statistics can hide local structures that are crucial to global connectivity. For example, a country might appear to have many international connections overall, but those connections could be concentrated in just a handful of networks. If one of those fails, the whole country could be affected.</p><p>For resilience, the goal isn’t to isolate national infrastructure from the global Internet. In fact, the opposite is true: healthy integration with diverse partners is what makes both local and global connectivity stronger. When local networks invest in secure, redundant, and diverse interconnections, they improve their own resilience and contribute to the stability of the Internet as a whole.</p><p>This perspective shapes how we design and interpret resilience metrics. Rather than treating countries as isolated units, we look at how well their networks are woven into the global fabric: the number and diversity of upstream providers, the extent of international peering, and the richness of local interconnections. These are the building blocks of a resilient Internet.</p>
    <div>
      <h3>Route hygiene: Keeping the Internet healthy</h3>
      <a href="#route-hygiene-keeping-the-internet-healthy">
        
      </a>
    </div>
    <p>The Internet is constructed according to a <i>layered</i> model, by design, so that different Internet components and features can evolve independent of the others. The Physical layer stores, carries, and forwards, all the bits and bytes transmitted in packets between devices. It consists of cables, routers and switches, but also buildings that house interconnection facilities. The Application layer sits above all others and has virtually no information about the network so that applications can communicate without having to worry about the underlying details, for example, if a network is ethernet or Wi-Fi. The application layer includes web browsers, web servers, as well as caching, security, and other features provided by Content Distribution Networks (CDNs). Between the physical and application layers is the Network layer responsible for Internet routing. It is ‘logical’, consisting of software that learns about interconnection and routes, and makes (local) forwarding decisions that deliver packets to their destinations. </p><p>Good route hygiene works like personal hygiene: it prevents problems before they spread. The Internet relies on the <a href="https://www.cloudflare.com/learning/security/glossary/what-is-bgp/"><u>Border Gateway Protocol</u></a> (BGP) to exchange routes between networks, but BGP wasn’t built with security in mind. A single bad route announcement, whether by mistake or attack, can send traffic the wrong way or cause widespread outages.</p><p>Two practices help stop this: The <b>RPKI</b> (Resource Public Key Infrastructure) lets networks publish cryptographic proof that they’re allowed to announce specific IP prefixes. <b>ROV </b>(Route Origin Validation) checks those proofs before accepting routes.</p><p>Together, they act like passports and border checks for Internet routes, helping filter out hijacks and leaks early.</p><p>Hygiene doesn’t just happen in the routing table – it spans multiple layers of the Internet’s architecture, and weaknesses in one layer can ripple through the rest. At the physical layer, having multiple, geographically diverse cable routes ensures that a single cut or disaster doesn’t isolate an entire region. For example, distributing submarine landing stations along different coastlines can protect international connectivity when one corridor fails. At the network layer, practices like multi-homing and participation in Internet Exchange Points (IXPs) give operators more options to reroute traffic during incidents, reducing reliance on any single upstream provider. At the application layer, Content Delivery Networks (CDNs) and caching keep popular content close to users, so even if upstream routes are disrupted, many services remain accessible. Finally, policy and market structure also play a role: open peering policies and competitive markets foster diversity, while dependence on a single ISP or cable system creates fragility.</p><p>Resilience emerges when these layers work together. If one layer is weak, the whole system becomes more vulnerable to disruption.</p><p>The more networks adopt these practices, the stronger and more resilient the Internet becomes. We actively support the deployment of RPKI, ROV, and diverse routing to keep the global Internet healthy.</p>
    <div>
      <h2>Measuring resilience is harder than it sounds</h2>
      <a href="#measuring-resilience-is-harder-than-it-sounds">
        
      </a>
    </div>
    <p>The biggest hurdle in measuring resilience is data access. The most valuable information, like internal network topologies, the physical paths of fiber cables, or specific peering agreements, is held by private network operators. This is the ground truth of the network.</p><p>However, operators view this information as a highly sensitive competitive asset. Revealing detailed network maps could expose strategic vulnerabilities or undermine business negotiations. Without access to this ground truth data, we're forced to rely on inference, approximation, and the clever use of publicly available data sources. Our framework is built entirely on these public sources to ensure anyone can reproduce and build upon our findings.</p><p>Projects like RouteViews and RIPE RIS collect BGP routing data that shows how networks connect. <a href="https://www.cloudflare.com/en-in/learning/network-layer/what-is-mtr/"><u>Traceroute</u></a> measurements reveal paths at the router level. IXP and submarine cable maps give partial views of the physical layer. But each of these sources has blind spots: peering links often don’t appear in BGP data, backup paths may remain hidden, and physical routes are hard to map precisely. This lack of a single, complete dataset means that resilience measurement relies on combining many partial perspectives, a bit like reconstructing a city map from scattered satellite images, traffic reports, and public utility filings. It’s challenging, but it’s also what makes this field so interesting.</p>
    <div>
      <h3>Translating resilience into quantifiable metrics</h3>
      <a href="#translating-resilience-into-quantifiable-metrics">
        
      </a>
    </div>
    <p>Once we understand why resilience matters and what makes it hard to measure, the next step is to translate these ideas into concrete metrics. These metrics give us a way to evaluate how well different parts of the Internet can withstand disruptions and to identify where the weak points are. No single metric can capture Internet resilience on its own. Instead, we look at it from multiple angles: physical infrastructure, network topology, interconnection patterns, and routing behavior. Below are some of the key dimensions we use. Some of these metrics are inspired from existing research, like the <a href="https://pulse.internetsociety.org/en/resilience/"><u>ISOC Pulse</u></a> framework. All described methods rely on public data sources and are fully reproducible. As a result, in our visualizations we intentionally omit country and region names to maintain focus on the methodology and interpretation of the results. </p>
    <div>
      <h3>IXPs and colocation facilities</h3>
      <a href="#ixps-and-colocation-facilities">
        
      </a>
    </div>
    <p>Networks primarily interconnect in two types of physical facilities: colocation facilities (colos), and Internet Exchange Points (IXPs) often housed within the colos. Although symbiotically linked, they serve distinct functions in a nation’s digital ecosystem. A colocation facility provides the foundational infrastructure —- secure space, power, and cooling – for network operators to place their equipment. The IXP builds upon this physical base to provide the logical interconnection fabric, a role that is transformative for a region’s Internet development and resilience. The networks that connect at these facilities are its members. </p><p>Metrics that reflect resilience include:</p><ul><li><p><b>Number and distribution of IXPs</b>, normalized by population or geography. A higher IXP count, weighted by population or geographic coverage, is associated with improved local connectivity.</p></li><li><p><b>Peering participation rates</b> — the percentage of local networks connected to domestic IXPs. This metric reflects the extent to which local networks rely on regional interconnection rather than routing traffic through distant upstream providers.</p></li><li><p><b>Diversity of IXP membership</b>, including ISPs, CDNs, and cloud providers, which indicates how much critical content is available locally, making it accessible to domestic users even if international connectivity is severely degraded.</p></li></ul><p>Resilience also depends on how well local networks connect globally:</p><ul><li><p>How many <b>local networks peer at international IXPs</b>, increasing their routing options</p></li><li><p>How many <b>international networks peer at local IXPs</b>, bringing content closer to users</p></li></ul><p>A balanced flow in both directions strengthens resilience by ensuring multiple independent paths in and out of a region.</p><p>The geographic distribution of IXPs further enhances resilience. A resilient IXP ecosystem should be geographically dispersed to serve different regions within a country effectively, reducing the risk of a localized infrastructure failure from affecting the connectivity of an entire country. Spatial distribution metrics help evaluate how infrastructure is spread across a country’s geography or its population. Key spatial metrics include:</p><ul><li><p><b>Infrastructure per Capita</b>: This metric – inspired by <a href="https://en.wikipedia.org/wiki/Telephone_density"><u>teledensity</u></a>  – measures infrastructure relative to population size of a sub-region, providing a per-person availability indicator. A low IXP-per-population ratio in a region suggests that users there rely on distant exchanges, increasing the bit-risk miles.</p></li><li><p><b>Infrastructure per Area (Density)</b>: This metric evaluates how infrastructure is distributed per unit of geographic area, highlighting spatial coverage. Such area-based metrics are crucial for critical infrastructures to ensure remote areas are not left inaccessible.</p></li></ul><p>These metrics can be summarized using the <a href="https://www.bls.gov/k12/students/economics-made-easy/location-quotients.pdf"><u>Location Quotient (LQ)</u></a>. The location quotient is a widely used geographic index that measures a region’s share of infrastructure relative to its share of a baseline (such as population or area).</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4S52jlwCpQ8WVS6gRSdNqp/4722abb10331624a54b411708f1e576b/image5.png" />
          </figure><p>For example, the figure above represents US states where a region hosts more or less infrastructure that is expected for its population, based on its LQ score. This statistic illustrates how even for the states with the highest number of facilities this number is <i>still</i> lower than would be expected given the population size of those states.</p>
    <div>
      <h4>Economic-weighted metrics</h4>
      <a href="#economic-weighted-metrics">
        
      </a>
    </div>
    <p>While spatial metrics capture the physical distribution of infrastructure, economic and usage-weighted metrics reveal how infrastructure is actually used. These account for traffic, capacity, or economic activity, exposing imbalances that spatial counts miss. <b>Infrastructure Utilization Concentration</b> measures how usage is distributed across facilities, using indices like the <b>Herfindahl–Hirschman Index (HHI)</b>. HHI sums the squared market shares of entities, ranging from 0 (competitive) to 10,000 (highly concentrated). For IXPs, market share is defined through operational metrics such as:</p><ul><li><p><b>Peak/Average Traffic Volume</b> (Gbps/Tbps): indicates operational significance</p></li><li><p><b>Number of Connected ASNs</b>: reflects network reach</p></li><li><p><b>Total Port Capacity</b>: shows physical scale</p></li></ul><p>The chosen metric affects results. For example, using connected ASNs yields an HHI of 1,316 (unconcentrated) for a Central European country, whereas using port capacity gives 1,809 (moderately concentrated).</p><p>The <b>Gini coefficient</b> measures inequality in resource or traffic distribution (0 = equal, 1 = fully concentrated). The <b>Lorenz curve</b> visualizes this: a straight 45° line indicates perfect equality, while deviations show concentration.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/30bh4nVHRX5O3HMKvGRYh7/e0c5b3a7cb8294dfe3caaec98a0557d0/Screenshot_2025-10-27_at_14.10.57.png" />
          </figure><p>The chart on the left suggests substantial geographical inequality in colocation facility distribution across the US states. However, the population-weighted analysis in the chart on the right demonstrates that much of that geographic concentration can be explained by population distribution.</p>
    <div>
      <h3>Submarine cables</h3>
      <a href="#submarine-cables">
        
      </a>
    </div>
    <p>Internet resilience, in the context of undersea cables, is defined by the global network’s capacity to withstand physical infrastructure damage and to recover swiftly from faults, thereby ensuring the continuity of intercontinental data flow. The metrics for quantifying this resilience are multifaceted, encompassing the frequency and nature of faults, the efficiency of repair operations, and the inherent robustness of both the network’s topology and its dedicated maintenance resources. Such metrics include:</p><ul><li><p>Number of <b>landing stations</b>, cable corridors, and operators. The goal is to ensure that national connectivity should withstand single failure events, be they natural disaster, targeted attack, or major power outage. A lack of diversity creates single points of failure, as highlighted by <a href="https://www.theguardian.com/news/2025/sep/30/tonga-pacific-island-internet-underwater-cables-volcanic-eruption"><u>incidents in Tonga</u></a> where damage to the only available cable led to a total outage.</p></li><li><p><b>Fault rates</b> and <b>mean time to repair (MTTR)</b>, which indicate how quickly service can be restored. These metrics measure a country’s ability to prevent, detect, and recover from cable incidents, focusing on downtime reduction and protection of critical assets. Repair times hinge on <b>vessel mobilization</b> and <b>government permits</b>, the latter often the main bottleneck.</p></li><li><p>Availability of <b>satellite backup capacity</b> as an emergency fallback. While cable diversity is essential, resilience planning must also cover worst-case outages. The Non-Terrestrial Backup System Readiness metric measures a nation’s ability to sustain essential connectivity during major cable disruptions. LEO and MEO satellites, though costlier and lower capacity than cables, offer proven emergency backup during conflicts or disasters. Projects like HEIST explore hybrid space-submarine architectures to boost resilience. Key indicators include available satellite bandwidth, the number of NGSO providers under contract (for diversity), and the deployment of satellite terminals for public and critical infrastructure. Tracking these shows how well a nation can maintain command, relief operations, and basic connectivity if cables fail.</p></li></ul>
    <div>
      <h3>Inter-domain routing</h3>
      <a href="#inter-domain-routing">
        
      </a>
    </div>
    <p>The network layer above the physical interconnection infrastructure governs how traffic is routed across the Autonomous Systems (ASes). Failures or instability at this layer – such as misconfigurations, attacks, or control-plane outages – can disrupt connectivity even when the underlying physical infrastructure remains intact. In this layer, we look at resilience metrics that characterize the robustness and fault tolerance of AS-level routing and BGP behavior.</p><p><b>AS Path Diversity</b> measures the number and independence of AS-level routes between two points. High diversity provides alternative paths during failures, enabling BGP rerouting and maintaining connectivity. Low diversity leaves networks vulnerable to outages if a critical AS or link fails. Resilience depends on upstream topology.</p><ul><li><p>Single-homed ASes rely on one provider, which is cheaper and simpler but more fragile.</p></li><li><p>Multi-homed ASes use multiple upstreams, requiring BGP but offering far greater redundancy and performance at higher cost.</p></li></ul><p>The <b>share of multi-homed ASes</b> reflects an ecosystem’s overall resilience: higher rates signal greater protection from single-provider failures. This metric is easy to measure using <b>public BGP data</b> (e.g., RouteViews, RIPE RIS, CAIDA). Longitudinal BGP monitoring helps reveal hidden backup links that snapshots might miss.</p><p>Beyond multi-homing rates, <b>the distribution of single-homed ASes per transit provider</b> highlights systemic weak points. For each provider, counting customer ASes that rely exclusively on it reveals how many networks would be cut off if that provider fails. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1ECZveUVwyM6TmGa1SaZnl/1222c7579c81fd62a5d8d80d63000ec3/image1.png" />
          </figure><p>The figure above shows Canadian transit providers for July 2025: the x-axis is total customer ASes, the y-axis is single-homed customers. Canada’s overall single-homing rate is 30%, with some providers serving many single-homed ASes, mirroring vulnerabilities seen during the <a href="https://en.wikipedia.org/wiki/2022_Rogers_Communications_outage"><u>2022 Rogers outage</u></a>, which disrupted over 12 million users.</p><p>While multi-homing metrics provide a valuable, static view of an ecosystem’s upstream topology, a more dynamic and nuanced understanding of resilience can be achieved by analyzing the characteristics of the actual BGP paths observed from global vantage points. These path-centric metrics move beyond simply counting connections to assess the diversity and independence of the routes to and from a country’s networks. These metrics include:</p><ul><li><p><b>Path independence</b> measures whether those alternative routes truly avoid shared bottlenecks. Multi-homing only helps if upstream paths are truly distinct. If two providers share upstream transit ASes, redundancy is weak. Independence can be measured with the Jaccard distance between AS paths. A stricter <b>path disjointness score</b> calculates the share of path pairs with no common ASes, directly quantifying true redundancy.</p></li><li><p><b>Transit entropy</b> measures how evenly traffic is distributed across transit providers. High Shannon entropy signals a decentralized, resilient ecosystem; low entropy shows dependence on few providers, even if nominal path diversity is high.</p></li><li><p><b>International connectivity ratios</b> evaluate the share of domestic ASes with direct international links. High percentages reflect a mature, distributed ecosystem; low values indicate reliance on a few gateways.</p></li></ul><p>The figure below encapsulates the aforementioned AS-level resilience metrics into single polar pie charts. For the purpose of exposition we plot the metrics for infrastructure from two different nations with very different resilience profiles.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/PKxDcl4m1XXCAuvFUcTdZ/d0bce797dcbd5e1baf39ca66e7ac0056/image4.png" />
          </figure><p>To pinpoint critical ASes and potential single points of failure, graph centrality metrics can provide useful insights. <b>Betweenness Centrality (BC)</b> identifies nodes lying on many shortest paths, but applying it to BGP data suffers from vantage point bias. ASes that provide BGP data to the RouteViews and RIS collectors appear falsely central. <b>AS Hegemony</b>, developed by<a href="https://dl.acm.org/doi/10.1145/3123878.3131982"><u> Fontugne et al.</u></a>, corrects this by filtering biased viewpoints, producing a 0–1 score that reflects the true fraction of paths crossing an AS. It can be applied globally or locally to reveal Internet-wide or AS-specific dependencies.</p><p><b>Customer cone size</b> developed by <a href="https://asrank.caida.org/about#customer-cone"><u>CAIDA</u></a> offers another perspective, capturing an AS’s economic and routing influence via the set of networks it serves through customer links. Large cones indicate major transit hubs whose failure affects many downstream networks. However, global cone rankings can obscure regional importance, so <a href="https://www.caida.org/catalog/papers/2023_on_importance_being_as/on_importance_being_as.pdf"><u>country-level adaptations</u></a> give more accurate resilience assessments.</p>
    <div>
      <h4>Impact-Weighted Resilience Assessment</h4>
      <a href="#impact-weighted-resilience-assessment">
        
      </a>
    </div>
    <p>Not all networks have the same impact when they fail. A small hosting provider going offline affects far fewer people than if a national ISP does. Traditional resilience metrics treat all networks equally, which can mask where the real risks are. To address this, we use impact-weighted metrics that factor in a network’s user base or infrastructure footprint. For example, by weighting multi-homing rates or path diversity by user population, we can see how many people actually benefit from redundancy — not just how many networks have it. Similarly, weighting by the number of announced prefixes highlights networks that carry more traffic or control more address space.</p><p>This approach helps separate theoretical resilience from practical resilience. A country might have many multi-homed networks, but if most users rely on just one single-homed ISP, its resilience is weaker than it looks. Impact weighting helps surface these kinds of structural risks so that operators and policymakers can prioritize improvements where they matter most.</p>
    <div>
      <h3>Metrics of network hygiene</h3>
      <a href="#metrics-of-network-hygiene">
        
      </a>
    </div>
    <p>Large Internet outages aren’t always caused by cable cuts or natural disasters — sometimes, they stem from routing mistakes or security gaps. Route hijacks, leaks, and spoofed announcements can disrupt traffic on a national scale. How well networks protect themselves against these incidents is a key part of resilience, and that’s where network hygiene comes in.</p><p>Network hygiene refers to the security and operational practices that make the global routing system more trustworthy. This includes:</p><ul><li><p><b>Cryptographic validation</b>, like RPKI, to prevent unauthorized route announcements. <b>ROA Coverage</b> measures the share of announced IPv4/IPv6 space with valid Route Origin Authorizations (ROAs), indicating participation in the RPKI ecosystem. <b>ROV Deployment</b> gauges how many networks drop invalid routes, but detecting active filtering is difficult. Policymakers can improve visibility by supporting independent measurements, data transparency, and standardized reporting.</p></li><li><p><b>Filtering and cooperative norms</b>, where networks block bogus routes and follow best practices when sharing routing information.</p></li><li><p><b>Consistent deployment across both domestic networks and their international upstreams</b>, since traffic often crosses multiple jurisdictions.</p></li></ul><p>Strong hygiene practices reduce the likelihood of systemic routing failures and limit their impact when they occur. We actively support and monitor the adoption of these mechanisms, for instance through <a href="https://isbgpsafeyet.com/"><u>crowd-sourced measurements</u></a> and public advocacy, because every additional network that validates routes and filters traffic contributes to a safer and more resilient Internet for everyone.</p><p>Another critical aspect of Internet hygiene is mitigating DDoS attacks, which often rely on IP address spoofing to amplify traffic and obscure the attacker’s origin. <a href="https://datatracker.ietf.org/doc/bcp38/"><u>BCP-38</u></a>, the IETF’s network ingress filtering recommendation, addresses this by requiring operators to block packets with spoofed source addresses, reducing a region’s role as a launchpad for global attacks. While BCP-38 does not prevent a network from being targeted, its deployment is a key indicator of collective security responsibility. Measuring compliance requires active testing from inside networks, which is carried out by the <a href="https://spoofer.caida.org/summary.php"><u>CAIDA Spoofer Project</u></a>. Although the global sample remains limited, these metrics offer valuable insight into both the technical effectiveness and the security engagement of a nation’s network community, complementing RPKI in strengthening the overall routing security posture.</p>
    <div>
      <h3>Measuring the collective security posture</h3>
      <a href="#measuring-the-collective-security-posture">
        
      </a>
    </div>
    <p>Beyond securing individual networks through mechanisms like RPKI and BCP-38, strengthening the Internet’s resilience also depends on collective action and visibility. While origin validation and anti-spoofing reduce specific classes of threats, broader frameworks and shared measurement infrastructures are essential to address systemic risks and enable coordinated responses.</p><p>The <a href="https://manrs.org/"><u>Mutually Agreed Norms for Routing Security (MANRS)</u></a> initiative promotes Internet resilience by defining a clear baseline of best practices. It is not a new technology but a framework fostering collective responsibility for global routing security. MANRS focuses on four key actions: filtering incorrect routes, anti-spoofing, coordination through accurate contact information, and global validation using RPKI and IRRs. While many networks implement these independently, MANRS participation signals a public commitment to these norms and to strengthening the shared security ecosystem.</p><p>Additionally, a region’s participation in public measurement platforms reflects its Internet observability, which is essential for fault detection, impact assessment, and incident response. <a href="https://atlas.ripe.net/"><u>RIPE Atlas</u></a> and <a href="https://www.caida.org/projects/ark/"><u>CAIDA Ark</u></a> provide dense data-plane measurements; <a href="https://www.routeviews.org/routeviews/"><u>RouteViews</u></a> and <a href="https://www.ripe.net/analyse/internet-measurements/routing-information-service-ris/"><u>RIPE RIS</u></a> collect BGP routing data to detect anomalies; and <a href="https://www.peeringdb.com/"><u>PeeringDB</u></a> documents interconnection details, reflecting operational maturity and integration into the global peering fabric. Together, these platforms underpin observatories like <a href="https://ioda.inetintel.cc.gatech.edu/"><u>IODA</u></a> and <a href="https://grip.oie.gatech.edu/home"><u>GRIP</u></a>, which combine BGP and active data to detect outages and routing incidents in near real time, offering critical visibility into Internet health and security.</p>
    <div>
      <h2>Building a more resilient Internet, together</h2>
      <a href="#building-a-more-resilient-internet-together">
        
      </a>
    </div>
    <p>Measuring Internet resilience is complex, but it's not impossible. By using publicly available data, we can create a transparent and reproducible framework to identify strengths, weaknesses, and single points of failure in any network ecosystem.</p><p>This isn't just a theoretical exercise. For policymakers, this data can inform infrastructure investment and pro-competitive policies that encourage diversity. For network operators, it provides a benchmark to assess their own resilience and that of their partners. And for everyone who relies on the Internet, it's a critical step toward building a more stable, secure, and reliable global network.</p><p><i>For more details of the framework, including a full table of the metrics and links to source code, please refer to the full paper: </i> <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5376106"><u>Regional Perspectives for Route Resilience in a Global Internet: Metrics, Methodology, and Pathways for Transparency</u></a> published at <a href="https://www.tprcweb.com/tprc23program"><u>TPRC23</u></a>.</p> ]]></content:encoded>
            <category><![CDATA[Research]]></category>
            <category><![CDATA[Better Internet]]></category>
            <category><![CDATA[Routing Security]]></category>
            <category><![CDATA[Insights]]></category>
            <guid isPermaLink="false">48ry6RI3JhA9H3t280EWUX</guid>
            <dc:creator>Vasilis Giotsas</dc:creator>
            <dc:creator>Cefan Daniel Rubin</dc:creator>
            <dc:creator>Marwan Fayed</dc:creator>
        </item>
        <item>
            <title><![CDATA[The tricky science of Internet measurement]]></title>
            <link>https://blog.cloudflare.com/tricky-internet-measurement/</link>
            <pubDate>Mon, 27 Oct 2025 13:00:00 GMT</pubDate>
            <description><![CDATA[ The Internet is one big open system composed of many closed boxes — which makes measuring the Internet difficult. In this post we explore Internet measurement as a science. ]]></description>
            <content:encoded><![CDATA[ <p>Measurement is critical to our understanding not just of the world and the universe, but also the systems we design and deploy. The Internet is no exception but the challenges of measuring the Internet are unique.</p><p>The Internet is remarkably opaque, which is counter-intuitive given its open and multi-stakeholder model. It’s opaque because ultimately the Internet joins many networks and services that are each owned and operated by unrelated entities, and that rarely share or report about their systems. Every network may carry and forward what other systems produce, but each system is entirely independent — which, to be honest, is the magic of the Internet. It’s in this opaque-yet-critical context that Internet measurement must exist as a scientific practice, with all the associated rigor, repeatability, and reproduction.</p><p>Measurement as a scientific practice can be exciting — for what it gets right as well as wrong. The following statement encapsulates some of the subtleties:</p><blockquote><p>“<b>5 out of 6 scientists say that </b><a href="https://en.wikipedia.org/wiki/Russian_roulette"><b><u>Russian Roulette</u></b></a><b> is safe.”</b></p></blockquote><p>The statement is absurd! Laugh as we might, the statement is also logical. It’s trivially easy to design an experiment that leads to the above statement. However, the only way this experiment could succeed is if the “actor” — that is, whoever conducts the experiment — ignores every aspect of measurement science that makes the practice credible, as follows.</p><ul><li><p><b>Methodology</b>: a cycle consisting of data curation, modeling, and validation. Here, the experiment (data curation) could only succeed if each participant is prevented from seeing others’ injuries. More importantly, no measurement is needed because the actor can calculate probabilities with available numbers, without the experiment!</p></li><li><p><b>Ethics</b>: the way we measure can have undue, undesirable consequences. A bare minimum principle is <i>do no harm.</i></p></li><li><p><b>Representation</b>: clear and complete statements or visualizations should be at least informative and ideally actionable; otherwise, they can be misleading. Say each participant answered with yes to the question, “are you safe?” They are answering a different question than “is the game safe?”</p></li></ul><p>In this blog we look at each of the above aspects of measurement, describe how they manifest in the Internet space, and relate them to examples from work that will be featured throughout <a href="https://blog.cloudflare.com/internet-measurement-resilience-transparency-week"><u>the week</u></a>. Let’s first start with some background.</p>
    <div>
      <h2>Preface: A motivating example from inside Cloudflare</h2>
      <a href="#preface-a-motivating-example-from-inside-cloudflare">
        
      </a>
    </div>
    <p>High quality measurements help to identify, understand, even explain our experiences, environments, and systems. However, observation in isolation, without context, can be perilous. The following is a time series from an internal graph of HTTP requests from Lviv, Ukraine, leading up to the evening of 28 February 2022:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7D1hr8mMykICnj7Rh1Apyf/9b50dd98d996ed296fbad64cdbada497/image9.png" />
          </figure><p>On that day, traffic from the region increased by 3-4X. For context, the Russian incursion into Ukraine began four days earlier. The world was watching events closely. Cloudflare was no exception, helping both to <a href="https://blog.cloudflare.com/internet-traffic-patterns-in-ukraine-since-february-21-2022/"><u>report</u></a> and to <a href="https://blog.cloudflare.com/steps-taken-around-cloudflares-services-in-ukraine-belarus-and-russia/"><u>mitigate</u></a> network effects.</p><p>Upon observing that abnormal spike, we at Cloudflare <i>could have</i> mistakenly reported the increase as a potential DoS attack. However, there were counter-indications. First, no attack was flagged by the DoS defense and mitigation systems. In addition, the profile was atypical of attack traffic, which tends to be either single source from a single location or multiple sources from multiple locations. In this instance the increase came from multiple source networks but in a single location (Lviv).</p><p>Cloudflare had the tools to avoid erroneous reporting and later <a href="https://blog.cloudflare.com/internet-traffic-patterns-in-ukraine-since-february-21-2022/#internet-traffic"><u>correctly reported</u></a> that the increase was due to a mass of people converging in Lviv, the city with the last train station on the westward journey out of Ukraine. But — and this is important in a measurement context — nothing visible from Cloudflare’s perspective could provide an explanation. In the end, an employee saw a report on BBC about the massive movement of people in that part of Ukraine, which enabled us to better explain the traffic shift.</p><p>This example is an important reminder to always look for alternative explanations. It also shows how observations alone can lead to wrong conclusions, due to missing information or unrecognized biases. But good numbers without bias <a href="https://blog.cloudflare.com/loving-performance-measurements/"><u>can be misunderstood</u></a>, too.</p>
    <div>
      <h2>Measurement vocabulary and jargon</h2>
      <a href="#measurement-vocabulary-and-jargon">
        
      </a>
    </div>
    <p>In the measurement context there is a vocabulary of common words with specific meanings that are useful to know before diving into practice and examples.</p>
    <div>
      <h3>Active and passive measurement </h3>
      <a href="#active-and-passive-measurement">
        
      </a>
    </div>
    <p>These describe the “how.” In an <i>active</i> measurement, an actor initiates some action designed to trigger a response. The response may be data, such as latency returned from a ping or a DNS answer in response to a query. The response may be an observable change in a mechanism or system triggered by an action, such as well-crafted probe packets that prompt reactions from and expose middleboxes.</p><p>In a <i>passive</i> measurement, the actor only observes. No action is taken. As a result, no response is triggered; the system and its behaviour are unaltered. Logs are typically compiled from passive observations, and Cloudflare’s own are no exception. The vast majority of data shown in <a href="https://radar.cloudflare.com"><u>Cloudflare Radar</u></a> derives from those logs.</p><p>Each has its trade-offs. Active measurements are targeted and can be controlled. They are also exceptionally difficult (and often costly) to scale and, as a result, are only able to observe the parts of a system where they are deployed. Conversely, passive measurements tend to be lighter weight, but only succeed if the observer is at the right place at the right time. </p><p>Effectively, the two methods complement each other, and that makes them most powerful when orchestrated so that the knowledge from one feeds into the other. For example, in our own prior attempts to <a href="https://blog.cloudflare.com/cdn-latency-passive-measurement/"><u>understand performance across CDNs</u></a>, we interrogated the (passive) request logs to get insights, which helped inform later (active) pings using RIPE’s Atlas that we used to confirm our insights and results. In the opposite direction, our efforts to (passively) <a href="https://blog.cloudflare.com/connection-tampering/"><u>detect and understand connection failures</u></a> was informed by, and arguably only possible because of, a large body of (active) measurements in the research community to understand wide-scale connection tampering.</p><p>For more on the interplay between active and passive, you can read about the experience of a researcher who was equipped to <a href="https://blog.cloudflare.com/experience-of-data-at-scale"><u>dig deep</u></a> into Cloudflare’s vast troves of data because of insights from prior active measurements in the research community.</p>
    <div>
      <h3>Direct and indirect measurement </h3>
      <a href="#direct-and-indirect-measurement">
        
      </a>
    </div>
    <p>It is possible to gain insights about something without directly observing it. Consider, for example, the capacity of a path, better known as the <i>bandwidth</i>. The common method to <i>directly</i> observe bandwidth is to launch a <a href="https://speed.cloudflare.com/"><u>speed test</u></a>. It’s a simple test, but it has two problems.</p><p>The first is that it works by consuming as much of the bandwidth as it can (which creates an ethical dilemma we later revisit). The second is that it actually measures throughput from a sender to a receiver, which is the available bandwidth (or, alternately, the residual capacity) of the <i>bottleneck</i> link. If two speed tests share a bottleneck then each might observe throughput that is ½ of the actual bandwidth. The evidence is in the numbers, as seen below, where observations of a speed test range from 69-85Mbps — that’s a +/- range of nearly 20% from the median, and far from a fixed value!</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6OpXXM8CqkhWkbavw9RgMv/395827e2390fa145650703905c4abdb4/image2.png" />
          </figure><p>There is instead a 25+ year-old <i>indirect</i> alternative to speed tests called <a href="https://www.usenix.org/legacy/publications/library/proceedings/usits01/full_papers/lai/lai_html/node2.html"><u>the packet pair</u></a>, or packet train. It works by first transmitting pairs of packets with no delay between them and recording their transmission times, then recording their arrival times. The change between transmission and arrival times of the two packets gives an indication of the bottleneck bandwidth. Repeat the packet pair probes and, with some statistical analysis, a good estimate of the true bottleneck bandwidth emerges. Instead of directly observing bandwidth by pushing and counting bytes over time, the packet pair technique uses the time between two packets to <i>indirectly</i> calculate — or infer — the metric.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4LMXzWWY1rbU0Tb02v7uzt/6e83b407e8cece51c3fa42b91bd036b3/image5.png" />
          </figure>
    <div>
      <h2>The (Network) Measurement Lifecycle</h2>
      <a href="#the-network-measurement-lifecycle">
        
      </a>
    </div>
    <p>Measurements are most powerful when they lead to reasonable predictions. Sometimes the predictions confirm our understanding of the world and systems we deploy into it. Occasionally, the predictions reveal something new. Either way, predictive measurements emerge by following a simple pattern: curate data, construct a model based on the data, then validate the model with (ideally) different data. Together, these create a measurement lifecycle.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2bnJ1aWYRag3edCAUfEnkj/87ee90c52223d03120e7bc2d7df5c72b/image8.png" />
          </figure><p>Ideally a measurement exercise encompasses the lifecycle from beginning to end, but there can be extremely valuable contributions and advances within each in isolation. Individual high-quality datasets are so difficult to curate that each can be a valid contribution. Similarly, with modeling techniques, or tools for validation. Measurement spans expert domains, and benefits from diverse skill sets.</p><p>Let’s look at each step in order, beginning with data curation.</p>
    <div>
      <h2>Data curation</h2>
      <a href="#data-curation">
        
      </a>
    </div>
    <p>The most common and familiar measurement exercise — often synonymous with measurement — is data gathering and curation. Data on its own can be fascinating and useful; <a href="https://radar.cloudflare.com"><u>Cloudflare Radar</u></a> is clear evidence of that! Simple counting in many contexts can help us relate to and place our environments in context.</p><p>Data gathering and curation consumes more energy, time, and resources than modeling or validation. The explanation is implied by the cyclical measurement pattern: validation requires a preceding model, and models are constructed using data. No data, no model, no validation, no insight nor prediction nor learning. The quality of each step in the cycle <i>depends</i> on the quality of the previous step — high-quality data is <i>the</i> linchpin in measurement practices. The <a href="https://en.wikipedia.org/wiki/Large_Hadron_Collider"><u>Large Hadron Collider</u></a> and the <a href="https://en.wikipedia.org/wiki/James_Webb_Space_Telescope"><u>James Webb Telescope</u></a> are great examples of how much we can, and need, to do — they operate relentlessly in pursuit of high-quality data. Similar “always-on” tools in the Internet measurement community are much less glamorous, but no less important. <a href="https://www.caida.org/about/"><u>CAIDA</u></a> and <a href="https://atlas.ripe.net/"><u>RIPE’s Atlas</u></a> are just two examples of longstanding projects that gather telemetry and curate datasets.</p><p>Make no mistake: High-quality data gathering and curation is <i>hard</i>.</p><p>Luckily, “high-quality” does not mean perfect; it does mean <i>representative</i>. For example, if we’re counting distance or time, the accuracy must reflect the true value. Large populations can be reasonably studied using much smaller numbers of samples. For example, our global assessment of connection tampering revealed valuable insights with a sample of <a href="https://blog.cloudflare.com/tcp-resets-timeouts/"><u>1 in 10,000</u></a> (or 0.0001%). The low sampling rate works at Cloudflare in part because of the immense diversity of Cloudflare’s customers, which attracts traffic for all kinds of content and purposes. Later this week, we’ll share in a blog post how imperfect signals used to find a sample of around 180,000 carrier-grade NATs in Cloudflare’s request logs are “good enough” to identify more than 12,000,000 others that cannot be directly observed.</p><p>Another important, and arguably counterintuitive, misconception is that more data naturally reveals more detail and answers to more questions. As Ram Sundaran writes in a <a href="https://blog.cloudflare.com/experience-of-data-at-scale"><u>guest post</u></a>, sometimes there is so much noise that finding answers in large datasets can seem like a small miracle.</p>
    <div>
      <h2>Modeling</h2>
      <a href="#modeling">
        
      </a>
    </div>
    <p>Models may be conceptual, and describe aspects of an environment or system. The most useful can be expressed as simple statements about our understanding or our assumptions. In effect, they encapsulate a hypothesis that can be tested. For example, we might believe or assume that an ISP or network <a href="https://blog.cloudflare.com/cdn-latency-passive-measurement/#example"><u>will typically prefer</u></a> a direct no-cost peering path to a CDN over transit network paths that incur a cost, even when the direct path is longer. This forms a model that can be validated.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2rnYBJIYzjpAoerH3Z2bDL/2e2bcf155e2b1c3f6abe46a72fe129f5/image3.png" />
          </figure><p>Predictive models push beyond our boundaries of understanding to help identify, explain, or understand aspects of systems that are not obvious or directly observable, or are difficult to ascertain. Predictive models often use statistical techniques to, for example, identify underlying stochastic processes or to create machine learning classifiers. A more common use of the statistical tools is to characterize the curated data itself. Remarkably powerful models can be simple probability distributions with means, medians, variance, and confidence indicators.</p><p>One aspect of the Internet that attracted a lot of attention was how networks on the Internet choose to connect to other networks. Understanding how the Internet forms and grows is crucial for simulation, but also helps to predict ways in which networks might fail. The equation below on the left comes from the <a href="https://en.wikipedia.org/wiki/Barab%C3%A1si%E2%80%93Albert_model"><u>Barabási–Albert (B-A) model</u></a>, an early model that assumes <i>preferential connectivity</i> or, in more familiar terms, “rich get richer.”</p><p>In its simplest version, a new network in the BA model chooses to connect to an existing network with a probability that is proportional to the number of connections of the existing networks. Later models did away with ‘intelligent’ selection mechanisms. The equation below on the right is based on the <a href="https://dl.acm.org/doi/pdf/10.1145/956981.956986"><u>sizes of networks</u></a>, a more general mechanism similar to the way celestial bodies form in the universe.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5h8V0ABHULoh2vRaa2wQhn/baf190909036b7f4f15fa506754784c5/Screenshot_2025-10-27_at_10.51.04.png" />
          </figure><p>Sometimes knowing which tool to use and when is a skill in itself. One such example is throwing ML and AI at problems that are tractable with mechanisms that are simpler and far more transparent. This <a href="https://blog.cloudflare.com/experience-of-data-at-scale"><u>guest blog</u></a>, for example, explains that ML was ruled out to understand anomalous TCP behaviour because TCP is tightly specified, which suggested that a full enumeration of various packet sequences was possible—and proved correct.</p><p>An understanding of the domain is often critical to our ability to construct accurate models. Machine learning, for example, is a useful tool to help make sense of large unstructured data, but can be remarkably powerful with some domain expertise. Our work featured later this week on detection of multi-user IPs provides one such example. In particular, we sought to detect carrier-grade NAT devices (CGNATs). They are unique among large-scale multiuser IPs because, unlike VPNs and proxies, users neither choose to use CGNATs nor are aware of their existence.</p><p>The ML models successfully identified multiuser IPs, but disambiguating CGNATs proved elusive until we applied domain knowledge. For example, CGNATs are typically deployed across a range of contiguous IPs (e.g. in a /24 block) and, as shown below, turns out to be a very important feature in the model.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/15WI2U2JnD12WOcaCD9wQN/7bdf0adb9ade2444f7b3837f75c7f109/unnamed__1_.png" />
          </figure>
    <div>
      <h2>Validation</h2>
      <a href="#validation">
        
      </a>
    </div>
    <p>The validation phase almost singularly determines the value of the whole measurement exercise, by testing the output of the model against data. If the model makes predictions that are reflected in the data, then the model has validity. Predictions that contrast or conflict with the validation data indicate that either the model is flawed or is biased by the curated data.</p><p>Validation is where great measurement can fall apart — primarily in one of two ways. First, just like in the initial data curation phase, validation data must be representative of the population. For example, it would be a mistake to curate data about traffic during the day, build a model about that data, and then validate using data about traffic at night. There is also no point in using QUIC data to validate measurements about, say, TCP (unless the measurement’s hypothesis is that they have attributes in common). Care must always be taken to ensure that measurement cannot be corrupted by the differences between validation and initial data.</p><p>Validation also risks being misleading when using the curated data, directly. Certainly this approach mitigates differences between datasets. However, the only conclusion that can be drawn when validating with the same data, is that the model reasonably describes the data —not whatever the data represents. Consider, for example, machine learning. At its core, machine learning is a measurement in so much as it follows the lifecycle: curate data, (feed it into a machine learning algorithm to) build a model, then validate the output against data. An early common practice in the machine learning community was to partition a single dataset into 70% for training and 30% for validation. This is a setup that leads to a higher likelihood of a positive evaluation of the model that is not warranted, and potentially misleading. The best case for an ML model trained on a dataset that amplifies or omits important characteristics is a model that reflects those biases — which becomes a potential source of <a href="https://en.wikipedia.org/wiki/Algorithmic_bias"><u>algorithmic bias</u></a>. </p><p>Naturally we have greater confidence in models that prove valid with unrelated data. The validation dataset can describe the same attributes from a different source, for example, models constructed <a href="https://blog.cloudflare.com/cdn-latency-passive-measurement/"><u>from passive RTT log data and validated against active pings</u></a>. Alternatively, models may be validated using entirely different data or signals, such as confirming <a href="https://blog.cloudflare.com/connection-tampering/"><u>connection tampering with distributions and header values</u></a> that were ignored in the model’s construction. </p>
    <div>
      <h2>The ethics of network measurement</h2>
      <a href="#the-ethics-of-network-measurement">
        
      </a>
    </div>
    <p>The importance of ethics in network measurement is hard to overstate. It’s easy to perceive network measurement as risk-free, removed from and having little effect on humans—a perception far from truth. Recall the speed tests and the packet pair technique for bandwidth estimation described above. In a speed test, an actor estimates bandwidth by consuming all the available bottleneck capacity that may or may not be within the actor’s network. The cost of resource consumption might be borne by others, and certainly reduces the potential performance of the network for its users. The risks of that type of bandwidth measurement prompted the packet pair technique and its use of only a few pairs of packets and a little math to infer bandwidth—albeit with some orchestration between a sender and receiver.</p><p>Best practice in network measurement scrutinizes risks and effects <i>before</i> the measurement exercise. This might seem like a burden, but the ethical considerations often spark creativity and are the reasons that novel methodology emerge. Looking for alternatives to JavaScript injection is what prompted Cloudflare’s own efforts to <a href="https://blog.cloudflare.com/cdn-latency-passive-measurement/"><u>estimate the performance</u></a> of other CDNs using passive data. For more information, see “<a href="https://dl.acm.org/doi/10.1145/2896816"><u>Ethical Considerations in Network Measurement Papers</u></a>” published in the Communications of the ACM (2016).</p>
    <div>
      <h2>Visualization and representation</h2>
      <a href="#visualization-and-representation">
        
      </a>
    </div>
    <p>Visualization and representation are invaluable <i>at every stage</i> of the measurement lifecycle. Representations should at least improve our understanding; ideally, they also make follow-up actions clear. Statements without context are poor representations. For example, “30% greater chance” sounds like a lot but has no value without a reference point—30% of 0.5% is likely less a concern than 30% of 20% chance.</p><p>One example of representation is Cloudflare’s “<a href="https://www.cloudflare.com/sv-se/network/"><u>closeness</u></a>” statement: Cloudflare is “<i>approximately 50 ms from 95% of the Internet-connected population globally</i>.” The statement encapsulates a “survey” of our logs: From among all connections from each IP address that connects to Cloudflare, half of the minimum-RTT is a “worst approximation” of the latency from the IP address to Cloudflare; in 95% of cases, the minRTT/2 is at or below 50ms.</p><p>Visualizations, meanwhile, can be so powerful as to lead to misleading conclusions — a notion that features prominently later this week in a blog post about routing resilience evaluations. One example on that subject appears below, with two bar charts that order individual US states by the number of interconnection facilities in each state, from most to least. On the left, states are ordered according to raw count facilities; the top-ranked state has more than 140 interconnection facilities. On the right, the raw counts are normalized (in this case divided by) the population of each state.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7c2LwnBXvucFPhWKwG0F7g/033d94a2a8e3be8844a6f958ced6d762/image9.png" />
          </figure><p>These representations demonstrate that our models are shaped, and can be misinformed, by how we evaluate data. In this case we have purposefully omitted the state names on the x-axis because they are a distraction. Instead, each bar is coloured to indicate whether it is above (green) or below (yellow) the median of facilities per person in the right-hand graph. What becomes immediately obvious is that the two states with the highest number of facilities fall below the median, i.e., they are in the bottom half of states when ordered by facilities per person.</p><p>Sometimes a visualization can be so powerful as to leave no doubt. The image below is a personal favourite, because it gives strong evidence that the data and models were correct. In this visualization, each column represents a single type of connection anomaly that we observed. Inside each column, the anomaly’s occurrence is divided proportionally into the country where the connection was initiated. As an example, look at the left-most column for SYN→∅ anomalies (a type of timeout). It shows that connections from China, India, Iran, and the United States dominated this specific anomaly type. Organizing the visualization this way put the data <i>first</i>, which helped mitigate any bias we might have had about explanations, underlying mechanisms, or locations.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1sUdADDxOjZzn5Bq6qCgs2/0cfd453013b83ff50924993bb38c6e9b/image1.png" />
          </figure><p>By organizing the anomalies this way, the visualization immediately answered one question: “Are the failures expected behaviour?”  If they were expected, or normal across the Internet, then the anomalies would appear in roughly similar proportions rather than so different. The visualization was a strong validation (but <a href="https://blog.cloudflare.com/connection-tampering/#signature-validation-letting-the-data-speak"><u>not the only one</u></a>) of our approach and intuition—and opened up further avenues of investigation as a result.</p>
    <div>
      <h2>What’s next?</h2>
      <a href="#whats-next">
        
      </a>
    </div>
    <p>Cloudflare continues to think deeply about new and novel ways to use available (passive) data, and welcomes ideas. Measurement helps us understand the Internet we all depend on, value, and love, and is a community-wide endeavour.</p><p>We encourage new entrants into the measurement space, and hope this blog serves as both an introduction to its challenges, and a map with which to evaluate measurement work published at Cloudflare or anywhere else.</p> ]]></content:encoded>
            <category><![CDATA[Research]]></category>
            <guid isPermaLink="false">20pf9BGcV10k0j9ASL8JtY</guid>
            <dc:creator>Marwan Fayed</dc:creator>
        </item>
        <item>
            <title><![CDATA[From .com to .anything: introducing Top-Level Domain (TLD) insights on Cloudflare Radar]]></title>
            <link>https://blog.cloudflare.com/introducing-tld-insights-on-cloudflare-radar/</link>
            <pubDate>Mon, 27 Oct 2025 12:00:00 GMT</pubDate>
            <description><![CDATA[ Cloudflare Radar has launched a new Top-Level Domain (TLD) page, providing insights into TLD popularity, traffic, and security. The top-ranking TLD may come as a surprise. ]]></description>
            <content:encoded><![CDATA[ <p>Readers of a certain age may remember the so-called "dot com boom" that took place in the early 2000's. The boom's "dot com" is what is known as a Top-Level Domain (TLD). <a href="https://www.rfc-editor.org/rfc/rfc920.html"><u>Originally</u></a> intended to organize domain names into a small set of categorical groupings, over the past 40+ years, the set of TLDs has expanded to include country code top-level domains (ccTLDs, like <a href="https://radar.cloudflare.com/tlds/us"><code><u>.us</u></code></a>, <a href="https://radar.cloudflare.com/tlds/pt"><code><u>.pt</u></code></a>, and <a href="https://radar.cloudflare.com/tlds/cn"><code><u>.cn</u></code></a>), as well as additional generic top-level domains (gTLDs) beyond the initial seven, such as <a href="https://radar.cloudflare.com/tlds/biz"><code><u>.biz</u></code></a>, <a href="https://radar.cloudflare.com/tlds/shop"><code><u>.shop</u></code></a>, and <a href="https://radar.cloudflare.com/tlds/nyc"><code><u>.nyc</u></code></a>. Internationalized TLDs, such as <a href="https://radar.cloudflare.com/tlds/xn--80aswg"><code><u>.сайт</u></code></a>, <a href="https://radar.cloudflare.com/tlds/xn--80asehdb"><code><u>.онлайн</u></code></a>,<code> </code><a href="https://radar.cloudflare.com/tlds/xn--ngbc5azd"><code><u>.شبكة</u></code></a>, <a href="https://radar.cloudflare.com/tlds/xn--unup4y"><code><u>.游戏</u></code></a>, and brand TLDs, like <a href="https://radar.cloudflare.com/tlds/google"><code><u>.google</u></code></a> and <a href="https://radar.cloudflare.com/tlds/nike"><code><u>.nike</u></code></a> have also been added. As of October 2025, <a href="https://data.iana.org/TLD/tlds-alpha-by-domain.txt"><u>over 1,400 entries</u></a> can be found in ICANN's list of all valid top-level domains, and a further expansion is <a href="https://newgtldprogram.icann.org/en/application-rounds/round2"><u>expected to begin in April 2026</u></a>.</p><p><a href="https://radar.cloudflare.com/"><u>Cloudflare Radar</u></a> has long published <a href="https://radar.cloudflare.com/domains"><u>domain ranking</u></a> information, providing insights into popular and trending domains. And in February 2025, we <a href="https://blog.cloudflare.com/new-dns-section-on-cloudflare-radar/"><u>added</u></a> a number of <a href="https://radar.cloudflare.com/dns"><u>DNS-related insights to Radar</u></a>, based on analysis of traffic to our <a href="https://one.one.one.one/"><u>1.1.1.1</u></a> Public DNS Resolver.</p><p>Building on this, today we are launching a <a href="https://radar.cloudflare.com/tlds"><u>new TLD page</u></a> on Radar that, based on aggregated data from multiple Cloudflare services, provides insights into TLD popularity, activity, and security, along with links directly into <a href="https://domains.cloudflare.com/"><u>Cloudflare Registrar</u></a> to enable users to register domain names in <a href="https://domains.cloudflare.com/tlds"><u>supported TLDs</u></a>.</p>
    <div>
      <h2>Initial security-related insights</h2>
      <a href="#initial-security-related-insights">
        
      </a>
    </div>
    <p>Before today, Radar already offered insights into TLDs, though these were distributed across a couple of different pages and datasets.</p><p>In March 2024, when we <a href="https://blog.cloudflare.com/email-security-insights-on-cloudflare-radar/"><u>launched</u></a> the <a href="https://radar.cloudflare.com/security/email"><u>Email Security page</u></a>, we introduced the <a href="https://radar.cloudflare.com/security/email#most-observed-tlds"><u>“Most abused TLDs”</u></a> metric. This chart highlights TLDs associated with the largest shares of malicious and spam email. The analysis is based on the sending domain’s TLD, extracted from the <code>From:</code> header in email messages, with data sourced from <a href="https://www.cloudflare.com/zero-trust/products/email-security/"><u>Cloudflare’s cloud email security service</u></a>.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/53HpBXjJBYPbDq72R1e5WG/8d56e5518b5f2aa7771af494a95a49a3/image10.png" />
          </figure><p>More recently, during 2025’s Birthday Week, we <a href="https://blog.cloudflare.com/new-regional-internet-traffic-and-certificate-transparency-insights-on-radar/#introducing-certificate-transparency-insights-on-radar"><u>introduced</u></a> <a href="https://radar.cloudflare.com/certificate-transparency"><u>Certificate Transparency (CT) insights</u></a> on Radar, leveraging data from <a href="https://developers.cloudflare.com/radar/glossary/#certificate-transparency"><u>CT logs</u></a> monitored by Cloudflare. One highlight is the <a href="https://radar.cloudflare.com/certificate-transparency#certificate-coverage"><u>Certificate Coverage</u></a> section, which visualizes the distribution of pre-certificates across the top 10 TLDs. These insights give a different perspective on TLD activity, complementing email-based metrics by showing which domains are actively securing web traffic.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/595UGFz1v7EJN2iy7G09WT/60b65333882e612b0949a4299c6bb138/image6.png" />
          </figure>
    <div>
      <h2>A new aggregate overview based on DNS Magnitude</h2>
      <a href="#a-new-aggregate-overview-based-on-dns-magnitude">
        
      </a>
    </div>
    <p>Today, we’re excited to announce the new <a href="http://radar.cloudflare.com/tlds"><u>TLD page</u></a> on Radar. The landing page and the dedicated per-TLD pages provide TLD managers and site owners with a perspective on the relative popularity of TLDs they manage or may be considering domains in, as well as insights into TLD traffic volume and distribution.</p><p>Located under the DNS menu, the landing page introduces a ranking of top-level domains based on <a href="https://www.icann.org/en/system/files/files/dns-magnitude-05aug20-en.pdf"><u>DNS Magnitude</u></a> — a metric originally developed by <a href="https://www.nic.at/media/files/pdf/dns-magnitude-paper-20200601.pdf"><u>nic.at</u></a> to estimate a domain’s overall visibility on the Internet.</p><p>Instead of simply counting the total number of DNS queries, DNS Magnitude incorporates a sense of how many unique clients send queries to domains within the TLD. This approach gives a more accurate picture of a TLD’s reach, since a small number of sources can generate a large number of queries. Our ranking is based on queries observed at Cloudflare’s 1.1.1.1 resolver. We aggregate individual client IP addresses into subnets, referred to here as "networks".</p><p>The magnitude value ranges from 0 to 10, with higher values (closer to 10) indicating that the TLD is queried by a broader range of networks. This reflects greater global visibility and, in some cases, a higher likelihood of name collision across different systems. <a href="https://www.icann.org/resources/pages/name-collision-2013-12-06-en"><u>According to ICANN</u></a>, a name collision occurs when an attempt to resolve a name used in a private name space (such as under a non-delegated Top-Level Domain) results in a query to the public <a href="https://www.cloudflare.com/learning/dns/what-is-dns/">Domain Name System (DNS)</a>. When the administrative boundaries of private and public namespaces overlap, name resolution may yield unintended or harmful results. For example, if ICANN were to delegate <code>.home</code>, that could cause significant issues for hobbyists that use the (currently non-delegated) TLD within their local networks.</p><p>$Magnitude=\frac{ln(unique\ networks\ querying\ the\ TLD)}{ln(all\ unique\ networks)}*10$</p><p>The table displays a paginated ranking of the top 2,500 TLDs, along with several key attributes. Each entry includes the TLD itself — which links to a dedicated page for delegated TLDs — as well as its type:</p><ul><li><p><a href="http://radar.cloudflare.com/tlds?q=gTLD"><u>gTLD</u></a> (generic TLD): used for general purposes, such as <a href="https://radar.cloudflare.com/tlds/com"><code><u>.com</u></code></a> or<code> </code><a href="https://radar.cloudflare.com/tlds/info"><code><u>.info</u></code></a>.</p></li><li><p><a href="http://radar.cloudflare.com/tlds?q=grTLD"><u>grTLD</u></a> (generic restricted TLD): limited to specific communities or uses, such as<code> </code><a href="https://radar.cloudflare.com/tlds/name"><code><u>.name</u></code></a>.</p></li><li><p><a href="http://radar.cloudflare.com/tlds?q=ccTLD"><u>ccTLD</u></a> (country code TLD): assigned to individual countries or territories, such as<code> </code><a href="https://radar.cloudflare.com/tlds/uk"><code><u>.uk</u></code></a> or <a href="https://radar.cloudflare.com/tlds/jp"><code><u>.jp</u></code></a>.</p></li><li><p><a href="http://radar.cloudflare.com/tlds?q=iTLD"><u>iTLD</u></a> (infrastructure TLD): reserved for technical infrastructure, such as <a href="https://radar.cloudflare.com/tlds/arpa"><code><u>.arpa</u></code></a>.</p></li><li><p><a href="http://radar.cloudflare.com/tlds?q=sTLD"><u>sTLD</u></a> (sponsored TLD): operated by a sponsoring organization representing a defined community, such as <a href="https://radar.cloudflare.com/tlds/edu"><code><u>.edu</u></code></a> or <a href="https://radar.cloudflare.com/tlds/gov"><code><u>.gov</u></code></a>.</p></li></ul><p>The status column indicates whether the TLD is delegated, meaning it is officially assigned and active in the <a href="https://www.iana.org/domains/root/db"><u>root zone</u></a> of the DNS, or non-delegated, meaning it is not currently part of the public DNS. The table also shows the manager of each TLD — typically the organization or registry responsible for its operation — and the corresponding DNS magnitude value.</p><p>While the top 10 TLDs include stalwarts such as <a href="https://radar.cloudflare.com/tlds/com"><code><u>.com</u></code></a>/<a href="https://radar.cloudflare.com/tlds/net"><code><u>.net</u></code></a>/<a href="https://radar.cloudflare.com/tlds/org"><code><u>.org</u></code></a> and ccTLDs that have been commercially repurposed, such as <a href="https://radar.cloudflare.com/tlds/io"><code><u>.io</u></code></a>/<a href="https://radar.cloudflare.com/tlds/co"><code><u>.co</u></code></a>/<a href="https://radar.cloudflare.com/tlds/tv"><code><u>.tv</u></code></a>, the TLD at the top of the list may be a bit surprising: <a href="https://en.wikipedia.org/wiki/.su"><code><u>.su</u></code></a>.</p><p>This TLD was delegated for the Soviet Union back in 1990, but its use waned after the dissolution of the USSR, with constituent republics becoming independent and using their own dedicated ccTLDs. (ICANN reportedly <a href="https://domainnamewire.com/2025/03/11/icann-moves-to-retire-soviet-era-su-country-domain-name/"><u>plans to retire</u></a> <code>.su </code>in 2030.) Looking at a single day’s worth of data, the<code> .su</code> TLD does not rank #1 by unique networks. However, over a longer period of time, such as seven days, it sees queries from more unique networks than other TLDs, placing it atop the magnitude list. Further analysis of the top hostnames observed within this TLD suggests that they are mostly associated with a popular online world-building game. Interestingly, over half of the queries for .su domains <a href="https://radar.cloudflare.com/tlds/su#geographical-distribution"><u>come from</u></a> the United States, Germany, and Brazil.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3L7ya17Ef98tXD8oBnU8SG/e69c02bf749993a9e89d2e9ad7a6d037/image1.png" />
          </figure>
    <div>
      <h2>More detailed TLD insights</h2>
      <a href="#more-detailed-tld-insights">
        
      </a>
    </div>
    <p>The new TLD section also offers <a href="https://radar.cloudflare.com/tlds/com"><u>dedicated pages</u></a> for individual TLDs. By clicking on a TLD in the DNS Magnitude table or searching for a TLD in the top search bar, users can access a page with detailed insights and information about that TLD. It’s important to note that while non-delegated TLDs are included in the DNS Magnitude ranking, TLD-specific pages are only available for delegated TLDs. The list of delegated TLDs, along with their type and manager, is sourced from the <a href="https://www.iana.org/domains/root/db"><u>IANA’s Root Zone Database</u></a>.</p><p>When a user enters an individual TLD page, they see two main cards. The first card provides general information about the TLD, including its type, manager, DNS magnitude value, DNSSEC support, and RDAP support. DNSSEC support is determined by checking whether the TLD has a <a href="https://www.cloudflare.com/learning/dns/dns-records/dnskey-ds-records/"><u>Delegation Signer (DS) record</u></a> in the <a href="https://www.internic.net/domain/root.zone"><u>root zone</u></a>. We also parse the record to get the associated <a href="https://www.cloudflare.com/learning/dns/dnssec/how-dnssec-works/"><u>DNSSEC algorithm</u></a>. <a href="https://developers.cloudflare.com/registrar/account-options/whois-redaction/#what-is-rdap"><u>RDAP</u></a> support is indicated if the TLD is listed in the <a href="https://data.iana.org/rdap/dns.json"><u>IANA RDAP bootstrap file</u></a>. RDAP (Registration Data Access Protocol) is a new standard for querying domain contact and nameserver information for all registered domains.</p><p>The second card contains <a href="https://www.cloudflare.com/learning/dns/what-is-domain-privacy/"><u>WHOIS</u></a> data for the TLD, including its creation date, the date of the last update, and the list of nameservers. If the TLD is supported by Cloudflare Registrar, an additional card appears, giving users direct access to registration options. As of today, Cloudflare Registrar supports <a href="https://domains.cloudflare.com/tlds"><u>over 400 TLDs</u></a>.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2XoNlzH0pzDmwLay9O5123/44be6f897fea6e3cd94591192915e259/image5.png" />
          </figure><p>Below these cards, the page features the <a href="https://radar.cloudflare.com/tlds/com#dns-query-volume"><u>DNS query volume</u></a> section, which presents insights based on queries to Cloudflare’s 1.1.1.1 resolver for domains under the TLD. This section includes a chart showing DNS queries over the selected time period, along with a donut chart breaking down queries by type, response code, and DNSSEC support. A choropleth map further illustrates the percentage of DNS queries by country, highlighting which regions generate the most queries for domains under the TLD.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6dwNEKbnBrJLDpoIjvSnOf/d47321ed271115889551eaca6f882710/image4.png" />
          </figure>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/303ZsAaOZFihRHII7KCW27/c24567953d1949b9d2ef223a98bfa601/image8.png" />
          </figure><p>Each individual TLD page also includes a <a href="https://radar.cloudflare.com/tlds/com#certificate-issuance-volume"><u>Certificate Transparency</u></a> section, offering visibility into <a href="https://www.cloudflare.com/application-services/products/ssl/">TLS/SSL certificate issuance</a> for the TLD. This section displays a line chart showing the total number of certificates issued over the selected period, as well as a donut chart depicting the distribution of certificate issuance among the top Certificate Authorities.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/bohRgeA6ieFrAfkX1pMVx/c16be9eeb6da0372f4b251d69cb64e9e/image7.png" />
          </figure><p>When we <a href="https://blog.cloudflare.com/new-dns-section-on-cloudflare-radar/"><u>launched</u></a> the <a href="https://radar.cloudflare.com/dns"><u>DNS page</u></a> earlier in 2025, we provided query volumes by TLDs, but this was limited to ccTLDs. Today, we’re extending that dataset to include all delegated TLDs. With these new insights, we’ve added the <a href="https://radar.cloudflare.com/dns#top-level-domain-distribution"><u>“Top-level domain distribution”</u></a> section to the DNS page, featuring a line chart that shows the distribution of queries to 1.1.1.1 across the top 10 TLDs, alongside a table extending this ranking to the top 100. Not surprisingly, <a href="https://radar.cloudflare.com/tlds/com"><u>.com</u></a> tops the ranking with more than 60% of queries, followed by <a href="https://radar.cloudflare.com/tlds/net"><code><u>.net</u></code></a>, <a href="https://radar.cloudflare.com/tlds/arpa"><code><u>.arpa</u></code></a> (an infrastructure TLD), and <a href="https://radar.cloudflare.com/tlds/org"><code><u>.org</u></code></a>.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/z5LgMRXqhqpMtPFSFlOZ5/331540312793d369b2aab7a88940830e/image3.png" />
          </figure><p>It is also worth noting that both Radar search and the API support both <a href="https://en.wikipedia.org/wiki/Punycode"><u>punycode</u></a> (<a href="https://datatracker.ietf.org/doc/html/rfc5890#section-2.3.2.1"><u>A-Label/ASCII-Label</u></a>) and <a href="https://en.wikipedia.org/wiki/Internationalized_domain_name"><u>internationalized domain name (IDN)</u></a> (<a href="https://datatracker.ietf.org/doc/html/rfc5890#section-2.3.2.1"><u>U-Label/UNICODE-Label</u></a>) representations of non-ASCII TLDs. For example, the U-Label representation of the South Korean TLD <a href="https://www.iana.org/domains/root/db/xn--3e0b707e.html"><u>.kr</u></a> is written as 한국 and the A-Label representation is <a href="https://radar.cloudflare.com/tlds/xn--3e0b707e"><code><u>xn--3e0b707e</u></code></a>.</p>
    <div>
      <h2>Looking ahead</h2>
      <a href="#looking-ahead">
        
      </a>
    </div>
    <p>Because TLDs are a foundational component of the Domain Name System, it is critical that the associated name servers are highly performant. Based on billions of daily queries to these name servers, we plan to add insights into their performance to Radar’s TLD pages in 2026. These insights will provide TLD managers with an external perspective on query responsiveness, and will give developers and site owners a perspective on the potential impact of the performance of the associated TLD name servers as they look to register new domain names.</p><p>The underlying data for these new TLD pages is available via the <a href="https://developers.cloudflare.com/api/resources/radar/subresources/tlds/"><u>API</u></a> and can be interactively explored in more detail using Radar’s <a href="https://radar.cloudflare.com/explorer?dataSet=dns&amp;groupBy=tld"><u>Data Explorer and AI Assistant</u></a>. And as always, Radar and Data Assistant charts and graphs are downloadable for sharing, and embeddable for use in your own blog posts, websites, or dashboards.</p><p>If you share our TLD charts and graphs on social media, be sure to tag us: <a href="https://x.com/CloudflareRadar"><u>@CloudflareRadar</u></a> (X), <a href="https://noc.social/@cloudflareradar"><u>noc.social/@cloudflareradar</u></a> (Mastodon), and <a href="https://bsky.app/profile/radar.cloudflare.com"><u>radar.cloudflare.com</u></a> (Bluesky). If you have questions or comments, or suggestions for data that you’d like to see us add to Radar, you can reach out to us on social media, or contact us via <a href="#"><u>email</u></a>.</p> ]]></content:encoded>
            <category><![CDATA[Research]]></category>
            <category><![CDATA[Radar]]></category>
            <category><![CDATA[DNS]]></category>
            <category><![CDATA[1.1.1.1]]></category>
            <category><![CDATA[Registrar]]></category>
            <guid isPermaLink="false">3ByKEmji9raNHTQ39Ui1Xr</guid>
            <dc:creator>André Jesus</dc:creator>
            <dc:creator>David Belson</dc:creator>
        </item>
        <item>
            <title><![CDATA[Data at Cloudflare scale: some insights on measurement for 1,111 interns]]></title>
            <link>https://blog.cloudflare.com/experience-of-data-at-scale/</link>
            <pubDate>Mon, 27 Oct 2025 12:00:00 GMT</pubDate>
            <description><![CDATA[ While large cloud providers hold vast troves of passive network data, analyzing them is complicated. The scale, noise, and absence of definitive ground truth all create major hurdles. Yet by carefully quantifying these constraints and finding alternative forms of evidence, meaningful insights can still emerge. ]]></description>
            <content:encoded><![CDATA[ <p>Cloudflare recently announced our goal to hire <a href="https://blog.cloudflare.com/cloudflare-1111-intern-program/"><u>1,111 interns</u></a> in 2026 — that’s equivalent to about 25% of our full-time workforce. This means countless opportunities to develop and ship working code into production. It also creates novel opportunities to measure aspects of the Internet that are otherwise hard to observe — and more difficult still to understand.</p><p>Measurement is hard, even at Cloudflare, despite the vast amount of data generated by our traffic (much of it published via <a href="https://radar.cloudflare.com/"><u>Cloudflare Radar</u></a>). A common misconception we often hear is, “Cloudflare has so much data that it must have all the answers.” Having a huge amount of data is great — but it also means much more noise to filter out, and lots of additional work to rule out alternative explanations.</p><p>Ram Sundara Raman was an intern at Cloudflare in 2022 as he pursued his PhD. He’s now an assistant professor at University of California, Santa Cruz, and we’ve invited him back to share his insights about working with data at Cloudflare.</p><p>Ram’s project is a great example of how insights that researchers shared and brought from their  <a href="https://breakerspace.cs.umd.edu/"><u>university research lab</u></a> can lay the groundwork for a valuable project at Cloudflare — in this case, detecting and explaining connection failures to customers. One tip for prospective interns: If you’re applying and thinking about data and measurement ideas to work on at Cloudflare, a good question to ponder is if, how, or why, <i>your</i> idea might matter to Cloudflare. We love hearing your ideas!</p><p>Without further ado, here’s Ram. We hope his insights are as informative and refreshing to future interns — and practitioners — as they are to us here at Cloudflare.</p>
    <div>
      <h2>Insights from data at large scale might just be a small miracle  </h2>
      <a href="#insights-from-data-at-large-scale-might-just-be-a-small-miracle">
        
      </a>
    </div>
    <p><i>by Ram Sundara Raman, Assistant Professor of Computer Science and Engineering, UC Santa Cruz</i></p><p>Before joining Cloudflare as a research intern in the summer of 2022, I’d worked on multiple network security and privacy research problems as a PhD student at the University of Michigan. My previous experience involved <i>active measurements</i>, in which probes were carefully crafted and transmitted to detect and quantify security issues such as <a href="https://dl.acm.org/doi/10.1145/3419394.3423665"><u>HTTPS interception</u></a> and <a href="https://dl.acm.org/doi/10.1145/3372297.3417883"><u>connection tampering</u></a>. These attacks, performed by powerful network middleboxes between users and Internet servers, can block Internet content and services for numerous users in various regions, and can also reduce their security. For example, <a href="https://dl.acm.org/doi/10.1145/3419394.3423665"><u>the HTTPS Interception Man-in-the-Middle Attack in Kazakhstan in 2019</u></a> was detected in 7-24% of all measurements we performed in the country. </p><p>Detecting such attacks is challenging. The underlying mechanisms are diverse, with both geographic and temporal variations — and they’re entirely opaque. Moreover, the Internet has no technical mechanisms to report to users when their traffic is being manipulated, and third party actors rarely, if ever, are transparent with affected users. </p><p>My active measurement work before Cloudflare helped resolve these challenges. Along with my PI and team at the University of Michigan, I helped develop <a href="https://censoredplanet.org/"><u>Censored Planet</u></a>, one of the largest active Internet censorship observatories, detecting connection tampering in more than 200 countries. However, active measurements face barriers on scale, resources, and real-world view. For instance, Censored Planet is only able to measure blocking and connection tampering for the 2,000 most popular websites, simply because of limits on time and resources. </p><p>While working on projects like Censored Planet, I’d often look at large network operators or cloud providers and think: “<i>If only I had my hands on the data they collect, I could solve this problem so easily. They have a global view of real-world traffic from nearly every network, and probably enough resources and telemetry to scale analysis to that level of data. How hard could it be to use this data, for example, to detect when middleboxes interfere?”</i> </p><p>As we learned through <a href="https://research.cloudflare.com/publications/SundaraRaman2023/"><u>our research</u></a> published at <a href="https://www.sigcomm.org/"><u>ACM SIGCOMM’23</u></a> — it can be <i>very</i> hard.</p><p>My perspectives on censorship evolved as a direct result of my experience at Cloudflare, which taught me that detection at scale is hard, even with large-scale data. The research I did during my internship helped reveal that network middleboxes block or otherwise interfere with certain connections not only in limited places, but also at <a href="https://blog.cloudflare.com/tcp-resets-timeouts/"><u>various scales around the world</u></a>. </p>
    <div>
      <h3>An internship project built on real insights, using production data</h3>
      <a href="#an-internship-project-built-on-real-insights-using-production-data">
        
      </a>
    </div>
    <p>In this research, we built upon insights gathered by the wider active measurement community, namely that middleboxes interfere with Internet TCP connections by dropping packets, or injecting RST packets to cause connections to abort. The same insights revealed that the patterns of packet drops and RSTs are deterministic  —  and, as a result, potentially detectable. Such is the flexibility of active measurement: craft a custom request, or ‘probe,’ that elicits a response from the environment. However, such a targeted approach would be difficult to scale and maintain, even for Cloudflare: What probes should be crafted? Where should they be sent? What motivation would Cloudflare have to even try, if the risk of missing so much is so high?  </p><p>The goal of my internship was to see if we could instead flip the approach: to be passive instead of active. Everything Cloudflare does must be both scalable and <i>sustainable</i>. However, it was entirely uncertain whether a system restricted to passive observation could be constructed, even if the tampering events could be detected. The requirement was clear: Only observe and use data that comes to Cloudflare naturally. No mixing in other datasets, no running our own active measurements. Either would have made life easier: we could have controlled the variables, maybe even obtained ground truth that would help us confirm our observations. But where’s the fun in that? Besides, Cloudflare has <i>all</i> the data anyway… right? </p><p>Yes, maybe — if it is sampled appropriately, can be teased out reliably, and correctly interpreted.</p><p>Here’s a useful insight: I’ve often heard people say that finding middleboxes that tamper with Internet connections using active measurements is like finding a needle in a haystack — rare, finicky, and hard to pin down. When we started looking at this problem from the lens of Cloudflare’s passive dataset, we quickly realized we were still looking for the same needle — and in some ways, it was now even harder to find.</p><p>That’s because as a passive observer we lose the ability to choose where to look. Also, the haystack now stretches across continents, millions of users, and — I’m not exaggerating here — thousands of ways connections can be made and broken. Not only did we have to identify tampering from millions of real-world data points, we had to do it with data that was full of obstacles and pitfalls. It felt a lot like working with unseen traps and their tripwires. </p>
    <div>
      <h3>The traps and tripwires of large-scale passive data</h3>
      <a href="#the-traps-and-tripwires-of-large-scale-passive-data">
        
      </a>
    </div>
    <p>There were multiple challenges that I only truly understood once faced with them. Let’s start with the obvious one: <b>scale</b>.</p><p>First, there was a glut of large-scale datasets, primarily associated with incoming connections to Cloudflare. For example, at the time of my internship, Cloudflare was serving more than 45 million HTTP requests per second globally, across more than 285 data centers. Cloudflare also gets TCP connections to its 1.1.1.1 DNS server. We also explored <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Guides/Network_Error_Logging"><u>Network Error Logging</u><b><u> </u></b><u>(NEL)</u></a> data. Usually, in measurement research, we’re dealing with the issue of <i>too little scale. </i>Here, we had the opposite problem: too much of a good thing. In practice, each of these datasets had their own independent sampling methods, making it all but impossible to utilize them all together. Moreover, datasets like NEL are biased since only some clients support it, and because only some websites enable it. After evaluating these biases, NEL did not make the final cut. </p><p>To manage the scale, we constructed special <a href="https://blog.cloudflare.com/tcp-resets-timeouts/#first-sample-connections"><u>IPTABLES rules</u></a> to log and store incoming TCP connections across all of Cloudflare’s points of presence — every server in each of 285 datacenters. However, due to the extremely large scale of the data, we had to limit ourselves to work with a uniformly random sample of one in every 10,000 connections. For each sample, we only logged the first 10 inbound packets of each connection. That meant we could not detect certain infrequent types of tampering, or any tampering that occurs later in a flow, after the first 10 packets. </p><p>Still, within those constraints, we managed to develop tampering signatures — distinctive packet patterns that reveal when middleboxes interfere. However, developing these signatures was anything but straightforward, due to the second tripwire: <b>noisy data. </b></p><p>It’s difficult to imagine that we could have anticipated all the different sources of noise. For example, the resolution of time-keeping in event records was milliseconds, but many packets could arrive in a single millisecond, which meant we could not trust the ordering of logged packets. We eventually learned that some denial-of-service attack traffic, as well as port scans, can look eerily like tampering events, and certain “best practices” designed to help improve the Internet, such as <a href="https://datatracker.ietf.org/doc/html/rfc6555"><u>Happy Eyeballs</u></a>, became quirks that messed with our detection. We spent a lot of time analyzing these sources of noise and iterating on our signatures to understand them. We accepted events as tampering only if supported by other sources of evidence that we identified, including but not limited to inconsistent changes in the Time-To-Live (TTL) field in the IP header.</p><p>That brings me to our last tripwire: a <b>lack of ground truth.</b></p><p>Without active, controlled experiments, it would have been extremely difficult for us to confirm when something we detected was indeed tampering, and not one of the thousand other phenomena on the Internet. Fortunately, thanks to the <a href="https://censorbib.nymity.ch/"><u>amazing work of many researchers in the censorship measurement space</u></a>, we were able to recognize at least some known signals and patterns in the data, and these helped us confirm many cases of tampering. </p><p>There were plenty more tripwires. But the key realization for me was this: While providers have lots of data that can tell you <i>things</i>, it’s incredibly hard to know which thing, how much of it, and about what. Large infrastructure operators see a filtered, sampled, and often partial view of the Internet. For example,</p><ul><li><p>Services like Cloudflare can see only which connections were affected and where the connections were initiated, but <i>not who did the tampering;</i></p></li><li><p>It was sometimes possible to understand which domains were blocked, but not always, because the necessary packets can be dropped before they get to Cloudflare;</p></li><li><p>As a passive observer, it’s possible only to see users' activity that is affected, not what <i>could</i> be affected.</p></li></ul><p>For a company that handles a double-digit percentage of Internet websites and services, these were surprising — but understandable –  limitations. 

It may seem like the exercise is impossible, but it’s not. It’s just more challenging than I expected it to be. Despite all that, we found ways to extract meaning from chaos. For example, we carefully and painstakingly enumerated all common packet sequences Cloudflare observed, and extracted from them those that might indicate tampering, based on prior work. Moreover, we used signals like the TTL field mentioned above as supporting evidence that these packet signatures did indeed show tampering. </p><p>All of this adds up to a simple but important conclusion: large infrastructure providers are not omniscient.<b> </b>Having a global view can be powerful, but doesn’t automatically translate into <i>easy</i> observations. You can have all the data in the world and still struggle to tell the difference between a middlebox, a security filter, a confused IoT device, and even regular users closing tabs and browsers. </p><p>But that dichotomy is also the beauty of the problem space. Working with imperfect data forces us to be creative, to find patterns in the noise, and to design methods that work despite what’s missing. And no, before you ask, you can’t just throw machine learning at the problem, nor do you need to — even with all the noise, the protocols are tightly specified, meaning patterns can be enumerated easily but must still be debated manually. </p>
    <div>
      <h3>An internship project built on real insights, using production data</h3>
      <a href="#an-internship-project-built-on-real-insights-using-production-data">
        
      </a>
    </div>
    <p>Using our packet-level samples and <b>19 tampering signatures</b>, we saw distinctive tampering behaviors across hundreds of networks, including being able to track large increases in tampering rates (Figure 1). And it worked because, despite the data’s limits, Cloudflare’s networks let us see the <i>real-world effects</i> of tampering. Also, thanks to the tireless efforts of <a href="https://research.cloudflare.com/about/people/luke-valenta/"><u>Luke Valenta</u></a> and the Cloudflare Radar team, the data from our project is continuously being <a href="https://radar.cloudflare.com/security/network-layer#tcp-resets-and-timeouts"><u>published on Cloudflare Radar</u></a> (Figure 2).</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/306MKIUSWYPDewkUmckP4p/74227ea6d9a9f5750d6231e17aaabe0f/image1.png" />
          </figure><p><sup>Figure 1: Increase in mach rates of our 19 tampering signatures during a period of nationwide protests in Iran in late-2022.</sup></p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/26qYosPoBquXSZrUACTbYp/a9adbfce9c04cb1831f4b8610fe69445/image2.png" />
          </figure><p><sup>Figure 2: Data from our connection tampering research is available live on Radar.</sup></p><p>In the future, though, I think solving challenges like these will require a <b>combination of passive and active probing</b>, using the scale of providers like Cloudflare together with targeted, controlled measurements to paint the full picture of Internet tampering. My team at  <a href="https://randlab.engineering.ucsc.edu/"><u>UCSC’s RANDLab</u></a> and the research group at <a href="https://censoredplanet.org"><u>Censored Planet</u></a> continue to work on this problem, especially asking how we can automatically identify tampering when attacks happen or networks change. </p><p>While collaborations between academia and industry aren’t always straightforward, they hold strong potential to help build a better Internet. If you’re interested in an internship adventure like the one I described, <a href="https://www.cloudflare.com/en-gb/careers/jobs/?department=Early+Talent"><u>apply today</u></a>! </p> ]]></content:encoded>
            <category><![CDATA[Research]]></category>
            <guid isPermaLink="false">5plcCyVVqbzFwO2FQx0uGN</guid>
            <dc:creator>Marwan Fayed</dc:creator>
            <dc:creator>Ram Sundara Raman (Guest author)</dc:creator>
        </item>
        <item>
            <title><![CDATA[Making the Internet observable: the evolution of Cloudflare Radar]]></title>
            <link>https://blog.cloudflare.com/evolution-of-cloudflare-radar/</link>
            <pubDate>Mon, 27 Oct 2025 12:00:00 GMT</pubDate>
            <description><![CDATA[ Cloudflare Radar has evolved significantly since its 2020 launch, offering deeper insights into Internet security, routing, and traffic with new tools and data that help anyone understand what's happening online. ]]></description>
            <content:encoded><![CDATA[ <p>The Internet is constantly changing in ways that are difficult to see. How do we measure its health, spot new threats, and track the adoption of new technologies? When we <a href="https://blog.cloudflare.com/introducing-cloudflare-radar/"><u>launched Cloudflare Radar in 2020</u></a>, our goal was to illuminate the Internet's patterns, helping anyone understand what was happening from a security, performance, and usage perspective, based on aggregated data from Cloudflare services. From the start, Internet measurement, transparency, and resilience has been at the core of our mission.</p><p>The launch blog post noted, “<i>There are three key components that we’re launching today: Radar </i><a href="https://blog.cloudflare.com/introducing-cloudflare-radar/#radar-internet-insights"><i><u>Internet Insights</u></i></a><i>, Radar </i><a href="https://blog.cloudflare.com/introducing-cloudflare-radar/#radar-domain-insights"><i><u>Domain Insights</u></i></a><i> and Radar </i><a href="https://blog.cloudflare.com/introducing-cloudflare-radar/#radar-ip-insights"><i><u>IP Insights</u></i></a><i>.</i>” These components have remained at the core of Radar, and they have been continuously expanded and complemented by other data sets and capabilities to support that mission. By shining a brighter light on Internet security, routing, traffic disruptions, protocol adoption, DNS, and now AI, Cloudflare Radar has become an increasingly comprehensive source of information and insights. And despite our expanding scope, we’ve focused on maintaining Radar’s “easy access” by evolving our information architecture, making our search capabilities more powerful, and building everything on top of a powerful, publicly-accessible API.</p><p>Now more than ever, Internet observability matters. New protocols and use cases compete with new security threats. Connectivity is threatened not only by errant construction equipment, but also by governments practicing targeted content blocking. Cloudflare Radar is uniquely positioned to provide actionable visibility into these trends, threats, and events with local, network, and global level insights, spanning multiple data sets. Below, we explore some highlights of Radar’s evolution over the five years since its launch, looking at how Cloudflare Radar is building one of the industry’s most comprehensive views of what is happening on the Internet.</p>
    <div>
      <h2>Making Internet security more transparent</h2>
      <a href="#making-internet-security-more-transparent">
        
      </a>
    </div>
    <p>The <a href="https://research.cloudflare.com/"><u>Cloudflare Research</u></a> team takes a practical <a href="https://research.cloudflare.com/about/approach/"><u>approach</u></a> to research, tackling projects that have the potential to make a big impact. A number of these projects have been in the security space, and for three of them, we’ve collaborated to bring associated data sets to Radar, highlighting the impact of these projects.</p><p>The 2025 <a href="https://blog.cloudflare.com/new-regional-internet-traffic-and-certificate-transparency-insights-on-radar/#introducing-certificate-transparency-insights-on-radar"><u>launch</u></a> of the <a href="https://radar.cloudflare.com/certificate-transparency"><u>Certificate Transparency (CT) section on Radar</u></a> was the culmination of several months of collaborative work to expand visibility into key metrics for the Certificate Transparency ecosystem, enabling us to deprecate the original <a href="https://blog.cloudflare.com/a-tour-through-merkle-town-cloudflares-ct-ecosystem-dashboard/"><u>Merkle Town CT dashboard</u></a>, which was launched in 2018. Digital certificates are the foundation of trust on the modern Internet, and Certificate Authorities (CAs) serve as trusted gatekeepers, issuing those certificates, with CT logs providing a public, auditable record of every certificate issued, making it possible to detect fraudulent or mis-issued certificates. The information available in the new CT section allows users to explore information about these certificates and CAs, as well as about the CT logs that capture information about every issued certificate.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7peWlbK1j0Da36jqjlD6rV/4fd7ef53247992078bbc89bd34f18fa9/image3.png" />
          </figure><p>In 2024, members of Cloudflare’s Research team collaborated with outside researchers to publish a paper titled “<a href="https://research.cloudflare.com/publications/SundaraRaman2023/"><u>Global, Passive Detection of Connection Tampering</u></a>”. Among the findings presented in the paper, it noted that globally, about 20% of all connections to Cloudflare close unexpectedly before any useful data exchange occurs. This unexpected closure is consistent with connection tampering by a third party, which may occur, for instance, when repressive governments seek to block access to websites or applications. Working with the Research team, we added visibility into <a href="https://blog.cloudflare.com/tcp-resets-timeouts/"><u>TCP resets and timeouts</u></a> to the <a href="https://radar.cloudflare.com/security/network-layer#tcp-resets-and-timeouts"><u>Network Layer Security page</u></a> on Radar. This graph, such as the example below for Turkmenistan, provides a perspective on potential connection tampering activity globally, and at a country level. Changes and trends visible in this graph can be used to corroborate reports of content blocking and other local restrictions on Internet connectivity.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/lxyCbxlW0mUHP9cU0n3Dp/a27081a3926ac4b0917fef1870197fce/image6.png" />
          </figure><p>The research team has been working on post-quantum encryption <a href="https://blog.cloudflare.com/tag/post-quantum/page/2/"><u>since 2017</u></a>, racing improvements in quantum computing to help ensure that today’s encrypted data and communications are resistant to being decrypted in the future. They have led the drive to incorporate post-quantum encryption across Cloudflare’s infrastructure and services, and in 2023 <a href="https://blog.cloudflare.com/post-quantum-crypto-should-be-free/"><u>we announced that it would be included in our delivery services</u></a>, available to everyone and free of charge, forever. However, to take full advantage, support is needed on the client side as well, so to track that, we worked together to add a <a href="https://radar.cloudflare.com/adoption-and-usage#post-quantum-encryption-adoption"><u>graph</u></a> to Radar’s <a href="https://radar.cloudflare.com/adoption-and-usage"><u>Adoption &amp; Usage</u></a> page that tracks the post-quantum encrypted share of HTTPS request traffic. <a href="https://radar.cloudflare.com/adoption-and-usage?dateStart=2024-01-01&amp;dateEnd=2024-01-28#post-quantum-encryption-adoption"><u>Starting 2024 at under 3%</u></a>, it has <a href="https://radar.cloudflare.com/adoption-and-usage?dateStart=2025-10-10&amp;dateEnd=2025-10-16#post-quantum-encryption-adoption"><u>grown to just over 47%</u></a>, thanks to major browsers and code libraries <a href="https://developers.cloudflare.com/ssl/post-quantum-cryptography/pqc-support/"><u>activating post-quantum support by default</u></a>.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3l2ceulOBO9S3Yytv7wIUr/c24b02ee132b7ced328993e2557cf765/image11.png" />
          </figure>
    <div>
      <h2>Measuring AI bot &amp; crawler activity</h2>
      <a href="#measuring-ai-bot-crawler-activity">
        
      </a>
    </div>
    <p>The rapid proliferation and growth of AI platforms since the launch of OpenAI’s ChatGPT in November 2022 has upended multiple industries. This is especially true for content creators. Over the last several decades, they generally allowed their sites to be crawled in exchange for the traffic that the search engines would send back to them — traffic that could be monetized in various ways. However, two developments have changed this dynamic. First, AI platforms began aggressively crawling these sites to vacuum up content to use for training their models (with no compensation to content creators). Second, search engines have evolved into answer engines, drastically reducing the amount of traffic they send back to sites. This has led content owners to demand <a href="https://blog.cloudflare.com/content-independence-day-no-ai-crawl-without-compensation/"><u>solutions</u></a>.</p><p>Among these solutions is providing customers with increased visibility into how frequently AI crawlers are <a href="https://www.cloudflare.com/learning/ai/how-to-prevent-web-scraping/">scraping their content</a>, and Radar has built on that to provide aggregated perspectives on this activity. Radar’s <a href="https://radar.cloudflare.com/ai-insights"><u>AI Insights page</u></a> provides graphs based on crawling traffic, including <a href="https://radar.cloudflare.com/ai-insights?industrySet=Finance#http-traffic-by-bot"><u>traffic trends by bot</u></a> and <a href="https://radar.cloudflare.com/ai-insights?industrySet=Finance#crawl-purpose"><u>traffic trends by crawl purpose</u></a>, both of which can be broken out by industry set as well. Customers can compare the traffic trends we show on the dashboard with trends across their industry.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4wNMjFo5eR2gBV78u2ITuD/833d83029224095d22fa0ad96aff9356/image1.png" />
          </figure><p>One key insight is the crawl-to-refer ratio:  a measure of how many HTML pages a crawler consumes in comparison to the number of page visits that they refer back to the crawled site. A view into these ratios by platform, and how they change over time, gives content creators insight into just how significant the reciprocal traffic imbalances are, and the impact of the ongoing transition of search engines into answer engines.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7fGlbFfPnuhaizCNZ5Wlr5/4e75c7fbb317428bffa5ea915d2ca428/image5.png" />
          </figure><p>Over the three decades, the humble <a href="https://www.robotstxt.org/robotstxt.html"><u>robots.txt file</u></a> has served as something of a gatekeeper for websites, letting crawlers know <a href="https://www.cloudflare.com/learning/ai/how-to-block-ai-crawlers/">if they are allowed to access content</a> on the site, and if so, which content. Well-behaved crawlers read and parse the file, and adjust their crawling activity accordingly. Based on the robots.txt files found across Radar’s top 10,000 domains, Radar’s AI Insights page shows <a href="https://radar.cloudflare.com/ai-insights#ai-user-agents-found-in-robotstxt"><u>how many of these sites explicitly allow or disallow these AI crawlers to access content</u></a>, and how complete that access/restriction is. With the ability to filter the data by domain category, this graph can provide site owners with visibility into how their peers may be dealing with these AI crawlers.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5r1iT7cCKr1OeCx3XlsFVq/78d101224e54673cfd84e513c73f6527/image8.png" />
          </figure>
    <div>
      <h2>Improving Internet resilience with routing visibility</h2>
      <a href="#improving-internet-resilience-with-routing-visibility">
        
      </a>
    </div>
    <p><a href="https://www.cloudflare.com/learning/network-layer/what-is-routing/"><u>Routing</u></a> is the process of selecting a path across one or more networks, and in the context of the Internet, routing selects the paths for <a href="https://www.cloudflare.com/learning/ddos/glossary/internet-protocol/"><u>Internet Protocol (IP)</u></a> packets to travel from their origin to their destination. It is absolutely critical to the functioning of the Internet, but lots of things can go wrong, and when they do, they can take a whole network offline. (And depending on the network, a <a href="https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-knocked-large-parts-of-the-internet-offline-today/"><u>larger blast radius</u></a> of sites, applications, and other service providers may be impacted.</p><p>Routing visibility provides insights into the health of a network, and its relationship to other networks. These insights can help identify or troubleshoot problems when they occur. Among the more significant things that can go wrong are route leaks and origin hijacks. <a href="https://blog.cloudflare.com/route-leak-detection-with-cloudflare-radar/#about-bgp-and-route-leaks"><u>Route leaks</u></a> occur when a routing announcement propagates beyond its intended scope — that is, when the announcement reaches networks that it shouldn’t. An <a href="https://blog.cloudflare.com/bgp-hijack-detection/#what-is-bgp-origin-hijacking"><u>origin hijack</u></a> occurs when an attacker creates fake announcements for a targeted prefix, falsely identifying an <a href="https://developers.cloudflare.com/radar/glossary/#autonomous-systems"><u>autonomous systems (AS)</u></a> under their control as the origin of the prefix — in other words, the attacker claims that their network is responsible for a given set of IP addresses, which would cause traffic to those addresses to be routed to them.</p><p>In 2022 and 2023 respectively, we added <a href="https://blog.cloudflare.com/route-leak-detection-with-cloudflare-radar/"><u>route leak</u></a> and <a href="https://blog.cloudflare.com/bgp-hijack-detection/"><u>origin hijack</u></a> detection to <a href="https://radar.cloudflare.com/routing#routing-anomalies"><u>Radar</u></a>, providing network operators and other interested groups (such as researchers) with information to help identify which networks may be party to such events, whether as a leaker/hijacker, or a victim. And perhaps more importantly, in 2023 we also <a href="https://blog.cloudflare.com/traffic-anomalies-notifications-radar/#notifications-overview"><u>launched notifications</u></a> for route leaks and origin hijacks, automatically notifying subscribers via email or webhook when such an event is detected, enabling them to take immediate action.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/31Q0SVrOitlfKiw4jkT0Po/298ad31e36807c3ebc89aa1adfb149f8/image7.png" />
          </figure><p>In 2025, we further improved this visibility by adding two additional capabilities. The first was <a href="https://blog.cloudflare.com/bringing-connections-into-view-real-time-bgp-route-visibility-on-cloudflare/"><u>real-time BGP route visibility</u></a>, which illustrates how a given network prefix is connected to other networks — what is the route that packets take to get from that set of IP addresses to the large “tier 1” network providers? Network administrators can use this information when facing network outages, implementing new deployments, or investigating route leaks.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/69wWKFXVxN91YMHydKj8P9/858e3d8d0b90fbf737bb3b0b195b4885/image4.png" />
          </figure><p>An <a href="https://www.apnic.net/manage-ip/using-whois/guide/as-set/"><u>AS-SET</u></a> is a grouping of related networks, historically used for multiple purposes such as grouping together a list of downstream customers of a particular network provider. Our recently announced <a href="https://blog.cloudflare.com/monitoring-as-sets-and-why-they-matter/"><u>AS-SET monitoring</u></a> enables network operators to monitor valid and invalid AS-SET memberships for their networks, which can help prevent misuse and issues like route leaks.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5sydRT2tCT7VDJ84S87z7t/1386688e7fb96b477dcf56fbaae090ca/image10.png" />
          </figure>
    <div>
      <h2>Not just pretty pictures</h2>
      <a href="#not-just-pretty-pictures">
        
      </a>
    </div>
    <p>While Radar has been historically focused on providing clear, informative visualizations, we have also launched capabilities that enable users to get at the underlying data more directly, enabling them to use it in a more programmatic fashion. The most important one is the <a href="https://developers.cloudflare.com/api/resources/radar/"><u>Radar API</u></a>, <a href="https://blog.cloudflare.com/radar2/#sharing-insights"><u>launched in 2022</u></a>. Requiring just an access token, users can get access to all the data shown on Radar, as well as some more advanced filters that provide more specific data, enabling them to incorporate Radar data into their own tools, websites, and applications. The example below shows a simple API call that returns the global distribution of human and bot traffic observed over the last seven days.</p>
            <pre><code>curl -X 'GET' \
'https://api.cloudflare.com/client/v4/radar/http/summary/bot_class?name=main&amp;dateRange=1d' \
-H 'accept: application/json' \
-H 'Authorization: Bearer $TOKEN'</code></pre>
            
            <pre><code>{
  "success": true,
  "errors": [],
  "result": {
    "main": {
      "human": "72.520636",
      "bot": "27.479364"
    },
    "meta": {
      "dateRange": [
        {
          "startTime": "2025-10-19T19:00:00Z",
          "endTime": "2025-10-20T19:00:00Z"
        }
      ],
      "confidenceInfo": {
        "level": null,
        "annotations": []
      },
      "normalization": "PERCENTAGE",
      "lastUpdated": "2025-10-20T19:45:00Z",
      "units": [
        {
          "name": "*",
          "value": "requests"
        }
      ]
    }
  }
}</code></pre>
            <p>The <a href="https://www.cloudflare.com/learning/ai/what-is-model-context-protocol-mcp/"><u>Model Context Protocol</u></a> is a standard way to make information available to <a href="https://www.cloudflare.com/learning/ai/what-is-large-language-model/"><u>large language models (LLMs)</u></a>. Somewhat similar to the way an application programming interface (API) works, MCP offers a documented, standardized way for a computer program to integrate services from an external source. It essentially allows <a href="https://www.cloudflare.com/learning/ai/what-is-artificial-intelligence/"><u>AI</u></a> programs to exceed their training, enabling them to incorporate new sources of information into their decision-making and content generation, and helps them connect to external tools. The <a href="https://github.com/cloudflare/mcp-server-cloudflare/tree/main/apps/radar#cloudflare-radar-mcp-server-"><u>Radar MCP server</u></a> allows MCP clients to gain access to Radar data and tools, enabling exploration using natural language queries.</p><p>Radar’s <a href="https://radar.cloudflare.com/scan"><u>URL Scanner</u></a> has proven to be one of its most popular tools, <a href="https://blog.cloudflare.com/building-urlscanner/"><u>scanning millions of sites</u></a> since launching in 2023. It allows users to safely determine whether a site may contain malicious content, as well as providing information on technologies used and insights into the site’s headers, cookies, and links. In addition to being available on Radar, it is also accessible through the API and MCP server.</p><p>Finally, Radar’s user interface has seen a number of improvements over the last several years, in service of improved usability and a better user experience. As new data sets and capabilities are launched, they are added to the search bar, allowing users to search not only for countries and ASNs, but also IP address prefixes, certificate authorities, bot names, IP addresses, and more. Initially launching with just a few default date ranges (such as last 24 hours, last 7 days, etc.), we’ve expanded the number of default options, as well as enabling the user to select custom date ranges of up to one year in length. And because the Internet is global, Radar should be too. In 2024, we <a href="https://blog.cloudflare.com/cloudflare-radar-localization-journey/"><u>launched internationalized versions of Radar</u></a>, marking availability of the site in 14 languages/dialects, including downloaded and embedded content.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5zVcB5Wy98ekCJY0wAsx8e/77857f7fe3519a508c3db50a19432e08/image9.png" />
          </figure><p>This is a sampling of the updates and enhancements that we have made to Radar over the last five years in support of Internet measurement, transparency, and resilience. These individual data sets and tools combine to provide one of the most comprehensive views of the Internet available. And we’re not close to being done. We’ll continue to bring additional visibility to the unseen ways that the Internet is changing by adding more tools, data sets, and visualizations, to help users answer more questions in areas including AI, performance, adoption and usage, and security.</p><p>Visit <a href="http://radar.cloudflare.com"><u>radar.cloudflare.com</u></a> to explore all the great data sets, capabilities, and tools for yourself, and to use the Radar <a href="https://developers.cloudflare.com/api/resources/radar/"><u>API</u></a> or <a href="https://github.com/cloudflare/mcp-server-cloudflare/tree/main/apps/radar#cloudflare-radar-mcp-server-"><u>MCP server</u></a> to incorporate Radar data into your own tools, sites, and applications. Keep an eye on the <a href="https://developers.cloudflare.com/changelog/?product=radar"><u>Radar changelog feed</u></a>, <a href="https://developers.cloudflare.com/radar/release-notes/"><u>Radar release notes</u></a>, and the <a href="https://blog.cloudflare.com/tag/cloudflare-radar/"><u>Cloudflare blog</u></a> for news about the latest changes and launches, and don’t hesitate to <a href="#"><u>reach out to us</u></a> with feedback, suggestions, and feature requests.</p> ]]></content:encoded>
            <category><![CDATA[Research]]></category>
            <category><![CDATA[Radar]]></category>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[Routing]]></category>
            <category><![CDATA[API]]></category>
            <guid isPermaLink="false">4hyomcz7ZJG76L799PaqhJ</guid>
            <dc:creator>David Belson</dc:creator>
        </item>
        <item>
            <title><![CDATA[Internet measurement, resilience, and transparency: blog takeover from Cloudflare Research and friends]]></title>
            <link>https://blog.cloudflare.com/internet-measurement-resilience-transparency-week/</link>
            <pubDate>Mon, 27 Oct 2025 12:00:00 GMT</pubDate>
            <description><![CDATA[ Coinciding with the ACM’s Internet Measurement Conference, the Cloudflare Research team is publishing a series of posts this week to share their research on building a more measurable, resilient, and transparent Internet. These posts will cover foundational concepts in Internet measurement, Internet resilience, cryptography, and networking.  ]]></description>
            <content:encoded><![CDATA[ <p>The Cloudflare Research team spends our time investigating how we can apply new technologies to continue to help build a better Internet. We don’t just <a href="https://research.cloudflare.com/publications/"><u>write papers</u></a> – we put ideas into practice, and test our hypotheses in real time.</p><p>Our work is deeply collaborative by nature, working closely with academia, standards bodies like the <a href="https://www.ietf.org/"><u>IETF</u></a>, the open-source community, and our own product and engineering teams. We believe in doing this research in the open so that others can learn from it, give us feedback, and work with us to make the next version of the Internet even better. That’s why this week we’re publishing a series of posts to make more of our research public – research that we think will help push forward a more measurable, resilient, and transparent Internet.</p><p>Internet Measurement will be one of the week’s major themes because our posts here coincide with the Association for Computing Machinery (ACM)’s annual <a href="https://conferences.sigcomm.org/imc/2025/"><u>Internet Measurement Conference</u></a>, a venue for new work that measures and analyzes the behavior, performance, and evolution of the Internet and networked systems. Internet measurement is hard to get right, so we’re taking the opportunity to dive deeper into some of the foundational concepts and products that define how we do measurement at Cloudflare scale.   </p><p>Each day this week we share new stories from our Research team and friends in our engineering groups elsewhere at Cloudflare. We will dive deep into Internet measurement data, establish new frameworks for Internet resilience, discuss cryptographic protocols for an increasingly automated web, and explore new advances in networking technologies.</p><p>We’re excited to showcase this work, so stay tuned this week for the posts to follow. Want a preview of what to expect? Read on for an outline of what we will cover this week.</p>
    <div>
      <h2>An ode to Internet measurement </h2>
      <a href="#an-ode-to-internet-measurement">
        
      </a>
    </div>
    <p>We’ll start the week with a foundational look at what Internet measurement actually consists of, explaining the jargon behind the science and some of the fundamental tradeoffs one has to make when trying to do measurement well. A former Cloudflare intern will share how working with Cloudflare-scale data completely changed his perspective on detecting connection tampering. We’ll also dig into how Cloudflare Radar has evolved in the past few years, and take a deeper look at how our <a href="https://speed.cloudflare.com/"><u>Internet speed test</u></a> works! </p>
    <div>
      <h2>A better Internet is a more resilient Internet </h2>
      <a href="#a-better-internet-is-a-more-resilient-internet">
        
      </a>
    </div>
    <p>Something that we take for granted, but notice when it fails: a network's ability not just to stay online, but to withstand, adapt to, and rapidly recover from breakdowns – otherwise known as Internet Resilience. There are many factors that can cause Internet disruption, from cyberattack to natural disaster to government-directed shutdowns. We’ll go deeper into these disruptions in our quarterly Internet Disruption Summary, which details the length and impact of each outage as observed from Cloudflare’s network. </p><p>It’s easy to say Internet Resilience is the goal, but it can be harder to define what that actually means. In our blog “A Framework for Internet Resilience,” we do exactly that – establish a framework for how governments, infrastructure providers, and researchers can assess how resilient their infrastructure is, from first principles.   </p><p>A resilient Internet is also immune to quantum compromise. Much has happened since we published our highly cited <a href="https://blog.cloudflare.com/pq-2024/"><u>State of the Post-Quantum Internet</u></a>, so we’ll share an updated view of progress of post-quantum deployment over the past year, as well as a deep dive into Merkle Tree Certificates, an experimental design with Chrome to make post-quantum certificates deployable at scale. </p>
    <div>
      <h2>A transparent look into Cloudflare’s network</h2>
      <a href="#a-transparent-look-into-cloudflares-network">
        
      </a>
    </div>
    <p>Cloudflare sees millions of connections and IP addresses per second – and characterizing them at scale isn’t easy. We’ll take a deeper look at what a connection actually <i>means </i>at Cloudflare: what server-side characteristics we observe and measure across our network, and what they tell us about the size and flow of data through the Internet.</p><p>Many products at Cloudflare aren’t possible without pushing the limits of network hardware and software to deliver improved performance, increased efficiency, or novel capabilities. That’s why we’re sharing a deep dive into how we bend the limits of our Linux networking stack to be economical with addressing space while maintaining performance.</p><p>All of this theory has real-world applications we’ll dive into: from detecting shared IP space (CGNAT), to defending against DDoS attacks, to improving the efficiency of our cache.   </p>
    <div>
      <h2>Cryptographic protocols for an agentic web</h2>
      <a href="#cryptographic-protocols-for-an-agentic-web">
        
      </a>
    </div>
    <p>The rise of AI agents and AI crawlers is a turning point for infrastructure providers. For instance, traffic from many users is condensed into a few beefy datacenters, and request patterns appear to be more automated as LLMs orchestrate web browsers. Measuring the impact of this shift has become an interesting and complex problem.</p><p>This week, we’ll dive into how honest agents and website operators can work together to stay safe, private, and resilient. We’ll discuss new work being done in the IETF that builds upon <a href="https://blog.cloudflare.com/web-bot-auth/"><u>Web Bot Auth</u></a> – a protocol that allows automated HTTP clients like bots and agents to identify themselves to the rest of the Internet. In addition, in order to empower honest users, we’ll propose new cryptographic protocols that allow them through while protecting websites from DDoS, fraud, or <a href="https://www.cloudflare.com/learning/ai/how-to-prevent-web-scraping/">scraping attacks</a>. We will present real-world deployment considerations, as well as mechanisms to future-proof them in the face of the imminent post-quantum transition.</p>
    <div>
      <h2>Get your reading glasses on </h2>
      <a href="#get-your-reading-glasses-on">
        
      </a>
    </div>
    <p>Expect blog posts this week that push the boundaries of emerging research in their respective fields, establish new frameworks and ideas, and bridge the gap between academic theory and real-world applications. We couldn’t be more excited to share them with you!</p> ]]></content:encoded>
            <category><![CDATA[Research]]></category>
            <guid isPermaLink="false">10mTvbwbgtwvoeI31cIbnV</guid>
            <dc:creator>Mari Galicer</dc:creator>
        </item>
        <item>
            <title><![CDATA[How does Cloudflare’s Speed Test really work?]]></title>
            <link>https://blog.cloudflare.com/how-does-cloudflares-speed-test-really-work/</link>
            <pubDate>Mon, 27 Oct 2025 12:00:00 GMT</pubDate>
            <description><![CDATA[ In this blog post we’ll discuss how Cloudflare thinks about measuring Internet quality, how our own Cloudflare speed test works, and our future plans for providing Internet measurement tools that help everyone build a better Internet. ]]></description>
            <content:encoded><![CDATA[ <p>Anyone can <i>say </i>their Internet service is fast, but how do you really know if it is? Just as we check our temperature to see if a fever has gone down or test the air to know its quality, users of the Internet run speed tests to answer: “How fast is my connection?” Since it is common to talk about Internet connectivity in terms of “speed,” you might think this is a straightforward concept to measure, but there are actually many different ways to do so. For Cloudflare’s Speed Test, we set out to measure your connection’s quality and what it realistically provides, rather than focusing on peak bandwidth. In this blog post we’ll discuss how Cloudflare thinks about measuring Internet quality, how our own Cloudflare speed test works, and our future plans for providing Internet measurement tools that help everyone build a better Internet. </p>
    <div>
      <h2>What is a speed test? </h2>
      <a href="#what-is-a-speed-test">
        
      </a>
    </div>
    <p>Before diving into Cloudflare’s speed test, let’s take a moment to understand what a speed test actually is. There’s no <i>one</i> definition of what Internet “speed” means, but what people are typically referring to is the measurement of <i>throughput</i> or the rate at which data is sent between sender and receiver within a network. Throughput is typically expressed in mega or gigabits per second (Mbps or Gbps), which are units that end users are usually familiar with, due to how commercial Internet Service Providers (ISPs) often market their packages (500 Mbps, 1 Gbps, increasingly 10 Gbps and so on). In light of this popular association, speed tests are typically designed to send data until the maximum throughput of a connection is met.</p><p>Most speed tests are run from end user devices such as laptops, mobile phones and sometimes routers, but where the test sends data <i>to</i>, meaning where the server is in the network, differs from test to test. These variances can impact results dramatically. For example, consider a user in New York City running one speed test that sends data to New Jersey, while another connects to a server in Singapore. Even if both tests use the exact same methodology, their results will differ noticeably due to the distance they have to travel and the network links they have to cross to get there. </p><p>Server locations are one of many ways speed tests vary from one another. They may also differ in how the test decides to send more data, the number of TCP/UDP streams it opens to send data, which congestion control algorithm it uses, how it aggregates the samples it collects, etc. Each of these decisions influences what the end user sees as their final “speed”. It is also common for speed tests to measure <a href="https://www.cloudflare.com/learning/performance/glossary/what-is-latency/"><u>latency</u></a>, packet loss and sometimes latency variation (jitter), though as important as they are, and as we’ll discuss in more detail below, these metrics are not always intuitive for end users to understand. </p><p>Speed tests gained popularity in the early days of the Internet, when bandwidth was the primary obstacle to a quality end user experience. But as the Internet has progressed and its use cases have expanded, bandwidth has become less of a limitation and, in some geographies, almost plentiful. Now, other challenges that can degrade your video calls or gaming sessions, such as latency under load (<a href="https://en.wikipedia.org/wiki/Bufferbloat"><u>bufferbloat</u></a>) and packet loss, have become the industry focus as key metrics to optimize when improving Internet connectivity. Nevertheless, speed tests remain a valuable tool for assessing Internet quality, in part because of their popularity with end users. Speed tests are by far the most well-known kind of Internet measurement and for that reason, Cloudflare is proud to provide one.</p>
    <div>
      <h2>How does Cloudflare’s Speed Test work? </h2>
      <a href="#how-does-cloudflares-speed-test-work">
        
      </a>
    </div>
    <p>When you visit <a href="https://speed.cloudflare.com/"><u>Cloudflare’s Speed Test</u></a>, results start appearing right away. That’s because as soon as the page loads, your browser begins sending data requests to Cloudflare’s Network Quality API and recording how long each exchange takes. The API runs on Cloudflare’s global network using <a href="https://workers.cloudflare.com/"><u>Workers</u></a>, leveraging our <a href="https://www.cloudflare.com/learning/cdn/glossary/anycast-network/"><u>anycast</u></a> architecture to automatically route you to the nearest data center.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2PGrZTYBlsK0kW8H8EY97T/abc44cdad820c143756dc4056883b9ee/image4.png" />
          </figure><p>Unlike many other speed test methodologies that focus on absolute maximum throughput, Cloudflare’s Speed Test doesn’t try to saturate your connection. Instead, it sends a series of data payloads of predefined sizes—what we call data blocks—to assess your connection’s quality under more realistic usage patterns. Each data block is transmitted a fixed number of times, and once the sequence completes, the detailed results are displayed in box-and-whisker plots to show the observed ranges and percentiles.</p><p>To generate each individual result, we record the time it takes to establish the connection and the time required for the data transfer to finish, subtracting any server “thinking time”. Establishing a connection involves exchanging individual packets back and forth and happens as quickly as network latency permits, while the data transfer time is limited by network bandwidth, congestion, server limits, and even the amount of data transferred—perhaps surprisingly, smaller transfers also have their throughput limited by network latency.</p><p>As throughput measurements run, the test also sends empty requests at regular intervals to measure loaded latency: the round-trip time (RTT) it takes for data to travel to Cloudflare’s network and back while your connection is busy. Loaded latency differs from idle latency, which measures RTT to Cloudflare’s network when no data is being transferred. Idle latency is recorded first, as soon as the page loads, and reflects the lowest expected latency. The test also measures loaded and idle jitter, the average variation between consecutive RTT measurements—reflecting network stability—and packet loss, the percentage of packets that fail to reach their destination when relayed through a WebRTC TURN server over a period of time.</p><p>Throughout the test, you can watch the aggregate results for each metric update in real time, but the final result isn’t calculated until all test sequences are complete. Once they are, the full set of measurements is used to compute an <a href="https://developers.cloudflare.com/speed/aim/"><u>Aggregated Internet Measurement</u></a> (AIM) score—a metric designed to translate your connection’s performance into end-user-friendly terms, such as how well it supports streaming, gaming, or video conferencing. The AIM score provides a convenient summary of overall performance, but in this deep dive, we’ll focus on what the detailed Cloudflare Speed Test results actually tell you—and what they don’t—about your Internet connection.</p>
    <div>
      <h2>What do the Cloudflare Speed Test results represent? </h2>
      <a href="#what-do-the-cloudflare-speed-test-results-represent">
        
      </a>
    </div>
    <p>A defining feature of Cloudflare’s Speed Test is that it runs on Cloudflare’s own global network. Other speed test providers place their servers closer to end users or major exchange points to capture how the network performs under specific conditions. Cloudflare’s Speed Test, however—and any test built on our Network Quality API—measures performance in a context that mirrors what users actually do every day: accessing content delivered through Cloudflare’s network.</p><p>Additionally, since Cloudflare’s Speed Test does not strive to saturate a user’s connection, its download and upload tests do not technically measure maximum throughput, but rather the rate at which you can reliably expect to send various sizes of data. While this may seem like a small distinction, it means that Cloudflare’s Speed Test is not trying to show what your connection is capable of at its peak, but rather what it typically delivers—its <i>quality</i>.</p><p>Day to day, most users are not maximizing their available bandwidth. Video conferencing, streaming, web browsing, and even gaming all require minimal bandwidth and are much more sensitive to latency, jitter, and packet loss. In other words, achieving a high score on a throughput-saturating speed test—one that mirrors the service level you purchased from your ISP—does not necessarily equate to a high-quality online experience. The finer details of which metrics matter most for evaluating network quality depend on individual use cases. For example, a gamer might benefit more from lower latency (lower lag), while a remote worker may benefit more from lower jitter (smoother video conferencing). For the majority of modern use cases, throughput is just one of many metrics that contribute to a quality Internet connection</p><p>It’s also important to note that Cloudflare’s Speed Test runs primarily from an end-user device, within the browser. As a result, its measurements include potential bottlenecks beyond the access network—such as the browser itself, the local Wi-Fi network, and other factors. This means the results don’t solely reflect the performance of your ISP, but rather the combined performance of all components along the path to the content.</p><p>It’s common for end users to run speed tests to check whether they’re getting the Internet service they pay for. While that’s a perfectly reasonable question, there’s no standardized definition for how to answer it. This means that no speed test—including Cloudflare’s—is a definitive measure of ISP service. However, it is a helpful resource for assessing the quality of experience when accessing content delivered by Cloudflare’s vast global network.</p>
    <div>
      <h2>How do I interpret my Cloudflare Speed Test results?</h2>
      <a href="#how-do-i-interpret-my-cloudflare-speed-test-results">
        
      </a>
    </div>
    <p>In this section, we’ll interpret the results from two speed test examples: the first test scoring <b>“Great”</b> on all three <a href="https://developers.cloudflare.com/speed/aim/"><u>network quality rubrics</u></a>, and the second scoring a mere <b>“Average”</b>. In your own tests, you may get a consistent score, or you may get different scores for video streaming, online gaming and video chatting, depending on how well-balanced your Internet connection is over these three use cases.</p><p>From these scores we already get a high-level interpretation of the test results. You can expect consistently good quality from the <b>“Great”</b> connection and reasonable quality with occasional glitches from the <b>“Average”</b> connection – but to understand <i>why</i>, we must look at the numbers.</p>
    <div>
      <h3>Example 1: Wi-Fi over a residential fiber connection</h3>
      <a href="#example-1-wi-fi-over-a-residential-fiber-connection">
        
      </a>
    </div>
    
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/42aa3c4PsRf5R9q3FxF2Af/e0125315a1417cf7040060fa7012a21a/image5.png" />
          </figure><p>This test ran from a laptop connected over Wi-Fi inside a single-family home served by a 500 Mbps residential fiber connection, and we can already see that we can’t quite reach the contracted download speed, topping off at 406 Mbps. The culprit here is Wi-Fi, which is usually the bottleneck on high-speed connections, and a common cause of observable instability.</p><p>But here we can see that we’re probably in an area of the house with good reception and without significant activity from neighboring Wi-Fi networks (the two most common causes of poor Wi-Fi). We can tell from the relatively consistent shape of the download and upload graphs, and from the low jitter.</p><p>The latency is well within what’s expected in an urban area (and could be 2 milliseconds lower by switching to a wired connection), and the difference between the numbers at idle and the numbers while loaded (downloading or uploading) is relatively small. This means you can expect to attend a video call while your files synchronize to and from your cloud drive of choice in the background, without any glitches. Large differences between the idle and loaded numbers are a common indicator of a poor connection—if you observe differences approaching 100 milliseconds or more over a wired connection, your ISP is likely at fault.</p><blockquote><p><i>Higher-bandwidth connections should display lower idle to loaded latency differences. The higher the bandwidth, the less likely it is to be fully utilized in practice. However, congestion further upstream in the network can drive these numbers up, especially if your ISP is oversubscribing its capacity.</i></p></blockquote><p>You might be wondering why the download and upload graphs start slow and ramp up. This happens because data transfers progressively send more packets at once for each required acknowledgment, starting by one acknowledgment for each packet. The consequence is that small data transfers are limited in speed by latency—the longer it takes for a packet to reach its destination, the longer it takes the acknowledgment packet to make its way back to the sender, and the longer it takes for the next data packet to be sent.</p><p>If you’re technically inclined, you may enjoy learning about <a href="https://en.wikipedia.org/wiki/TCP_congestion_control"><u>congestion control algorithms</u></a>, but that topic alone can fill entire books. For now, you can see this effect in the charts for each download size: transfers smaller than 10 MB can’t utilize the full bandwidth of this connection.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/38crAmjPGHbeSQPhCrjUR9/4cb604373b23a92add26224fd22d710b/image2.png" />
          </figure><p>If you’re left wondering if this means that your normal day-to-day web browsing, composed primarily of relatively small data transfers, is mostly unable to fully utilize the available bandwidth above a certain level, then you have successfully grasped one of the reasons why pure speed is no longer the main indicator of quality of experience in modern broadband connections.</p>
    <div>
      <h3>Example 2: Cellular 5G connection</h3>
      <a href="#example-2-cellular-5g-connection">
        
      </a>
    </div>
    
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/38bPiw8YyT0zxzUZsbhmcd/acc76a83c616ca7de10b3219b417eca7/image1.png" />
          </figure><p>The second test ran from the same laptop using a cellular 5G connection, and the results are very different. The speeds are much lower and inconsistent over time, the latency numbers are higher (especially under load), and the latency jitter is quite high.</p><p>From the download and upload speeds we can guess that we’re probably not in a densely populated area—in areas of dense 5G coverage you can expect higher speeds and lower latencies. On the other hand, in densely populated areas you can also expect more people to be using the network at the same time, driving speeds down and latencies up (due to congestion). From the detailed latency charts we can observe how irregular latencies are in this case, with some numbers above 100 milliseconds. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1BxebFqhpduYBbCrBeou4p/adebd47d72e8af8bcdb0351f25301baa/image3.png" />
          </figure><p>Connection quality and convenience are often at odds with each other. The convenience of being able to access the Internet from anywhere in your house, or from a park or the beach, comes with quality tradeoffs. The Cloudflare Speed Test reports allows you to better understand those tradeoffs, <a href="https://radar.cloudflare.com/quality"><u>compare your results</u></a> against your peers or other available providers, and make more informed choices.</p>
    <div>
      <h2>Why does Cloudflare provide a speed test?</h2>
      <a href="#why-does-cloudflare-provide-a-speed-test">
        
      </a>
    </div>
    <p>Cloudflare provides its speed test to empower end users with greater insight into their connectivity and to help improve the Internet by offering transparency into how it performs. The engine that runs the test is <a href="https://github.com/cloudflare/speedtest"><u>open source</u></a>, which means that anyone can use our speed test to facilitate their own research and can always verify how the results are produced. To enable researchers, policymakers, network operators, and other stakeholders to analyze Internet connectivity, all results from Cloudflare’s Speed Tests are published to Measurement Lab’s <a href="https://www.measurementlab.net/blog/cloudflare-aimscoredata-announcement/"><u>public Internet measurement dataset</u></a> in BigQuery and are also accessible through Cloudflare’s <a href="https://radar.cloudflare.com/quality"><u>Radar API</u></a>. We share this data to advance open Internet research, but every result is anonymized to protect user privacy and is never used for commercial purposes.</p>
    <div>
      <h2>What’s next for Cloudflare’s Speed Test? </h2>
      <a href="#whats-next-for-cloudflares-speed-test">
        
      </a>
    </div>
    <p>Originally developed in 2020, Cloudflare’s speed test has become a go-to resource for measuring end user network quality. In particular, we receive a lot of positive feedback about its easy-to-understand user interface and the metrics that it reports alongside throughput.</p><p>But at Cloudflare, we are always improving – so here’s what we’re planning to make Cloudflare’s speed test even better.</p><p><b>Increased Measurement</b></p><p>We’re continuing to expand the reach and scalability of Cloudflare’s Network Quality API to make it easier for third parties to integrate and use. Our goal is to empower customers to measure their users' connectivity by utilizing Cloudflare's network. We’re already proud to partner with UNICEF, which uses Cloudflare’s Speed Test as part of its Giga project to connect every school in the world to the Internet, and with <a href="https://orb.net/docs/getting-started/what-is-orb"><u>Orb</u></a>, which enables end users to continuously monitor the quality of their Internet connections from any platform or device using Cloudflare’s Network Quality API as part of its diagnostic measurement suite. Throughout 2026, we plan to significantly increase the number of third parties using our Speed Test and Network Quality API to power their own measurement tools and initiatives.</p><p><b>Additional Capabilities</b></p><p>To make the Speed Test more valuable for third parties, we’re also developing new capabilities that enable more detailed performance analysis. This includes support for higher throughput measurements—which, while not the sole indicator of connection quality, remain important for diagnosing network performance, especially in enterprise or shared-office environments where multiple users share the same connection. These enhancements will help make our platform a more comprehensive tool for understanding and improving network health.</p><p><b>Improved Diagnostics</b></p><p>Many users turn to speed tests not only to verify that they’re getting the service they’ve paid for, but also to diagnose connectivity issues. We want to make that diagnostic process even more effective. Our goal is to expose richer metrics and more advanced functionality to help users answer key questions, such as: Where’s the bottleneck? Is it within my local network or my ISP’s? Does this issue occur only with specific applications? Is it unique to me, or are others in my region experiencing it too? By providing deeper insight into these questions, we aim to make Cloudflare’s Speed Test a more powerful tool for understanding and improving real-world Internet performance.</p>
    <div>
      <h2>Try It Now</h2>
      <a href="#try-it-now">
        
      </a>
    </div>
    <p>Try running a Cloudflare Speed Test to test your connectivity today by visiting <a href="http://speed.cloudflare.com"><u>speed.cloudflare.com</u></a>.</p> ]]></content:encoded>
            <category><![CDATA[Research]]></category>
            <category><![CDATA[Speed]]></category>
            <guid isPermaLink="false">6JKtIeRYl1gOCq1Jf2yonN</guid>
            <dc:creator>Lai Yi Ohlsen</dc:creator>
            <dc:creator>Carlos Rodrigues</dc:creator>
        </item>
    </channel>
</rss>