
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[ The Cloudflare Blog ]]></title>
        <description><![CDATA[ Get the latest news on how products at Cloudflare are built, technologies used, and join the teams helping to build a better Internet. ]]></description>
        <link>https://blog.cloudflare.com</link>
        <atom:link href="https://blog.cloudflare.com/" rel="self" type="application/rss+xml"/>
        <language>en-us</language>
        <image>
            <url>https://blog.cloudflare.com/favicon.png</url>
            <title>The Cloudflare Blog</title>
            <link>https://blog.cloudflare.com</link>
        </image>
        <lastBuildDate>Wed, 15 Apr 2026 19:33:11 GMT</lastBuildDate>
        <item>
            <title><![CDATA[Introducing Programmable Flow Protection: custom DDoS mitigation logic for Magic Transit customers]]></title>
            <link>https://blog.cloudflare.com/programmable-flow-protection/</link>
            <pubDate>Tue, 31 Mar 2026 13:00:00 GMT</pubDate>
            <description><![CDATA[ Magic Transit customers can now program their own DDoS mitigation logic and deploy it across Cloudflare’s global network. This enables precise, stateful mitigation for custom and proprietary UDP protocols. ]]></description>
            <content:encoded><![CDATA[ <p>We're proud to introduce <a href="https://developers.cloudflare.com/ddos-protection/advanced-ddos-systems/overview/programmable-flow-protection/"><u>Programmable Flow Protection</u></a>: a system designed to let <a href="https://www.cloudflare.com/network-services/products/magic-transit/"><u>Magic Transit</u></a> customers implement their own custom DDoS mitigation logic and deploy it across Cloudflare’s global network. This enables precise, stateful mitigation for custom and proprietary protocols built on UDP. It is engineered to provide the highest possible level of customization and flexibility to mitigate DDoS attacks of any scale. </p><p>Programmable Flow Protection is currently in beta and available to all Magic Transit Enterprise customers for an additional cost. Contact your account team to join the beta or sign up at <a href="https://www.cloudflare.com/en-gb/lp/programmable-flow-protection/"><u>this page</u></a>.</p>
    <div>
      <h3>Programmable Flow Protection is customizable</h3>
      <a href="#programmable-flow-protection-is-customizable">
        
      </a>
    </div>
    <p>Our existing <a href="https://www.cloudflare.com/ddos/"><u>DDoS mitigation systems</u></a> have been designed to understand and protect popular, well-known protocols from DDoS attacks. For example, our <a href="https://developers.cloudflare.com/ddos-protection/advanced-ddos-systems/overview/advanced-tcp-protection/"><u>Advanced TCP Protection</u></a> system uses specific known characteristics about the TCP protocol to issue challenges and establish a client’s legitimacy. Similarly, our <a href="https://blog.cloudflare.com/advanced-dns-protection/"><u>Advanced DNS Protection</u></a> builds a per-customer profile of DNS queries to mitigate DNS attacks. Our generic DDoS mitigation platform also understands common patterns across a variety of other well known protocols, including NTP, RDP, SIP, and many others.</p><p>However, custom or proprietary UDP protocols have always been a challenge for Cloudflare’s DDoS mitigation systems because our systems do not have the relevant protocol knowledge to make intelligent decisions to pass or drop traffic. </p><p>Programmable Flow Protection addresses this gap. Now, customers can write their own <a href="https://ebpf.io/"><u>eBPF</u></a> program that defines what “good” and “bad” packets are and how to deal with them. Cloudflare then runs the program across our entire global network. The program can choose to either drop or challenge “bad” packets, preventing them from reaching the customer’s origin. </p>
    <div>
      <h3>The problem of UDP-based attacks</h3>
      <a href="#the-problem-of-udp-based-attacks">
        
      </a>
    </div>
    <p><a href="https://www.cloudflare.com/learning/ddos/glossary/user-datagram-protocol-udp/"><u>UDP</u></a> is a connectionless transport layer protocol. Unlike TCP, UDP has no handshake or stateful connections. It does not promise that packets will arrive in order or exactly once. UDP instead prioritizes speed and simplicity, and is therefore well-suited for online gaming, VoIP, video streaming, and any other use case where the application requires real-time communication between clients and servers.</p><p>Our DDoS mitigation systems have always been able to detect and mitigate attacks against well-known protocols built on top of UDP. For example, the standard DNS protocol is built on UDP, and each DNS packet has a well-known structure. If we see a DNS packet, we know how to interpret it. That makes it easier for us to detect and drop DNS-based attacks. </p><p>Unfortunately, if we don’t understand the protocol inside a UDP packet’s payload, our DDoS mitigation systems have limited options available at mitigation time. If an attacker <a href="https://www.cloudflare.com/learning/ddos/udp-flood-ddos-attack/"><u>sends a large flood of UDP traffic</u></a> that does not match any known patterns or protocols, Cloudflare can either entirely block or apply a rate limit to the destination IP and port combination. This is a crude “last line of defense” that is only intended to keep the rest of the customer’s network online, and it can be painful in a couple ways. </p><p>First, a block or a generic <a href="https://www.cloudflare.com/learning/bots/what-is-rate-limiting/"><u>rate limit</u></a> does not distinguish good traffic from bad, which means these mitigations will likely cause legitimate clients to experience lag or connection loss — doing the attacker’s job for them! Second, a generic rate limit can be too strict or too lax depending on the customer. For example, a customer who expects to receive 1Gbps of legitimate traffic probably needs more aggressive rate limiting compared to a customer who expects to receive 25Gbps of legitimate traffic.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/L8PZ6eWn9nkpATaNcUinB/b6c12b4be815fbd4e71166b6f0c30329/BLOG-3182_2.png" />
          </figure><p><sup><i>An illustration of UDP packet contents. A user can define a valid payload and reject traffic that doesn’t match the defined pattern.</i></sup></p><p>The Programmable Flow Protection platform was built to address this problem by allowing our customers to dictate what “good” versus “bad” traffic actually looks like. Many of our customers use custom or proprietary UDP protocols that we do not understand — and now we don’t have to.</p>
    <div>
      <h3>How Programmable Flow Protection works</h3>
      <a href="#how-programmable-flow-protection-works">
        
      </a>
    </div>
    <p>In previous blog posts, we’ve described how “flowtrackd”, our <a href="https://blog.cloudflare.com/announcing-flowtrackd/"><u>stateful network layer DDoS mitigation system</u></a>, protects Magic Transit users from complex TCP and DNS attacks. We’ve also described how we use Linux technologies like <a href="https://blog.cloudflare.com/l4drop-xdp-ebpf-based-ddos-mitigations/"><u>XDP</u></a> and <a href="https://blog.cloudflare.com/cloudflare-architecture-and-how-bpf-eats-the-world/"><u>eBPF</u></a> to efficiently mitigate common types of large scale DDoS attacks. </p><p>Programmable Flow Protection combines these technologies in a novel way. With Programmable Flow Protection, a customer can write their own eBPF program that decides whether to pass, drop, or challenge individual packets based on arbitrary logic. A customer can upload the program to Cloudflare, and Cloudflare will execute it on every packet destined to their network. Programs are executed in userspace, not kernel space, which allows Cloudflare the flexibility to support a variety of customers and use cases on the platform without compromising security. Programmable Flow Protection programs run after all of Cloudflare’s existing DDoS mitigations, so users still benefit from our standard security protections. </p><p>There are many similarities between an XDP eBPF program loaded into the Linux kernel and an eBPF program running on the Programmable Flow Protection platform. Both types of programs are compiled down to BPF bytecode. They are both run through a “verifier” to ensure memory safety and verify program termination. They are also executed in a fast, lightweight VM to provide isolation and stability.</p><p>However, eBPF programs loaded into the Linux kernel make use of many Linux-specific “helper functions” to integrate with the network stack, maintain state between program executions, and emit packets to network devices. Programmable Flow Protection offers the same functionality whenever a customer chooses, but with a different API tailored specifically to implement DDoS mitigations. For example, we’ve built helper functions to store state about clients between program executions, perform cryptographic validation, and emit challenge packets to clients. With these helper functions, a developer can use the power of the Cloudflare platform to protect their own network.</p>
    <div>
      <h3>Combining customer knowledge with Cloudflare’s network</h3>
      <a href="#combining-customer-knowledge-with-cloudflares-network">
        
      </a>
    </div>
    <p>Let’s step through an example to illustrate how a customer’s protocol-specific knowledge can be combined with Cloudflare’s network to create powerful mitigations.</p><p>Say a customer hosts an online gaming server on UDP port 207. The game engine uses a proprietary application header that is specific to the game. Cloudflare has no knowledge of the structure or contents of the application header. The customer gets hit by DDoS attacks that overwhelm the game server and players report lag in gameplay. The attack traffic comes from highly randomized source IPs and ports, and the payload data appears to be random as well. </p><p>To mitigate the attack, the customer can use their knowledge of the application header and deploy a Programmable Flow Protection program to check a packet’s validity. In this example, the application header contains a token that is unique to the gaming protocol. The customer can therefore write a program to extract the last byte of the token. The program passes all packets with the correct value present and drops all other traffic:</p>
            <pre><code>#include &lt;linux/ip.h&gt;
#include &lt;linux/udp.h&gt;
#include &lt;arpa/inet.h&gt;

#include "cf_ebpf_defs.h"
#include "cf_ebpf_helper.h"

// Custom application header
struct apphdr {
    uint8_t  version;
    uint16_t length;   // Length of the variable-length token
    uint8_t  token[0]; // Variable-length token
} __attribute__((packed));

uint64_t
cf_ebpf_main(void *state)
{
    struct cf_ebpf_generic_ctx *ctx = state;
    struct cf_ebpf_parsed_headers headers;
    struct cf_ebpf_packet_data *p;

    // Parse the packet headers with provided helper function
    if (parse_packet_data(ctx, &amp;p, &amp;headers) != 0) {
        return CF_EBPF_DROP;
    }

    // Drop packets not destined to port 207
    struct udphdr *udp_hdr = (struct udphdr *)headers.udp;
    if (ntohs(udp_hdr-&gt;dest) != 207) {
        return CF_EBPF_DROP;
    }

    // Get application header from UDP payload
    struct apphdr *app = (struct apphdr *)(udp_hdr + 1);
    if ((uint8_t *)(app + 1) &gt; headers.data_end) {
        return CF_EBPF_DROP;
    }

    // Perform memory checks to satisfy the verifier
    // and access the token safely
    if ((uint8_t *)(app-&gt;token + token_len) &gt; headers.data_end) {
        return CF_EBPF_DROP;
    }

    // Check the last byte of the token against expected value
    uint8_t *last_byte = app-&gt;token + token_len - 1;
    if (*last_byte != 0xCF) {
        return CF_EBPF_DROP;
    }

    return CF_EBPF_PASS;
}</code></pre>
            <p><sup><i>An eBPF program to filter packets according to a value in the application header.</i></sup></p><p>This program leverages application-specific information to create a more targeted mitigation than Cloudflare is capable of crafting on its own. <b>Customers can now combine their proprietary knowledge with the capacity of Cloudflare’s global network to absorb and mitigate massive attacks better than ever before.</b></p>
    <div>
      <h3>Going beyond firewalls: stateful tracking and challenges</h3>
      <a href="#going-beyond-firewalls-stateful-tracking-and-challenges">
        
      </a>
    </div>
    <p>Many pattern checks, like the one performed in the example above, can be accomplished with traditional firewalls. However, programs provide useful primitives that are not available in firewalls, including variables, conditional execution, loops, and procedure calls. But what really sets Programmable Flow Protection apart from other solutions is its ability to statefully track flows and challenge clients to prove they are real. A common type of attack that showcases these abilities is a <i>replay attack</i>.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6Pgo9uUQDY1GTrxAOAOgiK/52c6d6a329cce05ff11ba3e4694313b2/BLOG-3182_3.png" />
          </figure><p>In a replay attack, an attacker repeatedly sends packets that were valid at <i>some</i> point, and therefore conform to expected patterns of the traffic, but are no longer valid in the application’s current context. For example, the attacker could record some of their valid gameplay traffic and use a script to duplicate and transmit the same traffic at a very high rate.</p><p>With Programmable Flow Protection, a user can deploy a program that challenges suspicious clients and drops scripted traffic. We can extend our original example as follows:</p>
            <pre><code>
#include &lt;linux/ip.h&gt;
#include &lt;linux/udp.h&gt;
#include &lt;arpa/inet.h&gt;

#include "cf_ebpf_defs.h"
#include "cf_ebpf_helper.h"

uint64_t
cf_ebpf_main(void *state)
{
    // ...
 
    // Get the status of this source IP (statefully tracked)
    uint8_t status;
    if (cf_ebpf_get_source_ip_status(&amp;status) != 0) {
        return CF_EBPF_DROP;
    }

    switch (status) {
        case NONE:
		// Issue a custom challenge to this source IP
             issue_challenge();
             cf_ebpf_set_source_ip_status(CHALLENGED);
             return CF_EBPF_DROP;


        case CHALLENGED:
		// Check if this packet passes the challenge
		// with custom logic
             if (verify_challenge()) {
                 cf_ebpf_set_source_ip_status(VERIFIED);
                 return CF_EBPF_PASS;
             } else {
                 cf_ebpf_set_source_ip_status(BLOCKED);
                 return CF_EBPF_DROP;
             }


        case VERIFIED:
		// This source IP has passed the challenge
		return CF_EBPF_PASS;

	 case BLOCKED:
		// This source IP has been blocked
		return CF_EBPF_DROP;

        default:
            return CF_EBPF_PASS;
    }


    return CF_EBPF_PASS;
}
</code></pre>
            <p><sup><i>An eBPF program to challenge UDP connections and statefully manage connections. This example has been simplified for illustration purposes.</i></sup></p><p>The program statefully tracks the source IP addresses it has seen and emits a packet with a cryptographic challenge back to unknown clients. A legitimate client running a valid gaming client is able to correctly solve the challenge and respond with proof, but the attacker’s script is not. Traffic from the attacker is marked as “blocked” and subsequent packets are dropped.</p><p>With these new abilities, customers can statefully track flows and make sure only real, verified clients can send traffic to their origin servers. Although we have focused the example on gaming, the potential use cases for this technology extend to any UDP-based protocol.</p>
    <div>
      <h3>Get started today</h3>
      <a href="#get-started-today">
        
      </a>
    </div>
    <p>We’re excited to offer the Programmable Flow Protection feature to Magic Transit Enterprise customers. Talk to your account manager to learn more about how you can enable Programmable Flow Protection to help keep your infrastructure safe.</p><p>We’re still in active development of the platform, and we’re excited to see what our users build next. If you are not yet a Cloudflare customer, let us know if you’d like to protect your network with Cloudflare and Programmable Flow Protection by signing up at this page: <a href="https://www.cloudflare.com/lp/programmable-flow-protection/"><u>https://www.cloudflare.com/lp/programmable-flow-protection/</u></a>.</p> ]]></content:encoded>
            <category><![CDATA[Beta]]></category>
            <category><![CDATA[DDoS]]></category>
            <category><![CDATA[UDP]]></category>
            <category><![CDATA[eBPF]]></category>
            <category><![CDATA[Magic Transit]]></category>
            <category><![CDATA[Network Services]]></category>
            <guid isPermaLink="false">64lPEfE3ML34AycHER46Tz</guid>
            <dc:creator>Anita Tenjarla</dc:creator>
            <dc:creator>Alex Forster</dc:creator>
            <dc:creator>Cody Doucette</dc:creator>
            <dc:creator>Venus Xeon-Blonde</dc:creator>
        </item>
        <item>
            <title><![CDATA[QUIC restarts, slow problems: udpgrm to the rescue]]></title>
            <link>https://blog.cloudflare.com/quic-restarts-slow-problems-udpgrm-to-the-rescue/</link>
            <pubDate>Wed, 07 May 2025 13:00:00 GMT</pubDate>
            <description><![CDATA[ udpgrm is a lightweight daemon for graceful restarts of UDP servers. It leverages SO_REUSEPORT and eBPF to route new and existing flows to the correct server instance. ]]></description>
            <content:encoded><![CDATA[ <p>At Cloudflare, we do everything we can to avoid interruption to our services. We frequently deploy new versions of the code that delivers the services, so we need to be able to restart the server processes to upgrade them without missing a beat. In particular, performing graceful restarts (also known as "zero downtime") for UDP servers has proven to be surprisingly difficult.</p><p>We've <a href="https://blog.cloudflare.com/graceful-upgrades-in-go/"><u>previously</u></a> <a href="https://blog.cloudflare.com/oxy-the-journey-of-graceful-restarts/"><u>written</u></a> about graceful restarts in the context of TCP, which is much easier to handle. We didn't have a strong reason to deal with UDP until recently — when protocols like HTTP3/QUIC became critical. This blog post introduces <b><i>udpgrm</i></b>, a lightweight daemon that helps us to upgrade UDP servers without dropping a single packet.</p><p><a href="https://github.com/cloudflare/udpgrm/blob/main/README.md"><u>Here's the </u><i><u>udpgrm</u></i><u> GitHub repo</u></a>.</p>
    <div>
      <h2>Historical context</h2>
      <a href="#historical-context">
        
      </a>
    </div>
    <p>In the early days of the Internet, UDP was used for stateless request/response communication with protocols like DNS or NTP. Restarts of a server process are not a problem in that context, because it does not have to retain state across multiple requests. However, modern protocols like QUIC, WireGuard, and SIP, as well as online games, use stateful flows. So what happens to the state associated with a flow when a server process is restarted? Typically, old connections are just dropped during a server restart. Migrating the flow state from the old instance to the new instance is possible, but it is complicated and notoriously hard to get right.</p><p>The same problem occurs for TCP connections, but there a common approach is to keep the old instance of the server process running alongside the new instance for a while, routing new connections to the new instance while letting existing ones drain on the old. Once all connections finish or a timeout is reached, the old instance can be safely shut down. The same approach works for UDP, but it requires more involvement from the server process than for TCP.</p><p>In the past, we <a href="https://blog.cloudflare.com/everything-you-ever-wanted-to-know-about-udp-sockets-but-were-afraid-to-ask-part-1/"><u>described</u></a> the <i>established-over-unconnected</i> method. It offers one way to implement flow handoff, but it comes with significant drawbacks: it’s prone to race conditions in protocols with multi-packet handshakes, and it suffers from a scalability issue. Specifically, the kernel hash table used for dispatching packets is keyed only by the local IP:port tuple, which can lead to bucket overfill when dealing with many inbound UDP sockets.</p><p>Now we have found a better method, leveraging Linux’s <code>SO_REUSEPORT</code> API. By placing both old and new sockets into the same REUSEPORT group and using an eBPF program for flow tracking, we can route packets to the correct instance and preserve flow stickiness. This is how <i>udpgrm</i> works.</p>
    <div>
      <h2>REUSEPORT group</h2>
      <a href="#reuseport-group">
        
      </a>
    </div>
    <p>Before diving deeper, let's quickly review the basics. Linux provides the <code>SO_REUSEPORT</code> socket option, typically set after <code>socket()</code> but before <code>bind()</code>. Please note that this has a separate purpose from the better known <code>SO_REUSEADDR</code> socket option.</p><p><code>SO_REUSEPORT</code> allows multiple sockets to bind to the same IP:port tuple. This feature is primarily used for load balancing, letting servers spread traffic efficiently across multiple CPU cores. You can think of it as a way for an IP:port to be associated with multiple packet queues. In the kernel, sockets sharing an IP:port this way are organized into a <i>reuseport group </i>— a term we'll refer to frequently throughout this post.</p>
            <pre><code>┌───────────────────────────────────────────┐
│ reuseport group 192.0.2.0:443             │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ socket #1 │ │ socket #2 │ │ socket #3 │ │
│ └───────────┘ └───────────┘ └───────────┘ │
└───────────────────────────────────────────┘
</code></pre>
            <p>Linux supports several methods for distributing inbound packets across a reuseport group. By default, the kernel uses a hash of the packet's 4-tuple to select a target socket. Another method is <code>SO_INCOMING_CPU</code>, which, when enabled, tries to steer packets to sockets running on the same CPU that received the packet. This approach works but has limited flexibility.</p><p>To provide more control, Linux introduced the <code>SO_ATTACH_REUSEPORT_CBPF</code> option, allowing server processes to attach a classic BPF (cBPF) program to make socket selection decisions. This was later extended with <code>SO_ATTACH_REUSEPORT_EBPF</code>, enabling the use of modern eBPF programs. With <a href="https://blog.cloudflare.com/cloudflare-architecture-and-how-bpf-eats-the-world/"><u>eBPF</u></a>, developers can implement arbitrary custom logic. A boilerplate program would look like this:</p>
            <pre><code>SEC("sk_reuseport")
int udpgrm_reuseport_prog(struct sk_reuseport_md *md)
{
    uint64_t socket_identifier = xxxx;
    bpf_sk_select_reuseport(md, &amp;sockhash, &amp;socket_identifier, 0);
    return SK_PASS;
}</code></pre>
            <p>To select a specific socket, the eBPF program calls <code>bpf_sk_select_reuseport</code>, using a reference to a map with sockets (<code>SOCKHASH</code>, <code>SOCKMAP</code>, or the older, mostly obsolete <code>SOCKARRAY</code>), along with a key or index. For example, a declaration of a <code>SOCKHASH</code> might look like this:</p>
            <pre><code>struct {
	__uint(type, BPF_MAP_TYPE_SOCKHASH);
	__uint(max_entries, MAX_SOCKETS);
	__uint(key_size, sizeof(uint64_t));
	__uint(value_size, sizeof(uint64_t));
} sockhash SEC(".maps");</code></pre>
            <p>This <code>SOCKHASH</code> is a hash map that holds references to sockets, even though the value size looks like a scalar 8-byte value. In our case it's indexed by an <code>uint64_t</code> key. This is pretty neat, as it allows for a simple number-to-socket mapping!</p><p>However, there's a catch: <b>the </b><code><b>SOCKHASH</b></code><b> must be populated and maintained from user space (or a separate control plane), outside the eBPF program itself</b>. Keeping this socket map accurate and in sync with the server process state is surprisingly difficult to get right — especially under dynamic conditions like restarts, crashes, or scaling events. The point of <i>udpgrm</i> is to take care of this stuff, so that server processes don’t have to.</p>
    <div>
      <h2>Socket generation and working generation</h2>
      <a href="#socket-generation-and-working-generation">
        
      </a>
    </div>
    <p>Let’s look at how graceful restarts for UDP flows are achieved in <i>udpgrm</i>. To reason about this setup, we’ll need a bit of terminology: A <b>socket generation</b> is a set of sockets within a reuseport group that belong to the same logical application instance:</p>
            <pre><code>┌───────────────────────────────────────────────────┐
│ reuseport group 192.0.2.0:443                     │
│  ┌─────────────────────────────────────────────┐  │
│  │ socket generation 0                         │  │
│  │  ┌───────────┐ ┌───────────┐ ┌───────────┐  │  │
│  │  │ socket #1 │ │ socket #2 │ │ socket #3 │  │  │
│  │  └───────────┘ └───────────┘ └───────────┘  │  │
│  └─────────────────────────────────────────────┘  │
│  ┌─────────────────────────────────────────────┐  │
│  │ socket generation 1                         │  │
│  │  ┌───────────┐ ┌───────────┐ ┌───────────┐  │  │
│  │  │ socket #4 │ │ socket #5 │ │ socket #6 │  │  │
│  │  └───────────┘ └───────────┘ └───────────┘  │  │
│  └─────────────────────────────────────────────┘  │
└───────────────────────────────────────────────────┘</code></pre>
            <p>When a server process needs to be restarted, the new version creates a new socket generation for its sockets. The old version keeps running alongside the new one, using sockets from the previous socket generation.</p><p>Reuseport eBPF routing boils down to two problems:</p><ul><li><p>For new flows, we should choose a socket from the socket generation that belongs to the active server instance.</p></li><li><p>For already established flows, we should choose the appropriate socket — possibly from an older socket generation — to keep the flows sticky. The flows will eventually drain away, allowing the old server instance to shut down.</p></li></ul><p>Easy, right?</p><p>Of course not! The devil is in the details. Let's take it one step at a time.</p><p>Routing new flows is relatively easy. <i>udpgrm</i> simply maintains a reference to the socket generation that should handle new connections. We call this reference the <b>working generation</b>. Whenever a new flow arrives, the eBPF program consults the working generation pointer and selects a socket from that generation.</p>
            <pre><code>┌──────────────────────────────────────────────┐
│ reuseport group 192.0.2.0:443                │
│   ...                                        │
│   Working generation ────┐                   │
│                          V                   │
│           ┌───────────────────────────────┐  │
│           │ socket generation 1           │  │
│           │  ┌───────────┐ ┌──────────┐   │  │
│           │  │ socket #4 │ │ ...      │   │  │
│           │  └───────────┘ └──────────┘   │  │
│           └───────────────────────────────┘  │
│   ...                                        │
└──────────────────────────────────────────────┘</code></pre>
            <p>For this to work, we first need to be able to differentiate packets belonging to new connections from packets belonging to old connections. This is very tricky and highly dependent on the specific UDP protocol. For example, QUIC has an <a href="https://datatracker.ietf.org/doc/html/rfc9000#name-initial-packet"><i><u>initial packet</u></i></a> concept, similar to a TCP SYN, but other protocols might not.</p><p>There needs to be some flexibility in this and <i>udpgrm</i> makes this configurable. Each reuseport group sets a specific <b>flow dissector</b>.</p><p>Flow dissector has two tasks:</p><ul><li><p>It distinguishes new packets from packets belonging to old, already established flows.</p></li><li><p>For recognized flows, it tells <i>udpgrm</i> which specific socket the flow belongs to.</p></li></ul><p>These concepts are closely related and depend on the specific server. Different UDP protocols define flows differently. For example, a naive UDP server might use a typical 5-tuple to define flows, while QUIC uses a "connection ID" field in the QUIC packet header to survive <a href="https://www.rfc-editor.org/rfc/rfc9308.html#section-3.2"><u>NAT rebinding</u></a>.</p><p><i>udpgrm</i> supports three flow dissectors out of the box and is highly configurable to support any UDP protocol. More on this later.</p>
    <div>
      <h2>Welcome udpgrm!</h2>
      <a href="#welcome-udpgrm">
        
      </a>
    </div>
    <p>Now that we covered the theory, we're ready for the business: please welcome <b>udpgrm</b> — UDP Graceful Restart Marshal! <i>udpgrm</i> is a stateful daemon that handles all the complexities of the graceful restart process for UDP. It installs the appropriate eBPF REUSEPORT program, maintains flow state, communicates with the server process during restarts, and reports useful metrics for easier debugging.</p><p>We can describe <i>udpgrm</i> from two perspectives: for administrators and for programmers.</p>
    <div>
      <h2>udpgrm daemon for the system administrator</h2>
      <a href="#udpgrm-daemon-for-the-system-administrator">
        
      </a>
    </div>
    <p><i>udpgrm</i> is a stateful daemon, to run it:</p>
            <pre><code>$ sudo udpgrm --daemon
[ ] Loading BPF code
[ ] Pinning bpf programs to /sys/fs/bpf/udpgrm
[*] Tailing message ring buffer  map_id 936146</code></pre>
            <p>This sets up the basic functionality, prints rudimentary logs, and should be deployed as a dedicated systemd service — loaded after networking. However, this is not enough to fully use <i>udpgrm</i>. <i>udpgrm</i> needs to hook into <code>getsockopt</code>, <code>setsockopt</code>, <code>bind</code>, and <code>sendmsg</code> syscalls, which are scoped to a cgroup. To install the <i>udpgrm</i> hooks, you can install it like this:</p>
            <pre><code>$ sudo udpgrm --install=/sys/fs/cgroup/system.slice</code></pre>
            <p>But a more common pattern is to install it within the <i>current</i> cgroup:</p>
            <pre><code>$ sudo udpgrm --install --self</code></pre>
            <p>Better yet, use it as part of the systemd "service" config:</p>
            <pre><code>[Service]
...
ExecStartPre=/usr/local/bin/udpgrm --install --self</code></pre>
            <p>Once <i>udpgrm</i> is running, the administrator can use the CLI to list reuseport groups, sockets, and metrics, like this:</p>
            <pre><code>$ sudo udpgrm list
[ ] Retrievieng BPF progs from /sys/fs/bpf/udpgrm
192.0.2.0:4433
	netns 0x1  dissector bespoke  digest 0xdead
	socket generations:
		gen  3  0x17a0da  &lt;=  app 0  gen 3
	metrics:
		rx_processed_total 13777528077
...</code></pre>
            <p>Now, with both the <i>udpgrm</i> daemon running, and cgroup hooks set up, we can focus on the server part.</p>
    <div>
      <h2>udpgrm for the programmer</h2>
      <a href="#udpgrm-for-the-programmer">
        
      </a>
    </div>
    <p>We expect the server to create the appropriate UDP sockets by itself. We depend on <code>SO_REUSEPORT</code>, so that each server instance can have a dedicated socket or a set of sockets:</p>
            <pre><code>sd = socket.socket(AF_INET, SOCK_DGRAM, 0)
sd.setsockopt(SOL_SOCKET, SO_REUSEPORT, 1)
sd.bind(("192.0.2.1", 5201))</code></pre>
            <p>With a socket descriptor handy, we can pursue the <i>udpgrm</i> magic dance. The server communicates with the <i>udpgrm</i> daemon using <code>setsockopt</code> calls. Behind the scenes, udpgrm provides eBPF <code>setsockopt</code> and <code>getsockopt</code> hooks and hijacks specific calls. It's not easy to set up on the kernel side, but when it works, it’s truly awesome. A typical socket setup looks like this:</p>
            <pre><code>try:
    work_gen = sd.getsockopt(IPPROTO_UDP, UDP_GRM_WORKING_GEN)
except OSError:
    raise OSError('Is udpgrm daemon loaded? Try "udpgrm --self --install"')
    
sd.setsockopt(IPPROTO_UDP, UDP_GRM_SOCKET_GEN, work_gen + 1)
for i in range(10):
    v = sd.getsockopt(IPPROTO_UDP, UDP_GRM_SOCKET_GEN, 8);
    sk_gen, sk_idx = struct.unpack('II', v)
    if sk_idx != 0xffffffff:
        break
    time.sleep(0.01 * (2 ** i))
else:
    raise OSError("Communicating with udpgrm daemon failed.")

sd.setsockopt(IPPROTO_UDP, UDP_GRM_WORKING_GEN, work_gen + 1)</code></pre>
            <p>You can see three blocks here:</p><ul><li><p>First, we retrieve the working generation number and, by doing so, check for <i>udpgrm</i> presence. Typically, <i>udpgrm</i> absence is fine for non-production workloads.</p></li><li><p>Then we register the socket to an arbitrary socket generation. We choose <code>work_gen + 1</code> as the value and verify that the registration went through correctly.</p></li><li><p>Finally, we bump the working generation pointer.</p></li></ul><p>That's it! Hopefully, the API presented here is clear and reasonable. Under the hood, the <i>udpgrm</i> daemon installs the REUSEPORT eBPF program, sets up internal data structures, collects metrics, and manages the sockets in a <code>SOCKHASH</code>.</p>
    <div>
      <h2>Advanced socket creation with udpgrm_activate.py</h2>
      <a href="#advanced-socket-creation-with-udpgrm_activate-py">
        
      </a>
    </div>
    <p>In practice, we often need sockets bound to low ports like <code>:443</code>, which requires elevated privileges like <code>CAP_NET_BIND_SERVICE</code>. It's usually better to configure listening sockets outside the server itself. A typical pattern is to pass the listening sockets using <a href="https://0pointer.de/blog/projects/socket-activation.html"><u>socket activation</u></a>.</p><p>Sadly, systemd cannot create a new set of UDP <code>SO_REUSEPORT</code> sockets for each server instance. To overcome this limitation, <i>udpgrm</i> provides a script called <code>udpgrm_activate.py</code>, which can be used like this:</p>
            <pre><code>[Service]
Type=notify                 # Enable access to fd store
NotifyAccess=all            # Allow access to fd store from ExecStartPre
FileDescriptorStoreMax=128  # Limit of stored sockets must be set

ExecStartPre=/usr/local/bin/udpgrm_activate.py test-port 0.0.0.0:5201</code></pre>
            <p>Here, <code>udpgrm_activate.py</code> binds to <code>0.0.0.0:5201</code> and stores the created socket in the systemd FD store under the name <code>test-port</code>. The server <code>echoserver.py</code> will inherit this socket and receive the appropriate <code>FD_LISTEN</code> environment variables, following the typical systemd socket activation pattern.</p>
    <div>
      <h2>Systemd service lifetime</h2>
      <a href="#systemd-service-lifetime">
        
      </a>
    </div>
    <p>Systemd typically can't handle more than one server instance running at the same time. It prefers to kill the old instance quickly. It supports the "at most one" server instance model, not the "at least one" model that we want. To work around this, <i>udpgrm</i> provides a <b>decoy</b> script that will exit when systemd asks it to, while the actual old instance of the server can stay active in the background.</p>
            <pre><code>[Service]
...
ExecStart=/usr/local/bin/mmdecoy examples/echoserver.py

Restart=always             # if pid dies, restart it.
KillMode=process           # Kill only decoy, keep children after stop.
KillSignal=SIGTERM         # Make signals explicit</code></pre>
            <p>At this point, we showed a full template for a <i>udpgrm</i> enabled server that contains all three elements: <code>udpgrm --install --self</code> for cgroup hooks, <code>udpgrm_activate.py</code> for socket creation, and <code>mmdecoy</code> for fooling systemd service lifetime checks.</p>
            <pre><code>[Service]
Type=notify                 # Enable access to fd store
NotifyAccess=all            # Allow access to fd store from ExecStartPre
FileDescriptorStoreMax=128  # Limit of stored sockets must be set

ExecStartPre=/usr/local/bin/udpgrm --install --self
ExecStartPre=/usr/local/bin/udpgrm_activate.py --no-register test-port 0.0.0.0:5201
ExecStart=/usr/local/bin/mmdecoy PWD/examples/echoserver.py

Restart=always             # if pid dies, restart it.
KillMode=process           # Kill only decoy, keep children after stop. 
KillSignal=SIGTERM         # Make signals explicit</code></pre>
            
    <div>
      <h2>Dissector modes</h2>
      <a href="#dissector-modes">
        
      </a>
    </div>
    <p>We've discussed the <i>udpgrm</i> daemon, the <i>udpgrm</i> setsockopt API, and systemd integration, but we haven't yet covered the details of routing logic for old flows. To handle arbitrary protocols, <i>udpgrm</i> supports three <b>dissector modes</b> out of the box:</p><p><b>DISSECTOR_FLOW</b>: <i>udpgrm</i> maintains a flow table indexed by a flow hash computed from a typical 4-tuple. It stores a target socket identifier for each flow. The flow table size is fixed, so there is a limit to the number of concurrent flows supported by this mode. To mark a flow as "assured," <i>udpgrm</i> hooks into the <code>sendmsg</code> syscall and saves the flow in the table only when a message is sent.</p><p><b>DISSECTOR_CBPF</b>: A cookie-based model where the target socket identifier — called a udpgrm cookie — is encoded in each incoming UDP packet. For example, in QUIC, this identifier can be stored as part of the connection ID. The dissection logic is expressed as cBPF code. This model does not require a flow table in <i>udpgrm</i> but is harder to integrate because it needs protocol and server support.</p><p><b>DISSECTOR_NOOP</b>: A no-op mode with no state tracking at all. It is useful for traditional UDP services like DNS, where we want to avoid losing even a single packet during an upgrade.</p><p>Finally, <i>udpgrm</i> provides a template for a more advanced dissector called <b>DISSECTOR_BESPOKE</b>. Currently, it includes a QUIC dissector that can decode the QUIC TLS SNI and direct specific TLS hostnames to specific socket generations.</p><p>For more details, <a href="https://github.com/cloudflare/udpgrm/blob/main/README.md"><u>please consult the </u><i><u>udpgrm</u></i><u> README</u></a>. In short: the FLOW dissector is the simplest one, useful for old protocols. CBPF dissector is good for experimentation when the protocol allows storing a custom connection id (cookie) — we used it to develop our own QUIC Connection ID schema (also named DCID) — but it's slow, because it interprets cBPF inside eBPF (yes really!). NOOP is useful, but only for very specific niche servers. The real magic is in the BESPOKE type, where users can create arbitrary, fast, and powerful dissector logic.</p>
    <div>
      <h2>Summary</h2>
      <a href="#summary">
        
      </a>
    </div>
    <p>The adoption of QUIC and other UDP-based protocols means that gracefully restarting UDP servers is becoming an increasingly important problem. To our knowledge, a reusable, configurable and easy to use solution didn't exist yet. The <i>udpgrm</i> project brings together several novel ideas: a clean API using <code>setsockopt()</code>, careful socket-stealing logic hidden under the hood, powerful and expressive configurable dissectors, and well-thought-out integration with systemd.</p><p>While <i>udpgrm</i> is intended to be easy to use, it hides a lot of complexity and solves a genuinely hard problem. The core issue is that the Linux Sockets API has not kept up with the modern needs of UDP.</p><p>Ideally, most of this should really be a feature of systemd. That includes supporting the "at least one" server instance mode, UDP <code>SO_REUSEPORT</code> socket creation, installing a <code>REUSEPORT_EBPF</code> program, and managing the "working generation" pointer. We hope that <i>udpgrm</i> helps create the space and vocabulary for these long-term improvements.</p> ]]></content:encoded>
            <category><![CDATA[UDP]]></category>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Programming]]></category>
            <guid isPermaLink="false">2baeaA3qbgFISPMjlZ74a4</guid>
            <dc:creator>Marek Majkowski</dc:creator>
        </item>
        <item>
            <title><![CDATA[Lost in transit: debugging dropped packets from negative header lengths]]></title>
            <link>https://blog.cloudflare.com/lost-in-transit-debugging-dropped-packets-from-negative-header-lengths/</link>
            <pubDate>Mon, 26 Jun 2023 13:00:56 GMT</pubDate>
            <description><![CDATA[ In this post, we'll provide some insight into the process of investigating networking issues and how to begin debugging issues in the kernel using pwru and kprobe tracepoints ]]></description>
            <content:encoded><![CDATA[ <p></p><p>Previously, I wrote about building <a href="/high-availability-load-balancers-with-maglev/">network load balancers with the maglev scheduler</a>, which we use for ingress into our Kubernetes clusters. At the time of that post we were using Foo-over-UDP encapsulation with virtual interfaces, one for each Internet Protocol version for each worker node.</p><p>To reduce operational toil managing the traffic director nodes, we've recently switched to using IP Virtual Server's (IPVS) native support for encapsulation. Much to our surprise, instead of a smooth change, we instead observed significant drops in bandwidth and failing API requests. In this post I'll discuss the impact observed, the multi-week search for the root cause, and the ultimate fix.</p>
    <div>
      <h3>Recap and the change</h3>
      <a href="#recap-and-the-change">
        
      </a>
    </div>
    <p>To support our requirements we've been creating virtual interfaces on our traffic directors configured to encapsulate traffic with Foo-Over-UDP (FOU). In this encapsulation new UDP and IP headers are added to the original packet. When the worker node receives this packet, the kernel removes the outer headers and injects the inner packet back into the network stack. Each virtual interface would be assigned a private IP, which would be configured to send traffic to these private IPs in "direct" mode.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/166RG9mgcpaP80xGDieN46/a25fe61233073af4b3132591bb379353/image4-24.png" />
            
            </figure><p>This configuration presents several problems for our operations teams.</p><p>For example, each Kubernetes worker node needs a separate virtual interface on the traffic director, and each of the interfaces requires their own private IP. The pairs of virtual interfaces and private IPs were only used by this system, but still needed to be tracked in our configuration management system. To ensure the interfaces were created and configured properly on each director we had to run complex health checks, which added to the lag between provisioning a new worker node and the director being ready to send it traffic. Finally, the header for FOU also lacks a way to signal the "next protocol" of the inner packet, requiring a separate virtual interface for IPv4 and IPv6. Each of these problems individually contributed a small amount of toil, but taken together, gave us impetus to find a better alternative.</p><p>In the time since we had originally implemented this system, IPVS has added native support for encapsulation. This would allow us to eliminate provisioning virtual interfaces (and their corresponding private IPs), and be able to use newly provisioned workers without delay.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4CLprcnKtcCYzpCCnDkjl2/97ba996da2c9ee81c95ec2e97ed08ff9/image1-40.png" />
            
            </figure><p>IPVS doesn't support Foo-over-UDP as an encapsulation type. As part of this project we switched to a similar option, <a href="https://datatracker.ietf.org/doc/html/draft-ietf-intarea-gue-09">Generic UDP Encapsulation</a> (GUE). This encapsulation option does include the "next protocol", allowing one listener on the worker nodes to support both IPv4 and IPv6.</p>
    <div>
      <h3>What went wrong?</h3>
      <a href="#what-went-wrong">
        
      </a>
    </div>
    <p>When we make changes to our Kubernetes clusters, we go through several layers of testing. This change had been deployed to the traffic directors in front of our staging environments, where it had been running for several weeks. However, due to the nature of this bug, the type of traffic to our staging environment did not trigger the underlying bug.</p><p>We began a slow rollout of this change to one production cluster, and after a few hours we began observing issues reaching services behind Kubernetes Load Balancers. The behavior observed was very interesting: the vast majority of requests had no issues, but a small percentage of requests corresponding to large HTTP request payloads or gRPC had significant latency. However, large responses had no corresponding latency increase. There was no corresponding increase in latency seen to any requests to our ingress controllers, though we could observe a small drop in overall requests per second.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/LhkHSdh6ronJadcny3OGI/6a5d6ab9108296646e70932763e423af/image6-9.png" />
            
            </figure><p>Through debugging after the incident, we discovered that the traffic directors were dropping packets that exceeded the Internet maximum transmission unit (MTU) of 1500 bytes, despite the packets being smaller than the actual MTU configured in our internal networks. Once dropped, the original host would fragment and resend packets until they were small enough to pass through the traffic directors. Dropping one packet is inconvenient, but unlikely to be noticed. However, when making a request with large payloads it is very likely that all packets would be dropped and need to be individually fragmented and resent.</p><p>When every packet is dropped and has to be  resent, the network latency can add up to several seconds, exceeding the request timeouts configured by applications. This would cause the request to fail. necessitating retries by applications. As more traffic directors were reconfigured, these retries were less likely to succeed, leading to slower processing and causing the backlog of queued work to increase.</p><p>As you can see this small, but consistent, number of dropped packets could cause a domino effect into much larger problems. Once it became clear there was a problem, we reverted traffic directors to their previous configuration, and this quickly restored traffic to previous levels. From this we knew something about the change caused this problem.</p>
    <div>
      <h3>Finding the culprit</h3>
      <a href="#finding-the-culprit">
        
      </a>
    </div>
    <p>With the symptoms of the issues identified, we started to try to understand the root cause. Once the root cause was understood, we could come up with a satisfactory solution.</p><p>Knowing the packets were larger than the Internet MTU, our first thought was that this was a misconfiguration of the machine in our configuration management tool. However, we found the interface MTUs were all set as expected, and there were no overriding MTUs in the routing table. We also found that sending packets from the director that exceeded the Internet MTU worked fine with no drops.</p><p>Cilium has developed a debugging tool <a href="https://github.com/cilium/pwru">pwru</a>, short for "packet, where are you?", that uses eBPF to aid in debugging the kernel networking state. We used pwru in our staging environment and found the location where the packet had been dropped.</p>
            <pre><code>sudo pwru --output-skb --output-tuple --per-cpu-buffer 2097152 --output-file pwru.log</code></pre>
            <p>This captures tracing data for all packets that reach the traffic director, and saves the trace data to "pwru.log". There are filters built into <code>pwru</code> to select matching packets, but unfortunately they could not be used as the packet was being modified by the encapsulation. Instead, we used grep afterwards to find a matching packet, and then filtered farther based on the pointer in the first column.</p>
            <pre><code>0xffff947712f34400      9        [&lt;empty&gt;]               packet_rcv netns=4026531840 mark=0x0 ifindex=4 proto=8 mtu=1600 len=1512 172.70.2.6:49756-&gt;198.51.100.150:8000(tcp)
0xffff947712f34400      9        [&lt;empty&gt;]                 skb_push netns=4026531840 mark=0x0 ifindex=4 proto=8 mtu=1600 len=1512 172.70.2.6:49756-&gt;198.51.100.150:8000(tcp)
0xffff947712f34400      9        [&lt;empty&gt;]              consume_skb netns=4026531840 mark=0x0 ifindex=4 proto=8 mtu=1600 len=1512 172.70.2.6:49756-&gt;198.51.100.150:8000(tcp)
[ ... snip ... ]
0xffff947712f34400      9        [&lt;empty&gt;]         inet_gso_segment netns=4026531840 mark=0x0 ifindex=7 proto=8 mtu=1600 len=1544 172.70.4.34:33259-&gt;172.70.64.149:5501(udp)
0xffff947712f34400      9        [&lt;empty&gt;]        udp4_ufo_fragment netns=4026531840 mark=0x0 ifindex=7 proto=8 mtu=1600 len=1524 172.70.4.34:33259-&gt;172.70.64.149:5501(udp)
0xffff947712f34400      9        [&lt;empty&gt;]   skb_udp_tunnel_segment netns=4026531840 mark=0x0 ifindex=7 proto=8 mtu=1600 len=1524 172.70.4.34:33259-&gt;172.70.64.149:5501(udp)
0xffff947712f34400      9        [&lt;empty&gt;] kfree_skb_reason(SKB_DROP_REASON_NOT_SPECIFIED) netns=4026531840 mark=0x0 ifindex=7 proto=8 mtu=1600 len=1558 172.70.4.34:33259-&gt;172.70.64.149:5501(udp)</code></pre>
            <p>Looking at the lines logged for this packet, we can observe some behavior. We originally received a TCP packet for the load balancer IP. However, when the packet is dropped, it is a UDP packet destined for the worker's IP on the port we use for GUE. We can surmise that the packet was being processed and encapsulated by IPVS, and form a theory it was being dropped on the egress path from the director node. We could also see that when the packet was dropped, the packet was still smaller than the calculated MTU.</p><p>We can visualize this change by applying this information to our GUE encapsulation diagram shown earlier. The byte totals of the encapsulated packet is 1544, matching the length <code>pwru</code> logged entering <code>inet_gso_segment</code> above.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4oBm7taQSl2U4UH3reulvJ/c0a00891b2e1e7c3228e3c13ab7546ac/image2-38.png" />
            
            </figure><p>The trace above tells us what kernel functions are entered, but does not tell us if or how the program flow left. This means we don't know in which function kfree_skb_reason was called. Fortunately <code>pwru</code> can print a stacktrace when functions are entered.</p>
            <pre><code>0xffff9868b7232e00 	19       	[pwru] kfree_skb_reason(SKB_DROP_REASON_NOT_SPECIFIED) netns=4026531840 mark=0x0 ifindex=7 proto=8 mtu=1600 len=1558 172.70.4.34:63336-&gt;172.70.72.206:5501(udp)
kfree_skb_reason
validate_xmit_skb
__dev_queue_xmit
ip_finish_output2
ip_vs_tunnel_xmit   	[ip_vs]
ip_vs_in_hook   [ip_vs]
nf_hook_slow
ip_local_deliver
ip_sublist_rcv_finish
ip_sublist_rcv
ip_list_rcv
__netif_receive_skb_list_core
netif_receive_skb_list_internal
napi_complete_done
mlx5e_napi_poll [mlx5_core]
__napi_poll
net_rx_action
__softirqentry_text_start
__irq_exit_rcu
common_interrupt
asm_common_interrupt
audit_filter_syscall
__audit_syscall_exit
syscall_exit_work
syscall_exit_to_user_mode
do_syscall_64
entry_SYSCALL_64_after_hwframe</code></pre>
            <p>This <code>stacktrace</code> shows that <code>kfree_skb_reason</code> was called from within the <code>validate_xmit_skb</code> function and this is called from <code>ip_vs_tunnel_xmit</code>. However, when looking at the <a href="https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/net/core/dev.c?h=v6.1.32&amp;id=76ba310227d2490018c271f1ecabb6c0a3212eb0#n3660">implementation of validate_xmit_skb</a>, we find there are three paths to <code>kfree_skb</code>. How can we determine which path is taken?</p><p>In addition to the eBPF support used by pwru, Linux has support for attaching dynamic tracepoints using <code>perf-probe</code>. After installing the kernel source code and debugging symbols, we can ask the <code>kprobe</code> mechanism what lines of <code>validate_xmit_skb</code> we can create a dynamic tracepoint. It prints the line numbers for the line we can attach a tracepoint onto.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6ddSe06sv8S2w4lAXfZGFi/729a4c2bfb990122bc8287005117c97a/image5-16.png" />
            
            </figure><p>Unfortunately, we can't create a tracepoint on the goto lines, but we can attach tracepoints around them, using the context to determine how control flowed through this function. In addition to specifying a line number, additional arguments can be specified to include local variables. The skb variable is a pointer to a structure that represents this packet, which can be logged to separate packets in case more than one is being processed at a time.</p>
            <pre><code>sudo perf probe --add 'validate_xmit_skb:17 skb'
sudo perf probe --add 'validate_xmit_skb:20 skb'
sudo perf probe --add 'validate_xmit_skb:24 skb'
sudo perf probe --add 'validate_xmit_skb:32 skb'</code></pre>
            <p>Access to these tracepoints could be recorded by using <code>perf-record</code> and providing the tracepoint names given when they were added.</p>
            <pre><code>sudo perf record -a -e 'probe:validate_xmit_skb_L17,probe:validate_xmit_skb_L20,probe:validate_xmit_skb_L24,probe:validate_xmit_skb_L32'</code></pre>
            <p>The tests can be rerun so some packets are logged, before using <code>perf-script</code> to read the generated "perf.data" file. When inspecting the output file, we found that all the packets that were dropped were coming from the control flow of <code>netif_needs_gso</code> succeeding (that is, from the goto on line 18). We continued to create and record tracepoints, following the failing control flow, until execution reached <a href="https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/net/ipv4/udp_offload.c?h=v6.1.32&amp;id=76ba310227d2490018c271f1ecabb6c0a3212eb0#n15"><code>__skb_udp_tunnel_segment</code></a>.</p><p>When <code>netif_needs_gso</code> returns false, we do not see packet drops and no problems are reported. It returns true when the packet is larger than the maximum segment size (MSS) advertised by our peer in the connection handshake. For IPv4, the MSS is usually 40 bytes smaller than the MTU (though this can be adjusted by the application or system configuration). For most packets the traffic directors see, the MSS is 40 bytes less than the Internet MTU of 1500, or in this case 1460.</p><p>The tracepoints in this function showed that control flow was leaving through the error case on line 33: that kernel was unable to allocate memory for the tunnel header. GUE was designed to have a minimal tunnel header, so this seemed suspicious. We updated the probe to also record the calculated <code>tnl_hlen</code>, and reran the tests. To our surprise the value recorded by the probes was "-2". Huh, somehow the encapsulated packet got smaller?</p>
            <pre><code>static struct sk_buff *__skb_udp_tunnel_segment(struct sk_buff *skb,
    netdev_features_t features,
    struct sk_buff *(*gso_inner_segment)(struct sk_buff *skb,
   				  	netdev_features_t features),
    __be16 new_protocol, bool is_ipv6)
{
    int tnl_hlen = skb_inner_mac_header(skb) - skb_transport_header(skb); // tunnel header length computed here
    bool remcsum, need_csum, offload_csum, gso_partial;
    struct sk_buff *segs = ERR_PTR(-EINVAL);
    struct udphdr *uh = udp_hdr(skb);
    u16 mac_offset = skb-&gt;mac_header;
    __be16 protocol = skb-&gt;protocol;
    u16 mac_len = skb-&gt;mac_len;
    int udp_offset, outer_hlen;
    __wsum partial;
    bool need_ipsec;

    if (unlikely(!pskb_may_pull(skb, tnl_hlen))) // allocation failed due to negative number.
   	 goto out;</code></pre>
            
    <div>
      <h3>Ultimate fix</h3>
      <a href="#ultimate-fix">
        
      </a>
    </div>
    <p>At this point the kernel's behavior was a bit baffling: why would the tunnel header be computed to be a negative number? To answer this question, we added two more probes. The first was added to <code>ip_vs_in_hook</code>, a hook function that is called as packets enter and leave IPVS code. The second probe was added to <code>__dev_queue_xmit</code>, which is called when preparing to ask the hardware device to transmit the packet. To both of these probes we also logged some of the fields of the sk_buff struct by using the <code>"pointer-&gt;field"</code> syntax. These fields are offsets into the packet's data for the packet's headers, as well as corresponding offsets for the encapsulated headers.</p><ul><li><p>The <code>mac_header</code> and <code>inner_mac_header</code> are offsets to the packet's layer two header. For Ethernet this includes the MAC addresses for the frame, but also other metadata such as the EtherType field giving the type the next protocol.</p></li><li><p>The <code>network_header</code> and <code>inner_network_header</code> fields are offsets to the packet's layer three header. For our purposes, this would be the header for IPv4 or IPv6.</p></li><li><p>The <code>transport_header</code> and <code>inner_transport_header</code> fields are offset to the packet's layer four header, such as TCP, UDP, or ICMP.</p></li></ul>
            <pre><code>sudo perf probe -m ip_vs --add 'ip_vs_in_hook dev=skb-&gt;dev-&gt;name:string skb skb-&gt;inner_mac_header skb-&gt;inner_network_header skb-&gt;inner_transport_header skb-&gt;mac_header skb-&gt;network_header skb-&gt;transport_header skb-&gt;ipvs_property skb-&gt;len:u skb-&gt;data_len:u'
sudo perf probe --add '__dev_queue_xmit dev=skb-&gt;dev-&gt;name:string skb skb-&gt;inner_mac_header skb-&gt;inner_network_header skb-&gt;inner_transport_header skb-&gt;mac_header skb-&gt;network_header skb-&gt;transport_header skb-&gt;len:u skb-&gt;data_len:u'</code></pre>
            <p>When we review the tracepoints using perf-script, we can notice something interesting with these values.</p>
            <pre><code>swapper     0 [030] 79090.376151:    probe:ip_vs_in_hook: (ffffffffc0feebe0) dev="vlan100" skb=0xffff9ca661e90200 inner_mac_header=0x0 inner_network_header=0x0 inner_transport_header=0x0 mac_header=0x44 network_header=0x52 transport_header=0x66 len=1512 data_len=12
swapper     0 [030] 79090.376153:    probe:ip_vs_in_hook: (ffffffffc0feebe0) dev="vlan100" skb=0xffff9ca661e90200 inner_mac_header=0x44 inner_network_header=0x52 inner_transport_header=0x66 mac_header=0x44 network_header=0x32 transport_header=0x46 len=1544 data_len=12
swapper     0 [030] 79090.376155: probe:__dev_queue_xmit: (ffffffff85070e60) dev="vlan100" skb=0xffff9ca661e90200 inner_mac_header=0x44 inner_network_header=0x52 inner_transport_header=0x66 mac_header=0x44 network_header=0x32 transport_header=0x46 len=1558 data_len=12</code></pre>
            <p>When the packet reaches <code>ip_vs_in_hook</code> on the way into IPVS, it only has outer packet headers. This makes sense, as the packet hasn't been encapsulated by IPVS yet. When the same hook is called again as the packet leaves IPVS, we see the encapsulation is already completed. We also find that the outer MAC header and the inner MAC header are at the same offset. Knowing how the tunnel header length is calculated in <code>__skb_udp_tunnel_segment</code>, we can also see where "-2" comes from. The <code>inner_mac_header</code> is at offset 0x44, while the <code>transport_header</code> is at offset 0x46.</p><p>When packets pass through network interfaces, the code for the interface reserves some space for the MAC header. For example, on an Ethernet interface some space is reserved for the Ethernet headers. FOU and GUE do not use link-layer addressing like Ethernet so no space is needed to be reserved. When we were using the virtual interfaces with FOU before it was correctly handling this case by setting the inner MAC header offset to the same as the inner network offset, effectively making it zero bytes long.</p><p>When using the encapsulation inside IPVS, this was not being accounted for, resulting in the inner MAC header offset being invalid. When packets did not need to be segmented this was a harmless bug. When segmenting, however, the tunnel header size needed to be known to duplicate it to all segmented packets.</p><p>I created a patch to correct the MAC header offset in IPVS's encapsulation code. The correction to the inner header offsets needed to be duplicated for IPv4 and IPv6.</p>
            <pre><code>diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index c7652da78c88..9193e109e6b3 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -1207,6 +1207,7 @@ ip_vs_tunnel_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 	skb-&gt;transport_header = skb-&gt;network_header;
 
 	skb_set_inner_ipproto(skb, next_protocol);
+	skb_set_inner_mac_header(skb, skb_inner_network_offset(skb));
 
 	if (tun_type == IP_VS_CONN_F_TUNNEL_TYPE_GUE) {
 		bool check = false;
@@ -1349,6 +1350,7 @@ ip_vs_tunnel_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
 	skb-&gt;transport_header = skb-&gt;network_header;
 
 	skb_set_inner_ipproto(skb, next_protocol);
+	skb_set_inner_mac_header(skb, skb_inner_network_offset(skb));
 
 	if (tun_type == IP_VS_CONN_F_TUNNEL_TYPE_GUE) {
 		bool check = false;</code></pre>
            <p>When the patch was included in our kernel and deployed the difference in end-to-end request latency was immediately noticeable. We also no longer observed packet drops for requests with large payloads.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4SVrX3EiwZFhiRRkN1BMar/0f450c536cc0e2f3f6535dc23eb0b37e/image7-8.png" />
            
            </figure><p>I've <a href="https://lore.kernel.org/all/20230609205842.2333727-1-terin@cloudflare.com/T/#u">submitted the Linux kernel patch upstream</a>, which is currently queued for inclusion in future releases of the kernel.</p>
    <div>
      <h3>Conclusion</h3>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>Through this post, we hoped to have provided some insight into the process of investigating networking issues and how to begin debugging issues in the kernel using <code>pwru</code> and <code>kprobe</code> tracepoints. This investigation also led to a Linux kernel patch upstream. It also allowed us to safely roll out IPVS's native encapsulation.</p> ]]></content:encoded>
            <category><![CDATA[UDP]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <guid isPermaLink="false">4XFiBPDirEDvUVaOc0IHH8</guid>
            <dc:creator>Terin Stock</dc:creator>
        </item>
        <item>
            <title><![CDATA[Argo Smart Routing for UDP: speeding up gaming, real-time communications and more]]></title>
            <link>https://blog.cloudflare.com/turbo-charge-gaming-and-streaming-with-argo-for-udp/</link>
            <pubDate>Tue, 20 Jun 2023 13:00:40 GMT</pubDate>
            <description><![CDATA[ Today, Cloudflare is super excited to announce that we’re bringing traffic acceleration to customer’s UDP traffic. Now, you can improve the latency of UDP-based applications like video games, voice calls, and video meetings by up to 17% ]]></description>
            <content:encoded><![CDATA[ 
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/64tixskqgONiSTACdvMbMX/3502932df801cd9691f432892495f379/image1-14.png" />
            
            </figure><p>Today, Cloudflare is super excited to announce that we’re bringing traffic acceleration to customer’s UDP traffic. Now, you can improve the latency of UDP-based applications like video games, voice calls, and video meetings by up to 17%. Combining the power of Argo Smart Routing (our traffic acceleration product) with UDP gives you the ability to supercharge your UDP-based traffic.</p>
    <div>
      <h3>When applications use TCP vs. UDP</h3>
      <a href="#when-applications-use-tcp-vs-udp">
        
      </a>
    </div>
    <p>Typically when people talk about the Internet, they think of websites they visit in their browsers, or apps that allow them to order food. This type of traffic is sent across the Internet via <a href="https://www.cloudflare.com/learning/ddos/glossary/hypertext-transfer-protocol-http/">HTTP</a> which is built on top of the <a href="https://www.cloudflare.com/learning/ddos/glossary/tcp-ip/">Transmission Control Protocol</a> (TCP). However, there’s a lot more to the Internet than just browsing websites and using apps. Gaming, <a href="https://www.cloudflare.com/developer-platform/solutions/live-streaming/">live video</a>, or tunneling traffic to different networks via a VPN are all common applications that don’t use HTTP or TCP. These popular applications leverage the <a href="https://www.cloudflare.com/learning/ddos/glossary/user-datagram-protocol-udp/">User Datagram Protocol</a> (or UDP for short). To understand why these applications use UDP instead of TCP, we’ll need to dig into how these different applications work.</p><p>When you load a web page, you generally want to see the <i>entire</i> web page; the website would be confusing if parts of it are missing. For this reason, HTTP uses TCP as a method of transferring website data. TCP ensures that if a packet ever gets lost as it crosses the Internet, that packet will be resent. Having a reliable protocol like TCP is generally a good idea when 100% of the information sent needs to be loaded. It’s worth noting that later HTTP versions like <a href="https://www.cloudflare.com/learning/performance/what-is-http3/">HTTP/3</a> actually deviated from TCP as a transmission protocol, but they still ensure packet delivery by handling packet retransmission using the <a href="/the-road-to-quic/">QUIC protocol</a>.</p><p>There are other applications that prioritize quickly sending real time data and are less concerned about perfectly delivering 100% of the data. Let’s explore Real-Time Communications (RTC) like video meetings as an example. If two people are streaming video live, all they care about is what is happening <i>now</i>. If a few packets are lost during the initial transmission, retransmission is usually too slow to render the lost packet data in the current video frame. TCP doesn’t really make sense in this scenario.</p><p>Instead, RTC protocols are built on top of UDP. TCP is like a formal back and forth conversation where every sentence matters. UDP is more like listening to your friend's stream of consciousness: you don’t care about every single bit as long as you get the gist of it. UDP transfers packet data with speed and efficiency without guaranteeing the delivery of those packets. This is perfect for applications like RTC where reducing latency is more important than occasionally losing a packet here or there. The same applies to gaming traffic; you generally want the most up-to-date information, and you don’t really care about retransmitting lost packets.</p><p>Gaming and RTC applications <i>really</i> care about <a href="https://www.cloudflare.com/learning/performance/glossary/what-is-latency/">latency</a>. Latency is the length of time it takes a packet to be sent to a server plus the length of time to receive a response from the server (called <a href="https://www.cloudflare.com/learning/cdn/glossary/round-trip-time-rtt/">round-trip time or RTT</a>). In the case of video games, the higher the latency, the longer it will take for you to see other players move and the less time you’ll have to react to the game. With enough latency, games become unplayable: if the players on your screen are constantly blipping around it’s near impossible to interact with them. In RTC applications like video meetings, you’ll experience a delay between yourself and your counterpart. You may find yourselves accidentally talking over each other which isn’t a great experience.</p><p>Companies that host gaming or RTC infrastructure often try to reduce latency by spinning up servers that are geographically closer to their users. However, it’s common to have two users that are trying to have a video call between distant locations like Amsterdam and Los Angeles. No matter where you install your servers, that's still a long distance for that traffic to travel. The longer the path, the higher the chances are that you're going to run into congestion along the way. Congestion is just like a traffic jam on a highway, but for networks. Sometimes certain paths get overloaded with traffic. This causes delays and packets to get dropped. This is where Argo Smart Routing comes in.</p>
    <div>
      <h3>Argo Smart Routing</h3>
      <a href="#argo-smart-routing">
        
      </a>
    </div>
    <p>Cloudflare customers that want the best cross-Internet application performance rely on Argo Smart Routing’s traffic acceleration to reduce latency. Argo Smart Routing is like the GPS of the Internet. It uses real time global network performance measurements to accelerate traffic, actively route around Internet congestion, and increase your traffic’s stability by reducing packet loss and jitter.</p><p>Argo Smart Routing was launched in <a href="/argo/">May 2017</a>, and its first iteration focused on reducing website traffic latency. Since then, we’ve <a href="/argo-v2/">improved Argo Smart Routing</a> and also <a href="/argo-spectrum/">launched Argo Smart Routing for Spectrum TCP traffic</a> which reduces latency in any TCP-based protocols. Today, we’re excited to bring the same Argo Smart Routing technology to customer’s UDP traffic which will reduce latency, packet loss, and jitter in gaming, and live audio/video applications.</p><p>Argo Smart Routing accelerates Internet traffic by sending millions of synthetic probes from every Cloudflare data center to the origin of every Cloudflare customer. These probes measure the latency of all possible routes between Cloudflare’s data centers and a customer’s origin. We then combine that with probes running between Cloudflare’s data centers to calculate possible routes. When an Internet user makes a request to an origin, Cloudflare calculates the results of our real time global latency measurements, examines Internet congestion data, and calculates the optimal route for customer’s traffic. To enable Argo Smart Routing for UDP traffic, Cloudflare extended the route computations typically used for HTTP and TCP traffic and applied them to UDP traffic.</p><p>We knew that Argo Smart Routing offered impressive benefits for HTTP traffic, reducing time to first byte by up to 30% on average for customers. But UDP can be treated differently by networks, so we were curious to see if we would see a similar reduction in round-trip-time for UDP. To validate, we ran a set of tests. We set up an origin in Iowa, USA and had a client connect to it from Tokyo, Japan. Compared to a regular Spectrum setup, we saw a decrease in round-trip-time of up to 17.3% on average. For the standard setup, Spectrum was able to proxy packets to Iowa in 173.3 milliseconds on average. Comparatively, turning on Argo Smart Routing reduced the average round-trip-time down to 143.3 milliseconds. The distance between those two cities is 6,074 miles (9,776 kilometers), meaning we've effectively moved the two closer to each other by over a thousand miles (or 1,609 km) just by turning on this feature.</p><p>We're incredibly excited about Argo Smart Routing for UDP and what our customers will use it for. If you're in gaming or real-time-communications, or even have a different use-case that you think would benefit from speeding up UDP traffic, please contact your account team today. We are currently in closed beta but are excited about accepting applications.</p> ]]></content:encoded>
            <category><![CDATA[Speed Week]]></category>
            <category><![CDATA[Argo Smart Routing]]></category>
            <category><![CDATA[Spectrum]]></category>
            <category><![CDATA[UDP]]></category>
            <category><![CDATA[Performance]]></category>
            <guid isPermaLink="false">5qKIhJCi7nIZIQudfOBtgh</guid>
            <dc:creator>Achiel van der Mandele</dc:creator>
            <dc:creator>Chris Draper</dc:creator>
        </item>
        <item>
            <title><![CDATA[How to stop running out of ephemeral ports and start to love long-lived connections]]></title>
            <link>https://blog.cloudflare.com/how-to-stop-running-out-of-ephemeral-ports-and-start-to-love-long-lived-connections/</link>
            <pubDate>Wed, 02 Feb 2022 09:53:28 GMT</pubDate>
            <description><![CDATA[ Often programmers have assumptions that turn out, to their surprise, to be invalid. From my experience this happens a lot. Every API, technology or system can be abused beyond its limits and break in a miserable way ]]></description>
            <content:encoded><![CDATA[ <p>Often programmers have assumptions that turn out, to their surprise, to be invalid. From my experience this happens a lot. Every API, technology or system can be abused beyond its limits and break in a miserable way.</p><p>It's particularly interesting when basic things used everywhere fail. Recently we've reached such a breaking point in a ubiquitous part of Linux networking: establishing a network connection using the <code>connect()</code> system call.</p><p>Since we are not doing anything special, just establishing TCP and UDP connections, how could anything go wrong? Here's one example: we noticed alerts from a misbehaving server, logged in to check it out and saw:</p>
            <pre><code>marek@:~# ssh 127.0.0.1
ssh: connect to host 127.0.0.1 port 22: Cannot assign requested address</code></pre>
            <p>You can imagine the face of my colleague who saw that. SSH to localhost refuses to work, while she was already using SSH to connect to that server! On another occasion:</p>
            <pre><code>marek@:~# dig cloudflare.com @1.1.1.1
dig: isc_socket_bind: address in use</code></pre>
            <p>This time a basic DNS query failed with a weird networking error. Failing DNS is a bad sign!</p><p>In both cases the problem was Linux running out of ephemeral ports. When this happens it's unable to establish any outgoing connections. This is a pretty serious failure. It's usually transient and if you don't know what to look for it might be hard to debug.</p><p>The root cause lies deeper though. We can often ignore limits on the number of outgoing connections. But we encountered cases where we hit limits on the number of concurrent outgoing connections during normal operation.</p><p>In this blog post I'll explain why we had these issues, how we worked around them, and present an userspace code implementing an improved variant of <code>connect()</code> syscall.</p>
    <div>
      <h3>Outgoing connections on Linux part 1 - TCP</h3>
      <a href="#outgoing-connections-on-linux-part-1-tcp">
        
      </a>
    </div>
    <p>Let's start with a bit of historical background.</p>
    <div>
      <h3>Long-lived connections</h3>
      <a href="#long-lived-connections">
        
      </a>
    </div>
    <p>Back in 2014 Cloudflare announced support for WebSockets. We wrote two articles about it:</p><ul><li><p><a href="/cloudflare-now-supports-websockets/">Cloudflare Now Supports WebSockets</a></p></li><li><p><a href="https://idea.popcount.org/2014-04-03-bind-before-connect/">Bind before connect</a></p></li></ul><p>If you skim these blogs, you'll notice we were totally fine with the WebSocket protocol, framing and operation. What worried us was our capacity to handle large numbers of concurrent outgoing connections towards the origin servers. Since WebSockets are long-lived, allowing them through our servers might greatly increase the concurrent connection count. And this did turn out to be a problem. It was possible to hit a ceiling for a total number of outgoing connections imposed by the Linux networking stack.</p><p>In a pessimistic case, each Linux connection consumes a local port (ephemeral port), and therefore the total connection count is limited by the size of the ephemeral port range.</p>
    <div>
      <h3>Basics - how port allocation works</h3>
      <a href="#basics-how-port-allocation-works">
        
      </a>
    </div>
    <p>When establishing an outbound connection a typical user needs the destination address and port. For example, DNS might resolve <code>cloudflare.com</code> to the '104.1.1.229' IPv4 address. A simple Python program can establish a connection to it with the following code:</p>
            <pre><code>cd = socket.socket(AF_INET, SOCK_STREAM)
cd.connect(('104.1.1.229', 80))</code></pre>
            <p>The operating system’s job is to figure out how to reach that destination, selecting an appropriate source address and source port to form the full 4-tuple for the connection:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1zDTSzTRPl4JRrdWfjbzkP/63e0de7a453377f267b41ee0fa394a33/image4-1.png" />
            
            </figure><p>The operating system chooses the source IP based on the routing configuration. On Linux we can see which source IP will be chosen with <code>ip route get</code>:</p>
            <pre><code>$ ip route get 104.1.1.229
104.1.1.229 via 192.168.1.1 dev eth0 src 192.168.1.8 uid 1000
	cache</code></pre>
            <p>The <code>src</code> parameter in the result shows the discovered source IP address that should be used when going towards that specific target.</p><p>The source port, on the other hand, is chosen from the local port range configured for outgoing connections, also known as the ephemeral port range. On Linux this is controlled by the following sysctls:</p>
            <pre><code>$ sysctl net.ipv4.ip_local_port_range net.ipv4.ip_local_reserved_ports
net.ipv4.ip_local_port_range = 32768    60999
net.ipv4.ip_local_reserved_ports =</code></pre>
            <p>The <code>ip_local_port_range</code> sets the low and high (inclusive) port range to be used for outgoing connections. The <code>ip_local_reserved_ports</code> is used to skip specific ports if the operator needs to reserve them for services.</p>
    <div>
      <h3>Vanilla TCP is a happy case</h3>
      <a href="#vanilla-tcp-is-a-happy-case">
        
      </a>
    </div>
    <p>The default ephemeral port range contains more than 28,000 ports (60999+1-32768=28232). Does that mean we can have at most 28,000 outgoing connections? That’s the core question of this blog post!</p><p>In TCP the connection is identified by a full 4-tuple, for example:</p>
<table>
<thead>
  <tr>
    <td><span>full 4-tuple</span></td>
    <td><span>192.168.1.8</span></td>
    <td><span>32768</span></td>
    <td><span>104.1.1.229</span></td>
    <td><span>80</span></td>
  </tr>
</thead>
</table><p>In principle, it is possible to reuse the source IP and port, and share them against another destination. For example, there could be two simultaneous outgoing connections with these 4-tuples:</p>
<table>
<thead>
  <tr>
    <th><span>full 4-tuple #A</span></th>
    <th><span>192.168.1.8</span></th>
    <th><span>32768</span></th>
    <th><span>104.1.1.229</span></th>
    <th><span>80</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>full 4-tuple #B</span></td>
    <td><span>192.168.1.8</span></td>
    <td><span>32768</span></td>
    <td><span>151.101.1.57</span></td>
    <td><span>80</span></td>
  </tr>
</tbody>
</table><p>This "source two-tuple" sharing can happen in practice when establishing connections using the vanilla TCP code:</p>
            <pre><code>sd = socket.socket(SOCK_STREAM)
sd.connect( (remote_ip, remote_port) )</code></pre>
            <p>But slightly different code can prevent this sharing, as we’ll discuss.</p><p>In the rest of this blog post, we’ll summarise the behaviour of code fragments that make outgoing connections showing:</p><ul><li><p>The technique’s description</p></li><li><p>The typical `errno` value in the case of port exhaustion</p></li><li><p>And whether the kernel is able to reuse the {source IP, source port}-tuple against another destination</p></li></ul><p>The last column is the most important since it shows if there is a low limit of total concurrent connections. As we're going to see later, the limit is present more often than we'd expect.</p>
<table>
<thead>
  <tr>
    <th><span>technique description</span></th>
    <th><span>errno on port exhaustion</span></th>
    <th><span>possible src 2-tuple reuse</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>connect(dst_IP, dst_port)</span></td>
    <td><span>EADDRNOTAVAIL</span></td>
    <td><span>yes (good!)</span></td>
  </tr>
</tbody>
</table><p>In the case of generic TCP, things work as intended. Towards a single destination it's possible to have as many connections as an ephemeral range allows. When the range is exhausted (against a single destination), we'll see EADDRNOTAVAIL error. The system also is able to correctly reuse local two-tuple {source IP, source port} for ESTABLISHED sockets against other destinations. This is expected and desired.</p>
    <div>
      <h3>Manually selecting source IP address</h3>
      <a href="#manually-selecting-source-ip-address">
        
      </a>
    </div>
    <p>Let's go back to the Cloudflare server setup. Cloudflare operates many services, to name just two: CDN (caching HTTP reverse proxy) and <a href="/1111-warp-better-vpn">WARP</a>.</p><p>For Cloudflare, it’s important that we don’t mix traffic types among our outgoing IPs. Origin servers on the Internet might want to differentiate traffic based on our product. The simplest example is <a href="https://www.cloudflare.com/learning/cdn/what-is-a-cdn/">CDN</a>: it's appropriate for an origin server to firewall off non-CDN inbound connections. Allowing Cloudflare cache pulls is totally fine, but allowing WARP connections which contain untrusted user traffic might lead to problems.</p><p>To achieve such outgoing IP separation, each of our applications must be explicit about which source IPs to use. They can’t leave it up to the operating system; the automatically-chosen source could be wrong. While it's technically possible to configure routing policy rules in Linux to express such requirements, we decided not to do that and keep Linux routing configuration as simple as possible.</p><p>Instead, before calling <code>connect()</code>, our applications select the source IP with the <code>bind()</code> syscall. A trick we call "bind-before-connect":</p>
            <pre><code>sd = socket.socket(SOCK_STREAM)
sd.bind( (src_IP, 0) )
sd.connect( (dst_IP, dst_port) )</code></pre>
            
<table>
<thead>
  <tr>
    <th><span>technique description</span></th>
    <th><span>errno on port exhaustion</span></th>
    <th><span>possible src 2-tuple reuse</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>bind(src_IP, 0)</span><br /><span>connect(dst_IP, dst_port)</span></td>
    <td><span>EADDRINUSE</span></td>
    <td><span>no </span><span>(bad!)</span></td>
  </tr>
</tbody>
</table><p>This code looks rather innocent, but it hides a considerable drawback. When calling <code>bind()</code>, the kernel attempts to find an unused local two-tuple. Due to BSD API shortcomings, the operating system can't know what we plan to do with the socket. It's totally possible we want to <code>listen()</code> on it, in which case sharing the source IP/port with a connected socket will be a disaster! That's why the source two-tuple selected when calling <code>bind()</code> must be unique.</p><p>Due to this API limitation, in this technique the source two-tuple can't be reused. Each connection effectively "locks" a source port, so the number of connections is constrained by the size of the ephemeral port range. Notice: one source port is used up for each connection, no matter how many destinations we have. This is bad, and is exactly the problem we were dealing with back in 2014 in the WebSockets articles mentioned above.</p><p>Fortunately, it's fixable.</p>
    <div>
      <h3>IP_BIND_ADDRESS_NO_PORT</h3>
      <a href="#ip_bind_address_no_port">
        
      </a>
    </div>
    <p>Back in 2014 we fixed the problem by setting the SO_REUSEADDR socket option and manually retrying <code>bind()</code>+ <code>connect()</code> a couple of times on error. This worked ok, but later in 2015 <a href="https://kernelnewbies.org/Linux_4.2#Networking">Linux introduced a proper fix: the IP_BIND_ADDRESS_NO_PORT socket option</a>. This option tells the kernel to delay reserving the source port:</p>
            <pre><code>sd = socket.socket(SOCK_STREAM)
sd.setsockopt(IPPROTO_IP, IP_BIND_ADDRESS_NO_PORT, 1)
sd.bind( (src_IP, 0) )
sd.connect( (dst_IP, dst_port) )</code></pre>
            
<table>
<thead>
  <tr>
    <th><span>technique description</span></th>
    <th><span>errno on port exhaustion</span></th>
    <th><span>possible src 2-tuple reuse</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>IP_BIND_ADDRESS_NO_PORT<br />bind(src_IP, 0)</span><br /><span>connect(dst_IP, dst_port)</span></td>
    <td><span>EADDRNOTAVAIL</span></td>
    <td><span>yes (good!)</span></td>
  </tr>
</tbody>
</table><p>This gets us back to the desired behavior. On modern Linux, when doing bind-before-connect for TCP, you should set IP_BIND_ADDRESS_NO_PORT.</p>
    <div>
      <h3>Explicitly selecting a source port</h3>
      <a href="#explicitly-selecting-a-source-port">
        
      </a>
    </div>
    <p>Sometimes an application needs to select a specific source port. For example: the operator wants to control full 4-tuple in order to debug ECMP routing issues.</p><p>Recently a colleague wanted to run a cURL command for debugging, and he needed the source port to be fixed. cURL provides the <code>--local-port</code> option to do this¹ :</p>
            <pre><code>$ curl --local-port 9999 -4svo /dev/null https://cloudflare.com/cdn-cgi/trace
*   Trying 104.1.1.229:443...</code></pre>
            <p>In other situations source port numbers should be controlled, as they can be used as an input to a routing mechanism.</p><p>But setting the source port manually is not easy. We're back to square one in our hackery since IP_BIND_ADDRESS_NO_PORT is not an appropriate tool when calling <code>bind()</code> with a specific source port value. To get the scheme working again and be able to share source 2-tuple, we need to turn to SO_REUSEADDR:</p>
            <pre><code>sd = socket.socket(SOCK_STREAM)
sd.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sd.bind( (src_IP, src_port) )
sd.connect( (dst_IP, dst_port) )</code></pre>
            <p>Our summary table:</p>
<table>
<thead>
  <tr>
    <th><span>technique description</span></th>
    <th><span>errno on port exhaustion</span></th>
    <th><span>possible src 2-tuple reuse</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>SO_REUSEADDR<br />bind(src_IP, src_port)</span><br /><span>connect(dst_IP, dst_port)</span></td>
    <td><span>EADDRNOTAVAIL</span></td>
    <td><span>yes (good!)</span></td>
  </tr>
</tbody>
</table><p>Here, the user takes responsibility for handling conflicts, when an ESTABLISHED socket sharing the 4-tuple already exists. In such a case <code>connect</code> will fail with EADDRNOTAVAIL and the application should retry with another acceptable source port number.</p>
    <div>
      <h3>Userspace connectx implementation</h3>
      <a href="#userspace-connectx-implementation">
        
      </a>
    </div>
    <p>With these tricks, <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2022-02-connectx/connectx.py#L93-L110">we can implement a common function and call it <code>connectx</code></a>. It will do what <code>bind()</code>+<code>connect()</code> should, but won't have the unfortunate ephemeral port range limitation. In other words, created sockets are able to share local two-tuples as long as they are going to distinct destinations:</p>
            <pre><code>def connectx((source_IP, source_port), (destination_IP, destination_port)):</code></pre>
            <p>We have three use cases this API should support:</p>
<table>
<thead>
  <tr>
    <th><span>user specified</span></th>
    <th><span>technique</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>{_, _, dst_IP, dst_port}</span></td>
    <td><span>vanilla connect()</span></td>
  </tr>
  <tr>
    <td><span>{src_IP, _, dst_IP, dst_port}</span></td>
    <td><span>IP_BIND_ADDRESS_NO_PORT</span></td>
  </tr>
  <tr>
    <td><span>{src_IP, src_port, dst_IP, dst_port}</span></td>
    <td><span>SO_REUSEADDR</span></td>
  </tr>
</tbody>
</table><p>The name we chose isn't an accident. MacOS (specifically the underlying Darwin OS) has exactly that function implemented <a href="https://www.manpagez.com/man/2/connectx">as a <code>connectx()</code> system call</a> (<a href="https://github.com/apple/darwin-xnu/blob/a1babec6b135d1f35b2590a1990af3c5c5393479/bsd/netinet/tcp_usrreq.c#L517">implementation</a>):</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2ixMd6STGDjs1IO4DhAaFQ/3cbca6a9ec28010fb15f4004e2450587/image2.png" />
            
            </figure><p>It's more powerful than our <code>connectx</code> code, since it supports TCP Fast Open.</p><p>Should we, Linux users, be envious? For TCP, it's possible to get the right kernel behaviour with the appropriate setsockopt/bind/connect dance, so a kernel syscall is not quite needed.</p><p>But for UDP things turn out to be much more complicated and a dedicated syscall might be a good idea.</p>
    <div>
      <h3>Outgoing connections on Linux - part 2 - UDP</h3>
      <a href="#outgoing-connections-on-linux-part-2-udp">
        
      </a>
    </div>
    <p>In the previous section we listed three use cases for outgoing connections that should be supported by the operating system:</p><ul><li><p>Vanilla egress: operating system chooses the outgoing IP and port</p></li><li><p>Source IP selection: user selects outgoing IP but the OS chooses port</p></li><li><p>Full 4-tuple: user selects full 4-tuple for the connection</p></li></ul><p>We demonstrated how to implement all three cases on Linux for TCP, without hitting connection count limits due to source port exhaustion.</p><p>It's time to extend our implementation to UDP. This is going to be harder.</p><p>For UDP, Linux maintains one hash table that is keyed on local IP and port, which can hold duplicate entries. Multiple UDP connected sockets can not only share a 2-tuple but also a 4-tuple! It's totally possible to have two distinct, connected sockets having exactly the same 4-tuple. This feature was created for multicast sockets. The implementation was then carried over to unicast connections, but it is confusing. With conflicting sockets on unicast addresses, only one of them will receive any traffic. A newer connected socket will "overshadow" the older one. It's surprisingly hard to detect such a situation. To get UDP <code>connectx()</code> right, we will need to work around this "overshadowing" problem.</p>
    <div>
      <h3>Vanilla UDP is limited</h3>
      <a href="#vanilla-udp-is-limited">
        
      </a>
    </div>
    <p>It might come as a surprise to many, but by default, the total count for outbound UDP connections is limited by the ephemeral port range size. Usually, with Linux you can't have more than ~28,000 connected UDP sockets, even if they point to multiple destinations.</p><p>Ok, let's start with the simplest and most common way of establishing outgoing UDP connections:</p>
            <pre><code>sd = socket.socket(SOCK_DGRAM)
sd.connect( (dst_IP, dst_port) )</code></pre>
            
<table>
<thead>
  <tr>
    <th><span>technique description</span></th>
    <th><span>errno on port exhaustion</span></th>
    <th><span>possible src 2-tuple reuse</span></th>
    <th><span>risk of overshadowing</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>connect(dst_IP, dst_port)</span></td>
    <td><span>EAGAIN</span></td>
    <td><span>no </span><span>(bad!)</span></td>
    <td><span>no</span></td>
  </tr>
</tbody>
</table><p>The simplest case is not a happy one. The total number of concurrent outgoing UDP connections on Linux is limited by the ephemeral port range size. On our multi-tenant servers, with potentially long-lived gaming and H3/QUIC flows containing WebSockets, this is too limiting.</p><p>On TCP we were able to slap on a <code>setsockopt</code> and move on. No such easy workaround is available for UDP.</p><p>For UDP, without REUSEADDR, Linux avoids sharing local 2-tuples among UDP sockets. During <code>connect()</code> it tries to find a 2-tuple that is not used yet. As a side note: there is no fundamental reason that it looks for a unique 2-tuple as opposed to a unique 4-tuple during 'connect()'. This suboptimal behavior might be fixable.</p>
    <div>
      <h3>SO_REUSEADDR is hard</h3>
      <a href="#so_reuseaddr-is-hard">
        
      </a>
    </div>
    <p>To allow local two-tuple reuse we need the SO_REUSEADDR socket option. Sadly, this would also allow established sockets to share a 4-tuple, with the newer socket overshadowing the older one.</p>
            <pre><code>sd = socket.socket(SOCK_DGRAM)
sd.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sd.connect( (dst_IP, dst_port) )</code></pre>
            
<table>
<thead>
  <tr>
    <th><span>technique description</span></th>
    <th><span>errno on port exhaustion</span></th>
    <th><span>possible src 2-tuple reuse</span></th>
    <th><span>risk of overshadowing</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>SO_REUSEADDR</span><br /><span>connect(dst_IP, dst_port)</span></td>
    <td><span>EAGAIN</span></td>
    <td><span>yes</span></td>
    <td><span>yes </span><span>(bad!)</span></td>
  </tr>
</tbody>
</table><p>In other words, we can't just set SO_REUSEADDR and move on, since we might hit a local 2-tuple that is already used in a connection against the same destination. We might already have an identical 4-tuple connected socket underneath. Most importantly, during such a conflict we won't be notified by any error. This is unacceptably bad.</p>
    <div>
      <h3>Detecting socket conflicts with eBPF</h3>
      <a href="#detecting-socket-conflicts-with-ebpf">
        
      </a>
    </div>
    <p>We thought a good solution might be to write an eBPF program to detect such conflicts. The idea was to put a code on the <code>connect()</code> syscall. Linux cgroups allow the BPF_CGROUP_INET4_CONNECT hook. The eBPF is called every time a process under a given cgroup runs the <code>connect()</code> syscall. This is pretty cool, and we thought it would allow us to verify if there is a 4-tuple conflict before moving the socket from UNCONNECTED to CONNECTED states.</p><p><a href="https://github.com/cloudflare/cloudflare-blog/tree/master/2022-02-connectx/ebpf_connect4">Here is how to load and attach our eBPF</a></p>
            <pre><code>bpftool prog load ebpf.o /sys/fs/bpf/prog_connect4  type cgroup/connect4
bpftool cgroup attach /sys/fs/cgroup/unified/user.slice connect4 pinned /sys/fs/bpf/prog_connect4</code></pre>
            <p>With such a code, we'll greatly reduce the probability of overshadowing:</p>
<table>
<thead>
  <tr>
    <th><span>technique description</span></th>
    <th><span>errno on port exhaustion</span></th>
    <th><span>possible src 2-tuple reuse</span></th>
    <th><span>risk of overshadowing</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>INET4_CONNECT hook</span><br /><span>SO_REUSEADDR</span><br /><span>connect(dst_IP, dst_port)</span></td>
    <td><span>manual port discovery, EPERM on conflict</span></td>
    <td><span>yes</span></td>
    <td><span>yes, but small</span></td>
  </tr>
</tbody>
</table><p>However, this solution is limited. First, it doesn't work for sockets with an automatically assigned source IP or source port, it only works when a user manually creates a 4-tuple connection from userspace. Then there is a second issue: a typical race condition. We don't grab any lock, so it's technically possible a conflicting socket will be created on another CPU in the time between our eBPF conflict check and the finish of the real <code>connect()</code> syscall machinery. In short, this lockless eBPF approach is better than nothing, but fundamentally racy.</p>
    <div>
      <h3>Socket traversal - SOCK_DIAG ss way</h3>
      <a href="#socket-traversal-sock_diag-ss-way">
        
      </a>
    </div>
    <p>There is another way to verify if a conflicting socket already exists: we can check for connected sockets in userspace. It's possible to do it without any privileges quite effectively with the SOCK_DIAG_BY_FAMILY feature of <code>netlink</code> interface. This is the same technique the <code>ss</code> tool uses to print out sockets available on the system.</p><p>The netlink code is not even all that complicated. <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2022-02-connectx/connectx.py#L23">Take a look at the code</a>. Inside the kernel, it goes <a href="https://elixir.bootlin.com/linux/latest/source/net/ipv4/udp_diag.c#L28">quickly into a fast <code>__udp_lookup()</code> routine</a>. This is great - we can avoid iterating over all sockets on the system.</p><p>With that function handy, we can draft our UDP code:</p>
            <pre><code>sd = socket.socket(SOCK_DGRAM)
sd.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
cookie = sd.getsockopt(socket.SOL_SOCKET, SO_COOKIE, 8)
sd.bind( src_addr )
c, _ = _netlink_udp_lookup(family, src_addr, dst_addr)
if c != cookie:
    raise OSError(...)
sd.connect( dst_addr )</code></pre>
            <p>This code has the same race condition issue as the connect inet eBPF hook before. But it's a good starting point. We need some locking to avoid the race condition. Perhaps it's possible to do it in the userspace.</p>
    <div>
      <h3>SO_REUSEADDR as a lock</h3>
      <a href="#so_reuseaddr-as-a-lock">
        
      </a>
    </div>
    <p>Here comes a breakthrough: we can use SO_REUSEADDR as a locking mechanism. Consider this:</p>
            <pre><code>sd = socket.socket(SOCK_DGRAM)
cookie = sd.getsockopt(socket.SOL_SOCKET, SO_COOKIE, 8)
sd.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sd.bind( src_addr )
sd.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 0)
c, _ = _netlink_udp_lookup(family, src_addr, dst_addr)
if c != cookie:
    raise OSError()
sd.connect( dst_addr )
sd.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)</code></pre>
            <p>The idea here is:</p><ul><li><p>We need REUSEADDR around bind, otherwise it wouldn't be possible to reuse a local port. It's technically possible to clear REUSEADDR after bind. Doing this technically makes the kernel socket state inconsistent, but it doesn't hurt anything in practice.</p></li><li><p>By clearing REUSEADDR, we're locking new sockets from using that source port. At this stage we can check if we have ownership of the 4-tuple we want. Even if multiple sockets enter this critical section, only one, the newest, can win this verification. This is a cooperative algorithm, so we assume all tenants try to behave.</p></li><li><p>At this point, if the verification succeeds, we can perform <code>connect()</code> and have a guarantee that the 4-tuple won't be reused by another socket at any point in the process.</p></li></ul><p>This is rather convoluted and hacky, but it satisfies our requirements:</p>
<table>
<thead>
  <tr>
    <th><span>technique description</span></th>
    <th><span>errno on port exhaustion</span></th>
    <th><span>possible src 2-tuple reuse</span></th>
    <th><span>risk of overshadowing</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>REUSEADDR as a lock</span></td>
    <td><span>EAGAIN</span></td>
    <td><span>yes</span></td>
    <td><span>no</span></td>
  </tr>
</tbody>
</table><p>Sadly, this schema only works when we know the full 4-tuple, so we can't rely on kernel automatic source IP or port assignments.</p>
    <div>
      <h3>Faking source IP and port discovery</h3>
      <a href="#faking-source-ip-and-port-discovery">
        
      </a>
    </div>
    <p>In the case when the user calls 'connect' and specifies only target 2-tuple - destination IP and port, the kernel needs to fill in the missing bits - the source IP and source port. Unfortunately the described algorithm expects the full 4-tuple to be known in advance.</p><p>One solution is to implement source IP and port discovery in userspace. This turns out to be not that hard. For example, <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2022-02-connectx/connectx.py#L204">here's a snippet of our code</a>:</p>
            <pre><code>def _get_udp_port(family, src_addr, dst_addr):
    if ephemeral_lo == None:
        _read_ephemeral()
    lo, hi = ephemeral_lo, ephemeral_hi
    start = random.randint(lo, hi)
    ...</code></pre>
            
    <div>
      <h3>Putting it all together</h3>
      <a href="#putting-it-all-together">
        
      </a>
    </div>
    <p>Combining the manual source IP, port discovery and the REUSEADDR locking dance, we get a decent userspace implementation of <code>connectx()</code> for UDP.</p><p>We have covered all three use cases this API should support:</p>
<table>
<thead>
  <tr>
    <th><span>user specified</span></th>
    <th><span>comments</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>{_, _, dst_IP, dst_port}</span></td>
    <td><span>manual source IP and source port discovery</span></td>
  </tr>
  <tr>
    <td><span>{src_IP, _, dst_IP, dst_port}</span></td>
    <td><span>manual source port discovery</span></td>
  </tr>
  <tr>
    <td><span>{src_IP, src_port, dst_IP, dst_port}</span></td>
    <td><span>just our "REUSEADDR as lock" technique</span></td>
  </tr>
</tbody>
</table><p><a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2022-02-connectx/connectx.py#L116-L166">Take a look at the full code</a>.</p>
    <div>
      <h3>Summary</h3>
      <a href="#summary">
        
      </a>
    </div>
    <p>This post described a problem we hit in production: running out of ephemeral ports. This was partially caused by our servers running numerous concurrent connections, but also because we used the Linux sockets API in a way that prevented source port reuse. It meant that we were limited to ~28,000 concurrent connections per protocol, which is not enough for us.</p><p>We explained how to allow source port reuse and prevent having this ephemeral-port-range limit imposed. We showed an userspace <code>connectx()</code> function, which is a better way of creating outgoing TCP and UDP connections on Linux.</p><p>Our UDP code is more complex, based on little known low-level features, assumes cooperation between tenants and undocumented behaviour of the Linux operating system. Using REUSEADDR as a locking mechanism is rather unheard of.</p><p>The <code>connectx()</code> functionality is valuable, and should be added to Linux one way or another. It's not trivial to get all its use cases right. Hopefully, this blog post shows how to achieve this in the best way given the operating system API constraints.</p><p>___</p><p>¹ On a side note, on the second cURL run it fails due to TIME-WAIT sockets: "bind failed with errno 98: Address already in use".</p><p>One option is to wait for the TIME_WAIT socket to die, or work around this with the <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2022-02-connectx/killtw.py">time-wait sockets kill script</a>. Killing time-wait sockets is generally a bad idea, violating protocol, unneeded and sometimes doesn't work. But hey, in some extreme cases it's good to know what's possible. Just saying.</p> ]]></content:encoded>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[UDP]]></category>
            <category><![CDATA[TCP]]></category>
            <category><![CDATA[Linux]]></category>
            <guid isPermaLink="false">319tj39kXPyzuiPbC755uC</guid>
            <dc:creator>Marek Majkowski</dc:creator>
        </item>
        <item>
            <title><![CDATA[Extending Cloudflare’s Zero Trust platform to support UDP and Internal DNS]]></title>
            <link>https://blog.cloudflare.com/extending-cloudflares-zero-trust-platform-to-support-udp-and-internal-dns/</link>
            <pubDate>Wed, 08 Dec 2021 13:59:15 GMT</pubDate>
            <description><![CDATA[ Last year, we launched a new feature which empowered users to begin building a private network on Cloudflare. Today, we’re excited to announce even more features which make your Zero Trust migration easier than ever.  ]]></description>
            <content:encoded><![CDATA[ <p></p><p>At the end of 2020, Cloudflare empowered organizations to start <a href="/build-your-own-private-network-on-cloudflare/">building a private network</a> on top of our network. Using Cloudflare Tunnel on the server side, and Cloudflare WARP on the client side, the need for a legacy VPN was eliminated. Fast-forward to today, and thousands of organizations have gone on this journey with us — unplugging their legacy VPN concentrators, internal firewalls, and load balancers. They’ve eliminated the need to maintain all this legacy hardware; they’ve dramatically improved speeds for end users; and they’re able to maintain Zero Trust rules organization-wide.</p><p>We started with TCP, which is powerful because it enables an important range of use cases. However, to truly replace a VPN, you need to be able to cover UDP, too. Starting today, we’re excited to provide early access to UDP on Cloudflare’s Zero Trust platform. And even better: as a result of supporting UDP, we can offer Internal DNS — so there’s no need to migrate thousands of private hostnames by hand to override DNS rules. You can get started with Cloudflare for Teams for free today by signing up <a href="https://dash.cloudflare.com/sign-up/teams">here</a>; and if you’d like to join the waitlist to gain early access to UDP and Internal DNS, please visit <a href="https://cloudflare.com/zero-trust/lp/private-dns-waitlist">here</a>.</p>
    <div>
      <h2>The topology of a private network on Cloudflare</h2>
      <a href="#the-topology-of-a-private-network-on-cloudflare">
        
      </a>
    </div>
    <p>Building out a private network has two primary components: the infrastructure side, and the client side.</p><p>The infrastructure side of the equation is powered by Cloudflare Tunnel, which simply connects your infrastructure (whether that be a singular application, many applications, or an entire <a href="https://www.cloudflare.com/learning/access-management/what-is-network-segmentation/">network segment</a>) to Cloudflare. This is made possible by running a simple command-line daemon in your environment to establish multiple secure, outbound-only, load-balanced links to Cloudflare. Simply put, Tunnel is what connects your network to Cloudflare.</p><p>On the other side of this equation, we need your end users to be able to easily connect to Cloudflare and, more importantly, your network. This connection is handled by our robust device client, <a href="/warp-for-desktop/">Cloudflare WARP</a>. This client can be rolled out to your entire organization in just a few minutes using your in-house MDM tooling, and it establishes a secure, WireGuard-based connection from your users’ devices to the Cloudflare network.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2m1z6HnFMDxGRFpphnS5kB/6a2cf02939d4a1e9b3f8829c9fcc656f/image1-36.png" />
            
            </figure><p>Now that we have your infrastructure and your users connected to Cloudflare, it becomes easy to tag your applications and layer on <a href="https://www.cloudflare.com/learning/security/glossary/what-is-zero-trust/">Zero Trust security controls</a> to verify both identity and device-centric rules for each and every request on your network.</p><p>Up until now though, only TCP was supported.</p>
    <div>
      <h2>Extending Cloudflare Zero Trust to support UDP</h2>
      <a href="#extending-cloudflare-zero-trust-to-support-udp">
        
      </a>
    </div>
    <p>Over the past year, with more and more users adopting Cloudflare’s Zero Trust platform, we have gathered data surrounding all the use cases that are keeping VPNs plugged in. Of those, the most common need has been blanket support for UDP-based traffic. Modern protocols like QUIC take advantage of UDP’s lightweight architecture — and at Cloudflare, we believe it is part of our mission to advance these new standards to help build a better Internet.</p><p>Today, we’re excited to open an official waitlist for those who would like early access to Cloudflare for Teams with UDP support.</p>
    <div>
      <h3>What is UDP and why does it matter?</h3>
      <a href="#what-is-udp-and-why-does-it-matter">
        
      </a>
    </div>
    <p>UDP is a vital component of the Internet. Without it, many applications would be rendered woefully inadequate for modern use. Applications which depend on near real time communication such as <a href="https://www.cloudflare.com/developer-platform/solutions/live-streaming/">video streaming</a> or VoIP services are prime examples of why we need UDP and the role it fills for the Internet. At their core, however, TCP and UDP achieve the same results — just through vastly different means. Each has their own unique benefits and drawbacks, which are always felt downstream by the applications that utilize them.</p><p>Here’s a quick example of how they both work, if you were to ask a question to somebody as a metaphor. TCP should look pretty familiar: you would typically say hi, wait for them to say hi back, ask how they are, wait for their response, and then ask them what you want.</p><p>UDP, on the other hand, is the equivalent of just walking up to someone and asking what you want without checking to make sure that they're listening. With this approach, some of your question may be missed, but that's fine as long as you get an answer.</p><p>Like the conversation above, with UDP many applications actually don’t care if some data gets lost; video streaming or game servers are good examples here. If you were to lose a packet in transit while streaming, you wouldn’t want the entire stream to be interrupted until this packet is received — you’d rather just drop the packet and move on. Another reason application developers may utilize UDP is because they’d prefer to develop their own controls around connection, transmission, and quality control rather than use TCP’s standardized ones.</p><p>For Cloudflare, end-to-end support for UDP-based traffic will unlock a number of new use cases. Here are a few we think you’ll agree are pretty exciting.</p>
    <div>
      <h3>Internal DNS Resolvers</h3>
      <a href="#internal-dns-resolvers">
        
      </a>
    </div>
    <p>Most corporate networks require an internal DNS resolver to disseminate access to resources made available over their Intranet. Your Intranet needs an internal DNS resolver for many of the same reasons the Internet needs public DNS resolvers. In short, humans are good at many things, but remembering long strings of numbers (in this case IP addresses) is not one of them. Both public and internal DNS resolvers were designed to solve this problem (and <a href="https://www.cloudflare.com/learning/dns/what-is-dns/">much more</a>) for us.</p><p>In the corporate world, it would be needlessly painful to ask internal users to navigate to, say, 192.168.0.1 to simply reach Sharepoint or OneDrive. Instead, it’s much easier to create DNS entries for each resource and let your internal resolver handle all the mapping for your users as this is something humans are actually quite good at.</p><p>Under the hood, DNS queries generally consist of a single UDP request from the client. The server can then return a single reply to the client. Since DNS requests are not very large, they can often be sent and received in a single packet. This makes support for UDP across our Zero Trust platform a key enabler to pulling the plug on your VPN.</p>
    <div>
      <h3>Thick Client Applications</h3>
      <a href="#thick-client-applications">
        
      </a>
    </div>
    <p>Another common use case for UDP is thick client applications. One benefit of UDP we have discussed so far is that it is a lean protocol. It’s lean because the <a href="https://www.cloudflare.com/learning/ddos/glossary/tcp-ip/">three-way handshake</a> of TCP and other measures for reliability have been stripped out by design. In many cases, application developers still want these reliability controls, but are intimately familiar with their applications and know these controls could be better handled by tailoring them to their application. These thick client applications often perform critical business functions and must be supported end-to-end to migrate. As an example, legacy versions of Outlook may be implemented through thick clients where most of the operations are performed by the local machine, and only the sync interactions with Exchange servers occur over UDP.</p><p>Again, UDP support on our Zero Trust platform now means these types of applications are no reason to remain on your legacy VPN.</p>
    <div>
      <h3>And more…</h3>
      <a href="#and-more">
        
      </a>
    </div>
    <p>A huge portion of the world's Internet traffic is transported over UDP. Often, people equate time-sensitive applications with UDP, where occasionally dropping packets would be better than waiting — but there are a number of other use cases, and we’re excited to be able to provide sweeping support.</p>
    <div>
      <h2>How can I get started today?</h2>
      <a href="#how-can-i-get-started-today">
        
      </a>
    </div>
    <p>You can already get started building your private network on Cloudflare with our tutorials and guides in our developer documentation. Below is the critical path. And if you’re already a customer, and you’re interested in joining the waitlist for UDP and Internal DNS access, please skip ahead to the end of this post!</p>
    <div>
      <h3>Connecting your network to Cloudflare</h3>
      <a href="#connecting-your-network-to-cloudflare">
        
      </a>
    </div>
    <p>First, you need to <a href="https://developers.cloudflare.com/cloudflare-one/connections/connect-apps/install-and-setup/installation">install cloudflared</a> on your network and authenticate it with the command below:</p>
            <pre><code>cloudflared tunnel login</code></pre>
            <p>Next, you’ll create a tunnel with a user-friendly name to identify your network or environment.</p>
            <pre><code>cloudflared tunnel create acme-network</code></pre>
            <p>Finally, you’ll want to configure your tunnel with the IP/CIDR range of your private network. By doing this, you’re making the Cloudflare WARP agent aware that any requests to this IP range need to be routed to our new tunnel.</p>
            <pre><code>cloudflared tunnel route ip add 192.168.0.1/32</code></pre>
            <p>Then, all you need to do is run your tunnel!</p>
    <div>
      <h3>Connecting your users to your network</h3>
      <a href="#connecting-your-users-to-your-network">
        
      </a>
    </div>
    <p>To connect your first user, start by downloading the Cloudflare WARP agent on the device they’ll be connecting from, then follow the steps in our installer.</p><p>Next, you’ll visit the <a href="https://dash.teams.cloudflare.com">Teams Dashboard</a> and define who is allowed to access our network by creating an enrollment policy. This policy can be created under Settings &gt; Devices &gt; Device Enrollment. In the example below, you can see that we’re requiring users to be located in Canada and have an email address ending @cloudflare.com.</p><p>Once you’ve created this policy, you can enroll your first device by clicking the WARP desktop icon on your machine and navigating to preferences &gt; Account &gt; Login with Teams.</p><p>Last, we’ll remove the IP range we added to our Tunnel from the Exclude list in Settings &gt; Network &gt; Split Tunnels. This will ensure this traffic is, in fact, routed to Cloudflare and then sent to our private network Tunnel as intended.</p><p>In addition to the tutorial above, we also have in-product guides in the Teams Dashboard which go into more detail about each step and provide validation along the way.</p><p>To create your first Tunnel, navigate to the <a href="https://dash.teams.cloudflare.com/access/tunnels">Access &gt; Tunnels</a>.</p><p>To enroll your first device into WARP, navigate to <a href="https://dash.teams.cloudflare.com/team/devices">My Team &gt; Devices</a>.</p>
    <div>
      <h2>What’s Next</h2>
      <a href="#whats-next">
        
      </a>
    </div>
    <p>We’re incredibly excited to release our <a href="https://cloudflare.com/zero-trust/lp/private-dns-waitlist">waitlist</a> today and even more excited to launch this feature in the coming weeks. We’re just getting started with private network Tunnels and plan to continue adding more support for Zero Trust access rules for each request to each internal DNS hostname after launch. We’re also working on a number of efforts to measure performance and to ensure we remain the fastest Zero Trust platform — making using us a delight for your users, compared to the pain of using a legacy VPN.</p> ]]></content:encoded>
            <category><![CDATA[CIO Week]]></category>
            <category><![CDATA[Cloudflare One]]></category>
            <category><![CDATA[Cloudflare Tunnel]]></category>
            <category><![CDATA[Zero Trust]]></category>
            <category><![CDATA[UDP]]></category>
            <category><![CDATA[Product News]]></category>
            <guid isPermaLink="false">1lvtfumva5EYQOVoDyU1fm</guid>
            <dc:creator>Abe Carryl</dc:creator>
        </item>
        <item>
            <title><![CDATA[Everything you ever wanted to know about UDP sockets but were afraid to ask, part 1]]></title>
            <link>https://blog.cloudflare.com/everything-you-ever-wanted-to-know-about-udp-sockets-but-were-afraid-to-ask-part-1/</link>
            <pubDate>Thu, 25 Nov 2021 17:27:37 GMT</pubDate>
            <description><![CDATA[ Historically Cloudflare's core competency was operating an HTTP reverse proxy. We've spent significant effort optimizing traditional HTTP/1.1 and HTTP/2 servers running on top of TCP. ]]></description>
            <content:encoded><![CDATA[ <p></p><p>Snippet from internal presentation about UDP inner workings in Spectrum. Who said UDP is simple!</p><p>Historically Cloudflare's core competency was operating an <a href="https://www.cloudflare.com/learning/cdn/glossary/reverse-proxy/">HTTP reverse proxy</a>. We've spent significant effort optimizing traditional HTTP/1.1 and HTTP/2 servers running on top of TCP. Recently though, we started operating big scale stateful <a href="https://www.cloudflare.com/learning/ddos/glossary/user-datagram-protocol-udp/">UDP</a> services.</p><p>Stateful UDP gains popularity for a number of reasons:</p><p>— <a href="/quic-version-1-is-live-on-cloudflare/">QUIC</a> is a new transport protocol based on UDP, it powers HTTP/3. We see the adoption accelerating.</p><p>— <a href="/1111-warp-better-vpn/">We operate WARP</a> — our Wireguard protocol based tunneling service — which uses UDP under the hood.</p><p>— We have a lot of generic UDP traffic going through <a href="https://www.cloudflare.com/products/cloudflare-spectrum/">our Spectrum service</a>.</p><p>Although UDP is simple in principle, there is a lot of domain knowledge needed to run things at scale. In this blog post we'll cover the basics: all you need to know about UDP servers to get started.</p>
    <div>
      <h3>Connected vs unconnected</h3>
      <a href="#connected-vs-unconnected">
        
      </a>
    </div>
    <p>How do you "accept" connections on a UDP server? If you are using unconnected sockets, you generally don't.</p><p>But let's start with the basics. UDP sockets can be "connected" (or "established") or "unconnected". Connected sockets have a full 4-tuple associated {source ip, source port, destination ip, destination port}, unconnected sockets have 2-tuple {bind ip, bind port}.</p><p>Traditionally the connected sockets were mostly used for outgoing flows, while unconnected for inbound "server" side connections.</p>
    <div>
      <h3>UDP client</h3>
      <a href="#udp-client">
        
      </a>
    </div>
    <p>As we'll learn today, these can be mixed. It is possible to use connected sockets for ingress handling, and unconnected for egress. To illustrate the latter, consider these two snippets. They do the same thing — send a packet to the DNS resolver. First snippet is using a connected socket:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/nLimg1UkSX4TIIYTaxASJ/630c7f32c10868d6cccee7736d5c713c/image4-20.png" />
            
            </figure><p>Second, using unconnected one:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1Iq69Z2Nrk00zNeTvJyrVU/6cb272253a0906191cb6aabf6b2e7b3d/image7-14.png" />
            
            </figure><p>Which one is better? In the second case, when receiving, the programmer should verify the source IP of the packet. Otherwise, the program can get confused by some random inbound internet junk — like port scanning. It is tempting to reuse the socket descriptor and query another DNS server afterwards, but this would be a bad idea, particularly when dealing with DNS. For security, DNS assumes the client source port is unpredictable and short-lived.</p><p>Generally speaking for outbound traffic it's preferable to use connected UDP sockets.</p><p>Connected sockets can save route lookup on each packet by employing a clever optimization — Linux can save a route lookup result on <a href="https://elixir.bootlin.com/linux/v5.15.4/source/include/net/sock.h#L434">a connection struct</a>. Depending on the specifics of the setup this might save some CPU cycles.</p><p>For completeness, it is possible to roll a new source port and reuse a socket descriptor with an obscure trick called "dissolving of the socket association". It can be done with <code>connect(AF_UNSPEC)</code>, but this is rather advanced Linux magic.</p>
    <div>
      <h3>UDP server</h3>
      <a href="#udp-server">
        
      </a>
    </div>
    <p>Traditionally on the server side UDP requires unconnected sockets. Using them requires a bit of finesse. To illustrate this, let's write an UDP echo server. In practice, you probably shouldn't write such a server, due to a risk of becoming a DoS reflection vector. <a href="/how-to-receive-a-million-packets/">Among other protections</a>, like rate limiting, UDP services should always respond with a strictly smaller amount of data than was sent in the initial packet. But let's not digress, the naive UDP echo server might look like:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/se9nCWduF9hcK8K1XwiWW/9c9625acfd065d3a66a2b98b99e15708/image6-14.png" />
            
            </figure><p>This code begs questions:</p><p>— Received packets can be longer than 2048 bytes. This can happen over loop back, when using jumbo frames or with help of IP fragmentation.</p><p>— It's totally possible for the received packet to have an empty payload.</p><p>— What about inbound ICMP errors?</p><p>These problems are specific to UDP, they don't happen in the TCP world. TCP can transparently deal with MTU / fragmentation and ICMP errors. Depending on the specific protocol, a UDP service might need to be more complex and pay extra care to such corner cases.</p>
    <div>
      <h3>Sourcing packets from a wildcard socket</h3>
      <a href="#sourcing-packets-from-a-wildcard-socket">
        
      </a>
    </div>
    <p>There is a bigger problem with this code. It only works correctly when binding to a specific IP address, like <code>::1</code> or <code>127.0.0.1</code>. It won't always work when we bind to a wildcard. The issue lies in the <code>sendto()</code> line — we didn't explicitly set the outbound IP address! Linux doesn't know where we'd like to source the packet from, and it will choose a default egress IP address. It might not be the IP the client communicated to. For example, let's say we added <code>::2</code> address to loop back interface and sent a packet to it, with src IP set to a valid <code>::1</code>:</p>
            <pre><code>marek@mrprec:~$ sudo tcpdump -ni lo port 1234 -t
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
IP6 ::1.41879 &gt; ::2.1234: UDP, length 2
IP6 ::1.1234 &gt; ::1.41879: UDP, length 2</code></pre>
            <p>Here we can see the packet correctly flying from <code>::1</code> to <code>::2</code>, to our server. But then when the server responds, it sources the response from <code>::1</code> IP which in this case is wrong.</p><p>On the server side, when binding to a wildcard:</p><p>— we might receive packets destined to a number of IP addresses</p><p>— we must be very careful when responding and use appropriate source IP address</p><p>BSD Sockets API doesn't make it easy to understand where the received packet was destined to. On Linux and <a href="https://www.freebsd.org/cgi/man.cgi?query=ip6&amp;sektion=4">BSD</a> it is possible to request useful CMSG metadata with IP_RECVPKTINO and IPV6_RECVPKTINFO.</p><p>An improved server loop might look like:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/52toeTH5ZY1aD3bOSry6Do/88b7b0e90c7802ae513ef2f5f2ade5f5/image2-30.png" />
            
            </figure><p>The <code>recvmsg</code> and <code>sendmsg</code> syscalls, as opposed to <code>recvfrom</code> / <code>sendto</code> allow the programmer to request and set extra CMSG metadata, which is very handy when dealing with UDP.</p><p>The IPV6_PKTINFO CMSG contains this data structure:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/JvR29xOL5YVQj4GvhudPL/c92277cd5b15270a2053663dc7412057/image1-71.png" />
            
            </figure><p>We can find here the IP address and interface number of the packet target. Notice, there's no place for a port number.</p>
    <div>
      <h3>Graceful server restart</h3>
      <a href="#graceful-server-restart">
        
      </a>
    </div>
    <p>Many traditional UDP protocols, like DNS, are request-response based. Since there is no state associated with a higher level "connection", the server can restart, to upgrade or change configuration, without any problems. Ideally, sockets should be managed with the usual <a href="http://0pointer.de/blog/projects/socket-activation.html">systemd socket activation</a> to avoid the short time window where the socket is down.</p><p>Modern protocols are often connection-based. For such servers, on restart, it's beneficial to keep the old connections directed to the old server process, while the new server instance is available for handling the new connections. The old connections will eventually die off, and the old server process will be able to terminate. This is a common and easy practice in the TCP world where each connection has its own file descriptor. The old server process stops accept()-ing new connections and just waits for the old connections to gradually go away. <a href="http://nginx.org/en/docs/control.html#upgrade">NGINX has a good documentation</a> on the subject.</p><p>Sadly, in UDP you can't <code>accept()</code> new connections. Doing graceful server restarts for UDP is surprisingly hard.</p>
    <div>
      <h3>Established-over-unconnected technique</h3>
      <a href="#established-over-unconnected-technique">
        
      </a>
    </div>
    <p>For some services we are using a technique which we call "established-over-unconnected". This comes from a realization that on Linux it's possible to create a connected socket *over* an unconnected one. Consider this code:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/78wXvRtNNnQFwlqfS7EYd5/bf3224997d0169e26c402d909b1c2012/image3-37.png" />
            
            </figure><p>Does this look hacky? Well, it should. What we do here is:</p><p>— We start a UDP unconnected socket.</p><p>— We wait for a client to come in.</p><p>— As soon as we receive the first packet from the client, we immediately create a new fully connected socket, *over* the unconnected socket! It shares the same local port and local IP.</p><p>This is how it might look in ss:</p>
            <pre><code>marek@mrprec:~$ ss -panu sport = :1234 or dport = :1234 | cat
State     Recv-Q    Send-Q       Local Address:Port        Peer Address:Port    Process                                                                         
ESTAB     0         0                    [::1]:1234               [::1]:44592    python3
UNCONN    0         0                        *:1234                   *:*        python3
ESTAB     0         0                    [::1]:44592              [::1]:1234     nc</code></pre>
            <p>Here you can see the two sockets managed in our python test server. Notice the established socket is sharing the unconnected socket port.</p><p>This trick is basically reproducing the 'accept()` behaviour in UDP, where each ingress connection gets its own dedicated socket descriptor.</p><p>While this trick is nice, it's not without drawbacks — it's racy in two places. First, it's possible that the client will send more than one packet to the unconnected socket before the connected socket is created. The application code should work around it — if a packet received from the server socket belongs to an already existing connected flow, it shall be handed over to the right place. Then, during the creation of the connected socket, in the short window after <code>bind()</code> before <code>connect()</code> we might receive unexpected packets belonging to the unconnected socket! We don't want these packets here. It is necessary to filter the source IP/port when receiving early packets on the connected socket.</p><p>Is this approach worth the extra complexity? It depends on the use case. For a relatively small number of long-lived flows, it might be ok. For a high number of short-lived flows (especially DNS or NTP) it's an overkill.</p><p>Keeping old flows stable during service restarts is particularly hard in UDP. The established-over-unconnected technique is just one of the simpler ways of handling it. We'll leave another technique, based on SO_REUSEPORT ebpf, for a future blog post.</p>
    <div>
      <h3>Summary</h3>
      <a href="#summary">
        
      </a>
    </div>
    <p>In this blog post we started by highlighting connected and unconnected UDP sockets. Then we discussed why binding UDP servers to a wildcard is hard, and how IP_PKTINFO CMSG can help to solve it. We discussed the UDP graceful restart problem, and hinted on an established-over-unconnected technique.</p><table><tr><td><p><b>Socket type</b></p></td><td><p><b>Created with</b></p></td><td><p><b>Appropriate syscalls</b></p></td></tr><tr><td><p>established</p></td><td><p>connect()</p></td><td><p>recv()/send()</p></td></tr><tr><td><p>established</p></td><td><p>bind() + connect()</p></td><td><p>recvfrom()/send(), watch out for the race after bind(), verify source of the packet</p></td></tr><tr><td><p>unconnected</p></td><td><p>bind(specific IP)</p></td><td><p>recvfrom()/sendto()</p></td></tr><tr><td><p>unconnected</p></td><td><p>bind(wildcard)</p></td><td><p>recvmsg()/sendmsg() with IP_PKTINFO CMSG</p></td></tr></table><p>Stay tuned, in future blog posts we might go even deeper into the curious world of production UDP servers.</p> ]]></content:encoded>
            <category><![CDATA[UDP]]></category>
            <guid isPermaLink="false">4Is8w5do12KTSyIAjAtpud</guid>
            <dc:creator>Marek Majkowski</dc:creator>
        </item>
        <item>
            <title><![CDATA[Cloudflare blocks an almost 2 Tbps multi-vector DDoS attack]]></title>
            <link>https://blog.cloudflare.com/cloudflare-blocks-an-almost-2-tbps-multi-vector-ddos-attack/</link>
            <pubDate>Sat, 13 Nov 2021 14:33:49 GMT</pubDate>
            <description><![CDATA[ Earlier this week, Cloudflare automatically detected and mitigated a DDoS attack that peaked just below 2 Tbps — the largest we’ve seen to date. ]]></description>
            <content:encoded><![CDATA[ <p></p><p>Earlier this week, Cloudflare automatically detected and mitigated a <a href="https://www.cloudflare.com/learning/ddos/what-is-a-ddos-attack/">DDoS attack</a> that peaked just below 2 Tbps — the largest we’ve seen to date. This was a multi-vector attack combining <a href="https://www.cloudflare.com/learning/ddos/dns-amplification-ddos-attack/">DNS amplification</a> attacks and <a href="https://www.cloudflare.com/learning/ddos/udp-flood-ddos-attack/">UDP floods</a>. The entire attack lasted just one minute. The attack was launched from approximately 15,000 bots running a variant of the original Mirai code on IoT devices and <a href="https://www.rapid7.com/blog/post/2021/11/01/gitlab-unauthenticated-remote-code-execution-cve-2021-22205-exploited-in-the-wild/">unpatched GitLab instances</a>.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/69O75xM8BJ3AQ4xboaaJc8/05b4efd41d8beb38aabe8f5df3b1e48c/image4-11.png" />
            
            </figure><p>DDoS attack peaking just below 2 Tbps‌‌</p>
    <div>
      <h3>Network-layer DDoS attacks increased by 44%</h3>
      <a href="#network-layer-ddos-attacks-increased-by-44">
        
      </a>
    </div>
    <p>Last quarter, we saw multiple terabit-strong DDoS attacks and this attack continues this trend of increased attack intensity. Another key finding from our <a href="/ddos-attack-trends-for-2021-q3/">Q3 DDoS Trends report</a> was that network-layer DDoS attacks actually increased by 44% quarter-over-quarter. While the fourth quarter is not over yet, we have, again, seen multiple terabit-strong attacks that targeted Cloudflare customers.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/16L0jRaxzWrkoprzjXWbEU/ebb62af9ba073ea608920f1b14c34a12/image1-21.png" />
            
            </figure><p>DDoS attacks peaking at 1-1.4 Tbps</p>
    <div>
      <h3>How did Cloudflare mitigate this attack?</h3>
      <a href="#how-did-cloudflare-mitigate-this-attack">
        
      </a>
    </div>
    <p>To begin with, our systems constantly analyze traffic samples “out-of-path” which allows us to asynchronously detect DDoS attacks without causing latency or impacting performance. Once the attack traffic was detected (within sub-seconds), our systems generated a real-time signature that surgically matched against the attack patterns to mitigate the attack without impacting legitimate traffic.</p><p>Once generated, the fingerprint is propagated as an ephemeral mitigation rule to the most optimal location in the Cloudflare edge for cost-efficient mitigation. In this specific case, as with most L3/4 DDoS attacks, the rule was pushed in-line into the Linux kernel <a href="/l4drop-xdp-ebpf-based-ddos-mitigations/">eXpress Data Path</a> (XDP) to drop the attack packet at wirespeed.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3aDR60Hnanrkb1TJ5cNgMO/e173eb50b8433d027d2e5ab834e4ee52/image3-17.png" />
            
            </figure><p>A conceptual diagram of Cloudflare’s DDoS protection systems</p><p>Read more about <a href="https://developers.cloudflare.com/ddos-protection/">Cloudflare’s DDoS Protection systems</a>.</p>
    <div>
      <h3>Helping build a better Internet</h3>
      <a href="#helping-build-a-better-internet">
        
      </a>
    </div>
    <p>Cloudflare’s mission is to help build a better Internet — one that is secure, faster, and more reliable for everyone. The DDoS team’s vision is derived from this mission: our goal is to make the impact of DDoS attacks a thing of the past. Whether it's the <a href="/meris-botnet/">Meris botnet</a> that launched some of the <a href="/cloudflare-thwarts-17-2m-rps-ddos-attack-the-largest-ever-reported/">largest HTTP DDoS attacks on record</a>, the recent <a href="/update-on-voip-attacks/">attacks on VoIP providers</a> or this Mirai-variant that’s DDoSing Internet properties, Cloudflare’s network automatically detects and mitigates DDoS attacks. Cloudflare provides a secure, reliable, performant, and <a href="/http-ddos-managed-rules/">customizable</a> platform for Internet properties of all types.</p><p>For more information about Cloudflare’s DDoS protection, <a href="http://www.cloudflare.com/enterprise">reach out to us</a> or have a go with a hands-on evaluation of <a href="https://www.cloudflare.com/plans/free/">Cloudflare’s Free plan</a>.</p> ]]></content:encoded>
            <category><![CDATA[DDoS]]></category>
            <category><![CDATA[Attacks]]></category>
            <category><![CDATA[Mirai]]></category>
            <category><![CDATA[Botnet]]></category>
            <category><![CDATA[DNS]]></category>
            <category><![CDATA[UDP]]></category>
            <guid isPermaLink="false">22mxDvugzkq2hvpQyg4tig</guid>
            <dc:creator>Omer Yoachimik</dc:creator>
        </item>
        <item>
            <title><![CDATA[Update on recent VoIP attacks: What should I do if I’m attacked?]]></title>
            <link>https://blog.cloudflare.com/update-on-voip-attacks/</link>
            <pubDate>Thu, 07 Oct 2021 02:20:59 GMT</pubDate>
            <description><![CDATA[ Because of the sustained attacks we are observing, we are sharing details on recent attack patterns, what steps they should take before an attack, and what to do after an attack has taken place. ]]></description>
            <content:encoded><![CDATA[ 
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/34Ko9laUln2ejkX7zQx97s/a3d664659f98cf96aca8a6d7a9942606/image-2-1.png" />
            
            </figure><p>Attackers continue targeting VoIP infrastructure around the world. In our blog from last week, <a href="/attacks-on-voip-providers/">May I ask who’s calling, please? A recent rise in VoIP DDoS attacks</a>, we reviewed how the SIP protocol works, ways it can be abused, and how Cloudflare can help protect against attacks on VoIP infrastructure without impacting performance.</p><p>Cloudflare’s network stands in front of some of the largest, most performance-sensitive voice and video providers in the world, and is uniquely well suited to mitigating attacks on VoIP providers.</p><p>Because of the sustained attacks we are observing, we are sharing details on recent attack patterns, what steps they should take before an attack, and what to do after an attack has taken place.</p><p>Below are three of the most common questions we’ve received from companies concerned about attacks on their VoIP systems, and Cloudflare’s answers.</p>
    <div>
      <h3>Question #1: How is VoIP infrastructure being attacked?</h3>
      <a href="#question-1-how-is-voip-infrastructure-being-attacked">
        
      </a>
    </div>
    <p>The attackers primarily use off-the-shelf <a href="https://www.cloudflare.com/learning/ddos/ddos-attack-tools/ddos-booter-ip-stresser">booter</a> services to launch attacks against VoIP infrastructure. The attack methods being used are not novel, <b>but the persistence of the attacker and their attempts to understand the target’s infrastructure are.</b></p><p>Attackers have used various attack vectors to probe the existing defenses of targets and try to infiltrate any existing defenses to disrupt VoIP services offered by certain providers. In some cases, they have been successful. HTTP attacks against <a href="https://www.cloudflare.com/learning/security/api/what-is-an-api-gateway/">API gateways</a> and the corporate websites of the providers have been combined with network-layer and transport-layer attack against VoIP infrastructures. Examples:</p><ol><li><p><b><b><b>TCP floods targeting stateful firewalls</b></b></b>These are being used in “trial-and-error” type attacks. They are not very effective against telephony infrastructure specifically (because it’s mostly UDP) but very effective at overwhelming stateful firewalls.</p></li><li><p><b><b><b>UDP floods targeting SIP infrastructure</b></b></b>Floods of UDP traffic that have no well-known fingerprint, aimed at critical VoIP services. Generic floods like this may look like legitimate traffic to unsophisticated filtering systems.</p></li><li><p><b><b><b>UDP reflection targeting SIP infrastructure</b></b></b>These methods, when targeted at SIP or RTP services, can easily overwhelm <a href="https://en.wikipedia.org/wiki/Session_border_controller">Session Border Controllers</a> (SBCs) and other telephony infrastructure. The attacker seems to learn enough about the target’s infrastructure to target such services with high precision.</p></li><li><p><b><b><b>SIP protocol-specific attacks</b></b></b>Attacks at the application layer are of particular concern because of the higher resource cost of generating application errors vs filtering on network devices.</p></li></ol>
    <div>
      <h3>Question #2: How should I prepare my organization in case our VoIP infrastructure is targeted?</h3>
      <a href="#question-2-how-should-i-prepare-my-organization-in-case-our-voip-infrastructure-is-targeted">
        
      </a>
    </div>
    <ol><li><p><b><b><b>Deploy an always-on DDoS mitigation service</b></b></b>Cloudflare recommends the deployment of always-on network level protection, like <a href="https://www.cloudflare.com/magic-transit/">Cloudflare Magic Transit</a>, prior to your organization being attacked.</p><p>Do not rely on reactive on-demand SOC-based DDoS Protection services that require humans to analyze attack traffic — they take too long to respond. Instead, onboard to a cloud service that has sufficient network capacity and automated DDoS mitigation systems.</p><p><b>Cloudflare has effective mitigations in place for the attacks seen against VoIP infrastructure</b>, including for <a href="/announcing-flowtrackd/">sophisticated TCP floods</a> and SIP specific attacks.</p></li><li><p><b><b><b>Enforce a positive security model</b></b></b>Block TCP on IP/port ranges that are not expected to receive TCP, instead of relying on on-premise firewalls that can be overwhelmed. Block network probing attempts (e.g. ICMP) and other packets that you don't normally expect to see.</p></li><li><p><b><b><b>Build custom mitigation strategies</b></b></b>Work together with your DDoS protection vendor to tailor mitigation strategies to your workload. Every network is different, and each poses unique challenges when integrating with DDoS mitigation systems.</p></li><li><p><b><b><b>Educate your employees</b></b></b>Train all of your employees to be on the lookout for ransom demands. Check email, support tickets, form submissions, and even server access logs. Ensure employees know to immediately report ransom demands to your Security Incident Response team.</p></li></ol>
    <div>
      <h3>Question #3: What should I do if I receive a ransom/threat?</h3>
      <a href="#question-3-what-should-i-do-if-i-receive-a-ransom-threat">
        
      </a>
    </div>
    <ol><li><p><b><b><b>Do not pay the ransom</b></b></b>Paying the ransom only encourages bad actors—and there’s no guarantee that they won’t attack your network now or later.</p></li><li><p><b><b><b>Notify Cloudflare</b></b></b>We can help ensure your website and network infrastructure are safeguarded against these attacks.</p></li><li><p><b><b><b>Notify local law enforcement</b></b></b>They will also likely request a copy of the ransom letter that you received.</p></li></ol>
    <div>
      <h3>Cloudflare is here to help</h3>
      <a href="#cloudflare-is-here-to-help">
        
      </a>
    </div>
    <p>With over 100 Tbps of network capacity, a network architecture that <a href="/magic-transit-network-functions/">efficiently filters traffic close to the source</a>, and a physical presence in over 250 cities, Cloudflare can help protect critical VoIP infrastructure without impacting latency, jitter, and call quality. Test results demonstrate a performance improvement of 36% on average across the globe for a real customer network using Cloudflare Magic Transit.</p><p>Some of the largest voice and video providers in the world rely on Cloudflare to protect their networks and ensure their services remain online and fast. We stand ready to help.</p><p>Talk to a Cloudflare specialist to <a href="https://www.cloudflare.com/lp/voip-ddos-protection/">learn more</a>.Under attack? Contact our <a href="https://www.cloudflare.com/under-attack-hotline/">hotline</a> to speak with someone immediately.</p> ]]></content:encoded>
            <category><![CDATA[DDoS]]></category>
            <category><![CDATA[Trends]]></category>
            <category><![CDATA[VoIP]]></category>
            <category><![CDATA[UDP]]></category>
            <category><![CDATA[REvil]]></category>
            <category><![CDATA[Ransom Attacks]]></category>
            <guid isPermaLink="false">2KaaFctdoCtSayt95YpQ48</guid>
            <dc:creator>Omer Yoachimik</dc:creator>
            <dc:creator>Vivek Ganti</dc:creator>
            <dc:creator>Alex Forster</dc:creator>
        </item>
        <item>
            <title><![CDATA[Raking the floods: my intern project using eBPF]]></title>
            <link>https://blog.cloudflare.com/building-rakelimit/</link>
            <pubDate>Fri, 18 Sep 2020 11:00:00 GMT</pubDate>
            <description><![CDATA[ SYN-cookies help mitigating SYN-floods for TCP, but how can we protect services from similar attacks that use UDP? We designed an algorithm and a library to fill this gap, and it’s open source! ]]></description>
            <content:encoded><![CDATA[ 
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2A1vrFJotzsKPMJxmJdOON/81b2292df9631836d592585149fbef25/rakelimit.jpg" />
            
            </figure><p>Cloudflare has sophisticated DDoS attack mitigation systems with multiple layers to provide defense in depth. Some of these layers analyse large-scale traffic patterns to detect and mitigate attacks. Other layers are more protocol- and application-specific, in order to stop attacks that might be hard to detect from overall traffic patterns. In some cases, the best place to detect and stop an attack is in the service itself.</p><p>During <a href="/cloudflare-doubling-size-of-2020-summer-intern-class/">my internship at Cloudflare</a> this summer, I’ve developed a new open-source framework to help UDP services protect themselves from attacks. This framework incorporates Cloudflare’s experience in running UDP-based services like Spectrum and the 1.1.1.1 resolver.</p>
    <div>
      <h3>Goals of the framework</h3>
      <a href="#goals-of-the-framework">
        
      </a>
    </div>
    <p>First of all, let's discuss what it actually means to protect an UDP service. We want to ensure that an attacker cannot drown out legitimate traffic. To achieve this we identify floods and limit them while leaving legitimate traffic untouched.</p><p>The idea to mitigate such attacks is straight forward: first identify a group of packets that is related to an attack, and then apply a rate limit on this group. Such groups are determined based on the attributes available to us in the packet, such as addresses and ports.</p><p>We then drop packets in the group. We only want to drop as much traffic as necessary to comply with our set rate limit. Completely ignoring a set of packets just because it is slightly above the rate limit is not an option, as it may contain legitimate traffic.</p><p>This ensures both that our service stays responsive but also that legitimate packets experience as little impact as possible.</p><p>While rate limiting is a somewhat straightforward procedure, determining groups is a bit harder, for a number of reasons.</p>
    <div>
      <h3>Finding needles in the haystack</h3>
      <a href="#finding-needles-in-the-haystack">
        
      </a>
    </div>
    <p>The problem in determining groups in packets is that we have barely any context. We consider four things as useful attributes as attack signatures: the source address and port as well as the destination address and port. While that already is not a lot, it gets worse: the source address and port may not even be accurate. Packets can be spoofed, in which case an attacker hides their own address. That means only keeping a rate per source address may not provide much value, as it could simply be spoofed.</p><p>But there is another problem: keeping one rate per address does not scale. When bringing IPv6 into the equation and its <a href="https://www.ripe.net/about-us/press-centre/understanding-ip-addressing#:~:text=For%20IPv4%2C%20this%20pool%20is,basic%20unit%20for%20storing%20information.">whopping address space</a> it becomes clear it’s not going to work.</p><p>To solve these issues we turned to the academic world and found what we were looking for, the problem of <i>Heavy Hitters.</i> <i>Heavy Hitters</i> are elements of a datastream that appear frequently, and can be expressed relative to the overall elements of the stream. We can define for example that an element is considered to be a <i>Heavy Hitter</i> if its frequency exceeds, say, 10% of the overall count. To do so we naively could suggest to simply maintain a counter per element, but due to the space limitations this will not scale. Instead probabilistic algorithms such as a <a href="http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf">CountMin sketch</a> or the <a href="https://www.cse.ust.hk/~raywong/comp5331/References/EfficientComputationOfFrequentAndTop-kElementsInDataStreams.pdf">SpaceSaving algorithm</a> can be used. These provide an estimated count instead of a precise one, but are capable of doing this with constant memory requirements, and in our case we will just save rates into the CountMin sketch instead of counts. So no matter how many unique elements we have to track, the memory consumption is the same.</p><p>We now have a way of finding the needle in the haystack, and it does have constant memory requirements, solving our problem. However, reality isn’t that simple. What if an attack is not just originating from a single port but many? Or what if a reflection attack is hitting our service, resulting in random source addresses but a single source port? Maybe a full /24 subnet is sending us a flood? We can not just keep a rate per combination we see, as it would ignore all these patterns.</p>
    <div>
      <h3>Grouping the groups: How to organize packets</h3>
      <a href="#grouping-the-groups-how-to-organize-packets">
        
      </a>
    </div>
    <p>Luckily the academic world has us covered again, with the concept of <i>Hierarchical Heavy Hitters.</i> It extends the <i>Heavy Hitter</i> concept by using the underlying hierarchy in the elements of the stream. For example, an IP address can be naturally grouped into several subnets:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7IPwA0tN0c6pOFMK5B2AT8/9101b971c3a69277a79b03a473e936ec/0B429C7C-869C-4517-9B95-C90600943486.png" />
            
            </figure><p>In this case we defined that we consider the fully-specified address, the /24 subnet and the /0 wildcard. We start at the left with the fully specified address, and each step walking towards the top we consider less information from it. We call these less-specific addresses generalisations, and measure how specific a generalisation is by assigning a level. In our example, the address 192.0.2.123 is at level 0, while 192.0.2.0/24 is at level 1, etc.</p><p>If we want to create a structure which can hold this information for every packet, it could look like this:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5lqAfAxLDrXwIgtQJ1Hm0/b37c644442d0f0e91ef38725929cb5ef/F4B2AA4A-4868-4E56-891A-6C19E0CACBA4.png" />
            
            </figure><p>We maintain a CountMin-sketch per subnet and then apply Heavy Hitters. When a new packet arrives and we need to determine if it is allowed to pass we simply check the rates of the corresponding elements in every node. If no rate exceeds the rate limit that we set, e.g. 25 packets per second (<i>pps</i>), it is allowed to pass.</p><p>The structure could now keep track of a single attribute, but we would waste a lot of context around packets! So instead of letting it go to waste, we use the two-dimensional approach for addresses proposed in the paper <a href="https://arxiv.org/abs/1102.5540">Hierarchical Heavy Hitters with SpaceSaving algorithm</a>, and extend it further to also incorporate ports into our structure. Ports do not have a natural hierarchy such as addresses, so they can only be in two states: either <i>specified</i> (e.g. 8080) or <i>wildcard</i>.</p><p>Now our structure looks like this:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/EowltLQdppI0JmX9TtKFU/f20d98ae5469b518897e4d9c621f5ab9/5D94D469-B2F3-48CD-918C-202B8C426B19.png" />
            
            </figure><p>Now let’s talk about the algorithm we use to traverse the structure and determine if a packet should be allowed to pass. The paper <i>Hierarchical Heavy Hitters with SpaceSaving algorithm</i> provides two methods that can be used on the data structure: one that updates elements and increases their counters, and one that provides all elements that currently are <i>Heavy Hitters</i>. This is actually not necessary for our use-case, as we are only interested if the element, or packet, we are looking at right now would be a <i>Heavy Hitter</i> to decide if it can pass or not.</p><p>Secondly, our goal is to prevent any Heavy Hitters from passing, thus leaving the structure with no _Heavy Hitter_s whatsoever. This is a great property, as it allows us to simplify the algorithm substantially, and it looks like this:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4gXxGnRtWTU6jorXm6538i/4c9c62fc80edffcf2ac2a4b2e2918292/A41D5CB1-5C6B-4378-B496-7B38E22DE0F7.png" />
            
            </figure><p>As you may notice, we update every node of a level and maintain the maximum rate we see. After each level we calculate a probability that determines if a packet should be passed to the next level, based on the maximum rate we saw on that level and a set rate limit. Each node essentially filters the traffic for the following, less specific level.</p><p>I actually left out a small detail: a packet is not dropped if any rate exceeds the limit, but instead is kept with the probability <i>rate limit</i>/<i>maximum rate seen</i>. The reason is that if we just drop all packets if the rates exceed the limit, we would drop the whole traffic, not just a subset to make it comply with our set rate limit.</p><p>Since we now still update more specific nodes even if a node reaches a rate limit, the rate limit will converge towards the underlying pattern of the attack as much as possible. That means other traffic will be impacted as minimally as possible, and that with no manual intervention whatsoever!</p>
    <div>
      <h3>BPF to the rescue: building a Go library</h3>
      <a href="#bpf-to-the-rescue-building-a-go-library">
        
      </a>
    </div>
    <p>As we want to use this algorithm to mitigate floods, we need to spend as little computation and overhead as possible before we decide if a packet should be dropped or not. As so often, we looked into the BPF toolbox and found what we need: <i>Socketfilters</i>. As our colleague Marek put it: <a href="/cloudflare-architecture-and-how-bpf-eats-the-world/">“It seems, no matter the question - BPF is the answer.”</a>.</p><p><i>Socketfilters</i> are pieces of code that can be attached to a single socket and get executed before a packet will be passed from kernel to userspace. This is ideal for a number of reasons. First, when the kernel runs the socket filter code, it gives it all the information from the packet we need, and other mitigations such as firewalls have been executed. Second the code is executed <i>per socket</i>, so every application can activate it as needed, and also set appropriate rate limits. It may even use different rate limits for different sockets. The third reason is privileges: we do not need to be root to attach the code to a socket. We can execute code in the kernel as a normal user!</p><p>BPF also has a number of limitations which have been already covered on this blog in the past, so we will focus on one that’s specific to our project: floating-point numbers.</p><p>To calculate rates we need floating-point numbers to provide an accurate estimate. BPF, and the whole kernel for that matter, does not support these. Instead we implemented a fixed-point representation, which uses a part of the available bits for the fractional part of a rational number and the remaining bits for the integer part. This allows us to represent floats within a certain range, but there is a catch when doing arithmetic: while subtraction and addition of two fixed-points work well, multiplication and division requires double the number of bits to ensure there will not be any loss in precision. As we use 64 bits for our fixed-point values, there is no larger data type available to ensure this does not happen. Instead of calculating the result with exact precision, we convert one of the arguments into an integer. That results in the loss of the fractional part, but as we deal with large rates that does not pose any issue, and helps us to work around the bit limitation as intermediate results fit into the available 64 bits. Whenever fixed-point arithmetic is necessary the precision of intermediate results has to be carefully considered.</p><p>There are many more details to the implementation, but instead of covering every single detail in this blog post lets just look at the code.</p><p>We open sourced rakelimit over on Github at <a href="https://github.com/cloudflare/rakelimit">cloudflare/rakelimit</a>! It is a full-blown Go library that can be enabled on any UDP socket, and is easy to configure.</p><p>The development is still in early stages and this is a first prototype, but we are excited to continue and push the development with the community! And if you still can’t get enough, look at our talk from this year's <a href="https://linuxplumbersconf.org/event/7/contributions/677/">Linux Plumbers Conference</a>.</p> ]]></content:encoded>
            <category><![CDATA[eBPF]]></category>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[UDP]]></category>
            <category><![CDATA[Programming]]></category>
            <guid isPermaLink="false">5v73ZaARMTKhq3UGeDWoMn</guid>
            <dc:creator>Jonas Otten</dc:creator>
        </item>
        <item>
            <title><![CDATA[Accelerating UDP packet transmission for QUIC]]></title>
            <link>https://blog.cloudflare.com/accelerating-udp-packet-transmission-for-quic/</link>
            <pubDate>Wed, 08 Jan 2020 17:08:00 GMT</pubDate>
            <description><![CDATA[ Significant work has gone into optimizing TCP, UDP hasn't received as much attention, putting QUIC at a disadvantage. Let's explore a few tricks that help mitigate this. ]]></description>
            <content:encoded><![CDATA[ <p><i>This was originally published on </i><a href="https://calendar.perfplanet.com/2019/accelerating-udp-packet-transmission-for-quic/"><i>Perf Planet's 2019 Web Performance Calendar</i></a><i>.</i></p><p><a href="/the-road-to-quic/">QUIC</a>, the new Internet transport protocol designed to accelerate HTTP traffic, is delivered on top of <a href="https://www.cloudflare.com/learning/ddos/glossary/user-datagram-protocol-udp/">UDP datagrams</a>, to ease deployment and avoid interference from network appliances that drop packets from unknown protocols. This also allows QUIC implementations to live in user-space, so that, for example, browsers will be able to implement new protocol features and ship them to their users without having to wait for operating systems updates.</p><p>But while a lot of work has gone into optimizing TCP implementations as much as possible over the years, including building offloading capabilities in both software (like in operating systems) and hardware (like in network interfaces), UDP hasn't received quite as much attention as TCP, which puts QUIC at a disadvantage. In this post we'll look at a few tricks that help mitigate this disadvantage for UDP, and by association QUIC.</p><p>For the purpose of this blog post we will only be concentrating on measuring throughput of QUIC connections, which, while necessary, is not enough to paint an accurate overall picture of the performance of the QUIC protocol (or its implementations) as a whole.</p>
    <div>
      <h3>Test Environment</h3>
      <a href="#test-environment">
        
      </a>
    </div>
    <p>The client used in the measurements is h2load, <a href="https://github.com/nghttp2/nghttp2/tree/quic">built with QUIC and HTTP/3 support</a>, while the server is NGINX, built with <a href="/experiment-with-http-3-using-nginx-and-quiche/">the open-source QUIC and HTTP/3 module provided by Cloudflare</a> which is based on quiche (<a href="https://github.com/cloudflare/quiche">github.com/cloudflare/quiche</a>), Cloudflare's own <a href="/enjoy-a-slice-of-quic-and-rust/">open-source implementation of QUIC and HTTP/3</a>.</p><p>The client and server are run on the same host (my laptop) running Linux 5.3, so the numbers don’t necessarily reflect what one would see in a production environment over a real network, but it should still be interesting to see how much of an impact each of the techniques have.</p>
    <div>
      <h3>Baseline</h3>
      <a href="#baseline">
        
      </a>
    </div>
    <p>Currently the code that implements QUIC in NGINX uses the <code>sendmsg()</code> system call to send a single UDP packet at a time.</p>
            <pre><code>ssize_t sendmsg(int sockfd, const struct msghdr *msg,
    int flags);</code></pre>
            <p>The <code>struct msghdr</code> carries a <code>struct iovec</code> which can in turn carry multiple buffers. However, all of the buffers within a single iovec will be merged together into a single UDP datagram during transmission. The kernel will then take care of encapsulating the buffer in a UDP packet and sending it over the wire.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/774r0FpU47qMQ5bbxIOPL5/3560c09b55949e3c406ad958498e7fd4/sendmsg.png" />
            
            </figure><p>The throughput of this particular implementation tops out at around 80-90 MB/s, as measured by h2load when performing 10 sequential requests for a 100 MB resource.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4V8JtNxw1ogFryQ7xBVcgk/74bf3551df4837143a4cb2dbc53e3f84/sendmsg-chart.png" />
            
            </figure>
    <div>
      <h3>sendmmsg()</h3>
      <a href="#sendmmsg">
        
      </a>
    </div>
    <p>Due to the fact that <code>sendmsg()</code> only sends a single UDP packet at a time, it needs to be invoked quite a lot in order to transmit all of the QUIC packets required to deliver the requested resources, as illustrated by the following bpftrace command:</p>
            <pre><code>% sudo bpftrace -p $(pgrep nginx) -e 'tracepoint:syscalls:sys_enter_sendm* { @[probe] = count(); }'
Attaching 2 probes...
 
 
@[tracepoint:syscalls:sys_enter_sendmsg]: 904539</code></pre>
            <p>Each of those system calls causes an expensive context switch between the application and the kernel, thus impacting throughput.</p><p>But while <code>sendmsg()</code> only transmits a single UDP packet at a time for each invocation, its close cousin <code>sendmmsg()</code> (note the additional “m” in the name) is able to batch multiple packets per system call:</p>
            <pre><code>int sendmmsg(int sockfd, struct mmsghdr *msgvec,
    unsigned int vlen, int flags);</code></pre>
            <p>Multiple <code>struct mmsghdr</code> structures can be passed to the kernel as an array, each in turn carrying a single <code>struct msghdr</code> with its own <code>struct iovec</code> , with each element in the <code>msgvec</code> array representing a single UDP datagram.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5QO4EdUwmU9MJhSCY2wwC1/ac6bc505e3b1f3b4d571910f13905131/sendmmsg.png" />
            
            </figure><p>Let's see what happens when NGINX is updated to use <code>sendmmsg()</code> to send QUIC packets:</p>
            <pre><code>% sudo bpftrace -p $(pgrep nginx) -e 'tracepoint:syscalls:sys_enter_sendm* { @[probe] = count(); }'
Attaching 2 probes...
 
 
@[tracepoint:syscalls:sys_enter_sendmsg]: 2437
@[tracepoint:syscalls:sys_enter_sendmmsg]: 15676</code></pre>
            <p>The number of system calls went down dramatically, which translates into an increase in throughput, though not quite as big as the decrease in syscalls:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6fv3ZaxYLZmteczFchjfiO/e0056c40dacea0422ed53edaf8158869/sendmmsg-chart.png" />
            
            </figure>
    <div>
      <h3>UDP segmentation offload</h3>
      <a href="#udp-segmentation-offload">
        
      </a>
    </div>
    <p>With <code>sendmsg()</code> as well as <code>sendmmsg()</code>, the application is responsible for separating each QUIC packet into its own buffer in order for the kernel to be able to transmit it. While the implementation in NGINX uses static buffers to implement this, so there is no overhead in allocating them, all of these buffers need to be traversed by the kernel during transmission, which can add significant overhead.</p><p>Linux supports a feature, Generic Segmentation Offload (GSO), which allows the application to pass a single "super buffer" to the kernel, which will then take care of segmenting it into smaller packets. The kernel will try to postpone the segmentation as much as possible to reduce the overhead of traversing outgoing buffers (some NICs even support hardware segmentation, but it was not tested in this experiment due to lack of capable hardware). Originally GSO was only supported for TCP, but support for UDP GSO was recently added as well, in Linux 4.18.</p><p>This feature can be controlled using the <code>UDP_SEGMENT</code> socket option:</p>
            <pre><code>setsockopt(fd, SOL_UDP, UDP_SEGMENT, &amp;gso_size, sizeof(gso_size)))</code></pre>
            <p>As well as via ancillary data, to control segmentation for each <code>sendmsg()</code> call:</p>
            <pre><code>cm = CMSG_FIRSTHDR(&amp;msg);
cm-&gt;cmsg_level = SOL_UDP;
cm-&gt;cmsg_type = UDP_SEGMENT;
cm-&gt;cmsg_len = CMSG_LEN(sizeof(uint16_t));
*((uint16_t *) CMSG_DATA(cm)) = gso_size;</code></pre>
            <p>Where <code>gso_size</code> is the size of each segment that form the "super buffer" passed to the kernel from the application. Once configured, the application can provide one contiguous large buffer containing a number of packets of <code>gso_size</code> length (as well as a final smaller packet), that will then be segmented by the kernel (or the NIC if hardware segmentation offloading is supported and enabled).</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3vQo11I0RupCQ4msUqj0Ve/dcec6ac6c0bea7c737aa9fa822e69d0a/sendmsg-gso.png" />
            
            </figure><p><a href="https://github.com/torvalds/linux/blob/80a0c2e511a97e11d82e0ec11564e2c3fe624b0d/include/linux/udp.h#L94">Up to 64 segments</a> can be batched with the <code>UDP_SEGMENT</code> option.</p><p>GSO with plain <code>sendmsg()</code> already delivers a significant improvement:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4q2OxEsgZcsw2JXc8JcAfk/a64c9b48cad41378122e7d7c5a88e67a/gso-chart.png" />
            
            </figure><p>And indeed the number of syscalls also went down significantly, compared to plain <code>sendmsg()</code> :</p>
            <pre><code>% sudo bpftrace -p $(pgrep nginx) -e 'tracepoint:syscalls:sys_enter_sendm* { @[probe] = count(); }'
Attaching 2 probes...
 
 
@[tracepoint:syscalls:sys_enter_sendmsg]: 18824</code></pre>
            <p>GSO can also be combined with <code>sendmmsg()</code> to deliver an even bigger improvement. The idea being that each <code>struct msghdr</code> can be segmented in the kernel by setting the <code>UDP_SEGMENT</code> option using ancillary data, allowing an application to pass multiple “super buffers”, each carrying up to 64 segments, to the kernel in a single system call.</p><p>The improvement is again fairly significant:</p>
    <div>
      <h3>Evolving from AFAP</h3>
      <a href="#evolving-from-afap">
        
      </a>
    </div>
    <p>Transmitting packets as fast as possible is easy to reason about, and there's much fun to be had in optimizing applications for that, but in practice this is not always the best strategy when optimizing protocols for the Internet</p><p>Bursty traffic is more likely to cause or be affected by congestion on any given network path, which will inevitably defeat any optimization implemented to increase transmission rates.</p><p>Packet pacing is an effective technique to squeeze out more performance from a network flow. The idea being that adding a short delay between each outgoing packet will smooth out bursty traffic and reduce the chance of congestion, and packet loss. For TCP this was originally implemented in Linux via the fq packet scheduler, and later by the BBR congestion control algorithm implementation, which implements its own pacer.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1PkaZcKDkzzjUDhFLRT1jw/a4247010827f1763bd7560894e30938e/afap.png" />
            
            </figure><p>Due to the nature of current QUIC implementations, which reside entirely in user-space, pacing of QUIC packets conflicts with any of the techniques explored in this post, because pacing each packet separately during transmission will prevent any batching on the application side, and in turn batching will prevent pacing, as batched packets will be transmitted as fast as possible once received by the kernel.</p><p>However Linux provides some facilities to offload the pacing to the kernel and give back some control to the application:</p><ul><li><p><b>SO_MAX_PACING_RATE</b>: an application can define this socket option to instruct the fq packet scheduler to pace outgoing packets up to the given rate. This works for UDP sockets as well, but it is yet to be seen how this can be integrated with QUIC, as a single UDP socket can be used for multiple QUIC connections (unlike TCP, where each connection has its own socket). In addition, this is not very flexible, and might not be ideal when implementing the BBR pacer.</p></li><li><p><b>SO_TXTIME / SCM_TXTIME</b>: an application can use these options to schedule transmission of specific packets at specific times, essentially instructing fq to delay packets until the provided timestamp is reached. This gives the application a lot more control, and can be easily integrated into sendmsg() as well as sendmmsg(). But it does not yet support specifying different times for each packet when GSO is used, as there is no way to define multiple timestamps for packets that need to be segmented (each segmented packet essentially ends up being sent at the same time anyway).</p></li></ul><p>While the performance gains achieved by using the techniques illustrated here are fairly significant, there are still open questions around how any of this will work with pacing, so more experimentation is required.</p> ]]></content:encoded>
            <category><![CDATA[QUIC]]></category>
            <category><![CDATA[UDP]]></category>
            <category><![CDATA[TCP]]></category>
            <category><![CDATA[HTTP3]]></category>
            <guid isPermaLink="false">3pwKBhG2s8cT4COiXLHTyT</guid>
            <dc:creator>Alessandro Ghedini</dc:creator>
        </item>
        <item>
            <title><![CDATA[It's crowded in here!]]></title>
            <link>https://blog.cloudflare.com/its-crowded-in-here/</link>
            <pubDate>Sat, 12 Oct 2019 13:00:00 GMT</pubDate>
            <description><![CDATA[ We recently gave a presentation on Programming socket lookup with BPF at the Linux Plumbers Conference 2019 in Lisbon, Portugal. ]]></description>
            <content:encoded><![CDATA[ <p>We recently gave a presentation on <a href="https://linuxplumbersconf.org/event/4/contributions/487/">Programming socket lookup with BPF</a> at the Linux Plumbers Conference 2019 in Lisbon, Portugal. This blog post is a recap of the problem statement and proposed solution we presented.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3ODIBQvumtXQYscbSqjLFn/24257f7186eae5e62fcaf20a166490ea/birds_cable_wire.jpg" />
          </figure><p>CC0 Public Domain, <a href="https://pxhere.com/en/photo/1526517">PxHere</a></p><p>Our edge servers are crowded. We run more than a dozen public facing services, leaving aside the all internal ones that do the work behind the scenes.</p><p>Quick Quiz #1: How many can you name? We blogged about them! <a href="https://blog.cloudflare.com/its-crowded-in-here/#quiz-1">Jump to answer</a>.</p><p>These services are exposed on more than a million Anycast <a href="https://www.cloudflare.com/ips/">public IPv4 addresses</a> partitioned into 100+ network prefixes.</p><p>To keep things uniform every Cloudflare edge server runs all services and responds to every Anycast address. This allows us to make efficient use of the hardware by load-balancing traffic between all machines. We have shared the details of Cloudflare <a href="https://blog.cloudflare.com/no-scrubs-architecture-unmetered-mitigation/">edge</a><a href="https://blog.cloudflare.com/cloudflare-architecture-and-how-bpf-eats-the-world/"> architecture</a> on the blog before.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6MHwv0fusFTiEylSaWQeg3/dbbaa25e7093b0be4fa8ec6dbcd88bca/edge_data_center-1.png" />
          </figure><p>Granted not all services work on all the addresses but rather on a subset of them, covering one or several network prefixes.</p><p>So how do you set up your network services to listen on hundreds of IP addresses without driving the network stack over the edge? Cloudflare engineers have had to ask themselves this question more than once over the years, and the answer has changed as our edge evolved. This evolution forced us to look for creative ways to work with the <a href="https://en.wikipedia.org/wiki/Berkeley_sockets">Berkeley sockets API</a>, a POSIX standard for assigning a network address and a port number to your application. It has been quite a journey, and we are not done yet.</p>
    <div>
      <h2>When life is simple - one address, one socket</h2>
      <a href="#when-life-is-simple-one-address-one-socket">
        
      </a>
    </div>
    
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6aVlT1BbslJwQRzC0HnF0p/eb5b0651ba36da1010e1bd72b7c57a55/mapping_1_to_1-1.png" />
          </figure><p>The simplest kind of association between an (IP address, port number) and a service that we can imagine is one-to-one. A server responds to client requests on a single address, on a well known port. To set it up the application has to open one socket for each transport protocol (be it TCP or UDP) it wants to support. A network server like our <a href="https://www.cloudflare.com/dns/">authoritative DNS</a> would open up two sockets (one for UDP, one for TCP):</p><p>(192.0.2.1, 53/tcp) ⇨ ("auth-dns", pid=1001, fd=3)
(192.0.2.1, 53/udp) ⇨ ("auth-dns", pid=1001, fd=4)</p><p>To take it to Cloudflare scale, the service is likely to have to receive on at least a /20 network prefix, which is a range of IPs with 4096 addresses in it.</p><p>This translates to opening 4096 sockets for each transport protocol. Something that is not likely to go unnoticed when looking at <a href="http://man7.org/linux/man-pages/man8/ss.8.html">ss tool</a> output.</p><p></p><p>$ sudo ss -ulpn 'sport = 53'
State  Recv-Q Send-Q  Local Address:Port Peer Address:Port
…
UNCONN 0      0           192.0.2.40:53        0.0.0.0:*    users:(("auth-dns",pid=77556,fd=11076))
UNCONN 0      0           192.0.2.39:53        0.0.0.0:*    users:(("auth-dns",pid=77556,fd=11075))
UNCONN 0      0           192.0.2.38:53        0.0.0.0:*    users:(("auth-dns",pid=77556,fd=11074))
UNCONN 0      0           192.0.2.37:53        0.0.0.0:*    users:(("auth-dns",pid=77556,fd=11073))
UNCONN 0      0           192.0.2.36:53        0.0.0.0:*    users:(("auth-dns",pid=77556,fd=11072))
UNCONN 0      0           192.0.2.31:53        0.0.0.0:*    users:(("auth-dns",pid=77556,fd=11071))
…</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2v9WUgAYqxc0JyRM6llZCa/5c29ebd8a06ab9ded6dc0c483431b88d/lots_of_socks.jpg" />
          </figure><p>CC BY 2.0, Luca Nebuloni, <a href="https://flickr.com/photos/7897906@N06/20655224708">Flickr</a></p><p>The approach, while naive, has an advantage: when an IP from the range gets attacked with a UDP flood, the receive queues of sockets bound to the remaining IP addresses are not affected.</p>
    <div>
      <h2>Life can be easier - all addresses, one socket</h2>
      <a href="#life-can-be-easier-all-addresses-one-socket">
        
      </a>
    </div>
    
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4nfFyQt9C1FLIKO6wnmV4Q/90405ed4e16846e092ecd7e0d88d5110/mapping_inaddr_any-1.png" />
          </figure><p>It seems rather silly to create so many sockets for one service to receive traffic on a range of addresses. Not only that, the more listening sockets there are, the longer the chains in the socket lookup hash table. We have learned the hard way that going in this direction <a href="https://blog.cloudflare.com/revenge-listening-sockets/">can hurt packet processing latency</a>.</p><p>The sockets API comes with a big hammer that can make our life easier - the <code>INADDR_ANY</code> aka <code>0.0.0.0</code> wildcard address. With <code>INADDR_ANY</code> we can make a single socket receive on all addresses assigned to our host, specifying just the port.</p>
            <pre><code>s = socket(AF_INET, SOCK_STREAM, 0)
s.bind(('0.0.0.0', 12345))
s.listen(16)</code></pre>
            <p></p><p>Quick Quiz #2: Is there another way to bind a socket to all local addresses? <a href="https://blog.cloudflare.com/its-crowded-in-here/#quiz-2">Jump to answer</a>.</p><p>In other words, compared to the naive “one address, one socket” approach, <code>INADDR_ANY</code> allows us to have a single catch-all listening socket for the whole IP range on which we accept incoming connections.</p><p>On Linux this is possible thanks to a two-phase listening socket lookup, where it falls back to search for an <code>INADDR_ANY</code> socket if a more specific match has not been found.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1B0Td9aV8dFNmmOZ7Z5Ell/51b479c1ed0abae1d71038ab030dd98b/tcp_socket_lookup-1.png" />
          </figure><p>Another upside of binding to <code>0.0.0.0</code> is that our application doesn’t need to be aware of what addresses we have assigned to our host. We are also free to assign or remove the addresses after binding the listening socket. No need to reconfigure the service when its listening IP range changes.</p><p>On the other hand if our service should be listening on just <code>A.B.C.0/20</code> prefix, binding to all local addresses is more than we need. We might unintentionally expose an otherwise internal-only service to external traffic without a proper firewall or a socket filter in place.</p><p>Then there is the security angle. Since we now only have one socket, attacks attempting to flood any of the IPs assigned to our host on our service’s port, will hit the catch-all socket and its receive queue. While in such circumstances the Linux <a href="https://blog.cloudflare.com/syn-packet-handling-in-the-wild/">TCP stack has your back</a>, UDP needs special care or legitimate traffic might drown in the flood of dropped packets.</p><p>Possibly the biggest downside, though, is that a service listening on the wildcard <code>INADDR_ANY</code> address claims the port number exclusively for itself. Binding over the wildcard-listening socket with a specific IP and port fails miserably due to the address already being taken (<code>EADDRINUSE</code>).</p>
            <pre><code>bind(3, {sa_family=AF_INET, sin_port=htons(12345), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
bind(4, {sa_family=AF_INET, sin_port=htons(12345), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRINUSE (Address already in use)
</code></pre>
            <p>Unless your service is UDP-only, setting the <code>SO_REUSEADDR</code> socket option, will not help you overcome this restriction. The only way out is to turn to <code>SO_REUSEPORT</code>, normally used to construct a load-balancing socket group. And that is only if you are lucky enough to run the port-conflicting services as the same user (UID). That is a story for another post.Quick Quiz #3: Does setting the <code>SO_REUSEADDR</code> socket option have any effect at all when there is bind conflict? <a href="https://blog.cloudflare.com/its-crowded-in-here/#quiz-3">Jump to answer</a>.</p>
    <div>
      <h2>Life gets real - one port, two services</h2>
      <a href="#life-gets-real-one-port-two-services">
        
      </a>
    </div>
    <p>As it happens, at the Cloudflare edge we do host services that share the same port number but otherwise respond to requests on non-overlapping IP ranges. A prominent example of such port-sharing is our <a href="https://blog.cloudflare.com/dns-resolver-1-1-1-1/">1.1.1.1</a> recursive DNS resolver running side-by-side with the<a href="https://www.cloudflare.com/dns/"> authoritative DNS service</a> that we offer to all customers.</p><p>Sadly the s<a href="http://man7.org/linux/man-pages/man2/bind.2.html">ockets API</a> doesn’t allow us to express a setup in which two services share a port and accept requests on disjoint IP ranges.</p><p>However, as Linux development history shows, any networking API limitation can be overcome by introducing a new <a href="https://github.com/torvalds/linux/blame/master/include/uapi/asm-generic/socket.h">socket option</a>, with sixty-something options available (and counting!).</p><p>Enter <code>SO_BINDTOPREFIX</code>.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1VW8uJH7XkJKqUt6HZXWiU/6995812a081001a31238f45c0455960b/mapping_bindtoprefix-1.png" />
          </figure><p>Back in 2016 we proposed <a href="https://lore.kernel.org/netdev/1458699966-3752-1-git-send-email-gilberto.bertin@gmail.com/">an extension to the Linux network stack</a>. It allowed services to constrain a wildcard-bound socket to an IP range belonging to a network prefix.</p>
            <pre><code># Service 1, 127.0.0.0/20, 1234/tcp
net1, plen1 = '127.0.0.0', 20
bindprefix1 = struct.pack('BBBBBxxx', *inet_aton(net1), plen1)

s1 = socket(AF_INET, SOCK_STREAM, 0)
s1.setsockopt(SOL_IP, IP_BINDTOPREFIX, bindprefix1)
s1.bind(('0.0.0.0', 1234))
s1.listen(1)

# Service 2, 127.0.16.0/20, 1234/tcp
net2, plen2 = '127.0.16.0', 20
bindprefix2 = struct.pack('BBBBBxxx', *inet_aton(net2), plen2)

s2 = socket(AF_INET, SOCK_STREAM, 0)
s2.setsockopt(SOL_IP, IP_BINDTOPREFIX, bindprefix2)
s2.bind(('0.0.0.0', 1234))
s2.listen(1)
</code></pre>
            <p>This mechanism has served us well since then. Unfortunately, it didn’t get accepted upstream due to being too specific to our use-case. Having no better alternative we ended up maintaining patches in our kernel to this day.</p>
    <div>
      <h2>Life gets complicated - all ports, one service</h2>
      <a href="#life-gets-complicated-all-ports-one-service">
        
      </a>
    </div>
    <p>Just when we thought we had things figured out, we were faced with a new challenge. How to build a service that accepts connections on any of the 65,535 ports? The ultimate reverse proxy, if you will, code named <a href="https://www.cloudflare.com/products/cloudflare-spectrum/">Spectrum</a>.The <code>bind</code> syscall offers very little flexibility when it comes to mapping a socket to a port number. You can either specify the number you want or let the network stack pick an unused one for you. There is no counterpart of <code>INADDR_ANY</code>, a wildcard value to select all ports (<code>INPORT_ANY</code>?).To achieve what we wanted, we had to turn to <a href="https://blog.cloudflare.com/how-we-built-spectrum/">TPROXY</a>, a <a href="https://www.kernel.org/doc/Documentation/networking/tproxy.txt">Netfilter / <code>iptables</code> extension</a> designed for intercepting remote-destined traffic on the forward path. However, we use it to steer local-destined packets, that is ones targeted to our host, to a catch-all-ports socket.</p>
            <pre><code>iptables -t mangle -I PREROUTING \
         -d 192.0.2.0/24 -p tcp \
         -j TPROXY --on-ip=127.0.0.1 --on-port=1234</code></pre>
            
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3n0QTqrm2m4ZTiqPtiiF9a/fbbd5a464721df6ae8017805e2d540dd/mapping_tproxy-1.png" />
          </figure><p>TPROXY-based setup comes at a price. For starters, your service needs elevated privileges to create a special catch-all socket (see the <a href="http://man7.org/linux/man-pages/man7/ip.7.html"><code>IP_TRANSPARENT</code> socket option</a>). Then you also have to understand and consider the subtle interactions between TPROXY and the receive path for your traffic profile, for example:</p><ul><li><p>does connection tracking register the flows redirected with TPROXY?</p></li><li><p>is listening socket contention during a SYN flood when using TPROXY a concern?</p></li><li><p>do other parts of the network stack, like XDP programs, need to know about TPROXY redirecting packets?</p></li></ul><p>These are some of the questions we needed to answer, and after running it in production for a while now, we have a good idea of what the consequences of using TPROXY are.That said, it would not come as a shock, if tomorrow we’d discovered something new about TPROXY. Due to its complexity we’ve always considered using it to steer local-destined traffic a <a href="https://blog.cloudflare.com/how-we-built-spectrum/">hack</a>, a use-case outside its intended application. No matter how well understood, a hack remains a hack.</p>
    <div>
      <h2>Can BPF make life easier?</h2>
      <a href="#can-bpf-make-life-easier">
        
      </a>
    </div>
    <p>Despite its complex nature TPROXY shows us something important. No matter what IP or port the listening socket is bound to, with a bit of support from the network stack, we can steer any connection to it. As long the application is ready to handle this situation, things work.</p><p>Quick Quiz #4: Are there really no problems with accepting any connection on any socket? <a href="https://blog.cloudflare.com/its-crowded-in-here/#quiz-4">Jump to answer</a>.</p><p>This is a really powerful concept. With a bunch of TPROXY rules, we can configure any mapping between (address, port) tuples and listening sockets.</p><p><b>? Idea #1:</b> A local-destined connection can be accepted by any listening socket.</p><p>We didn’t tell you the whole story before. When we published <code>SO_BINDTOPREFIX</code> patches, they did not just get rejected. <a href="https://meta.wikimedia.org/wiki/Cunningham%27s_Law">As sometimes happens</a> by posting the wrong answer, we got <a href="https://lore.kernel.org/netdev/1459261895.6473.176.camel@edumazet-glaptop3.roam.corp.google.com/">the right answer</a> to our problem</p><blockquote><p>❝BPF is absolutely the way to go here, as it allows for whatever user specified tweaks, like a list of destination subnetwork, or/and a list of source network, or the date/time of the day, or port knocking without netfilter, or … you name it.❞</p></blockquote><p><b>? Idea #2:</b> How we pick a listening socket can be tweaked with BPF.</p><p>Combine the two ideas together, and we arrive at an exciting concept. Let’s run BPF code to match an incoming packet with a listening socket, ignoring the address the socket is bound to. ?</p><p>Here’s an example to illustrate it.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5zhLPX1wssfv3c4QvObgz9/66812e76990de528517d020dbca70638/idea_program_socket_lookup_with_bpf-2.png" />
          </figure><p>All packets coming on <code>192.0.2.0/24</code> prefix, port <code>53</code> are steered to socket <code>sk:2</code>, while traffic targeted at <code>203.0.113.1</code>, on any port number lands in socket <code>sk:4</code>.</p>
    <div>
      <h2>Welcome BPF inet_lookup</h2>
      <a href="#welcome-bpf-inet_lookup">
        
      </a>
    </div>
    
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4Pg5Lou5IFLuZcFHbaY46O/590e5267f496e9ce2e1eb6d1460bbd45/bpf_inet_lookup_hook-1.png" />
          </figure><p>To make this concept a reality we are proposing a new mechanism to program the socket lookup with BPF. What is socket lookup? It’s a stage on the receive path where the transport layer searches for a socket to dispatch the packet to. The last possible moment to steer packets before they land in the selected socket receive queue. In there we attach a new type of BPF program called <code>inet_lookup</code>.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2yATE46s8d1no0gYCw6Asw/6cd9e449e7d6b0571a8c69ce9e4d9f33/tcp_socket_lookup_with_bpf-1.png" />
          </figure><p>If you recall, socket lookup in the Linux TCP stack is a <a href="https://elixir.bootlin.com/linux/v5.4-rc2/source/include/net/inet_hashtables.h#L329">two phase process</a>. First the kernel will try to find an established (connected) socket matching the packet 4-tuple. If there isn’t one, it will continue by looking for a listening socket using just the packet 2-tuple as key.</p><p>Our proposed extension allows users to program the second phase, the listening socket lookup. If present, a BPF program is allowed to choose a listening socket and terminate the lookup. Our program is also free to ignore the packet, in which case the kernel will continue to look for a listening socket as usual.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/bjoj0UUcthGLLGOSTbida/3e139697c9d897cd11aabbdf80946032/bpf_inet_lookup_operation-1.png" />
          </figure><p>How does this new type of BPF program operate? On input, as context, it gets handed a subset of information extracted from packet headers, including the packet 4-tuple. Based on the input the program accesses a BPF map containing references to listening sockets, and selects one to yield as the socket lookup result.</p><p>If we take a look at the corresponding BPF code, the program structure resembles a firewall rule. We have some match statements followed by an action.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2lhm5L9NH0AN8tnMh2cMpl/14656d47d0e81440617893422ac2140b/bpf_inet_lookup_code_sample-2.png" />
          </figure><p>You may notice that we don’t access the BPF map with sockets directly. Instead, we follow an established pattern in BPF called “map based redirection”, where a dedicated BPF helper accesses the map and carries out any steps necessary to redirect the packet.</p><p>We’ve skipped over one thing. Where does the BPF map of sockets come from? We create it ourselves and populate it with sockets. This is most easily done if your service uses systemd <a href="http://0pointer.de/blog/projects/socket-activation.html">socket activation</a>. systemd will let you associate more than one service unit with a socket unit, and both of the services will receive a file descriptor for the same socket. From there it’s just a matter of inserting the socket into the BPF map.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2ufZk0QnR4cOf061PvseTW/a7bdf0a77d655be88adc527cadc4a5bf/bpf_inet_lookup_socket_activation-1.png" />
          </figure>
    <div>
      <h2>Demo time!</h2>
      <a href="#demo-time">
        
      </a>
    </div>
    <p>This is not just a concept. We have already published a first working <a href="https://lore.kernel.org/bpf/20190828072250.29828-1-jakub@cloudflare.com/">set of patches for the kernel</a> together with ancillary <a href="https://github.com/majek/inet-tool">user-space tooling</a> to configure the socket lookup to your needs.</p><p>If you would like to see it in action, you are in luck. We’ve put together a demo that shows just how easily you can bind a network service to (i) a single port, (ii) all ports, or (iii) a network prefix. On-the-fly, without having to restart the service! There is a port scan running to prove it.</p><p>You can also bind to all-addresses-all-ports (<code>0.0.0.0/0</code>) because why not? Take that <code>INADDR_ANY</code>. All thanks to BPF superpowers.</p>
    <div>
      <h2>Summary</h2>
      <a href="#summary">
        
      </a>
    </div>
    <p>We have gone over how the way we bind services to network addresses on the Cloudflare edge has evolved over time. Each approach has its pros and cons, summarized below. We are currently working on a new BPF-based mechanism for binding services to addresses, which is intended to address the shortcomings of existing solutions.</p><p><b>bind to one address and port
</b>? flood traffic on one address hits one socket, doesn’t affect the rest
? as many sockets as listening addresses, doesn’t scale</p><p><b>bind to all addresses with </b><b><code>INADDR_ANY
</code></b>? just one socket for all addresses, the kernel thanks you
? application doesn’t need to know about listening addresses
? flood scenario requires custom protection, at least for UDP
? port sharing is tricky or impossible</p><p><b>bind to a network prefix with </b><b><code>SO_BINDTOPREFIX
</code></b>? two services can share a port if their IP ranges are non-overlapping
? custom kernel API extension that never went upstream</p><p><b>bind to all port with TPROXY
</b>? enables redirecting all ports to a listening socket and more
? meant for intercepting forwarded traffic early on the ingress path
? has subtle interactions with the network stack
? requires privileges from the application</p><p><b>bind to anything you want with BPF </b><b><code>inet_lookup
</code></b>? allows for the same flexibility as with TPROXY or <code>SO_BINDTOPREFIX</code>
? services don’t need extra capabilities, meant for local traffic only
? needs cooperation from services or PID 1 to build a socket map</p><hr /><p>Getting to this point has been a team effort. A special thank you to Lorenz Bauer and <a href="https://blog.cloudflare.com/author/marek-majkowski/">Marek Majkowski</a> who have contributed in an essential way to the BPF <code>inet_lookup</code> implementation. The <code>SO_BINDTOPREFIX</code> patches were authored by <a href="https://blog.cloudflare.com/author/gilberto-bertin/">Gilberto Bertin</a>.Fancy joining the team? <a href="https://www.cloudflare.com/careers/departments/?utm_referrer=blog">Apply here!</a></p>
    <div>
      <h2>Quiz Answers</h2>
      <a href="#quiz-answers">
        
      </a>
    </div>
    
    <div>
      <h3>Quiz 1</h3>
      <a href="#quiz-1">
        
      </a>
    </div>
    <p>Q: How many Cloudflare services can you name?</p><ol><li><p><a href="https://www.cloudflare.com/cdn/">HTTP CDN</a> (tcp/80)</p></li><li><p><a href="https://www.cloudflare.com/ssl/">HTTPS CDN</a> (tcp/443, <a href="https://cloudflare-quic.com/">udp/443</a>)</p></li><li><p><a href="https://www.cloudflare.com/dns/">authoritative DNS</a> (udp/53)</p></li><li><p><a href="https://blog.cloudflare.com/dns-resolver-1-1-1-1/">recursive DNS</a> (udp/53, 853)</p></li><li><p><a href="https://blog.cloudflare.com/secure-time/">NTP with NTS</a> (udp/1234)</p></li><li><p><a href="https://blog.cloudflare.com/roughtime/">Roughtime time service</a> (udp/2002)</p></li><li><p><a href="https://blog.cloudflare.com/distributed-web-gateway/">IPFS Gateway</a> (tcp/443)</p></li><li><p><a href="https://blog.cloudflare.com/cloudflare-ethereum-gateway/">Ethereum Gateway</a> (tcp/443)</p></li><li><p><a href="https://blog.cloudflare.com/spectrum/">Spectrum proxy</a> (tcp/any, udp/any)</p></li><li><p><a href="https://blog.cloudflare.com/announcing-warp-plus/">WARP</a> (udp)</p></li></ol><p><a href="https://blog.cloudflare.com/its-crowded-in-here/#quiz-1-question">Go back</a></p>
    <div>
      <h3>Quiz 2</h3>
      <a href="#quiz-2">
        
      </a>
    </div>
    <p>Q: Is there another way to bind a socket to all local addresses?Yes, there is - by not <code>bind()</code>’ing it at all. Calling <code>listen()</code> on an unbound socket is equivalent to binding it to <code>INADDR_ANY</code> and letting the kernel pick a free port.$ strace -e socket,bind,listen nc -l
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
listen(3, 1)                            = 0
^Z
[1]+  Stopped                 strace -e socket,bind,listen nc -l
$ ss -4tlnp
State      Recv-Q Send-Q Local Address:Port               Peer Address:Port
LISTEN     0      1            *:42669      </p><p><a href="https://blog.cloudflare.com/its-crowded-in-here/#quiz-2-question">Go back</a></p>
    <div>
      <h3>Quiz 3</h3>
      <a href="#quiz-3">
        
      </a>
    </div>
    <p>Q: Does setting the <code>SO_REUSEADDR</code> socket option have any effect at all when there is bind conflict?Yes. If two processes are racing to <code>bind</code> and <code>listen</code> on the same TCP port, on an overlapping IP, setting <code>SO_REUSEADDR</code> changes which syscall will report an error (<code>EADDRINUSE</code>). Without <code>SO_REUSEADDR</code> it will always be the second bind. With <code>SO_REUSEADDR</code> set there is a window of opportunity for a second <code>bind</code> to succeed but the subsequent <code>listen</code> to fail.</p><p><a href="https://blog.cloudflare.com/its-crowded-in-here/#quiz-3-question">Go back</a></p>
    <div>
      <h3><b>Quiz 4</b></h3>
      <a href="#quiz-4">
        
      </a>
    </div>
    <p>Q: Are there really no problems with accepting any connection on any socket?If the connection is destined for an address assigned to our host, i.e. a local address, there are no problems. However, for remote-destined connections, sending return traffic from a non-local address (i.e., one not present on any interface) will not get past the Linux network stack. The <code>IP_TRANSPARENT</code> socket option bypasses this protection mechanism known as <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=88ef4a5a78e63420dd1dd770f1bd1dc198926b04">source address check</a> to lift this restriction.</p><p><a href="https://blog.cloudflare.com/its-crowded-in-here/#quiz-4-question">Go back</a></p> ]]></content:encoded>
            <category><![CDATA[eBPF]]></category>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[UDP]]></category>
            <guid isPermaLink="false">2tVUhaeVSAohZJbbDgK6vy</guid>
            <dc:creator>Jakub Sitnicki</dc:creator>
        </item>
    </channel>
</rss>