The Cloudflare Blog

How Cloudflare responded to the “Copy Fail” Linux vulnerability

Chris J Arges — Thu, 07 May 2026 13:00:00 GMT

On April 29, 2026, a Linux kernel local privilege escalation vulnerability was publicly disclosed under the name "Copy Fail" (CVE-2026-31431). Cloudflare’s Security and Engineering teams began assessing the vulnerability as soon as it was disclosed. We reviewed the exploit technique, evaluated exposure across our infrastructure, and validated that our existing behavioral detections could identify the exploit pattern within minutes.

There was no impact to the Cloudflare environment, no customer data was at risk, and no services were disrupted at any point. Read on to learn how our preparedness paid off.

Background

Our Linux kernel release process

Cloudflare operates a global Linux server infrastructure at an immense scale, with datacenters located across 330 cities. We maintain a custom Linux kernel build based on the community's Long-Term Support (LTS) versions to manage updates effectively at this volume. At any given time, we may utilize multiple LTS versions from various series, such as 6.12 or 6.18, which benefit from extended update periods.

The community regularly merges and releases security and stability updates which trigger an automated job to generate a new internal kernel build approximately every week. These builds undergo testing in our staging data centers to ensure stability before a global rollout. Following a successful release, the Edge Reboot Release (ERR) pipeline manages a systematic update and reboot of the edge infrastructure on a four-week cycle. Our control plane infrastructure typically adopts the most recent kernel, with reboots scheduled according to specific workload requirements.

By the time a CVE becomes public knowledge, the necessary fix has typically been integrated into stable Linux LTS releases for several weeks. Our established procedures ensure that we have already deployed these patches.

At the time of the "Copy Fail" disclosure, the majority of our infrastructure was running the 6.12 LTS version, while a subset of machines had begun transitioning to the newer 6.18 LTS release.

About the Copy Fail vulnerability

It helps to understand the vulnerability before getting to the response story. A comprehensive write-up can be found in the original Xint Code disclosure post.

AF_ALG and the kernel crypto API

The Linux kernel's internal crypto API manages functions like kTLS and IPsec. Userspace programs access this via the AF_ALG socket family, allowing unprivileged processes to request encryption or decryption. The algif_aead module facilitates this for Authenticated Encryption with Associated Data (AEAD) ciphers.

An unprivileged program follows these steps:

Opens an AF_ALG socket and binds to an AEAD template.
Sets a key and accepts a request socket.
Submits input via sendmsg() or splice().
Executes the operation using recvmsg().

The splice() system call is critical here, as it moves data by passing page cache references.

Memory mechanics: page cache and in-place crypto

The page cache is a shared system cache for file contents. Modifying a page belonging to a setuid binary effectively edits that program for all users until the page is evicted.

The crypto API utilizes scatterlists, which are structures linking various memory pages. In 2017, algif_aead was optimized for in-place operations, chaining destination and reference pages together. This design lacked enforcement to prevent algorithms from writing past intended boundaries.

The vulnerability: out-of-bounds write

When the user executes recvmsg(), the authencesn wrapper in the kernel performs a 4-byte write past the legitimate output region:

scatterwalk_map_and_copy(tmp + 1, dst, assoclen + cryptlen, 4, 1);

By using splice(), an attacker can chain a target file's page cache pages to the scatterlist. The out-of-bounds write then taints the cached file, allowing an attacker to control which file is modified, the offset, and the specific 4 bytes written. This means the attacker can manipulate the following with this exploit:

File: Any readable file.
Offset: Tunable via assoclen and splice parameters.
Value: Controlled via AAD bytes 4-7 in sendmsg()

The exploit, step by step

The default exploit targets /usr/bin/su, a setuid-root binary present on essentially every distribution.

Cache Reference: Open /usr/bin/su as O_RDONLY and read() to populate the page cache. Use splice() on the file descriptor to pass these page cache references into the crypto scatterlist.
Setup: Create an AF_ALG socket, bind() to authencesn(hmac(sha256),cbc(aes)), set a key, and accept a request socket without needing privileges.
Write Construction: For each 4-byte shellcode chunk:
- sendmsg() with AAD bytes 4–7 containing the shellcode.
- splice() the binary into a pipe then the AF_ALG socket so assoclen + cryptlen targets the desired .text offset.
Trigger: recvmsg() initiates decryption. authencesn writes its scratch data to the target offset of /usr/bin/su in the page cache. Although the function returns -EBADMSG, the 4-byte write is now in the global page cache.
Execution: Running execve("/usr/bin/su") loads the tainted page cache. Since the binary is setuid-root, the injected shellcode executes with root privileges.

The upstream fix (commit a664bf3d603d) reverts the 2017 in-place optimization, removing the exploit.

How we responded

When the vulnerability was disclosed, many workstreams started in parallel:

Mapping the blast radius: Our security team worked with kernel engineers to determine which kernel versions were vulnerable and assess the potential exposure.
Validating coverage: Security reviewed the exploit technique and confirmed that our existing behavioral detections could identify the exploit pattern during authorized internal validation.
Proactive threat hunting: Security began searching for signs that the vulnerability had been exploited before it was publicly known, going back 48 hours in our fleet-wide logs.
Engineering a mitigation: Kernel engineers began building a runtime mitigation that would protect the fleet without breaking production services.
Continuing software updates: Our engineering teams worked on delivering an updated Linux kernel, which required carefully rebooting and rolling it out across our servers.

There was no customer impact at any point during this response.

Validating detection coverage

One of the first things our security team did was confirm that our existing endpoint detection would catch this exploit. Our servers run behavioral detection that continuously monitors process execution patterns. It doesn't rely on knowing about specific vulnerabilities; it watches for anomalous behavior across the fleet.

When our engineers validated the vulnerability internally as part of the response, the detection platform flagged it within minutes. The system linked the entire execution chain—starting at the script interpreter, moving through the kernel’s cryptographic subsystem, and ending at the privilege escalation binary—flagging it as malicious based on fleet-wide behavioral patterns.

This happened without a signature update, without a rule change, and without human intervention. Our behavioral detection coverage existed before we wrote any custom logic for this particular Copy File exploit.

The confirmation was important because it meant we had coverage before writing a vulnerability-specific rule.

Hunting for exploitation

While our engineering team moved to a more targeted mitigation, our security investigation had been running since disclosure. This is our standard procedure for any critical vulnerability.

Our security team operates on a simple principle for critical vulnerabilities: assume compromise until you can prove otherwise. The investigation started from the assumption that exploitation could have occurred before the vulnerability was public, and we worked systematically to either confirm or rule it out.

The exploit leaves a distinctive trace in kernel logs when it runs. We searched for that trace across our centralized logging infrastructure, covering 48 hours before the vulnerability was publicly disclosed. If someone had exploited this before the world knew about it, we would have seen it.

We pulled access logs for affected systems and reconstructed who connected, when, and what commands they ran. This gave us a complete forensic picture of interactive activity on potentially affected infrastructure.

We checked that system binaries had not been tampered with, validated cryptographic hashes against known-good package manifests, looked for persistence mechanisms, and audited network connections for anything unusual. Everything was clean.

Incident timeline and impact

Time (UTC)	Event
2026-04-29 16:00	Copy Fail publicly disclosed.
2026-04-29 ~21:00	Security and Engineering teams began assessing fleet exposure and mitigation options before full declaration of the Incident Response process
2026-04-29 22:52	Security confirmed existing behavioral detection covered the Copy Fail exploit pattern. During authorized internal validation, detection flagged the activity within minutes.
2026-04-29 23:01	Existing behavioral detection generated a high-severity alert for exploit-like activity, confirming detection coverage for the technique.
2026-04-29 (evening)	First mitigation attempt pushed to our staging datacenter. The deployment process surfaced a dependency conflict; the mitigation was rolled back. No production systems were affected.
2026-04-29 (overnight)	Engineering drafted bpf-lsm mitigation program.
2026-04-30 03:14	Security incident declared to drive cross-functional collaboration and urgency. Security performed fleetwide threat hunting of historical data to confirm that no malicious activity was present on Cloudflare systems.
2026-04-30 (morning)	Engineering tested the bpf-lsm mitigation program and made it production-ready.
2026-04-30 14:25	Engineering incident declared to coordinate mitigation program and Linux patch rollout.
2026-04-30 ~17:00	Decision made: ship a patched build of the previous LTS line through reboot automation; do not accelerate the new LTS; lean on bpf-lsm in the meantime.
2026-04-30 (afternoon)	Visibility pipeline (eBPF tracing of AF_ALG socket usage) deployed fleet-wide. Gives a complete picture of all legitimate AF_ALG users.
2026-04-30 (evening)	bpf-lsm mitigation program rolled out behind a separate gate to fully mitigate the fleet. End-to-end verification on a previously-vulnerable test node confirms the exploit no longer works.
2026-05-04 (morning)	Reboot automation resumed at normal pace with the patched kernel.
2026-05-04 onward	Servers that had already passed through reboot automation earlier in the week manually rebooted to pick up the patched kernel. Unpatched servers update per our normal reboot automation.

This graph shows the progress of our mitigation program as it progressed through our infrastructure.

How did we mitigate it?

Because of the long timeframe involved in deploying a patched Linux kernel, we also pursued mitigating this exploit without a reboot.

Removing the module

The bug was in the algif_aead kernel module. Therefore, the simple fix was to just remove this module and disallow it from being reloaded.

This mitigation was therefore exactly what the Copy Fail write-up from the security researchers who identified it recommends.

echo "install algif_aead /bin/false" > /etc/modprobe.d/disable-algif.conf
rmmod algif_aead 2>/dev/null || true

Unfortunately removing the module would have impacted software that leverages the kernel crypto API. This meant that we had to figure out a more surgical mitigation.

Bpf-lsm

We’ve already developed and deployed such a tool for this exact scenario: bpf-lsm. Instead of removing the module, this tool leaves it loaded for legitimate users and uses a BPF Linux Security Module program to deny the socket_bind LSM hook for everyone else. This completely blocks the front door for any exploits.

A draft of the eBPF program was put together overnight. Team members picked it up the following morning, ran validations, and made it production-ready. The program is fairly straightforward. On every socket_bind call:

If the socket family is not AF_ALG, allow the call through unchanged.
If the family is AF_ALG, check the calling binary's path against an allow-list of the binaries we know to be legitimate users.
If the binary is on the allow-list, allow the bind. Otherwise, deny it.

To verify the mitigation on a given machine without exploiting it, the Copy Fail write-up gives a one-liner:

python3 -c 'import socket; s = socket.socket(socket.AF_ALG, socket.SOCK_SEQPACKET, 0); s.bind(("aead","authencesn(hmac(sha256),cbc(aes))"));'

On a mitigated machine you get PermissionError: [Errno 1] Operation not permitted (or FileNotFoundError, depending on which mitigation is active) instead of a successful bind.

Rolling it out

Before enabling enforcement, we verified that our known internal service was the sole legitimate AF_ALG user to avoid accidental outages. We used prometheus-ebpf-exporter to hook the socket() syscall and track AF_ALG usage per binary across the fleet. This required no kernel changes and provided aggregate data from hundreds of thousands of servers within hours. Results confirmed the identified service was indeed the only legitimate user.

So the bpf-lsm rollout was deliberately staged in two steps:

Get visibility first. Push the ebpf-exporter config gated by salt. Confirm at the metric layer that the known service is effectively the only thing creating AF_ALG sockets.
Then enforce. Push the bpf-lsm program behind a separate enforcement gate.

In parallel, the upstream backport for our majority LTS line finally became available, and our internal automation built a patched kernel against it.

We started to test the patched kernel in our staging datacenters as soon as possible, then we resumed the longer reboot process in order to fully patch our fleet.

Remediation and follow-up steps

While we were prepared for this scenario, at Cloudflare we’re always learning and improving. Key areas we identified for improvement:

Better visibility into kernel-API dependencies. We will review kernel-subsystem usage across production services, so we can continue to quickly mitigate exploits without service disruption.
Better runtime mitigation. bpf-lsm is a valuable tool for mitigations, but we want to make this tool even better. This will include looking into faster deployments, better playbooks, and better logging and visibility of the tool.
Reduce attack surface of Linux Kernel. Review and audit our kernel configuration. Proactively identify unused modules or features so that we can remove them from our build entirely.

Conclusion

The "Copy Fail" vulnerability presented a unique challenge for us. Despite our practice of deploying Linux patch updates every two weeks, we remained vulnerable because a month-old mainline fix had yet to be backported to our primary kernel line. Despite that, we were still able to roll out patched kernels within hours of the backport's release. In the interim, bpf-lsm provided a surgical, no-reboot mitigation that secured our fleet. While our initial attempt to disable the problematic module failed, it did so safely within our internal staging environment rather than production, allowing us to identify this dependency.

By the end of the rollout, every machine in our fleet was protected by either a patched kernel or a bpf-lsm program denying the vulnerable code path to non-allow-listed binaries. There was no customer impact at any point during this incident, and we have committed to the follow-up work above to make our response faster and our visibility better the next time something like this lands. Responsible disclosure works, in-kernel visibility tooling pays off in moments exactly like this one, and bpf-lsm continues to be one of the most useful primitives we have for runtime kernel mitigation.

At Cloudflare, critical vulnerability response is a coordinated effort across Security, Engineering, Product, and many other teams. Special thanks to Ali Adnan, Ivan Babrou, Frederik Baetens, Curtis Bray, Piers Cornwell, Everton Didone Foscarini, Rob Dinh, Elle Dougherty, Kevin Flansburg, Matt Fleming, Kimberley Hall, Brandon Harris, Jerry Ho, Oxana Kharitonova, Marek Kroemeke, Fred Lawler, James Munson, Nafeez Nazer, Walead Parviz, Miguel Pato, Evan Pratten, Josh Seba, June Slater, Ryan Timken, Michael Wolf, Jianxin Zeng and everyone else who contributed to the investigation, mitigation, and remediation of Copy Fail. We'd also like to thank the Linux upstream maintainers and Copy Fail researchers whose work helped make a rapid response possible.

Introducing Programmable Flow Protection: custom DDoS mitigation logic for Magic Transit customers

Anita Tenjarla — Tue, 31 Mar 2026 13:00:00 GMT

We're proud to introduce Programmable Flow Protection: a system designed to let Magic Transit customers implement their own custom DDoS mitigation logic and deploy it across Cloudflare’s global network. This enables precise, stateful mitigation for custom and proprietary protocols built on UDP. It is engineered to provide the highest possible level of customization and flexibility to mitigate DDoS attacks of any scale.

Programmable Flow Protection is currently in beta and available to all Magic Transit Enterprise customers for an additional cost. Contact your account team to join the beta or sign up at this page.

Programmable Flow Protection is customizable

Our existing DDoS mitigation systems have been designed to understand and protect popular, well-known protocols from DDoS attacks. For example, our Advanced TCP Protection system uses specific known characteristics about the TCP protocol to issue challenges and establish a client’s legitimacy. Similarly, our Advanced DNS Protection builds a per-customer profile of DNS queries to mitigate DNS attacks. Our generic DDoS mitigation platform also understands common patterns across a variety of other well known protocols, including NTP, RDP, SIP, and many others.

However, custom or proprietary UDP protocols have always been a challenge for Cloudflare’s DDoS mitigation systems because our systems do not have the relevant protocol knowledge to make intelligent decisions to pass or drop traffic.

Programmable Flow Protection addresses this gap. Now, customers can write their own eBPF program that defines what “good” and “bad” packets are and how to deal with them. Cloudflare then runs the program across our entire global network. The program can choose to either drop or challenge “bad” packets, preventing them from reaching the customer’s origin.

The problem of UDP-based attacks

UDP is a connectionless transport layer protocol. Unlike TCP, UDP has no handshake or stateful connections. It does not promise that packets will arrive in order or exactly once. UDP instead prioritizes speed and simplicity, and is therefore well-suited for online gaming, VoIP, video streaming, and any other use case where the application requires real-time communication between clients and servers.

Our DDoS mitigation systems have always been able to detect and mitigate attacks against well-known protocols built on top of UDP. For example, the standard DNS protocol is built on UDP, and each DNS packet has a well-known structure. If we see a DNS packet, we know how to interpret it. That makes it easier for us to detect and drop DNS-based attacks.

Unfortunately, if we don’t understand the protocol inside a UDP packet’s payload, our DDoS mitigation systems have limited options available at mitigation time. If an attacker sends a large flood of UDP traffic that does not match any known patterns or protocols, Cloudflare can either entirely block or apply a rate limit to the destination IP and port combination. This is a crude “last line of defense” that is only intended to keep the rest of the customer’s network online, and it can be painful in a couple ways.

First, a block or a generic rate limit does not distinguish good traffic from bad, which means these mitigations will likely cause legitimate clients to experience lag or connection loss — doing the attacker’s job for them! Second, a generic rate limit can be too strict or too lax depending on the customer. For example, a customer who expects to receive 1Gbps of legitimate traffic probably needs more aggressive rate limiting compared to a customer who expects to receive 25Gbps of legitimate traffic.

^{An illustration of UDP packet contents. A user can define a valid payload and reject traffic that doesn’t match the defined pattern.}

The Programmable Flow Protection platform was built to address this problem by allowing our customers to dictate what “good” versus “bad” traffic actually looks like. Many of our customers use custom or proprietary UDP protocols that we do not understand — and now we don’t have to.

How Programmable Flow Protection works

In previous blog posts, we’ve described how “flowtrackd”, our stateful network layer DDoS mitigation system, protects Magic Transit users from complex TCP and DNS attacks. We’ve also described how we use Linux technologies like XDP and eBPF to efficiently mitigate common types of large scale DDoS attacks.

Programmable Flow Protection combines these technologies in a novel way. With Programmable Flow Protection, a customer can write their own eBPF program that decides whether to pass, drop, or challenge individual packets based on arbitrary logic. A customer can upload the program to Cloudflare, and Cloudflare will execute it on every packet destined to their network. Programs are executed in userspace, not kernel space, which allows Cloudflare the flexibility to support a variety of customers and use cases on the platform without compromising security. Programmable Flow Protection programs run after all of Cloudflare’s existing DDoS mitigations, so users still benefit from our standard security protections.

There are many similarities between an XDP eBPF program loaded into the Linux kernel and an eBPF program running on the Programmable Flow Protection platform. Both types of programs are compiled down to BPF bytecode. They are both run through a “verifier” to ensure memory safety and verify program termination. They are also executed in a fast, lightweight VM to provide isolation and stability.

However, eBPF programs loaded into the Linux kernel make use of many Linux-specific “helper functions” to integrate with the network stack, maintain state between program executions, and emit packets to network devices. Programmable Flow Protection offers the same functionality whenever a customer chooses, but with a different API tailored specifically to implement DDoS mitigations. For example, we’ve built helper functions to store state about clients between program executions, perform cryptographic validation, and emit challenge packets to clients. With these helper functions, a developer can use the power of the Cloudflare platform to protect their own network.

Combining customer knowledge with Cloudflare’s network

Let’s step through an example to illustrate how a customer’s protocol-specific knowledge can be combined with Cloudflare’s network to create powerful mitigations.

Say a customer hosts an online gaming server on UDP port 207. The game engine uses a proprietary application header that is specific to the game. Cloudflare has no knowledge of the structure or contents of the application header. The customer gets hit by DDoS attacks that overwhelm the game server and players report lag in gameplay. The attack traffic comes from highly randomized source IPs and ports, and the payload data appears to be random as well.

To mitigate the attack, the customer can use their knowledge of the application header and deploy a Programmable Flow Protection program to check a packet’s validity. In this example, the application header contains a token that is unique to the gaming protocol. The customer can therefore write a program to extract the last byte of the token. The program passes all packets with the correct value present and drops all other traffic:

#include 
#include 
#include 

#include "cf_ebpf_defs.h"
#include "cf_ebpf_helper.h"

// Custom application header
struct apphdr {
    uint8_t  version;
    uint16_t length;   // Length of the variable-length token
    uint8_t  token[0]; // Variable-length token
} __attribute__((packed));

uint64_t
cf_ebpf_main(void *state)
{
    struct cf_ebpf_generic_ctx *ctx = state;
    struct cf_ebpf_parsed_headers headers;
    struct cf_ebpf_packet_data *p;

    // Parse the packet headers with provided helper function
    if (parse_packet_data(ctx, &p, &headers) != 0) {
        return CF_EBPF_DROP;
    }

    // Drop packets not destined to port 207
    struct udphdr *udp_hdr = (struct udphdr *)headers.udp;
    if (ntohs(udp_hdr->dest) != 207) {
        return CF_EBPF_DROP;
    }

    // Get application header from UDP payload
    struct apphdr *app = (struct apphdr *)(udp_hdr + 1);
    if ((uint8_t *)(app + 1) > headers.data_end) {
        return CF_EBPF_DROP;
    }

    // Perform memory checks to satisfy the verifier
    // and access the token safely
    if ((uint8_t *)(app->token + token_len) > headers.data_end) {
        return CF_EBPF_DROP;
    }

    // Check the last byte of the token against expected value
    uint8_t *last_byte = app->token + token_len - 1;
    if (*last_byte != 0xCF) {
        return CF_EBPF_DROP;
    }

    return CF_EBPF_PASS;
}

^{An eBPF program to filter packets according to a value in the application header.}

This program leverages application-specific information to create a more targeted mitigation than Cloudflare is capable of crafting on its own. Customers can now combine their proprietary knowledge with the capacity of Cloudflare’s global network to absorb and mitigate massive attacks better than ever before.

Going beyond firewalls: stateful tracking and challenges

Many pattern checks, like the one performed in the example above, can be accomplished with traditional firewalls. However, programs provide useful primitives that are not available in firewalls, including variables, conditional execution, loops, and procedure calls. But what really sets Programmable Flow Protection apart from other solutions is its ability to statefully track flows and challenge clients to prove they are real. A common type of attack that showcases these abilities is a replay attack.

In a replay attack, an attacker repeatedly sends packets that were valid at some point, and therefore conform to expected patterns of the traffic, but are no longer valid in the application’s current context. For example, the attacker could record some of their valid gameplay traffic and use a script to duplicate and transmit the same traffic at a very high rate.

With Programmable Flow Protection, a user can deploy a program that challenges suspicious clients and drops scripted traffic. We can extend our original example as follows:


#include 
#include 
#include 

#include "cf_ebpf_defs.h"
#include "cf_ebpf_helper.h"

uint64_t
cf_ebpf_main(void *state)
{
    // ...
 
    // Get the status of this source IP (statefully tracked)
    uint8_t status;
    if (cf_ebpf_get_source_ip_status(&status) != 0) {
        return CF_EBPF_DROP;
    }

    switch (status) {
        case NONE:
		// Issue a custom challenge to this source IP
             issue_challenge();
             cf_ebpf_set_source_ip_status(CHALLENGED);
             return CF_EBPF_DROP;


        case CHALLENGED:
		// Check if this packet passes the challenge
		// with custom logic
             if (verify_challenge()) {
                 cf_ebpf_set_source_ip_status(VERIFIED);
                 return CF_EBPF_PASS;
             } else {
                 cf_ebpf_set_source_ip_status(BLOCKED);
                 return CF_EBPF_DROP;
             }


        case VERIFIED:
		// This source IP has passed the challenge
		return CF_EBPF_PASS;

	 case BLOCKED:
		// This source IP has been blocked
		return CF_EBPF_DROP;

        default:
            return CF_EBPF_PASS;
    }


    return CF_EBPF_PASS;
}

^{An eBPF program to challenge UDP connections and statefully manage connections. This example has been simplified for illustration purposes.}

The program statefully tracks the source IP addresses it has seen and emits a packet with a cryptographic challenge back to unknown clients. A legitimate client running a valid gaming client is able to correctly solve the challenge and respond with proof, but the attacker’s script is not. Traffic from the attacker is marked as “blocked” and subsequent packets are dropped.

With these new abilities, customers can statefully track flows and make sure only real, verified clients can send traffic to their origin servers. Although we have focused the example on gaming, the potential use cases for this technology extend to any UDP-based protocol.

Get started today

We’re excited to offer the Programmable Flow Protection feature to Magic Transit Enterprise customers. Talk to your account manager to learn more about how you can enable Programmable Flow Protection to help keep your infrastructure safe.

We’re still in active development of the platform, and we’re excited to see what our users build next. If you are not yet a Cloudflare customer, let us know if you’d like to protect your network with Cloudflare and Programmable Flow Protection by signing up at this page: https://www.cloudflare.com/lp/programmable-flow-protection/.

A deep dive into BPF LPM trie performance and optimization

Matt Fleming — Tue, 21 Oct 2025 13:00:00 GMT

It started with a mysterious soft lockup message in production. A single, cryptic line that led us down a rabbit hole into the performance of one of the most fundamental data structures we use: the BPF LPM trie.

BPF trie maps (BPF_MAP_TYPE_LPM_TRIE) are heavily used for things like IP and IP+Port matching when routing network packets, ensuring your request passes through the right services before returning a result. The performance of this data structure is critical for serving our customers, but the speed of the current implementation leaves a lot to be desired. We’ve run into several bottlenecks when storing millions of entries in BPF LPM trie maps, such as entry lookup times taking hundreds of milliseconds to complete and freeing maps locking up a CPU for over 10 seconds. For instance, BPF maps are used when evaluating Cloudflare’s Magic Firewall rules and these bottlenecks have even led to traffic packet loss for some customers.

This post gives a refresher of how tries and prefix matching work, benchmark results, and a list of the shortcomings of the current BPF LPM trie implementation.

A brief recap of tries

If it’s been a while since you last looked at the trie data structure (or if you’ve never seen it before), a trie is a tree data structure (similar to a binary tree) that allows you to store and search for data for a given key and where each node stores some number of key bits.

Searches are performed by traversing a path, which essentially reconstructs the key from the traversal path, meaning nodes do not need to store their full key. This differs from a traditional binary search tree (BST) where the primary invariant is that the left child node has a key that is less than the current node and the right child has a key that is greater. BSTs require that each node store the full key so that a comparison can be made at each search step.

Here’s an example that shows how a BST might store values for the keys:

ABC
ABCD
ABCDEFGH
DEF

In comparison, a trie for storing the same set of keys might look like this.

This way of splitting out bits is really memory-efficient when you have redundancy in your data, e.g. prefixes are common in your keys, because that shared data only requires a single set of nodes. It’s for this reason that tries are often used to efficiently store strings, e.g. dictionaries of words – storing the strings “ABC” and “ABCD” doesn’t require 3 bytes + 4 bytes (assuming ASCII), it only requires 3 bytes + 1 byte because “ABC” is shared by both (the exact number of bits required in the trie is implementation dependent).

Tries also allow more efficient searching. For instance, if you wanted to know whether the key “CAR” existed in the BST you are required to go to the right child of the root (the node with key “DEF”) and check its left child because this is where it would live if it existed. A trie is more efficient because it searches in prefix order. In this particular example, a trie knows at the root whether that key is in the trie or not.

This design makes tries perfectly suited for performing longest prefix matches and for working with IP routing using CIDR. CIDR was introduced to make more efficient use of the IP address space (no longer requiring that classes fall into 4 buckets of 8 bits) but comes with added complexity because now the network portion of an IP address can fall anywhere. Handling the CIDR scheme in IP routing tables requires matching on the longest (most specific) prefix in the table rather than performing a search for an exact match.

If searching a trie does a single-bit comparison at each node, that’s a binary trie. If searching compares more bits we call that a multibit trie. You can store anything you like in a trie, including IP and subnet addresses – it’s all just ones and zeroes.

Nodes in multibit tries use more memory than in binary tries, but since computers operate on multibit words anyhow, it’s more efficient from a microarchitecture perspective to use multibit tries because you can traverse through the bits faster, reducing the number of comparisons you need to make to search for your data. It’s a classic space vs time tradeoff.

There are other optimisations we can use with tries. The distribution of data that you store in a trie might not be uniform and there could be sparsely populated areas. For example, if you store the strings “A” and “BCDEFGHI” in a multibit trie, how many nodes do you expect to use? If you’re using ASCII, you could construct the binary trie with a root node and branch left for “A” or right for “B”. With 8-bit nodes, you’d need another 7 nodes to store “C”, “D”, “E”, “F”, “G”, “H", “I”.

Since there are no other strings in the trie, that’s pretty suboptimal. Once you hit the first level after matching on “B” you know there’s only one string in the trie with that prefix, and you can avoid creating all the other nodes by using path compression. Path compression replaces nodes “C”, “D”, “E” etc. with a single one such as “I”.

If you traverse the tree and hit “I”, you still need to compare the search key with the bits you skipped (“CDEFGH”) to make sure your search key matches the string. Exactly how and where you store the skipped bits is implementation dependent – BPF LPM tries simply store the entire key in the leaf node. As your data becomes denser, path compression is less effective.

What if your data distribution is dense and, say, all the first 3 levels in a trie are fully populated? In that case you can use level compression and replace all the nodes in those levels with a single node that has 2**3 children. This is how Level-Compressed Tries work which are used for IP route lookup in the Linux kernel (see net/ipv4/fib_trie.c).

There are other optimisations too, but this brief detour is sufficient for this post because the BPF LPM trie implementation in the kernel doesn’t fully use the three we just discussed.

How fast are BPF LPM trie maps?

Here are some numbers from running BPF selftests benchmark on AMD EPYC 9684X 96-Core machines. Here the trie has 10K entries, a 32-bit prefix length, and an entry for every key in the range [0, 10K).

Operation	Throughput	Stddev	Latency
lookup	7.423M ops/s	0.023M ops/s	134.710 ns/op
update	2.643M ops/s	0.015M ops/s	378.310 ns/op
delete	0.712M ops/s	0.008M ops/s	1405.152 ns/op
free	0.573K ops/s	0.574K ops/s	1.743 ms/op

The time to free a BPF LPM trie with 10K entries is noticeably large. We recently ran into an issue where this took so long that it caused soft lockup messages to spew in production.

This benchmark gives some idea of worst case behaviour. Since the keys are so densely populated, path compression is completely ineffective. In the next section, we explore the lookup operation to understand the bottlenecks involved.

Why are BPF LPM tries slow?

The LPM trie implementation in kernel/bpf/lpm_trie.c has a couple of the optimisations we discussed in the introduction. It is capable of multibit comparisons at leaf nodes, but since there are only two child pointers in each internal node, if your tree is densely populated with a lot of data that only differs by one bit, these multibit comparisons degrade into single bit comparisons.

Here’s an example. Suppose you store the numbers 0, 1, and 3 in a BPF LPM trie. You might hope that since these values fit in a single 32 or 64-bit machine word, you could use a single comparison to decide which next node to visit in the trie. But that’s only possible if your trie implementation has 3 child pointers in the current node (which, to be fair, most trie implementations do). In other words, you want to make a 3-way branching decision but since BPF LPM tries only have two children, you’re limited to a 2-way branch.

A diagram for this 2-child trie is given below.

The leaf nodes are shown in green with the key, as a binary string, in the center. Even though a single 8-bit comparison is more than capable of figuring out which node has that key, the BPF LPM trie implementation resorts to inserting intermediate nodes (blue) to inject 2-way branching decisions into your path traversal because its parent (the orange root node in this case) only has 2 children. Once you reach a leaf node, BPF LPM tries can perform a multibit comparison to check the key. If a node supported pointers to more children, the above trie could instead look like this, allowing a 3-way branch and reducing the lookup time.

This 2-child design impacts the height of the trie. In the worst case, a completely full trie essentially becomes a binary search tree with height log2(nr_entries) and the height of the trie impacts how many comparisons are required to search for a key.

The above trie also shows how BPF LPM tries implement a form of path compression – you only need to insert an intermediate node where you have two nodes whose keys differ by a single bit. If instead of 3, you insert a key of 15 (0b1111), this won’t change the layout of the trie; you still only need a single node at the right child of the root.

And finally, BPF LPM tries do not implement level compression. Again, this stems from the fact that nodes in the trie can only have 2 children. IP route tables tend to have many prefixes in common and you typically see densely packed tries at the upper levels which makes level compression very effective for tries containing IP routes.

Here’s a graph showing how the lookup throughput for LPM tries (measured in million ops/sec) degrades as the number of entries increases, from 1 entry up to 100K entries.

Once you reach 1 million entries, throughput is around 1.5 million ops/sec, and continues to fall as the number of entries increases.

Why is this? Initially, this is because of the L1 dcache miss rate. All of those nodes that need to be traversed in the trie are potential cache miss opportunities.

As you can see from the graph, L1 dcache miss rate remains relatively steady and yet the throughput continues to decline. At around 80K entries, dTLB miss rate becomes the bottleneck.

Because BPF LPM tries to dynamically allocate individual nodes from a freelist of kernel memory, these nodes can live at arbitrary addresses. Which means traversing a path through a trie almost certainly will incur cache misses and potentially dTLB misses. This gets worse as the number of entries, and height of the trie, increases.

Where do we go from here?

By understanding the current limitations of the BPF LPM trie, we can now work towards building a more performant and efficient solution for the future of the Internet.

We’ve already contributed these benchmarks to the upstream Linux kernel — but that’s only the start. We have plans to improve the performance of BPM LPM tries, particularly the lookup function which is heavily used for our workloads. This post covered a number of optimisations that are already used by the net/ipv4/fib_trie.c code, so a natural first step is to refactor that code so that a common Level Compressed trie implementation can be used. Expect future blog posts to explore this work in depth.

If you’re interested in looking at more performance numbers, Jesper Brouer has recorded some here: https://github.com/xdp-project/xdp-project/blob/main/areas/bench/bench02_lpm-trie-lookup.org.

If the Linux kernel, performance, or optimising data structures excites you, our engineering teams are hiring.

A July 4 technical reading list

John Graham-Cumming — Mon, 04 Jul 2022 12:55:08 GMT

Here’s a short list of recent technical blog posts to give you something to read today.

Internet Explorer, we hardly knew ye

Microsoft has announced the end-of-life for the venerable Internet Explorer browser. Here we take a look at the demise of IE and the rise of the Edge browser. And we investigate how many bots on the Internet continue to impersonate Internet Explorer versions that have long since been replaced.

Live-patching security vulnerabilities inside the Linux kernel with eBPF Linux Security Module

Looking for something with a lot of technical detail? Look no further than this blog about live-patching the Linux kernel using eBPF. Code, Makefiles and more within!

Hertzbleed explained

Feeling mathematical? Or just need a dose of CPU-level antics? Look no further than this deep explainer about how CPU frequency scaling leads to a nasty side channel affecting cryptographic algorithms.

Early Hints update: How Cloudflare, Google, and Shopify are working together to build a faster Internet for everyone

The HTTP standard for Early Hints shows a lot of promise. How much? In this blog post, we dig into data about Early Hints in the real world and show how much faster the web is with it.

Private Access Tokens: eliminating CAPTCHAs on iPhones and Macs with open standards

Dislike CAPTCHAs? Yes, us too. As part of our program to eliminate captures there’s a new standard: Private Access Tokens. This blog shows how they work and how they can be used to prove you’re human without saying who you are.

Optimizing TCP for high WAN throughput while preserving low latency

Network nerd? Yeah, me too. Here’s a very in depth look at how we tune TCP parameters for low latency and high throughput.

...We protect entire corporate networks, help customers build Internet-scale applications efficiently, accelerate any website or Internet application, ward off DDoS attacks, keep hackers at bay, and can help you on your journey to Zero Trust.

Visit 1.1.1.1 from any device to get started with our free app that makes your Internet faster and safer.To learn more about our mission to help build a better Internet, start here. If you’re looking for a new career direction, check out our open positions.

Production ready eBPF, or how we fixed the BSD socket API

Lorenz Bauer — Thu, 17 Feb 2022 17:02:54 GMT

As we develop new products, we often push our operating system - Linux - beyond what is commonly possible. A common theme has been relying on eBPF to build technology that would otherwise have required modifying the kernel. For example, we’ve built DDoS mitigation and a load balancer and use it to monitor our fleet of servers.

This software usually consists of a small-ish eBPF program written in C, executed in the context of the kernel, and a larger user space component that loads the eBPF into the kernel and manages its lifecycle. We’ve found that the ratio of eBPF code to userspace code differs by an order of magnitude or more. We want to shed some light on the issues that a developer has to tackle when dealing with eBPF and present our solutions for building rock-solid production ready applications which contain eBPF.

For this purpose we are open sourcing the production tooling we’ve built for the sk_lookup hook we contributed to the Linux kernel, called tubular. It exists because we’ve outgrown the BSD sockets API. To deliver some products we need features that are just not possible using the standard API.

Our services are available on millions of IPs.
Multiple services using the same port on different addresses have to coexist, e.g. 1.1.1.1 resolver and our authoritative DNS.
Our Spectrum product needs to listen on all 2^16 ports.

The source code for tubular is at https://github.com/cloudflare/tubular, and it allows you to do all the things mentioned above. Maybe the most interesting feature is that you can change the addresses of a service on the fly:

How tubular works

tubular sits at a critical point in the Cloudflare stack, since it has to inspect every connection terminated by a server and decide which application should receive it.

Failure to do so will drop or misdirect connections hundreds of times per second. So it has to be incredibly robust during day to day operations. We had the following goals for tubular:

Releases must be unattended and happen online. tubular runs on thousands of machines, so we can’t babysit the process or take servers out of production.
Releases must fail safely. A failure in the process must leave the previous version of tubular running, otherwise we may drop connections.
Reduce the impact of (userspace) crashes. When the inevitable bug comes along we want to minimise the blast radius.

In the past we had built a proof-of-concept control plane for sk_lookup called inet-tool, which proved that we could get away without a persistent service managing the eBPF. Similarly, tubular has tubectl: short-lived invocations make the necessary changes and persisting state is handled by the kernel in the form of eBPF maps. Following this design gave us crash resiliency by default, but left us with the task of mapping the user interface we wanted to the tools available in the eBPF ecosystem.

The tubular user interface

tubular consists of a BPF program that attaches to the sk_lookup hook in the kernel and userspace Go code which manages the BPF program. The tubectl command wraps both in a way that is easy to distribute.

tubectl manages two kinds of objects: bindings and sockets. A binding encodes a rule against which an incoming packet is matched. A socket is a reference to a TCP or UDP socket that can accept new connections or packets.

Bindings and sockets are "glued" together via arbitrary strings called labels. Conceptually, a binding assigns a label to some traffic. The label is then used to find the correct socket.

Adding bindings

To create a binding that steers port 80 (aka HTTP) traffic destined for 127.0.0.1 to the label “foo” we use tubectl bind:

$ sudo tubectl bind "foo" tcp 127.0.0.1 80

Due to the power of sk_lookup we can have much more powerful constructs than the BSD API. For example, we can redirect connections to all IPs in 127.0.0.0/24 to a single socket:

$ sudo tubectl bind "bar" tcp 127.0.0.0/24 80

A side effect of this power is that it's possible to create bindings that "overlap":

1: tcp 127.0.0.1/32 80 -> "foo"
2: tcp 127.0.0.0/24 80 -> "bar"

The first binding says that HTTP traffic to localhost should go to “foo”, while the second asserts that HTTP traffic in the localhost subnet should go to “bar”. This creates a contradiction, which binding should we choose? tubular resolves this by defining precedence rules for bindings:

A prefix with a longer mask is more specific, e.g. 127.0.0.1/32 wins over 127.0.0.0/24.
A port is more specific than the port wildcard, e.g. port 80 wins over "all ports" (0).

Applying this to our example, HTTP traffic to all IPs in 127.0.0.0/24 will be directed to bar, except for 127.0.0.1 which goes to foo.

Getting ahold of sockets

sk_lookup needs a reference to a TCP or a UDP socket to redirect traffic to it. However, a socket is usually accessible only by the process which created it with the socket syscall. For example, an HTTP server creates a TCP listening socket bound to port 80. How can we gain access to the listening socket?

A fairly well known solution is to make processes cooperate by passing socket file descriptors via SCM_RIGHTS messages to a tubular daemon. That daemon can then take the necessary steps to hook up the socket with sk_lookup. This approach has several drawbacks:

Requires modifying processes to send SCM_RIGHTS
Requires a tubular daemon, which may crash

There is another way of getting at sockets by using systemd, provided socket activation is used. It works by creating an additional service unit with the correct Sockets setting. In other words: we can leverage systemd oneshot action executed on creation of a systemd socket service, registering the socket into tubular. For example:

[Unit]
Requisite=foo.socket

[Service]
Type=oneshot
Sockets=foo.socket
ExecStart=tubectl register "foo"

Since we can rely on systemd to execute tubectl at the correct times we don't need a daemon of any kind. However, the reality is that a lot of popular software doesn't use systemd socket activation. Dealing with systemd sockets is complicated and doesn't invite experimentation. Which brings us to the final trick: pidfd_getfd:

The pidfd_getfd() system call allocates a new file descriptor in the calling process. This new file descriptor is a duplicate of an existing file descriptor, targetfd, in the process referred to by the PID file descriptor pidfd.

We can use it to iterate all file descriptors of a foreign process, and pick the socket we are interested in. To return to our example, we can use the following command to find the TCP socket bound to 127.0.0.1 port 8080 in the httpd process and register it under the "foo" label:

$ sudo tubectl register-pid "foo" $(pidof httpd) tcp 127.0.0.1 8080

It's easy to wire this up using systemd's ExecStartPost if the need arises.

[Service]
Type=forking # or notify
ExecStart=/path/to/some/command
ExecStartPost=tubectl register-pid $MAINPID foo tcp 127.0.0.1 8080

Storing state in eBPF maps

As mentioned previously, tubular relies on the kernel to store state, using BPF key / value data structures also known as maps. Using the BPF_OBJ_PIN syscall we can persist them in /sys/fs/bpf:

/sys/fs/bpf/4026532024_dispatcher
├── bindings
├── destination_metrics
├── destinations
├── sockets
└── ...

The way the state is structured differs from how the command line interface presents it to users. Labels like “foo” are convenient for humans, but they are of variable length. Dealing with variable length data in BPF is cumbersome and slow, so the BPF program never references labels at all. Instead, the user space code allocates numeric IDs, which are then used in the BPF. Each ID represents a (label, domain, protocol) tuple, internally called destination.

For example, adding a binding for "foo" tcp 127.0.0.1 ... allocates an ID for ("foo", AF_INET, TCP). Including domain and protocol in the destination allows simpler data structures in the BPF. Each allocation also tracks how many bindings reference a destination so that we can recycle unused IDs. This data is persisted into the destinations hash table, which is keyed by (Label, Domain, Protocol) and contains (ID, Count). Metrics for each destination are tracked in destination_metrics in the form of per-CPU counters.

bindings is a longest prefix match (LPM) trie which stores a mapping from (protocol, port, prefix) to (ID, prefix length). The ID is used as a key to the sockets map which contains pointers to kernel socket structures. IDs are allocated in a way that makes them suitable as an array index, which allows using the simpler BPF sockmap (an array) instead of a socket hash table. The prefix length is duplicated in the value to work around shortcomings in the BPF API.

Encoding the precedence of bindings

As discussed, bindings have a precedence associated with them. To repeat the earlier example:

1: tcp 127.0.0.1/32 80 -> "foo"
2: tcp 127.0.0.0/24 80 -> "bar"

The first binding should be matched before the second one. We need to encode this in the BPF somehow. One idea is to generate some code that executes the bindings in order of specificity, a technique we’ve used to great effect in l4drop:

1: if (mask(ip, 32) == 127.0.0.1) return "foo"
2: if (mask(ip, 24) == 127.0.0.0) return "bar"
...

This has the downside that the program gets longer the more bindings are added, which slows down execution. It's also difficult to introspect and debug such long programs. Instead, we use a specialised BPF longest prefix match (LPM) map to do the hard work. This allows inspecting the contents from user space to figure out which bindings are active, which is very difficult if we had compiled bindings into BPF. The LPM map uses a trie behind the scenes, so lookup has complexity proportional to the length of the key instead of linear complexity for the “naive” solution.

However, using a map requires a trick for encoding the precedence of bindings into a key that we can look up. Here is a simplified version of this encoding, which ignores IPv6 and uses labels instead of IDs. To insert the binding tcp 127.0.0.0/24 80 into a trie we first convert the IP address into a number.

127.0.0.0    = 0x7f 00 00 00

Since we're only interested in the first 24 bits of the address we, can write the whole prefix as

127.0.0.0/24 = 0x7f 00 00 ??

where “?” means that the value is not specified. We choose the number 0x01 to represent TCP and prepend it and the port number (80 decimal is 0x50 hex) to create the full key:

tcp 127.0.0.0/24 80 = 0x01 50 7f 00 00 ??

Converting tcp 127.0.0.1/32 80 happens in exactly the same way. Once the converted values are inserted into the trie, the LPM trie conceptually contains the following keys and values.

LPM trie:
        0x01 50 7f 00 00 ?? = "bar"
        0x01 50 7f 00 00 01 = "foo"

To find the binding for a TCP packet destined for 127.0.0.1:80, we again encode a key and perform a lookup.

input:  0x01 50 7f 00 00 01   TCP packet to 127.0.0.1:80
---------------------------
LPM trie:
        0x01 50 7f 00 00 ?? = "bar"
           y  y  y  y  y
        0x01 50 7f 00 00 01 = "foo"
           y  y  y  y  y  y
---------------------------
result: "foo"

y = byte matches

The trie returns “foo” since its key shares the longest prefix with the input. Note that we stop comparing keys once we reach unspecified “?” bytes, but conceptually “bar” is still a valid result. The distinction becomes clear when looking up the binding for a TCP packet to 127.0.0.255:80.

input:  0x01 50 7f 00 00 ff   TCP packet to 127.0.0.255:80
---------------------------
LPM trie:
        0x01 50 7f 00 00 ?? = "bar"
           y  y  y  y  y
        0x01 50 7f 00 00 01 = "foo"
           y  y  y  y  y  n
---------------------------
result: "bar"

n = byte doesn't match

In this case "foo" is discarded since the last byte doesn't match the input. However, "bar" is returned since its last byte is unspecified and therefore considered to be a valid match.

Observability with minimal privileges

Linux has the powerful ss tool (part of iproute2) available to inspect socket state:

$ ss -tl src 127.0.0.1
State      Recv-Q      Send-Q           Local Address:Port           Peer Address:Port
LISTEN     0           128                  127.0.0.1:ipp                 0.0.0.0:*

With tubular in the picture this output is not accurate anymore. tubectl bindings makes up for this shortcoming:

$ sudo tubectl bindings tcp 127.0.0.1
Bindings:
 protocol       prefix port label
      tcp 127.0.0.1/32   80   foo

Running this command requires super-user privileges, despite in theory being safe for any user to run. While this is acceptable for casual inspection by a human operator, it's a dealbreaker for observability via pull-based monitoring systems like Prometheus. The usual approach is to expose metrics via an HTTP server, which would have to run with elevated privileges and be accessible to the Prometheus server somehow. Instead, BPF gives us the tools to enable read-only access to tubular state with minimal privileges.

The key is to carefully set file ownership and mode for state in /sys/fs/bpf. Creating and opening files in /sys/fs/bpf uses BPF_OBJ_PIN and BPF_OBJ_GET. Calling BPF_OBJ_GET with BPF_F_RDONLY is roughly equivalent to open(O_RDONLY) and allows accessing state in a read-only fashion, provided the file permissions are correct. tubular gives the owner full access but restricts read-only access to the group:

$ sudo ls -l /sys/fs/bpf/4026532024_dispatcher | head -n 3
total 0
-rw-r----- 1 root root 0 Feb  2 13:19 bindings
-rw-r----- 1 root root 0 Feb  2 13:19 destination_metrics

It's easy to choose which user and group should own state when loading tubular:

$ sudo -u root -g tubular tubectl load
created dispatcher in /sys/fs/bpf/4026532024_dispatcher
loaded dispatcher into /proc/self/ns/net
$ sudo ls -l /sys/fs/bpf/4026532024_dispatcher | head -n 3
total 0
-rw-r----- 1 root tubular 0 Feb  2 13:42 bindings
-rw-r----- 1 root tubular 0 Feb  2 13:42 destination_metrics

There is one more obstacle, systemd mounts /sys/fs/bpf in a way that makes it inaccessible to anyone but root. Adding the executable bit to the directory fixes this.

$ sudo chmod -v o+x /sys/fs/bpf
mode of '/sys/fs/bpf' changed from 0700 (rwx------) to 0701 (rwx-----x)

Finally, we can export metrics without privileges:

$ sudo -u nobody -g tubular tubectl metrics 127.0.0.1 8080
Listening on 127.0.0.1:8080
^C

There is a caveat, unfortunately: truly unprivileged access requires unprivileged BPF to be enabled. Many distros have taken to disabling it via the unprivileged_bpf_disabled sysctl, in which case scraping metrics does require CAP_BPF.

Safe releases

tubular is distributed as a single binary, but really consists of two pieces of code with widely differing lifetimes. The BPF program is loaded into the kernel once and then may be active for weeks or months, until it is explicitly replaced. In fact, a reference to the program (and link, see below) is persisted into /sys/fs/bpf:

/sys/fs/bpf/4026532024_dispatcher
├── link
├── program
└── ...

The user space code is executed for seconds at a time and is replaced whenever the binary on disk changes. This means that user space has to be able to deal with an "old" BPF program in the kernel somehow. The simplest way to achieve this is to compare what is loaded into the kernel with the BPF shipped as part of tubectl. If the two don't match we return an error:

$ sudo tubectl bind foo tcp 127.0.0.1 80
Error: bind: can't open dispatcher: loaded program #158 has differing tag: "938c70b5a8956ff2" doesn't match "e007bfbbf37171f0"

tag is the truncated hash of the instructions making up a BPF program, which the kernel makes available for every loaded program:

$ sudo bpftool prog list id 158
158: sk_lookup  name dispatcher  tag 938c70b5a8956ff2
...

By comparing the tag tubular asserts that it is dealing with a supported version of the BPF program. Of course, just returning an error isn't enough. There needs to be a way to update the kernel program so that it's once again safe to make changes. This is where the persisted link in /sys/fs/bpf comes into play. bpf_links are used to attach programs to various BPF hooks. "Enabling" a BPF program is a two-step process: first, load the BPF program, next attach it to a hook using a bpf_link. Afterwards the program will execute the next time the hook is executed. By updating the link we can change the program on the fly, in an atomic manner.

$ sudo tubectl upgrade
Upgraded dispatcher to 2022.1.0-dev, program ID #159
$ sudo bpftool prog list id 159
159: sk_lookup  name dispatcher  tag e007bfbbf37171f0
…
$ sudo tubectl bind foo tcp 127.0.0.1 80
bound foo#tcp:[127.0.0.1/32]:80

Behind the scenes the upgrade procedure is slightly more complicated, since we have to update the pinned program reference in addition to the link. We pin the new program into /sys/fs/bpf:

/sys/fs/bpf/4026532024_dispatcher
├── link
├── program
├── program-upgrade
└── ...

Once the link is updated we atomically rename program-upgrade to replace program. In the future we may be able to use RENAME_EXCHANGE to make upgrades even safer.

Preventing state corruption

So far we’ve completely neglected the fact that multiple invocations of tubectl could modify the state in /sys/fs/bpf at the same time. It’s very hard to reason about what would happen in this case, so in general it’s best to prevent this from ever occurring. A common solution to this is advisory file locks. Unfortunately it seems like BPF maps don't support locking.

$ sudo flock /sys/fs/bpf/4026532024_dispatcher/bindings echo works!
flock: cannot open lock file /sys/fs/bpf/4026532024_dispatcher/bindings: Input/output error

This led to a bit of head scratching on our part. Luckily it is possible to flock the directory instead of individual maps:

$ sudo flock --exclusive /sys/fs/bpf/foo echo works!
works!

Each tubectl invocation likewise invokes flock(), thereby guaranteeing that only ever a single process is making changes.

Conclusion

tubular is in production at Cloudflare today and has simplified the deployment of Spectrum and our authoritative DNS. It allowed us to leave behind limitations of the BSD socket API. However, its most powerful feature is that the addresses a service is available on can be changed on the fly. In fact, we have built tooling that automates this process across our global network. Need to listen on another million IPs on thousands of machines? No problem, it’s just an HTTP POST away.

Interested in working on tubular and our L4 load balancer unimog? We are hiring in our European offices.

How We Used eBPF to Build Programmable Packet Filtering in Magic Firewall

Chris J Arges — Mon, 06 Dec 2021 13:59:53 GMT

Cloudflare actively protects services from sophisticated attacks day after day. For users of Magic Transit, DDoS protection detects and drops attacks, while Magic Firewall allows custom packet-level rules, enabling customers to deprecate hardware firewall appliances and block malicious traffic at Cloudflare’s network. The types of attacks and sophistication of attacks continue to evolve, as recent DDoS and reflection attacks against VoIP services targeting protocols such as Session Initiation Protocol (SIP) have shown. Fighting these attacks requires pushing the limits of packet filtering beyond what traditional firewalls are capable of. We did this by taking best of class technologies and combining them in new ways to turn Magic Firewall into a blazing fast, fully programmable firewall that can stand up to even the most sophisticated of attacks.

Magical Walls of Fire

Magic Firewall is a distributed stateless packet firewall built on Linux nftables. It runs on every server, in every Cloudflare data center around the world. To provide isolation and flexibility, each customer’s nftables rules are configured within their own Linux network namespace.

This diagram shows the life of an example packet when using Magic Transit, which has Magic Firewall built in. First, packets go into the server and DDoS protections are applied, which drops attacks as early as possible. Next, the packet is routed into a customer-specific network namespace, which applies the nftables rules to the packets. After this, packets are routed back to the origin via a GRE tunnel. Magic Firewall users can construct firewall statements from a single API, using a flexible Wirefilter syntax. In addition, rules can be configured via the Cloudflare dashboard, using friendly UI drag and drop elements.

Magic Firewall provides a very powerful syntax for matching on various packet parameters, but it is also limited to the matches provided by nftables. While this is more than sufficient for many use cases, it does not provide enough flexibility to implement the advanced packet parsing and content matching we want. We needed more power.

Hello eBPF, meet Nftables!

When looking to add more power to your Linux networking needs, Extended Berkeley Packet Filter (eBPF) is a natural choice. With eBPF, you can insert packet processing programs that execute in the kernel, giving you the flexibility of familiar programming paradigms with the speed of in-kernel execution. Cloudflare loves eBPF and this technology has been transformative in enabling many of our products. Naturally, we wanted to find a way to use eBPF to extend our use of nftables in Magic Firewall. This means being able to match, using an eBPF program within a table and chain as a rule. By doing this we can have our cake and eat it too, by keeping our existing infrastructure and code, and extending it further.

If nftables could leverage eBPF natively, this story would be much shorter; alas, we had to continue our quest. To get us started in our search, we know that iptables integrates with eBPF. For example, one can use iptables and a pinned eBPF program for dropping packets with the following command:

iptables -A INPUT -m bpf --object-pinned /sys/fs/bpf/match -j DROP

This clue helped to put us on the right path. Iptables uses the xt_bpf extension to match on an eBPF program. This extension uses the BPF_PROG_TYPE_SOCKET_FILTER eBPF program type, which allows us to load the packet information from the socket buffer and return a value based on our code.

Since we know iptables can use eBPF, why not just use that? Magic Firewall currently leverages nftables, which is a great choice for our use case due to its flexibility in syntax and programmable interface. Thus, we need to find a way to use the xt_bpf extension with nftables.

This diagram helps explain the relationship between iptables, nftables and the kernel. The nftables API can be used by both the iptables and nft userspace programs, and can configure both xtables matches (including xt_bpf) and normal nftables matches.

This means that given the right API calls (netlink/netfilter messages), we can embed an xt_bpf match within an nftables rule. In order to do this, we need to understand which netfilter messages we need to send. By using tools such as strace, Wireshark, and especially using the source we were able to construct a message that could append an eBPF rule given a table and chain.

NFTA_RULE_TABLE table
NFTA_RULE_CHAIN chain
NFTA_RULE_EXPRESSIONS | NFTA_MATCH_NAME
	NFTA_LIST_ELEM | NLA_F_NESTED
	NFTA_EXPR_NAME "match"
		NLA_F_NESTED | NFTA_EXPR_DATA
		NFTA_MATCH_NAME "bpf"
		NFTA_MATCH_REV 1
		NFTA_MATCH_INFO ebpf_bytes

The structure of the netlink/netfilter message to add an eBPF match should look like the above example. Of course, this message needs to be properly embedded and include a conditional step, such as a verdict, when there is a match. The next step was decoding the format of ebpf_bytes as shown in the example below.

 struct xt_bpf_info_v1 {
	__u16 mode;
	__u16 bpf_program_num_elem;
	__s32 fd;
	union {
		struct sock_filter bpf_program[XT_BPF_MAX_NUM_INSTR];
		char path[XT_BPF_PATH_MAX];
	};
};

The bytes format can be found in the kernel header definition of struct xt_bpf_info_v1. The code example above shows the relevant parts of the structure.

The xt_bpf module supports both raw bytecodes, as well as a path to a pinned ebpf program. The later mode is the technique we used to combine the ebpf program with nftables.

With this information we were able to write code that could create netlink messages and properly serialize any relevant data fields. This approach was just the first step, we are also looking into incorporating this into proper tooling instead of sending custom netfilter messages.

Just Add eBPF

Now we needed to construct an eBPF program and load it into an existing nftables table and chain. Starting to use eBPF can be a bit daunting. Which program type do we want to use? How do we compile and load our eBPF program? We started this process by doing some exploration and research.

First we constructed an example program to try it out.

SEC("socket")
int filter(struct __sk_buff *skb) {
  /* get header */
  struct iphdr iph;
  if (bpf_skb_load_bytes(skb, 0, &iph, sizeof(iph))) {
    return BPF_DROP;
  }

  /* read last 5 bytes in payload of udp */
  __u16 pkt_len = bswap_16(iph.tot_len);
  char data[5];
  if (bpf_skb_load_bytes(skb, pkt_len - sizeof(data), &data, sizeof(data))) {
    return BPF_DROP;
  }

  /* only packets with the magic word at the end of the payload are allowed */
  const char SECRET_TOKEN[5] = "xyzzy";
  for (int i = 0; i < sizeof(SECRET_TOKEN); i++) {
    if (SECRET_TOKEN[i] != data[i]) {
      return BPF_DROP;
    }
  }

  return BPF_OK;
}

The excerpt mentioned is an example of an eBPF program that only accepts packets that have a magic string at the end of the payload. This requires checking the total length of the packet to find where to start the search. For clarity, this example omits error checking and headers.

Once we had a program, the next step was integrating it into our tooling. We tried a few technologies to load the program, like BCC, libbpf, and we even created a custom loader. Ultimately, we ended up using cilium’s ebpf library, since we are using Golang for our control-plane program and the library makes it easy to generate, embed and load eBPF programs.

# nft list ruleset
table ip mfw {
	chain input {
		#match bpf pinned /sys/fs/bpf/mfw/match drop
	}
}

Once the program is compiled and pinned, we can add matches into nftables using netlink commands. Listing the ruleset shows the match is present. This is incredible! We are now able to deploy custom C programs to provide advanced matching inside a Magic Firewall ruleset!

More Magic

With the addition of eBPF to our toolkit, Magic Firewall is an even more flexible and powerful way to protect your network from bad actors. We are now able to look deeper into packets and implement more complex matching logic than nftables alone could provide. Since our firewall is running as software on all Cloudflare servers, we can quickly iterate and update features.

One outcome of this project is SIP protection, which is currently in beta. That’s only the beginning. We are currently exploring using eBPF for protocol validations, advanced field matching, looking into payloads, and supporting even larger sets of IP lists.

We welcome your help here, too! If you have other use cases and ideas, please talk to your account team. If you find this technology interesting, come join our team!

Raking the floods: my intern project using eBPF

Jonas Otten — Fri, 18 Sep 2020 11:00:00 GMT

Cloudflare has sophisticated DDoS attack mitigation systems with multiple layers to provide defense in depth. Some of these layers analyse large-scale traffic patterns to detect and mitigate attacks. Other layers are more protocol- and application-specific, in order to stop attacks that might be hard to detect from overall traffic patterns. In some cases, the best place to detect and stop an attack is in the service itself.

During my internship at Cloudflare this summer, I’ve developed a new open-source framework to help UDP services protect themselves from attacks. This framework incorporates Cloudflare’s experience in running UDP-based services like Spectrum and the 1.1.1.1 resolver.

Goals of the framework

First of all, let's discuss what it actually means to protect an UDP service. We want to ensure that an attacker cannot drown out legitimate traffic. To achieve this we identify floods and limit them while leaving legitimate traffic untouched.

The idea to mitigate such attacks is straight forward: first identify a group of packets that is related to an attack, and then apply a rate limit on this group. Such groups are determined based on the attributes available to us in the packet, such as addresses and ports.

We then drop packets in the group. We only want to drop as much traffic as necessary to comply with our set rate limit. Completely ignoring a set of packets just because it is slightly above the rate limit is not an option, as it may contain legitimate traffic.

This ensures both that our service stays responsive but also that legitimate packets experience as little impact as possible.

While rate limiting is a somewhat straightforward procedure, determining groups is a bit harder, for a number of reasons.

Finding needles in the haystack

The problem in determining groups in packets is that we have barely any context. We consider four things as useful attributes as attack signatures: the source address and port as well as the destination address and port. While that already is not a lot, it gets worse: the source address and port may not even be accurate. Packets can be spoofed, in which case an attacker hides their own address. That means only keeping a rate per source address may not provide much value, as it could simply be spoofed.

But there is another problem: keeping one rate per address does not scale. When bringing IPv6 into the equation and its whopping address space it becomes clear it’s not going to work.

To solve these issues we turned to the academic world and found what we were looking for, the problem of Heavy Hitters. Heavy Hitters are elements of a datastream that appear frequently, and can be expressed relative to the overall elements of the stream. We can define for example that an element is considered to be a Heavy Hitter if its frequency exceeds, say, 10% of the overall count. To do so we naively could suggest to simply maintain a counter per element, but due to the space limitations this will not scale. Instead probabilistic algorithms such as a CountMin sketch or the SpaceSaving algorithm can be used. These provide an estimated count instead of a precise one, but are capable of doing this with constant memory requirements, and in our case we will just save rates into the CountMin sketch instead of counts. So no matter how many unique elements we have to track, the memory consumption is the same.

We now have a way of finding the needle in the haystack, and it does have constant memory requirements, solving our problem. However, reality isn’t that simple. What if an attack is not just originating from a single port but many? Or what if a reflection attack is hitting our service, resulting in random source addresses but a single source port? Maybe a full /24 subnet is sending us a flood? We can not just keep a rate per combination we see, as it would ignore all these patterns.

Grouping the groups: How to organize packets

Luckily the academic world has us covered again, with the concept of Hierarchical Heavy Hitters. It extends the Heavy Hitter concept by using the underlying hierarchy in the elements of the stream. For example, an IP address can be naturally grouped into several subnets:

In this case we defined that we consider the fully-specified address, the /24 subnet and the /0 wildcard. We start at the left with the fully specified address, and each step walking towards the top we consider less information from it. We call these less-specific addresses generalisations, and measure how specific a generalisation is by assigning a level. In our example, the address 192.0.2.123 is at level 0, while 192.0.2.0/24 is at level 1, etc.

If we want to create a structure which can hold this information for every packet, it could look like this:

We maintain a CountMin-sketch per subnet and then apply Heavy Hitters. When a new packet arrives and we need to determine if it is allowed to pass we simply check the rates of the corresponding elements in every node. If no rate exceeds the rate limit that we set, e.g. 25 packets per second (pps), it is allowed to pass.

The structure could now keep track of a single attribute, but we would waste a lot of context around packets! So instead of letting it go to waste, we use the two-dimensional approach for addresses proposed in the paper Hierarchical Heavy Hitters with SpaceSaving algorithm, and extend it further to also incorporate ports into our structure. Ports do not have a natural hierarchy such as addresses, so they can only be in two states: either specified (e.g. 8080) or wildcard.

Now our structure looks like this:

Now let’s talk about the algorithm we use to traverse the structure and determine if a packet should be allowed to pass. The paper Hierarchical Heavy Hitters with SpaceSaving algorithm provides two methods that can be used on the data structure: one that updates elements and increases their counters, and one that provides all elements that currently are Heavy Hitters. This is actually not necessary for our use-case, as we are only interested if the element, or packet, we are looking at right now would be a Heavy Hitter to decide if it can pass or not.

Secondly, our goal is to prevent any Heavy Hitters from passing, thus leaving the structure with no _Heavy Hitter_s whatsoever. This is a great property, as it allows us to simplify the algorithm substantially, and it looks like this:

As you may notice, we update every node of a level and maintain the maximum rate we see. After each level we calculate a probability that determines if a packet should be passed to the next level, based on the maximum rate we saw on that level and a set rate limit. Each node essentially filters the traffic for the following, less specific level.

I actually left out a small detail: a packet is not dropped if any rate exceeds the limit, but instead is kept with the probability rate limit/maximum rate seen. The reason is that if we just drop all packets if the rates exceed the limit, we would drop the whole traffic, not just a subset to make it comply with our set rate limit.

Since we now still update more specific nodes even if a node reaches a rate limit, the rate limit will converge towards the underlying pattern of the attack as much as possible. That means other traffic will be impacted as minimally as possible, and that with no manual intervention whatsoever!

BPF to the rescue: building a Go library

As we want to use this algorithm to mitigate floods, we need to spend as little computation and overhead as possible before we decide if a packet should be dropped or not. As so often, we looked into the BPF toolbox and found what we need: Socketfilters. As our colleague Marek put it: “It seems, no matter the question - BPF is the answer.”.

Socketfilters are pieces of code that can be attached to a single socket and get executed before a packet will be passed from kernel to userspace. This is ideal for a number of reasons. First, when the kernel runs the socket filter code, it gives it all the information from the packet we need, and other mitigations such as firewalls have been executed. Second the code is executed per socket, so every application can activate it as needed, and also set appropriate rate limits. It may even use different rate limits for different sockets. The third reason is privileges: we do not need to be root to attach the code to a socket. We can execute code in the kernel as a normal user!

BPF also has a number of limitations which have been already covered on this blog in the past, so we will focus on one that’s specific to our project: floating-point numbers.

To calculate rates we need floating-point numbers to provide an accurate estimate. BPF, and the whole kernel for that matter, does not support these. Instead we implemented a fixed-point representation, which uses a part of the available bits for the fractional part of a rational number and the remaining bits for the integer part. This allows us to represent floats within a certain range, but there is a catch when doing arithmetic: while subtraction and addition of two fixed-points work well, multiplication and division requires double the number of bits to ensure there will not be any loss in precision. As we use 64 bits for our fixed-point values, there is no larger data type available to ensure this does not happen. Instead of calculating the result with exact precision, we convert one of the arguments into an integer. That results in the loss of the fractional part, but as we deal with large rates that does not pose any issue, and helps us to work around the bit limitation as intermediate results fit into the available 64 bits. Whenever fixed-point arithmetic is necessary the precision of intermediate results has to be carefully considered.

There are many more details to the implementation, but instead of covering every single detail in this blog post lets just look at the code.

We open sourced rakelimit over on Github at cloudflare/rakelimit! It is a full-blown Go library that can be enabled on any UDP socket, and is easy to configure.

The development is still in early stages and this is a first prototype, but we are excited to continue and push the development with the community! And if you still can’t get enough, look at our talk from this year's Linux Plumbers Conference.

It's crowded in here!

Jakub Sitnicki — Sat, 12 Oct 2019 13:00:00 GMT

We recently gave a presentation on Programming socket lookup with BPF at the Linux Plumbers Conference 2019 in Lisbon, Portugal. This blog post is a recap of the problem statement and proposed solution we presented.

CC0 Public Domain, PxHere

Our edge servers are crowded. We run more than a dozen public facing services, leaving aside the all internal ones that do the work behind the scenes.

Quick Quiz #1: How many can you name? We blogged about them! Jump to answer.

These services are exposed on more than a million Anycast public IPv4 addresses partitioned into 100+ network prefixes.

To keep things uniform every Cloudflare edge server runs all services and responds to every Anycast address. This allows us to make efficient use of the hardware by load-balancing traffic between all machines. We have shared the details of Cloudflare edge architecture on the blog before.

Granted not all services work on all the addresses but rather on a subset of them, covering one or several network prefixes.

So how do you set up your network services to listen on hundreds of IP addresses without driving the network stack over the edge? Cloudflare engineers have had to ask themselves this question more than once over the years, and the answer has changed as our edge evolved. This evolution forced us to look for creative ways to work with the Berkeley sockets API, a POSIX standard for assigning a network address and a port number to your application. It has been quite a journey, and we are not done yet.

When life is simple - one address, one socket

The simplest kind of association between an (IP address, port number) and a service that we can imagine is one-to-one. A server responds to client requests on a single address, on a well known port. To set it up the application has to open one socket for each transport protocol (be it TCP or UDP) it wants to support. A network server like our authoritative DNS would open up two sockets (one for UDP, one for TCP):

(192.0.2.1, 53/tcp) ⇨ ("auth-dns", pid=1001, fd=3) (192.0.2.1, 53/udp) ⇨ ("auth-dns", pid=1001, fd=4)

To take it to Cloudflare scale, the service is likely to have to receive on at least a /20 network prefix, which is a range of IPs with 4096 addresses in it.

This translates to opening 4096 sockets for each transport protocol. Something that is not likely to go unnoticed when looking at ss tool output.

$ sudo ss -ulpn 'sport = 53' State Recv-Q Send-Q Local Address:Port Peer Address:Port … UNCONN 0 0 192.0.2.40:53 0.0.0.0:* users:(("auth-dns",pid=77556,fd=11076)) UNCONN 0 0 192.0.2.39:53 0.0.0.0:* users:(("auth-dns",pid=77556,fd=11075)) UNCONN 0 0 192.0.2.38:53 0.0.0.0:* users:(("auth-dns",pid=77556,fd=11074)) UNCONN 0 0 192.0.2.37:53 0.0.0.0:* users:(("auth-dns",pid=77556,fd=11073)) UNCONN 0 0 192.0.2.36:53 0.0.0.0:* users:(("auth-dns",pid=77556,fd=11072)) UNCONN 0 0 192.0.2.31:53 0.0.0.0:* users:(("auth-dns",pid=77556,fd=11071)) …

CC BY 2.0, Luca Nebuloni, Flickr

The approach, while naive, has an advantage: when an IP from the range gets attacked with a UDP flood, the receive queues of sockets bound to the remaining IP addresses are not affected.

Life can be easier - all addresses, one socket

It seems rather silly to create so many sockets for one service to receive traffic on a range of addresses. Not only that, the more listening sockets there are, the longer the chains in the socket lookup hash table. We have learned the hard way that going in this direction can hurt packet processing latency.

The sockets API comes with a big hammer that can make our life easier - the INADDR_ANY aka 0.0.0.0 wildcard address. With INADDR_ANY we can make a single socket receive on all addresses assigned to our host, specifying just the port.

s = socket(AF_INET, SOCK_STREAM, 0)
s.bind(('0.0.0.0', 12345))
s.listen(16)

Quick Quiz #2: Is there another way to bind a socket to all local addresses? Jump to answer.

In other words, compared to the naive “one address, one socket” approach, INADDR_ANY allows us to have a single catch-all listening socket for the whole IP range on which we accept incoming connections.

On Linux this is possible thanks to a two-phase listening socket lookup, where it falls back to search for an INADDR_ANY socket if a more specific match has not been found.

Another upside of binding to 0.0.0.0 is that our application doesn’t need to be aware of what addresses we have assigned to our host. We are also free to assign or remove the addresses after binding the listening socket. No need to reconfigure the service when its listening IP range changes.

On the other hand if our service should be listening on just A.B.C.0/20 prefix, binding to all local addresses is more than we need. We might unintentionally expose an otherwise internal-only service to external traffic without a proper firewall or a socket filter in place.

Then there is the security angle. Since we now only have one socket, attacks attempting to flood any of the IPs assigned to our host on our service’s port, will hit the catch-all socket and its receive queue. While in such circumstances the Linux TCP stack has your back, UDP needs special care or legitimate traffic might drown in the flood of dropped packets.

Possibly the biggest downside, though, is that a service listening on the wildcard INADDR_ANY address claims the port number exclusively for itself. Binding over the wildcard-listening socket with a specific IP and port fails miserably due to the address already being taken (EADDRINUSE).

bind(3, {sa_family=AF_INET, sin_port=htons(12345), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
bind(4, {sa_family=AF_INET, sin_port=htons(12345), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRINUSE (Address already in use)

Unless your service is UDP-only, setting the SO_REUSEADDR socket option, will not help you overcome this restriction. The only way out is to turn to SO_REUSEPORT, normally used to construct a load-balancing socket group. And that is only if you are lucky enough to run the port-conflicting services as the same user (UID). That is a story for another post.Quick Quiz #3: Does setting the SO_REUSEADDR socket option have any effect at all when there is bind conflict? Jump to answer.

Life gets real - one port, two services

As it happens, at the Cloudflare edge we do host services that share the same port number but otherwise respond to requests on non-overlapping IP ranges. A prominent example of such port-sharing is our 1.1.1.1 recursive DNS resolver running side-by-side with the authoritative DNS service that we offer to all customers.

Sadly the sockets API doesn’t allow us to express a setup in which two services share a port and accept requests on disjoint IP ranges.

However, as Linux development history shows, any networking API limitation can be overcome by introducing a new socket option, with sixty-something options available (and counting!).

Enter SO_BINDTOPREFIX.

Back in 2016 we proposed an extension to the Linux network stack. It allowed services to constrain a wildcard-bound socket to an IP range belonging to a network prefix.

# Service 1, 127.0.0.0/20, 1234/tcp
net1, plen1 = '127.0.0.0', 20
bindprefix1 = struct.pack('BBBBBxxx', *inet_aton(net1), plen1)

s1 = socket(AF_INET, SOCK_STREAM, 0)
s1.setsockopt(SOL_IP, IP_BINDTOPREFIX, bindprefix1)
s1.bind(('0.0.0.0', 1234))
s1.listen(1)

# Service 2, 127.0.16.0/20, 1234/tcp
net2, plen2 = '127.0.16.0', 20
bindprefix2 = struct.pack('BBBBBxxx', *inet_aton(net2), plen2)

s2 = socket(AF_INET, SOCK_STREAM, 0)
s2.setsockopt(SOL_IP, IP_BINDTOPREFIX, bindprefix2)
s2.bind(('0.0.0.0', 1234))
s2.listen(1)

This mechanism has served us well since then. Unfortunately, it didn’t get accepted upstream due to being too specific to our use-case. Having no better alternative we ended up maintaining patches in our kernel to this day.

Life gets complicated - all ports, one service

Just when we thought we had things figured out, we were faced with a new challenge. How to build a service that accepts connections on any of the 65,535 ports? The ultimate reverse proxy, if you will, code named Spectrum.The bind syscall offers very little flexibility when it comes to mapping a socket to a port number. You can either specify the number you want or let the network stack pick an unused one for you. There is no counterpart of INADDR_ANY, a wildcard value to select all ports (INPORT_ANY?).To achieve what we wanted, we had to turn to TPROXY, a Netfilter / iptables extension designed for intercepting remote-destined traffic on the forward path. However, we use it to steer local-destined packets, that is ones targeted to our host, to a catch-all-ports socket.

iptables -t mangle -I PREROUTING \
         -d 192.0.2.0/24 -p tcp \
         -j TPROXY --on-ip=127.0.0.1 --on-port=1234

TPROXY-based setup comes at a price. For starters, your service needs elevated privileges to create a special catch-all socket (see the IP_TRANSPARENT socket option). Then you also have to understand and consider the subtle interactions between TPROXY and the receive path for your traffic profile, for example:

does connection tracking register the flows redirected with TPROXY?
is listening socket contention during a SYN flood when using TPROXY a concern?
do other parts of the network stack, like XDP programs, need to know about TPROXY redirecting packets?

These are some of the questions we needed to answer, and after running it in production for a while now, we have a good idea of what the consequences of using TPROXY are.That said, it would not come as a shock, if tomorrow we’d discovered something new about TPROXY. Due to its complexity we’ve always considered using it to steer local-destined traffic a hack, a use-case outside its intended application. No matter how well understood, a hack remains a hack.

Can BPF make life easier?

Despite its complex nature TPROXY shows us something important. No matter what IP or port the listening socket is bound to, with a bit of support from the network stack, we can steer any connection to it. As long the application is ready to handle this situation, things work.

Quick Quiz #4: Are there really no problems with accepting any connection on any socket? Jump to answer.

This is a really powerful concept. With a bunch of TPROXY rules, we can configure any mapping between (address, port) tuples and listening sockets.

? Idea #1: A local-destined connection can be accepted by any listening socket.

We didn’t tell you the whole story before. When we published SO_BINDTOPREFIX patches, they did not just get rejected. As sometimes happens by posting the wrong answer, we got the right answer to our problem

❝BPF is absolutely the way to go here, as it allows for whatever user specified tweaks, like a list of destination subnetwork, or/and a list of source network, or the date/time of the day, or port knocking without netfilter, or … you name it.❞

? Idea #2: How we pick a listening socket can be tweaked with BPF.

Combine the two ideas together, and we arrive at an exciting concept. Let’s run BPF code to match an incoming packet with a listening socket, ignoring the address the socket is bound to. ?

Here’s an example to illustrate it.

All packets coming on 192.0.2.0/24 prefix, port 53 are steered to socket sk:2, while traffic targeted at 203.0.113.1, on any port number lands in socket sk:4.

Welcome BPF inet_lookup

To make this concept a reality we are proposing a new mechanism to program the socket lookup with BPF. What is socket lookup? It’s a stage on the receive path where the transport layer searches for a socket to dispatch the packet to. The last possible moment to steer packets before they land in the selected socket receive queue. In there we attach a new type of BPF program called inet_lookup.

If you recall, socket lookup in the Linux TCP stack is a two phase process. First the kernel will try to find an established (connected) socket matching the packet 4-tuple. If there isn’t one, it will continue by looking for a listening socket using just the packet 2-tuple as key.

Our proposed extension allows users to program the second phase, the listening socket lookup. If present, a BPF program is allowed to choose a listening socket and terminate the lookup. Our program is also free to ignore the packet, in which case the kernel will continue to look for a listening socket as usual.

How does this new type of BPF program operate? On input, as context, it gets handed a subset of information extracted from packet headers, including the packet 4-tuple. Based on the input the program accesses a BPF map containing references to listening sockets, and selects one to yield as the socket lookup result.

If we take a look at the corresponding BPF code, the program structure resembles a firewall rule. We have some match statements followed by an action.

You may notice that we don’t access the BPF map with sockets directly. Instead, we follow an established pattern in BPF called “map based redirection”, where a dedicated BPF helper accesses the map and carries out any steps necessary to redirect the packet.

We’ve skipped over one thing. Where does the BPF map of sockets come from? We create it ourselves and populate it with sockets. This is most easily done if your service uses systemd socket activation. systemd will let you associate more than one service unit with a socket unit, and both of the services will receive a file descriptor for the same socket. From there it’s just a matter of inserting the socket into the BPF map.

Demo time!

This is not just a concept. We have already published a first working set of patches for the kernel together with ancillary user-space tooling to configure the socket lookup to your needs.

If you would like to see it in action, you are in luck. We’ve put together a demo that shows just how easily you can bind a network service to (i) a single port, (ii) all ports, or (iii) a network prefix. On-the-fly, without having to restart the service! There is a port scan running to prove it.

You can also bind to all-addresses-all-ports (0.0.0.0/0) because why not? Take that INADDR_ANY. All thanks to BPF superpowers.

Summary

We have gone over how the way we bind services to network addresses on the Cloudflare edge has evolved over time. Each approach has its pros and cons, summarized below. We are currently working on a new BPF-based mechanism for binding services to addresses, which is intended to address the shortcomings of existing solutions.

bind to one address and port ? flood traffic on one address hits one socket, doesn’t affect the rest ? as many sockets as listening addresses, doesn’t scale

bind to all addresses with INADDR_ANY? just one socket for all addresses, the kernel thanks you ? application doesn’t need to know about listening addresses ? flood scenario requires custom protection, at least for UDP ? port sharing is tricky or impossible

bind to a network prefix with SO_BINDTOPREFIX? two services can share a port if their IP ranges are non-overlapping ? custom kernel API extension that never went upstream

bind to all port with TPROXY ? enables redirecting all ports to a listening socket and more ? meant for intercepting forwarded traffic early on the ingress path ? has subtle interactions with the network stack ? requires privileges from the application

bind to anything you want with BPF inet_lookup? allows for the same flexibility as with TPROXY or SO_BINDTOPREFIX ? services don’t need extra capabilities, meant for local traffic only ? needs cooperation from services or PID 1 to build a socket map

Getting to this point has been a team effort. A special thank you to Lorenz Bauer and Marek Majkowski who have contributed in an essential way to the BPF inet_lookup implementation. The SO_BINDTOPREFIX patches were authored by Gilberto Bertin.Fancy joining the team? Apply here!

Quiz Answers

Quiz 1

Q: How many Cloudflare services can you name?

HTTP CDN (tcp/80)
HTTPS CDN (tcp/443, udp/443)
authoritative DNS (udp/53)
recursive DNS (udp/53, 853)
NTP with NTS (udp/1234)
Roughtime time service (udp/2002)
IPFS Gateway (tcp/443)
Ethereum Gateway (tcp/443)
Spectrum proxy (tcp/any, udp/any)
WARP (udp)

Go back

Quiz 2

Q: Is there another way to bind a socket to all local addresses?Yes, there is - by not bind()’ing it at all. Calling listen() on an unbound socket is equivalent to binding it to INADDR_ANY and letting the kernel pick a free port.$ strace -e socket,bind,listen nc -l socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3 listen(3, 1) = 0 ^Z [1]+ Stopped strace -e socket,bind,listen nc -l $ ss -4tlnp State Recv-Q Send-Q Local Address:Port Peer Address:Port LISTEN 0 1 *:42669

Go back

Quiz 3

Q: Does setting the SO_REUSEADDR socket option have any effect at all when there is bind conflict?Yes. If two processes are racing to bind and listen on the same TCP port, on an overlapping IP, setting SO_REUSEADDR changes which syscall will report an error (EADDRINUSE). Without SO_REUSEADDR it will always be the second bind. With SO_REUSEADDR set there is a window of opportunity for a second bind to succeed but the subsequent listen to fail.

Go back

Quiz 4

Q: Are there really no problems with accepting any connection on any socket?If the connection is destined for an address assigned to our host, i.e. a local address, there are no problems. However, for remote-destined connections, sending return traffic from a non-local address (i.e., one not present on any interface) will not get past the Linux network stack. The IP_TRANSPARENT socket option bypasses this protection mechanism known as source address check to lift this restriction.

Go back

Cloudflare architecture and how BPF eats the world

Marek Majkowski — Sat, 18 May 2019 15:00:00 GMT

Recently at Netdev 0x13, the Conference on Linux Networking in Prague, I gave a short talk titled "Linux at Cloudflare". The talk ended up being mostly about BPF. It seems, no matter the question - BPF is the answer.

Here is a transcript of a slightly adjusted version of that talk.

At Cloudflare we run Linux on our servers. We operate two categories of data centers: large "Core" data centers, processing logs, analyzing attacks, computing analytics, and the "Edge" server fleet, delivering customer content from 180 locations across the world.

In this talk, we will focus on the "Edge" servers. It's here where we use the newest Linux features, optimize for performance and care deeply about DoS resilience.

Our edge service is special due to our network configuration - we are extensively using anycast routing. Anycast means that the same set of IP addresses are announced by all our data centers.

This design has great advantages. First, it guarantees the optimal speed for end users. No matter where you are located, you will always reach the closest data center. Then, anycast helps us to spread out DoS traffic. During attacks each of the locations receives a small fraction of the total traffic, making it easier to ingest and filter out unwanted traffic.

Anycast allows us to keep the networking setup uniform across all edge data centers. We applied the same design inside our data centers - our software stack is uniform across the edge servers. All software pieces are running on all the servers.

In principle, every machine can handle every task - and we run many diverse and demanding tasks. We have a full HTTP stack, the magical Cloudflare Workers, two sets of DNS servers - authoritative and resolver, and many other publicly facing applications like Spectrum and Warp.

Even though every server has all the software running, requests typically cross many machines on their journey through the stack. For example, an HTTP request might be handled by a different machine during each of the 5 stages of the processing.

Let me walk you through the early stages of inbound packet processing:

(1) First, the packets hit our router. The router does ECMP, and forwards packets onto our Linux servers. We use ECMP to spread each target IP across many, at least 16, machines. This is used as a rudimentary load balancing technique.

(2) On the servers we ingest packets with XDP eBPF. In XDP we perform two stages. First, we run volumetric DoS mitigations, dropping packets belonging to very large layer 3 attacks.

(3) Then, still in XDP, we perform layer 4 load balancing. All the non-attack packets are redirected across the machines. This is used to work around the ECMP problems, gives us fine-granularity load balancing and allows us to gracefully take servers out of service.

(4) Following the redirection the packets reach a designated machine. At this point they are ingested by the normal Linux networking stack, go through the usual iptables firewall, and are dispatched to an appropriate network socket.

(5) Finally packets are received by an application. For example HTTP connections are handled by a "protocol" server, responsible for performing TLS encryption and processing HTTP, HTTP/2 and QUIC protocols.

It's in these early phases of request processing where we use the coolest new Linux features. We can group useful modern functionalities into three categories:

DoS handling
Load balancing
Socket dispatch

Let's discuss DoS handling in more detail. As mentioned earlier, the first step after ECMP routing is Linux's XDP stack where, among other things, we run DoS mitigations.

Historically our mitigations for volumetric attacks were expressed in classic BPF and iptables-style grammar. Recently we adapted them to execute in the XDP eBPF context, which turned out to be surprisingly hard. Read on about our adventures:

L4Drop: XDP DDoS Mitigations
xdpcap: XDP Packet Capture
XDP based DoS mitigation talk by Arthur Fabre
XDP in practice: integrating XDP into our DDoS mitigation pipeline (PDF)

During this project we encountered a number of eBPF/XDP limitations. One of them was the lack of concurrency primitives. It was very hard to implement things like race-free token buckets. Later we found that Facebook engineer Julia Kartseva had the same issues. In February this problem has been addressed with the introduction of bpf_spin_lock helper.

While our modern volumetric DoS defenses are done in XDP layer, we still rely on iptables for application layer 7 mitigations. Here, a higher level firewall’s features are useful: connlimit, hashlimits and ipsets. We also use the xt_bpf iptables module to run cBPF in iptables to match on packet payloads. We talked about this in the past:

After XDP and iptables, we have one final kernel side DoS defense layer.

Consider a situation when our UDP mitigations fail. In such case we might be left with a flood of packets hitting our application UDP socket. This might overflow the socket causing packet loss. This is problematic - both good and bad packets will be dropped indiscriminately. For applications like DNS it's catastrophic. In the past to reduce the harm, we ran one UDP socket per IP address. An unmitigated flood was bad, but at least it didn't affect the traffic to other server IP addresses.

Nowadays that architecture is no longer suitable. We are running more than 30,000 DNS IP's and running that number of UDP sockets is not optimal. Our modern solution is to run a single UDP socket with a complex eBPF socket filter on it - using the SO_ATTACH_BPF socket option. We talked about running eBPF on network sockets in past blog posts:

The mentioned eBPF rate limits the packets. It keeps the state - packet counts - in an eBPF map. We can be sure that a single flooded IP won't affect other traffic. This works well, though during work on this project we found a rather worrying bug in the eBPF verifier:

eBPF can't count?!

I guess running eBPF on a UDP socket is not a common thing to do.

Apart from the DoS, in XDP we also run a layer 4 load balancer layer. This is a new project, and we haven't talked much about it yet. Without getting into many details: in certain situations we need to perform a socket lookup from XDP.

The problem is relatively simple - our code needs to look up the "socket" kernel structure for a 5-tuple extracted from a packet. This is generally easy - there is a bpf_sk_lookup helper available for this. Unsurprisingly, there were some complications. One problem was the inability to verify if a received ACK packet was a valid part of a three-way handshake when SYN-cookies are enabled. My colleague Lorenz Bauer is working on adding support for this corner case.

After DoS and the load balancing layers, the packets are passed onto the usual Linux TCP / UDP stack. Here we do a socket dispatch - for example packets going to port 53 are passed onto a socket belonging to our DNS server.

We do our best to use vanilla Linux features, but things get complex when you use thousands of IP addresses on the servers.

Convincing Linux to route packets correctly is relatively easy with the "AnyIP" trick. Ensuring packets are dispatched to the right application is another matter. Unfortunately, standard Linux socket dispatch logic is not flexible enough for our needs. For popular ports like TCP/80 we want to share the port between multiple applications, each handling it on a different IP range. Linux doesn't support this out of the box. You can call bind() either on a specific IP address or all IP's (with 0.0.0.0).

In order to fix this, we developed a custom kernel patch which adds a SO_BINDTOPREFIX socket option. As the name suggests - it allows us to call bind() on a selected IP prefix. This solves the problem of multiple applications sharing popular ports like 53 or 80.

Then we run into another problem. For our Spectrum product we need to listen on all 65535 ports. Running so many listen sockets is not a good idea (see our old war story blog), so we had to find another way. After some experiments we learned to utilize an obscure iptables module - TPROXY - for this purpose. Read about it here:

Abusing Linux's firewall: the hack that allowed us to build Spectrum

This setup is working, but we don't like the extra firewall rules. We are working on solving this problem correctly - actually extending the socket dispatch logic. You guessed it - we want to extend socket dispatch logic by utilizing eBPF. Expect some patches from us.

Then there is a way to use eBPF to improve applications. Recently we got excited about doing TCP splicing with SOCKMAP:

SOCKMAP - TCP splicing of the future

This technique has a great potential for improving tail latency across many pieces of our software stack. The current SOCKMAP implementation is not quite ready for prime time yet, but the potential is vast.

Similarly, the new TCP-BPF aka BPF_SOCK_OPS hooks provide a great way of inspecting performance parameters of TCP flows. This functionality is super useful for our performance team.

Some Linux features didn't age well and we need to work around them. For example, we are hitting limitations of networking metrics. Don't get me wrong - the networking metrics are awesome, but sadly they are not granular enough. Things like TcpExtListenDrops and TcpExtListenOverflows are reported as global counters, while we need to know it on a per-application basis.

Our solution is to use eBPF probes to extract the numbers directly from the kernel. My colleague Ivan Babrou wrote a Prometheus metrics exporter called "ebpf_exporter" to facilitate this. Read on:

With "ebpf_exporter" we can generate all manner of detailed metrics. It is very powerful and saved us on many occasions.

In this talk we discussed 6 layers of BPFs running on our edge servers:

Volumetric DoS mitigations are running on XDP eBPF
Iptables xt_bpf cBPF for application-layer attacks
SO_ATTACH_BPF for rate limits on UDP sockets
Load balancer, running on XDP
eBPFs running application helpers like SOCKMAP for TCP socket splicing, and TCP-BPF for TCP measurements
"ebpf_exporter" for granular metrics

And we're just getting started! Soon we will be doing more with eBPF based socket dispatch, eBPF running on Linux TC (Traffic Control) layer and more integration with cgroup eBPF hooks. Then, our SRE team is maintaining ever-growing list of BCC scripts useful for debugging.

It feels like Linux stopped developing new API's and all the new features are implemented as eBPF hooks and helpers. This is fine and it has strong advantages. It's easier and safer to upgrade eBPF program than having to recompile a kernel module. Some things like TCP-BPF, exposing high-volume performance tracing data, would probably be impossible without eBPF.

Some say "software is eating the world", I would say that: "BPF is eating the software".

eBPF can't count?!

Jakub Sitnicki — Fri, 03 May 2019 13:00:00 GMT

Grant mechanical calculating machine, public domain image

It is unlikely we can tell you anything new about the extended Berkeley Packet Filter, eBPF for short, if you've read all the great man pages, docs, guides, and some of our blogs out there.

But we can tell you a war story, and who doesn't like those? This one is about how eBPF lost its ability to count for a while¹.

They say in our Austin, Texas office that all good stories start with "y'all ain't gonna believe this… tale." This one though, starts with a post to Linux netdev mailing list from Marek Majkowski after what I heard was a long night:

Marek's findings were quite shocking - if you subtract two 64-bit timestamps in eBPF, the result is garbage. But only when running as an unprivileged user. From root all works fine. Huh.

If you've seen Marek's presentation from the Netdev 0x13 conference, you know that we are using BPF socket filters as one of the defenses against simple, volumetric DoS attacks. So potentially getting your packet count wrong could be a Bad Thing™, and affect legitimate traffic.

Let's try to reproduce this bug with a simplified eBPF socket filter that subtracts two 64-bit unsigned integers passed to it from user-space though a BPF map. The input for our BPF program comes from a BPF array map, so that the values we operate on are not known at build time. This allows for easy experimentation and prevents the compiler from optimizing out the operations.

Starting small, eBPF, what is 2 - 1? View the code on our GitHub.

$ ./run-bpf 2 1
arg0                    2 0x0000000000000002
arg1                    1 0x0000000000000001
diff                    1 0x0000000000000001

OK, eBPF, what is 2^32 - 1?

$ ./run-bpf $[2**32] 1
arg0           4294967296 0x0000000100000000
arg1                    1 0x0000000000000001
diff 18446744073709551615 0xffffffffffffffff

Wrong! But if we ask nicely with sudo:

$ sudo ./run-bpf $[2**32] 1
[sudo] password for jkbs:
arg0           4294967296 0x0000000100000000
arg1                    1 0x0000000000000001
diff           4294967295 0x00000000ffffffff

Who is messing with my eBPF?

When computers stop subtracting, you know something big is up. We called for reinforcements.

Our colleague Arthur Fabre quickly noticed something is off when you examine the eBPF code loaded into the kernel. It turns out kernel doesn't actually run the eBPF it's supplied - it sometimes rewrites it first.

Any sane programmer would expect 64-bit subtraction to be expressed as a single eBPF instruction

$ llvm-objdump -S -no-show-raw-insn -section=socket1 bpf/filter.o
…
      20:       1f 76 00 00 00 00 00 00         r6 -= r7
…

However, that's not what the kernel actually runs. Apparently after the rewrite the subtraction becomes a complex, multi-step operation.

To see what the kernel is actually running we can use little known bpftool utility. First, we need to load our BPF

$ ./run-bpf --stop-after-load 2 1
[2]+  Stopped                 ./run-bpf 2 1

Then list all BPF programs loaded into the kernel with bpftool prog list

$ sudo bpftool prog list
…
5951: socket_filter  name filter_alu64  tag 11186be60c0d0c0f  gpl
        loaded_at 2019-04-05T13:01:24+0200  uid 1000
        xlated 424B  jited 262B  memlock 4096B  map_ids 28786

The most recently loaded socket_filter must be our program (filter_alu64). Now we now know its id is 5951 and we can list its bytecode with

$ sudo bpftool prog dump xlated id 5951
…
  33: (79) r7 = *(u64 *)(r0 +0)
  34: (b4) (u32) r11 = (u32) -1
  35: (1f) r11 -= r6
  36: (4f) r11 |= r6
  37: (87) r11 = -r11
  38: (c7) r11 s>>= 63
  39: (5f) r6 &= r11
  40: (1f) r6 -= r7
  41: (7b) *(u64 *)(r10 -16) = r6
…

bpftool can also display the JITed code with: bpftool prog dump jited id 5951.

As you see, subtraction is replaced with a series of opcodes. That is unless you are root. When running from root all is good

$ sudo ./run-bpf --stop-after-load 0 0
[1]+  Stopped                 sudo ./run-bpf --stop-after-load 0 0
$ sudo bpftool prog list | grep socket_filter
659: socket_filter  name filter_alu64  tag 9e7ffb08218476f3  gpl
$ sudo bpftool prog dump xlated id 659
…
  31: (79) r7 = *(u64 *)(r0 +0)
  32: (1f) r6 -= r7
  33: (7b) *(u64 *)(r10 -16) = r6
…

If you've spent any time using eBPF, you must have experienced first hand the dreaded eBPF verifier. It's a merciless judge of all eBPF code that will reject any programs that it deems not worthy of running in kernel-space.

What perhaps nobody has told you before, and what might come as a surprise, is that the very same verifier will actually also rewrite and patch up your eBPF code as needed to make it safe.

The problems with subtraction were introduced by an inconspicuous security fix to the verifier. The patch in question first landed in Linux 5.0 and was backported to 4.20.6 stable and 4.19.19 LTS kernel. The over 2000 words long commit message doesn't spare you any details on the attack vector it targets.

The mitigation stems from CVE-2019-7308 vulnerability discovered by Jann Horn at Project Zero, which exploits pointer arithmetic, i.e. adding a scalar value to a pointer, to trigger speculative memory loads from out-of-bounds addresses. Such speculative loads change the CPU cache state and can be used to mount a Spectre variant 1 attack.

To mitigate it the eBPF verifier rewrites any arithmetic operations on pointer values in such a way the result is always a memory location within bounds. The patch demonstrates how arithmetic operations on pointers get rewritten and we can spot a familiar pattern there

Wait a minute… What pointer arithmetic? We are just trying to subtract two scalar values. How come the mitigation kicks in?

It shouldn't. It's a bug. The eBPF verifier keeps track of what kind of values the ALU is operating on, and in this corner case the state was ignored.

Why running BPF as root is fine, you ask? If your program has CAP_SYS_ADMIN privileges, side-channel mitigations don't apply. As root you already have access to kernel address space, so nothing new can leak through BPF.

After our report, the fix has quickly landed in v5.0 kernel and got backported to stable kernels 4.20.15 and 4.19.28. Kudos to Daniel Borkmann for getting the fix out fast. However, kernel upgrades are hard and in the meantime we were left with code running in production that was not doing what it was supposed to.

32-bit ALU to the rescue

As one of the eBPF maintainers has pointed out, 32-bit arithmetic operations are not affected by the verifier bug. This opens a door for a work-around.

eBPF registers, r0..r10, are 64-bits wide, but you can also access just the lower 32 bits, which are exposed as subregisters w0..w10. You can operate on the 32-bit subregisters using BPF ALU32 instruction subset. LLVM 7+ can generate eBPF code that uses this instruction subset. Of course, you need to you ask it nicely with trivial -Xclang -target-feature -Xclang +alu32 toggle:

$ cat sub32.c
#include "common.h"

u32 sub32(u32 x, u32 y)
{
        return x - y;
}
$ clang -O2 -target bpf -Xclang -target-feature -Xclang +alu32 -c sub32.c
$ llvm-objdump -S -no-show-raw-insn sub32.o
…
sub32:
       0:       bc 10 00 00 00 00 00 00         w0 = w1
       1:       1c 20 00 00 00 00 00 00         w0 -= w2
       2:       95 00 00 00 00 00 00 00         exit

The 0x1c opcode of the instruction #1, which can be broken down as BPF_ALU | BPF_X | BPF_SUB (read more in the kernel docs), is the 32-bit subtraction between registers we are looking for, as opposed to regular 64-bit subtract operation 0x1f = BPF_ALU64 | BPF_X | BPF_SUB, which will get rewritten.

Armed with this knowledge we can borrow a page from bignum arithmetic and subtract 64-bit numbers using just 32-bit ops:

u64 sub64(u64 x, u64 y)
{
        u32 xh, xl, yh, yl;
        u32 hi, lo;

        xl = x;
        yl = y;
        lo = xl - yl;

        xh = x >> 32;
        yh = y >> 32;
        hi = xh - yh - (lo > xl); /* underflow? */

        return ((u64)hi << 32) | (u64)lo;
}

This code compiles as expected on normal architectures, like x86-64 or ARM64, but BPF Clang target plays by its own rules:

$ clang -O2 -target bpf -Xclang -target-feature -Xclang +alu32 -c sub64.c -o - \
  | llvm-objdump -S -
…  
      13:       1f 40 00 00 00 00 00 00         r0 -= r4
      14:       1f 30 00 00 00 00 00 00         r0 -= r3
      15:       1f 21 00 00 00 00 00 00         r1 -= r2
      16:       67 00 00 00 20 00 00 00         r0 <<= 32
      17:       67 01 00 00 20 00 00 00         r1 <<= 32
      18:       77 01 00 00 20 00 00 00         r1 >>= 32
      19:       4f 10 00 00 00 00 00 00         r0 |= r1
      20:       95 00 00 00 00 00 00 00         exit

Apparently the compiler decided it was better to operate on 64-bit registers and discard the upper 32 bits. Thus we weren't able to get rid of the problematic 0x1f opcode. Annoying, back to square one.

Surely a bit of IR will do?

The problem was in Clang frontend - compiling C to IR. We know that BPF "assembly" backend for LLVM can generate bytecode that uses ALU32 instructions. Maybe if we tweak the Clang compiler's output just a little we can achieve what we want. This means we have to get our hands dirty with the LLVM Intermediate Representation (IR).

If you haven't heard of LLVM IR before, now is a good time to do some reading ². In short the LLVM IR is what Clang produces and LLVM BPF backend consumes.

Time to write IR by hand! Here's a hand-tweaked IR variant of our sub64() function:

define dso_local i64 @sub64_ir(i64, i64) local_unnamed_addr #0 {
  %3 = trunc i64 %0 to i32      ; xl = (u32) x;
  %4 = trunc i64 %1 to i32      ; yl = (u32) y;
  %5 = sub i32 %3, %4           ; lo = xl - yl;
  %6 = zext i32 %5 to i64
  %7 = lshr i64 %0, 32          ; tmp1 = x >> 32;
  %8 = lshr i64 %1, 32          ; tmp2 = y >> 32;
  %9 = trunc i64 %7 to i32      ; xh = (u32) tmp1;
  %10 = trunc i64 %8 to i32     ; yh = (u32) tmp2;
  %11 = sub i32 %9, %10         ; hi = xh - yh
  %12 = icmp ult i32 %3, %5     ; tmp3 = xl < lo
  %13 = zext i1 %12 to i32
  %14 = sub i32 %11, %13        ; hi -= tmp3
  %15 = zext i32 %14 to i64
  %16 = shl i64 %15, 32         ; tmp2 = hi << 32
  %17 = or i64 %16, %6          ; res = tmp2 | (u64)lo
  ret i64 %17
}

It may not be pretty but it does produce desired BPF code when compiled³. You will likely find the LLVM IR reference helpful when deciphering it.

And voila! First working solution that produces correct results:

$ ./run-bpf -filter ir $[2**32] 1
arg0           4294967296 0x0000000100000000
arg1                    1 0x0000000000000001
diff           4294967295 0x00000000ffffffff

Actually using this hand-written IR function from C is tricky. See our code on GitHub.

public domain image by Sergei Frolov

The final trick

Hand-written IR does the job. The downside is that linking IR modules to your C modules is hard. Fortunately there is a better way. You can persuade Clang to stick to 32-bit ALU ops in generated IR.

We've already seen the problem. To recap, if we ask Clang to subtract 32-bit integers, it will operate on 64-bit values and throw away the top 32-bits. Putting C, IR, and eBPF side-by-side helps visualize this:

The trick to get around it is to declare the 32-bit variable that holds the result as volatile. You might already know the volatile keyword if you've written Unix signal handlers. It basically tells the compiler that the value of the variable may change under its feet so it should refrain from reorganizing loads (reads) from it, as well as that stores (writes) to it might have side-effects so changing the order or eliminating them, by skipping writing it to the memory, is not allowed either.

Using volatile makes Clang emit special loads and/or stores at the IR level, which then on eBPF level translates to writing/reading the value from memory (stack) on every access. While this sounds not related to the problem at hand, there is a surprising side-effect to it:

With volatile access compiler doesn't promote the subtraction to 64 bits! Don't ask me why, although I would love to hear an explanation. For now, consider this a hack. One that does not come for free - there is the overhead of going through the stack on each read/write.

However, if we play our cards right we just might reduce it a little. We don't actually need the volatile load or store to happen, we just want the side effect. So instead of declaring the value as volatile, which implies that both reads and writes are volatile, let's try to make only the writes volatile with a help of a macro:

/* Emits a "store volatile" in LLVM IR */
#define ST_V(rhs, lhs) (*(volatile typeof(rhs) *) &(rhs) = (lhs))

If this macro looks strangely familiar, it's because it does the same thing as WRITE_ONCE() macro in the Linux kernel. Applying it to our example:

That's another hacky but working solution. Pick your poison.

CC BY-SA 3.0 image by ANKAWÜ

So there you have it - from C, to IR, and back to C to hack around a bug in eBPF verifier and be able to subtract 64-bit integers again. Usually you won't have to dive into LLVM IR or assembly to make use of eBPF. But it does help to know a little about it when things don't work as expected.

Did I mention that 64-bit addition is also broken? Have fun fixing it!

¹ Okay, it was more like 3 months time until the bug was discovered and fixed.

² Some even think that it is better than assembly.

³ How do we know? The litmus test is to look for statements matching r[0-9] [-+]= r[0-9] in BPF assembly.

Introducing ebpf_exporter

Ivan Babrou — Fri, 24 Aug 2018 15:11:53 GMT

This is an adapted transcript of a talk I gave at Promcon 2018. You can find slides with additional information on our Prometheus deployment and presenter notes here. There's also a video.

Tip: you can click on the image to see the original large version.

Here at Cloudflare we use Prometheus to collect operational metrics. We run it on hundreds of servers and ingest millions of metrics per second to get insight into our network and provide the best possible service to our customers.

Prometheus metric format is popular enough, it's now being standardized as OpenMetrics under Cloud Native Computing Foundation. It's exciting to see convergence in long fragmented metrics landscape.

In this blog post we'll talk about how we measure low level metrics and share a tool that can help you to get similar understanding of your systems.

There are two main exporters one can use to get some insight into a Linux system performance.

The first one is node_exporter that gives you information about basics like CPU usage breakdown by type, memory usage, disk IO stats, filesystem and network usage.

The second one is cAdvisor, that gives similar metrics, but drills down to a container level. Instead of seeing total CPU usage you can see which containers (and systemd units are also containers for cAdvisor) use how much of global resources.

This is the absolute minimum of what you should know about your systems. If you don’t have these two, you should start there.

Let’s look at the graphs you can get out of this.

I should mention that every screenshot in this post is from a real production machine doing something useful. We have different generations of hardware, so don’t try to draw any conclusions.

Here you can see the basics you get from node_exporter for CPU and memory. You can see the utilization and how much slack resources you have.

Some more metrics from node_exporter, this time for disk IO. There are similar panels for network as well.

At the basic level you can do some correlation to explain why CPU went up if you see higher network and disk activity.

With cAdvisor this gets more interesting, since you can now see how global resources like CPU are sliced between services. If CPU or memory usage grows, you can pinpoint exact service that is responsible and you can also see how it affects other services.

If global CPU numbers do not change much, you can still see shifts between services.

All of this information comes from the simplest counters and first derivatives (rates) on them.

Counters are great, but they lack detail about individual events. Let’s take disk io for example. We get device io time from node_exporter and the derivative of that cannot go beyond one second per one second of real time, which means we can draw a bold red line at 1s and see what kind of utilization we get from our disks.

We get one number that characterizes out workload, but that’s simply not enough to understand it. Are we doing many fast IO operations? Are we doing few slow ones? What kind of mix of slow vs fast do we get? How are writes and reads different?

These questions beg for a histogram. What we have is a counter above, but what we want is a histogram below.

Keep in mind that Prometheus histograms are cumulative and le label counts all events lower than or equal to the value of the label.

Okay, so imagine we have that histogram. To visualize the difference you get between the two types of metrics here are two screenshots of the same event, which is an SSD replacement in production. We replaced a disk and this is how it affected the line and the histogram. In the new Grafana 5 you can plot histograms as heatmaps and it gives you a lot more detail than just one line.

The next slide has a bigger screenshot, on this slide you should just see the shift in detail.

Here you can see buckets color coded and each timeslot has its own distribution in a tooltip. It can definitely get easier to understand with bar highlights and double Y axis, but it’s a big step forward from just one line nonetheless.

In addition to nice visualizations, you can also plot number of events above Xms and have alerts and SLOs on that. For example, you can alert if your read latency for block devices exceeds 10ms.

And if you were looking closely at these histograms, you may have noticed values on the Y axis are kind of high. Before the replacement you can see values in 0.5s to 1.0s bucket.

Tech specs for the left disk give you 50 microsecond read/write latency and on the right you get a slight decrease to 36 microseconds. That’s not what we see on the histogram at all. Sometimes you can spot this with fio in testing, but production workloads may have patterns that are difficult to replicate and have very different characteristics. Histograms show how it is.

Even a few slow requests can hurt overall numbers if you're not careful with IO. We've blogged how this affected our cache latency and how we worked around this recently.

By now you should be convinced that histograms are clearly superior for events where you care about timings of individual events.

And if you wanted to switch all your storage latency measurements to histograms, I have tough news for you: Linux kernel only provides counters for node_exporter.

You can try to mess with blktrace for this, but it doesn’t seem practical or efficient.

Let’s switch gears a little and see how summary stats from counters can be deceiving.

This is a research from Autodesk. The creature on the left is called Datasaurus. Then there are target figures in the middle and animations on the right.

The amazing part is that shapes on the right and all intermediate animation frames for them have the same values for mean and standard deviation for both X and Y axis.

From summary statistic’s point of view every box on the right is same, but if you plot individual events, a very different picture emerges.

These animations are from the same research, here you can clearly see how box plots (mean + stddev) are non-representative of the raw events.

Histograms, on the contrary, give an accurate picture.

We established that histograms are what you want, but you need individual events to make those histograms. What are the requirements for a system that would handle this task, assuming that we want to measure things like io operations in the Linux kernel?

It has to be low overhead, otherwise we can’t run it in production
It has to be universal, so we are not locked into just io tracing
It has to be supported out of the box, third party kernel modules and patches are not very practical
And finally it has to be safe, we don’t want to crash large chunk of the internet we're responsible for to get some metrics, even if they are interesting

And it turns out, there’s a solution called eBPF. It’s a low overhead sandboxed user-defined bytecode running in the kernel. It can never crash, hang or interfere with the kernel negatively. That sounds kind of vague, but here are two links that dive into the details explaining how this works.

The main part is that it’s already included with the Linux kernel. It’s used in networking subsystem and things like seccomp rules, but as a general “run safe code in the kernel” it has many uses beyond that.

We said it’s a bytecode and this is how it looks. The good part is that you never have to write this by hand.

To use eBPF you write small C programs that attach to kernel functions and run before or after them. Your C code is then compiled into bytecode, verified and loaded into kernel to run with JIT compiler for translation into native opcodes. The constraints on the code are enforced by verifier and you are guaranteed to not be able to write an infinite loop or allocate tons of memory.

To share data with eBPF programs, you can use maps. In terms of metric collection, maps are updated by eBPF programs in the kernel and only accessed by userspace during scraping.

It is critical for performance that the eBPF VM runs in the kernel and does not cross userspace boundary.

You can see the workflow on the image above.

Having to write C is not an eBPF constraint, you can produce bytecode in any way you want. There are other alternatives like lua and ply, and sysdig is adding a backed to run via eBPF too. Maybe someday people will be writing safe javacript that runs in the kernel.

Just like GCC compiles C code into machine code, BCC compiles C code into eBPF opcodes. BCC is a rewriter plus LLVM compiler and you can use it as a library in your code. There are bindings for C, C++, Python and Go.

In this example we have a simple function that runs after d_lookup kernel function that is responsible for directory cache lookups. It doesn’t look complicated and the basics should be understandable for people familiar with C-like languages.

In addition to a compiler, BCC includes some tools that can be used for debugging production systems with low overhead eBPF provides.

Let’s quickly go over a few of them.

The one above is biolatency, which shows you a histogram of disk io operations. That’s exactly what we started with and it’s already available, just as a script instead of an exporter for Prometheus.

Here’s execsnoop, that allows you to see which commands are being launched in the system. This is often useful if you want to catch quickly terminating programs that do not hang around long enough to be observed in ps output.

There’s also ext4slower that instead of showing slow IO operations, shows slow ext4 filesystem operations. One might think these two map fairly closely, but one filesystem operation does not necessarily map to one disk IO operation:

Writes can go into writeback cache and not touch the disk until later
Reads can involve multiple IOs to the physical device
Reads can also be blocked behind async writes

The more you know about disk io, the more you want to run stateless, really. Sadly, RAM prices are not going down.

Okay, now to the main idea of this post. We have all these primitives, now we should be able to tie them all together and get an ebpf_exporter on our hands to get metrics in Prometheus where they belong. Many BCC tools already have kernel side ready, reviewed and battle tested, so the hardest part is covered.

Let’s look at a simple example to get counters for timers fired in the kernel. This example is from the ebpf_exporter repo, where you can find a few more complex ones.

On the BCC code side we define a hash and attach to a tracepoint. When the tracepoint fires, we increment our hash where the key is the address of a function that fired the tracepoint.

On the exporter side we say that we want to take the hash and transform 8 byte keys with ksym decoder. Kernel keeps a map of function addresses to their names in /proc/kallsyms and we just use that. We also define that we want to attach our function to timer:timer_start tracepoint.

This is objectively quite verbose and perhaps some things can be inferred instead of being explicit.

This is an example graph we can get out of this exporter config.

Why can this be useful? You may remember our blog post about our tracing of a weird bug during OS upgrade from Debian Jessie to Stretch.

TL;DR is that systemd bug broke TCP segmentation offload on vlan interface, which increased CPU load 5x and introduced lots of interesting side effects up to memory allocation stalls. You do not want to see that allocating one page takes 12 seconds, but that's exactly what we were seeing.

If we had timer metrics enabled, we would have seen the clear change in metrics sooner.

Another bundled example can give you IPC or instruction per cycle metrics for each CPU. On this production system we can see a few interesting things:

Not all cores are equal, with two cores being outliers (green on the top is zero, yellow on the bottom is one)
Something happened that dropped IPC of what looks like half of cores
There’s some variation in daily load cycles that affects IPC

Why can this be useful? Check out Brendan Gregg’s somewhat controversial blog post about CPU utilization.

TL;DR is that CPU% does not include memory stalls that do not perform any useful compute work. IPC helps to understand that factor better.

Another bundled example is LLC or L3 CPU cache hit rate. This is from the same machine as IPC metrics and you can see some major affecting not just IPC, but also the hit rate.

Why can this be useful? You can see how your CPU cache is doing and you can see how it may be affected by bigger L3 cache or more cores sharing the same cache.

LLC hit rate usually goes hand in hand with IPC patterns as well.

Now to the fun part. We started with IO latency histograms and they are bundled with ebpf_exporter examples.

This is also a histogram, but now for run queue latency. When a process is woken up in the kernel, it’s ready to run. If CPU is idle, it can run immediately and scheduling delay is zero. If CPU is busy doing some other work, the process is put on a run queue and the CPU picks it up when it’s available. That delay between being ready to run and actually running is the scheduling delay and it affects latency for interactive applications.

In cAdvisor you have a counter with this delay, here you have a histogram of actual values of that delay.

Understanding of how you can be delayed is important for understanding the causes of externally visible latency. Check out another blog post by Brendan Gregg to see how you can further trace and understand scheduling delays with Linux perf tool. It's quite surprising how high delay can be on even an even lighty loaded machine.

It also helps to have a metrics that you can observe if you change any scheduling sysctls in the kernel. The law is that you can never trust internet or even your own judgement to change any sysctls. If you can’t measure the effects, you are lying to yourself.

There’s also another great post from Scylla people about their finds. We were able to apply their sysctls and observe results, which was quite satisfying.

Things like pinning processes to CPUs also affects scheduling, so this metric is invaluable for such experiments.

The examples we gave are not the only ones possible. In addition to IPC and LLC metrics there are around 500 hardware metrics you can get on a typical server.

There are around 2000 tracepoints with stable ABI you can use on many kernel subsystems.

And you can always trace any non-inlined kernel function with kprobes and kretprobes, but nobody guarantees binary compatibility between releases for those. Some are stable, others not so much.

We don’t have support for user statically defined tracepoints or uprobes, which means you cannot trace userspace applications. This is something we can reconsider in the future, but in the meantime you can always add metrics to your apps by regular means.

This image shows tools that BCC provides, you can see it covers many aspects.

Nothing in this life is free and even low overhead still means some overhead. You should measure this in your own environment, but this is what we’ve seen.

For a fast getpid() syscall you get around 34% overhead if you count them by PID. 34% sounds like a lot, but it’s the simplest syscall and 100ns overhead is a cost of one memory reference.

For a complex case where we mix command name to copy some memory and mix in some randomness, the number jumps to 330ns or 105% overhead. We can still do 1.55M ops/s instead of 3.16M ops/s per logical core. If you measure something that doesn’t happen as often on each core, you’re probably not going to notice it as much.

We've seen run queue latency histogram to add 300ms of system time on a machine with 40 logical cores and 250K scheduling events per second.

Your mileage may vary.

So where should you run the exporter then? The answer is anywhere you feel comfortable.

If you run simple programs, it doesn’t hurt to run anywhere. If you are concerned about overhead, you can do it only on canary instances and hope for results to be representative. We do exactly that, but instead of canary instances we have a luxury of having a couple of canary datacenters with live customers. They get updates after our offices do, so it’s not as bad as it sounds.

For some upgrade events you may want to enable more extensive metrics. An example would be a major distro or kernel upgrade.

The last thing we wanted to mention is that ebpf_exporter is open source and we encourage you to try it and maybe contribute interesting examples that may be useful to others.

GitHub: https://github.com/cloudflare/ebpf_exporter

Reading materials on eBPF:

While this post was in drafts, we added another example for tracing port exhaustion issues and that took under 10 minutes to write.

Tracing System CPU on Debian Stretch

Ivan Babrou — Sun, 13 May 2018 16:00:00 GMT

This is a heavily truncated version of an internal blog post from August 2017. For more recent updates on Kafka, check out another blog post on compression, where we optimized throughput 4.5x for both disks and network.

Photo by Alex Povolyashko / Unsplash

Upgrading our systems to Debian Stretch

For quite some time we've been rolling out Debian Stretch, to the point where we have reached ~10% adoption in our core datacenters. As part of upgarding the underlying OS, we also evaluate the higher level software stack, e.g. taking a look at our ClickHouse and Kafka clusters.

During our upgrade of Kafka, we sucessfully migrated two smaller clusters, logs and dns, but ran into issues when attempting to upgrade one of our larger clusters, http.

Thankfully, we were able to roll back the http cluster upgrade relatively easily, due to heavy versioning of both the OS and the higher level software stack. If there's one takeaway from this blog post, it's to take advantage of consistent versioning.

High level differences

We upgraded one Kafka http node, and it did not go as planned:

Having 5x CPU usage was definitely an unexpected outcome. For control datapoints, we compared to a node where no upgrade happened, and an intermediary node that received a software stack upgrade, but not an OS upgrade. Neither of these two nodes experienced the same CPU saturation issues, even though their setups were practically identical.

For debugging CPU saturation issues, we call on perf to fish out details:

The command used was: perf top -F 99.

RCU stalls

In addition to higher system CPU usage, we found secondary slowdowns, including read-copy update (RCU) stalls:

[ 4909.110009] logfwdr (26887) used greatest stack depth: 11544 bytes left
[ 4909.392659] oom_reaper: reaped process 26861 (logfwdr), now anon-rss:8kB, file-rss:0kB, shmem-rss:0kB
[ 4923.462841] INFO: rcu_sched self-detected stall on CPU
[ 4923.462843]  13-...: (2 GPs behind) idle=ea7/140000000000001/0 softirq=1/2 fqs=4198
[ 4923.462845]   (t=8403 jiffies g=110722 c=110721 q=6440)

We've seen RCU stalls before, and our (suboptimal) solution was to reboot the machine.

However, one can only handle so many reboots before the problem becomes severe enough to warrant a deep dive. During our deep dive, we noticed in dmesg that we had issues allocating memory, while trying to write errors:

Aug 15 21:51:35 myhost kernel: INFO: rcu_sched detected stalls on CPUs/tasks:
Aug 15 21:51:35 myhost kernel:         26-...: (1881 ticks this GP) idle=76f/140000000000000/0 softirq=8/8 fqs=365
Aug 15 21:51:35 myhost kernel:         (detected by 0, t=2102 jiffies, g=1837293, c=1837292, q=262)
Aug 15 21:51:35 myhost kernel: Task dump for CPU 26:
Aug 15 21:51:35 myhost kernel: java            R  running task    13488  1714   1513 0x00080188
Aug 15 21:51:35 myhost kernel:  ffffc9000d1f7898 ffffffff814ee977 ffff88103f410400 000000000000000a
Aug 15 21:51:35 myhost kernel:  0000000000000041 ffffffff82203142 ffffc9000d1f78c0 ffffffff814eea10
Aug 15 21:51:35 myhost kernel:  0000000000000041 ffffffff82203142 ffff88103f410400 ffffc9000d1f7920
Aug 15 21:51:35 myhost kernel: Call Trace:
Aug 15 21:51:35 myhost kernel:  [] ? scrup+0x147/0x160
Aug 15 21:51:35 myhost kernel:  [] ? lf+0x80/0x90
Aug 15 21:51:35 myhost kernel:  [] ? vt_console_print+0x295/0x3c0
Aug 15 21:51:35 myhost kernel:  [] ? call_console_drivers.isra.22.constprop.30+0xf3/0x100
Aug 15 21:51:35 myhost kernel:  [] ? console_unlock+0x281/0x550
Aug 15 21:51:35 myhost kernel:  [] ? vprintk_emit+0x278/0x430
Aug 15 21:51:35 myhost kernel:  [] ? vprintk_default+0x1f/0x30
Aug 15 21:51:35 myhost kernel:  [] ? printk+0x48/0x50
Aug 15 21:51:35 myhost kernel:  [] ? dump_stack_print_info+0x7e/0xc0
Aug 15 21:51:35 myhost kernel:  [] ? dump_stack+0x44/0x65
Aug 15 21:51:35 myhost kernel:  [] ? warn_alloc+0x124/0x150
Aug 15 21:51:35 myhost kernel:  [] ? __alloc_pages_slowpath+0x932/0xb80
Aug 15 21:51:35 myhost kernel:  [] ? __alloc_pages_nodemask+0x202/0x250
Aug 15 21:51:35 myhost kernel:  [] ? alloc_pages_current+0x92/0x120
Aug 15 21:51:35 myhost kernel:  [] ? __page_cache_alloc+0xbf/0xd0
Aug 15 21:51:35 myhost kernel:  [] ? filemap_fault+0x2ea/0x4d0
Aug 15 21:51:35 myhost kernel:  [] ? xfs_filemap_fault+0x45/0xa0
Aug 15 21:51:35 myhost kernel:  [] ? __do_fault+0x6b/0xd0
Aug 15 21:51:35 myhost kernel:  [] ? handle_mm_fault+0xe98/0x12b0
Aug 15 21:51:35 myhost kernel:  [] ? __seccomp_filter+0x1db/0x290
Aug 15 21:51:35 myhost kernel:  [] ? __do_page_fault+0x22c/0x4c0
Aug 15 21:51:35 myhost kernel:  [] ? do_page_fault+0x20/0x70
Aug 15 21:51:35 myhost kernel:  [] ? page_fault+0x22/0x30

This suggested that we were logging too many errors, and the actual failure may be earlier in the process. Armed with this hypothesis, we looked at the very beginning of the error chain:

Aug 16 01:14:51 myhost systemd-journald[13812]: Missed 17171 kernel messages
Aug 16 01:14:51 myhost kernel:  [] shrink_inactive_list+0x1f4/0x4f0
Aug 16 01:14:51 myhost kernel:  [] shrink_node_memcg+0x5bb/0x780
Aug 16 01:14:51 myhost kernel:  [] shrink_node+0xd2/0x2f0
Aug 16 01:14:51 myhost kernel:  [] do_try_to_free_pages+0xef/0x310
Aug 16 01:14:51 myhost kernel:  [] try_to_free_pages+0xd5/0x180
Aug 16 01:14:51 myhost kernel:  [] __alloc_pages_slowpath+0x31b/0xb80

As much as shrink_node may scream "NUMA issues", you're looking primarily at:

Aug 16 01:14:51 myhost systemd-journald[13812]: Missed 17171 kernel messages

In addition, we also found memory allocation issues:

[78972.506644] Mem-Info:
[78972.506653] active_anon:3936889 inactive_anon:371971 isolated_anon:0
[78972.506653]  active_file:25778474 inactive_file:1214478 isolated_file:2208
[78972.506653]  unevictable:0 dirty:1760643 writeback:0 unstable:0
[78972.506653]  slab_reclaimable:1059804 slab_unreclaimable:141694
[78972.506653]  mapped:47285 shmem:535917 pagetables:10298 bounce:0
[78972.506653]  free:202928 free_pcp:3085 free_cma:0
[78972.506660] Node 0 active_anon:8333016kB inactive_anon:989808kB active_file:50622384kB inactive_file:2401416kB unevictable:0kB isolated(anon):0kB isolated(file):3072kB mapped:96624kB dirty:3422168kB writeback:0kB shmem:1261156kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:15744 all_unreclaimable? no
[78972.506666] Node 1 active_anon:7414540kB inactive_anon:498076kB active_file:52491512kB inactive_file:2456496kB unevictable:0kB isolated(anon):0kB isolated(file):5760kB mapped:92516kB dirty:3620404kB writeback:0kB shmem:882512kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:9080974 all_unreclaimable? no
[78972.506671] Node 0 DMA free:15900kB min:100kB low:124kB high:148kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15900kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
** 9 printk messages dropped ** [78972.506716] Node 0 Normal: 15336*4kB (UMEH) 4584*8kB (MEH) 2119*16kB (UME) 775*32kB (MEH) 106*64kB (UM) 81*128kB (MH) 29*256kB (UM) 25*512kB (M) 19*1024kB (M) 7*2048kB (M) 2*4096kB (M) = 236080kB
[78972.506725] Node 1 Normal: 31740*4kB (UMEH) 3879*8kB (UMEH) 873*16kB (UME) 353*32kB (UM) 286*64kB (UMH) 62*128kB (UMH) 28*256kB (MH) 20*512kB (UMH) 15*1024kB (UM) 7*2048kB (UM) 12*4096kB (M) = 305752kB
[78972.506726] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[78972.506727] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[78972.506728] 27531091 total pagecache pages
[78972.506729] 0 pages in swap cache
[78972.506730] Swap cache stats: add 0, delete 0, find 0/0
[78972.506730] Free swap  = 0kB
[78972.506731] Total swap = 0kB
[78972.506731] 33524975 pages RAM
[78972.506732] 0 pages HighMem/MovableOnly
[78972.506732] 546255 pages reserved
[78972.620129] ntpd: page allocation stalls for 272380ms, order:0, mode:0x24000c0(GFP_KERNEL)
[78972.620132] CPU: 16 PID: 13099 Comm: ntpd Tainted: G           O    4.9.43-cloudflare-2017.8.4 #1
[78972.620133] Hardware name: Quanta Computer Inc D51B-2U (dual 1G LoM)/S2B-MB (dual 1G LoM), BIOS S2B_3A21 10/01/2015
[78972.620136]  ffffc90022f9b6f8 ffffffff8142d668 ffffffff81ca31b8 0000000000000001
[78972.620138]  ffffc90022f9b778 ffffffff81162f14 024000c022f9b740 ffffffff81ca31b8
[78972.620140]  ffffc90022f9b720 0000000000000010 ffffc90022f9b788 ffffc90022f9b738
[78972.620140] Call Trace:
[78972.620148]  [] dump_stack+0x4d/0x65
[78972.620152]  [] warn_alloc+0x124/0x150
[78972.620154]  [] __alloc_pages_slowpath+0x932/0xb80
[78972.620157]  [] __alloc_pages_nodemask+0x202/0x250
[78972.620160]  [] alloc_pages_current+0x92/0x120
[78972.620162]  [] __get_free_pages+0xe/0x40
[78972.620165]  [] __pollwait+0x9a/0xe0
[78972.620168]  [] datagram_poll+0x29/0x100
[78972.620170]  [] sock_poll+0x48/0xa0
[78972.620172]  [] do_select+0x335/0x7b0

This specific error message did seem fun:

[78991.546088] systemd-network: page allocation stalls for 287000ms, order:0, mode:0x24200ca(GFP_HIGHUSER_MOVABLE)

You don't want your page allocations to stall for 5 minutes, especially when it's order zero allocation (smallest allocation of one 4 KiB page).

Comparing to our control nodes, the only two possible explanations were: a kernel upgrade, and the switch from Debian Jessie to Debian Stretch. We suspected the former, since CPU usage implies a kernel issue. However, just to be safe, we rolled both the kernel back to 4.4.55, and downgraded the affected nodes back to Debian Jessie. This was a reasonable compromise, since we needed to minimize downtime on production nodes.

Digging a bit deeper

Keeping servers running on older kernel and distribution is not a viable long term solution. Through bisection, we found the issue lay in the Jessie to Stretch upgrade, contrary to our initial hypothesis.

Now that we knew what the problem was, we proceeded to investigate why. With the help from existing automation around perf and Java, we generated the following flamegraphs:

Jessie

Stretch

At first it looked like Jessie was doing writev instead of sendfile, but the full flamegraphs revealed that Strech was executing sendfile a lot slower.

If you highlight sendfile:

Jessie

Stretch

And zoomed in:

Jessie

Stretch

These two look very different.

Some colleagues suggested that the differences in the graphs may be due to TCP offload being disabled, but upon checking our NIC settings, we found that the feature flags were identical.

We'll dive into the differences in the next section.

And deeper

To trace latency distributions of sendfile syscalls between Jessie and Stretch, we used funclatency from bcc-tools:

Jessie

$ sudo /usr/share/bcc/tools/funclatency -uTi 1 do_sendfile
Tracing 1 functions for "do_sendfile"... Hit Ctrl-C to end.
23:27:25
     usecs               : count     distribution
         0 -> 1          : 9        |                                        |
         2 -> 3          : 47       |****                                    |
         4 -> 7          : 53       |*****                                   |
         8 -> 15         : 379      |****************************************|
        16 -> 31         : 329      |**********************************      |
        32 -> 63         : 101      |**********                              |
        64 -> 127        : 23       |**                                      |
       128 -> 255        : 50       |*****                                   |
       256 -> 511        : 7        |                                        |

Stretch

$ sudo /usr/share/bcc/tools/funclatency -uTi 1 do_sendfile
Tracing 1 functions for "do_sendfile"... Hit Ctrl-C to end.
23:27:28
     usecs               : count     distribution
         0 -> 1          : 1        |                                        |
         2 -> 3          : 20       |***                                     |
         4 -> 7          : 46       |*******                                 |
         8 -> 15         : 56       |********                                |
        16 -> 31         : 65       |**********                              |
        32 -> 63         : 75       |***********                             |
        64 -> 127        : 75       |***********                             |
       128 -> 255        : 258      |****************************************|
       256 -> 511        : 144      |**********************                  |
       512 -> 1023       : 24       |***                                     |
      1024 -> 2047       : 27       |****                                    |
      2048 -> 4095       : 28       |****                                    |
      4096 -> 8191       : 35       |*****                                   |
      8192 -> 16383      : 1        |                                        |

In the flamegraphs, you can see timers being set at the tip (mod_timer function), with these timers taking locks. On Stretch we installed 3x more timers, resulting in 10x the amount of contention:

Jessie

$ sudo /usr/share/bcc/tools/funccount -T -i 1 mod_timer
Tracing 1 functions for "mod_timer"... Hit Ctrl-C to end.
00:33:36
FUNC                                    COUNT
mod_timer                               60482
00:33:37
FUNC                                    COUNT
mod_timer                               58263
00:33:38
FUNC                                    COUNT
mod_timer                               54626

$ sudo /usr/share/bcc/tools/funccount -T -i 1 lock_timer_base
Tracing 1 functions for "lock_timer_base"... Hit Ctrl-C to end.
00:32:36
FUNC                                    COUNT
lock_timer_base                         15962
00:32:37
FUNC                                    COUNT
lock_timer_base                         16261
00:32:38
FUNC                                    COUNT
lock_timer_base                         15806

Stretch

$ sudo /usr/share/bcc/tools/funccount -T -i 1 mod_timer
Tracing 1 functions for "mod_timer"... Hit Ctrl-C to end.
00:33:28
FUNC                                    COUNT
mod_timer                              149068
00:33:29
FUNC                                    COUNT
mod_timer                              155994
00:33:30
FUNC                                    COUNT
mod_timer                              160688

$ sudo /usr/share/bcc/tools/funccount -T -i 1 lock_timer_base
Tracing 1 functions for "lock_timer_base"... Hit Ctrl-C to end.
00:32:32
FUNC                                    COUNT
lock_timer_base                        119189
00:32:33
FUNC                                    COUNT
lock_timer_base                        196895
00:32:34
FUNC                                    COUNT
lock_timer_base                        140085

The Linux kernel includes debugging facilities for timers, which call the timer:timer_start tracepoint on every timer start. This allowed us to pull up timer names:

Jessie

$ sudo perf record -e timer:timer_start -p 23485 -- sleep 10 && sudo perf script | sed 's/.* function=//g' | awk '{ print $1 }' | sort | uniq -c
[ perf record: Woken up 54 times to write data ]
[ perf record: Captured and wrote 17.778 MB perf.data (173520 samples) ]
      6 blk_rq_timed_out_timer
      2 clocksource_watchdog
      5 commit_timeout
      5 cursor_timer_handler
      2 dev_watchdog
     10 garp_join_timer
      2 ixgbe_service_timer
     36 reqsk_timer_handler
   4769 tcp_delack_timer
    171 tcp_keepalive_timer
 168512 tcp_write_timer

Stretch

$ sudo perf record -e timer:timer_start -p 3416 -- sleep 10 && sudo perf script | sed 's/.* function=//g' | awk '{ print $1 }' | sort | uniq -c
[ perf record: Woken up 671 times to write data ]
[ perf record: Captured and wrote 198.273 MB perf.data (1988650 samples) ]
      6 clocksource_watchdog
      4 commit_timeout
     12 cursor_timer_handler
      2 dev_watchdog
     18 garp_join_timer
      4 ixgbe_service_timer
      1 neigh_timer_handler
      1 reqsk_timer_handler
   4622 tcp_delack_timer
      1 tcp_keepalive_timer
1983978 tcp_write_timer
      1 writeout_period

So basically we install 12x more tcp_write_timer timers, resulting in higher kernel CPU usage.

Taking specific flamegraphs of the timers revealed the differences in their operation:

Jessie

Stretch

We then traced the functions that were different:

Jessie

$ sudo /usr/share/bcc/tools/funccount -T -i 1 tcp_sendmsg
Tracing 1 functions for "tcp_sendmsg"... Hit Ctrl-C to end.
03:33:33
FUNC                                    COUNT
tcp_sendmsg                             21166
03:33:34
FUNC                                    COUNT
tcp_sendmsg                             21768
03:33:35
FUNC                                    COUNT
tcp_sendmsg                             21712

$ sudo /usr/share/bcc/tools/funccount -T -i 1 tcp_push_one
Tracing 1 functions for "tcp_push_one"... Hit Ctrl-C to end.
03:37:14
FUNC                                    COUNT
tcp_push_one                              496
03:37:15
FUNC                                    COUNT
tcp_push_one                              432
03:37:16
FUNC                                    COUNT
tcp_push_one                              495

$ sudo /usr/share/bcc/tools/trace -p 23485 'tcp_sendmsg "%d", arg3' -T -M 100000 | awk '{ print $NF }' | sort | uniq -c | sort -n | tail
   1583 4
   2043 54
   3546 18
   4016 59
   4423 50
   5349 8
   6154 40
   6620 38
  17121 51
  39528 44

Stretch

$ sudo /usr/share/bcc/tools/funccount -T -i 1 tcp_sendmsg
Tracing 1 functions for "tcp_sendmsg"... Hit Ctrl-C to end.
03:33:30
FUNC                                    COUNT
tcp_sendmsg                             53834
03:33:31
FUNC                                    COUNT
tcp_sendmsg                             49472
03:33:32
FUNC                                    COUNT
tcp_sendmsg                             51221

$ sudo /usr/share/bcc/tools/funccount -T -i 1 tcp_push_one
Tracing 1 functions for "tcp_push_one"... Hit Ctrl-C to end.
03:37:10
FUNC                                    COUNT
tcp_push_one                            64483
03:37:11
FUNC                                    COUNT
tcp_push_one                            65058
03:37:12
FUNC                                    COUNT
tcp_push_one                            72394

$ sudo /usr/share/bcc/tools/trace -p 3416 'tcp_sendmsg "%d", arg3' -T -M 100000 | awk '{ print $NF }' | sort | uniq -c | sort -n | tail
    396 46
    409 4
   1124 50
   1305 18
   1547 40
   1672 59
   1729 8
   2181 38
  19052 44
  64504 4096

The traces showed huge variations of tcp_sendmsg and tcp_push_one within sendfile.

To further introspect, we leveraged a kernel feature available since 4.9: the ability to count stacks. This led us to measuring what hits tcp_push_one:

Jessie

$ sudo /usr/share/bcc/tools/stackcount -i 10 tcp_push_one
Tracing 1 functions for "tcp_push_one"... Hit Ctrl-C to end.
  tcp_push_one
  inet_sendmsg
  sock_sendmsg
  sock_write_iter
  do_iter_readv_writev
  do_readv_writev
  vfs_writev
  do_writev
  SyS_writev
  do_syscall_64
  return_from_SYSCALL_64
    1
  tcp_push_one
  inet_sendpage
  kernel_sendpage
  sock_sendpage
  pipe_to_sendpage
  __splice_from_pipe
  splice_from_pipe
  generic_splice_sendpage
  direct_splice_actor
  splice_direct_to_actor
  do_splice_direct
  do_sendfile
  sys_sendfile64
  do_syscall_64
  return_from_SYSCALL_64
    4950

Stretch

$ sudo /usr/share/bcc/tools/stackcount -i 10 tcp_push_one
Tracing 1 functions for "tcp_push_one"... Hit Ctrl-C to end.
  tcp_push_one
  inet_sendmsg
  sock_sendmsg
  sock_write_iter
  do_iter_readv_writev
  do_readv_writev
  vfs_writev
  do_writev
  SyS_writev
  do_syscall_64
  return_from_SYSCALL_64
    123
  tcp_push_one
  inet_sendmsg
  sock_sendmsg
  sock_write_iter
  __vfs_write
  vfs_write
  SyS_write
  do_syscall_64
  return_from_SYSCALL_64
    172
  tcp_push_one
  inet_sendmsg
  sock_sendmsg
  kernel_sendmsg
  sock_no_sendpage
  tcp_sendpage
  inet_sendpage
  kernel_sendpage
  sock_sendpage
  pipe_to_sendpage
  __splice_from_pipe
  splice_from_pipe
  generic_splice_sendpage
  direct_splice_actor
  splice_direct_to_actor
  do_splice_direct
  do_sendfile
  sys_sendfile64
  do_syscall_64
  return_from_SYSCALL_64
    735110

If you diff the most popular stacks, you'll get:

--- jessie.txt  2017-08-16 21:14:13.000000000 -0700
+++ stretch.txt 2017-08-16 21:14:20.000000000 -0700
@@ -1,4 +1,9 @@
 tcp_push_one
+inet_sendmsg
+sock_sendmsg
+kernel_sendmsg
+sock_no_sendpage
+tcp_sendpage
 inet_sendpage
 kernel_sendpage
 sock_sendpage

Let's look closer at tcp_sendpage:

int tcp_sendpage(struct sock *sk, struct page *page, int offset,
         size_t size, int flags)
{
    ssize_t res;

    if (!(sk->sk_route_caps & NETIF_F_SG) ||
        !sk_check_csum_caps(sk))
        return sock_no_sendpage(sk->sk_socket, page, offset, size,
                    flags);

    lock_sock(sk);

    tcp_rate_check_app_limited(sk);  /* is sending application-limited? */

    res = do_tcp_sendpages(sk, page, offset, size, flags);
    release_sock(sk);
    return res;
}

It looks like we don't enter the if body. We looked up what NET_F_SG does: segmentation offload. This difference is peculiar, since both OS'es should have this enabled.

Even deeper, to the crux

It turned out that we had segmentation offload enabled for only a few of our NICs: eth2, eth3, and bond0. Our network setup can be described as follows:

eth2 -->|              |--> vlan10
        |---> bond0 -->|
eth3 -->|              |--> vlan100

The missing piece was that we were missing segmentation offload on VLAN interfaces, where the actual IPs live.

Here's the diff from ethtook -k vlan10:

$ diff -rup <(ssh jessie sudo ethtool -k vlan10) <(ssh stretch sudo ethtool -k vlan10)
--- /dev/fd/63  2017-08-16 21:21:12.000000000 -0700
+++ /dev/fd/62  2017-08-16 21:21:12.000000000 -0700
@@ -1,21 +1,21 @@
 Features for vlan10:
 rx-checksumming: off [fixed]
-tx-checksumming: off
+tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
-       tx-checksum-ip-generic: off
+       tx-checksum-ip-generic: on
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off
        tx-checksum-sctp: off
-scatter-gather: off
-       tx-scatter-gather: off
+scatter-gather: on
+       tx-scatter-gather: on
        tx-scatter-gather-fraglist: off
-tcp-segmentation-offload: off
-       tx-tcp-segmentation: off [requested on]
-       tx-tcp-ecn-segmentation: off [requested on]
-       tx-tcp-mangleid-segmentation: off [requested on]
-       tx-tcp6-segmentation: off [requested on]
-udp-fragmentation-offload: off [requested on]
-generic-segmentation-offload: off [requested on]
+tcp-segmentation-offload: on
+       tx-tcp-segmentation: on
+       tx-tcp-ecn-segmentation: on
+       tx-tcp-mangleid-segmentation: on
+       tx-tcp6-segmentation: on
+udp-fragmentation-offload: on
+generic-segmentation-offload: on
 generic-receive-offload: on
 large-receive-offload: off [fixed]
 rx-vlan-offload: off [fixed]

So we entusiastically enabled segmentation offload:

$ sudo ethtool -K vlan10 sg on

And it didn't help! Will the suffering ever end? Let's also enable TCP transmission checksumming offload:

$ sudo ethtool -K vlan10 tx on
Actual changes:
tx-checksumming: on
        tx-checksum-ip-generic: on
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: on
        tx-tcp-mangleid-segmentation: on
        tx-tcp6-segmentation: on
udp-fragmentation-offload: on

Nothing. The diff is essentially empty now:

$ diff -rup <(ssh jessie sudo ethtool -k vlan10) <(ssh stretch sudo ethtool -k vlan10)
--- /dev/fd/63  2017-08-16 21:31:27.000000000 -0700
+++ /dev/fd/62  2017-08-16 21:31:27.000000000 -0700
@@ -4,11 +4,11 @@ tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: on
        tx-checksum-ipv6: off [fixed]
-       tx-checksum-fcoe-crc: off [requested on]
-       tx-checksum-sctp: off [requested on]
+       tx-checksum-fcoe-crc: off
+       tx-checksum-sctp: off
 scatter-gather: on
        tx-scatter-gather: on
-       tx-scatter-gather-fraglist: off [requested on]
+       tx-scatter-gather-fraglist: off
 tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: on

The last missing piece we found was that offload changes are applied only during connection initiation, so we restarted Kafka, and we immediately saw a performance improvement (green line):

Not enabling offload features when possible seems like a pretty bad regression, so we filed a ticket for systemd:

https://github.com/systemd/systemd/issues/6629

In the meantime, we work around our upstream issue by enabling offload features automatically on boot if they are disabled on VLAN interfaces.

Having a fix enabled, we rebooted our logs Kafka cluster to upgrade to the latest kernel, and our 5 day CPU usage history yielded positive results:

The DNS cluster also yielded positive results, with just 2 nodes rebooted (purple line going down):

Conclusion

It was an error on our part to hit performance degradation without a good regression framework in place to catch the issue. Luckily, due to our heavy use of version control, we managed to bisect the issue rather quickly, and had a temp rollback in place while root causing the problem.

In the end, enabling offload also removed RCU stalls. It's not really clear whether it was the cause or just a catalyst, but the end result speaks for itself.

On the bright side, we dug pretty deep into Linux kernel internals, and although there were fleeting moments of giving up, moving to the woods to become a park ranger, we persevered and came out of the forest successful.

If deep diving from high level symptoms to kernel/OS issues makes you excited, drop us a line.

eBPF, Sockets, Hop Distance and manually writing eBPF assembly

Marek Majkowski — Thu, 29 Mar 2018 10:43:38 GMT

A friend gave me an interesting task: extract IP TTL values from TCP connections established by a userspace program. This seemingly simple task quickly exploded into an epic Linux system programming hack. The result code is grossly over engineered, but boy, did we learn plenty in the process!

CC BY-SA 2.0 image by Paul Miller

Context

You may wonder why she wanted to inspect the TTL packet field (formally known as "IP Time To Live (TTL)" in IPv4, or "Hop Count" in IPv6)? The reason is simple - she wanted to ensure that the connections are routed outside of our datacenter. The "Hop Distance" - the difference between the TTL value set by the originating machine and the TTL value in the packet received at its destination - shows how many routers the packet crossed. If a packet crossed two or more routers, we know it indeed came from outside of our datacenter.

It's uncommon to look at TTL values (except for their intended purpose of mitigating routing loops by checking when the TTL reaches zero). The normal way to deal with the problem we had would be to blocklist IP ranges of our servers. But it’s not that simple in our setup. Our IP numbering configuration is rather baroque, with plenty of Anycast, Unicast and Reserved IP ranges. Some belong to us, some don't. We wanted to avoid having to maintain a hard-coded blocklist of IP ranges.

The gist of the idea is: we want to note the TTL value from a returned SYN+ACK packet. Having this number we can estimate the Hop Distance - number of routers on the path. If the Hop Distance is:

zero: we know the connection went to localhost or a local network.

one: connection went through our router, and was terminated just behind it.

two: connection went through two routers. Most possibly our router, and one just near to it.

For our use case, we want to see if the Hop Distance was two or more - this would ensure the connection was routed outside the datacenter.

Not so easy

It's easy to read TTL values from a userspace application, right? No. It turns out it's almost impossible. Here are the theoretical options we considered early on:

A) Run a libpcap/tcpdump-like raw socket, and catch the SYN+ACK's manually. We ruled out this design quickly - it requires elevated privileges. Also, raw sockets are pretty fragile: they can suffer packet loss if the userspace application can’t keep up.

B) Use the IP_RECVTTL socket option. IP_RECVTTL requests a "cmsg" data to be attached to control/ancillary data in a recvmsg() syscall. This is a good choice for UDP connections, but this socket option is not supported by TCP SOCK_STREAM sockets.

Extracting the TTL is not so easy.

SO_ATTACH_FILTER to rule the world!

CC BY-SA 2.0 image by Lee Jordan

Wait, there is a third way!

You see, for quite some time it has been possible to attach a BPF filtering program to a socket. See socket(7)

SO_ATTACH_FILTER (since Linux 2.2), SO_ATTACH_BPF (since Linux 3.19)
    Attach a classic BPF (SO_ATTACH_FILTER) or an extended BPF
    (SO_ATTACH_BPF) program to the socket for use as a filter of
    incoming packets.  A packet will be dropped if the filter pro‐
    gram returns zero.  If the filter program returns a nonzero
    value which is less than the packet's data length, the packet
    will be truncated to the length returned.  If the value
    returned by the filter is greater than or equal to the
    packet's data length, the packet is allowed to proceed unmodi‐
    fied.

You probably take advantage of SO_ATTACH_FILTER already: This is how tcpdump/wireshark does filtering when you're dumping packets off the wire.

How does it work? Depending on the result of a BPF program, packets can be filtered, truncated or passed to the socket without modification. Normally SO_ATTACH_FILTER is used for RAW sockets, but surprisingly, BPF filters can also be attached to normal SOCK_STREAM and SOCK_DGRAM sockets!

We don't want to truncate packets though - we want to extract the TTL. Unfortunately with Classical BPF (cBPF) it's impossible to extract any data from a running BPF filter program.

eBPF and maps

This changed with modern BPF machinery, which includes:

modernised eBPF bytecode
eBPF maps
bpf() syscall
SO_ATTACH_BPF socket option

eBPF bytecode can be thought of as an extension to Classical BPF, but it's the extra features that really let it shine.

The gem is the "map" abstraction. An eBPF map is a thingy that allows an eBPF program to store data and share it with a userspace code. Think of an eBPF map as a data structure (a hash table most usually) shared between a userspace program and an eBPF program running in kernel space.

To solve our TTL problem, we can use eBPF filter program. It will look at the TTL values of passing packets, and save them in an eBPF map. Later, we can inspect the eBPF map and analyze the recorded values from userspace.

SO_ATTACH_BPF to rule the world!

To use eBPF we need a number of things set up. First, we need to create an "eBPF map". There are many specialized map types, but for our purposes let's use the "hash" BPF_MAP_TYPE_HASH type.

We need to figure out the "bpf(BPF_MAP_CREATE, map type, key size, value size, limit, flags)" parameters. For our small TTL program, let's set 4 bytes for key size, and 8 byte value size. The max element limit is set to 5. It doesn't matter, we expect all the packets in one connection to have just one coherent TTL value anyway.

This is how it would look in a Golang code:

bpfMapFd, err := ebpf.NewMap(ebpf.Hash, 4, 8, 5, 0)

A word of warning is needed here. BPF maps use the "locked memory" resource. With multiple BPF programs and maps, it's easy to exhaust the default tiny 64 KiB limit. Consider bumping this with ulimit -l, for example:

ulimit -l 10240

The bpf() syscall returns a file descriptor pointing to the kernel BPF map we just created. With it handy we can operate on a map. The possible operations are:

bpf(BPF_MAP_LOOKUP_ELEM, )
bpf(BPF_MAP_UPDATE_ELEM, , , )
bpf(BPF_MAP_DELETE_ELEM, )
bpf(BPF_MAP_GET_NEXT_KEY, )

BPF calling convention

Before we go further we should highlight couple of things about the eBPF environment. The official kernel documentation isn't too friendly:

Documentation/networking/filter.txt

The first important bit to know, is the calling convention:

R0 - return value from in-kernel function, and exit value for eBPF program
R1-R5 - arguments from eBPF program to in-kernel function
R6-R9 - callee saved registers that in-kernel function will preserve
R10 - read-only frame pointer to access stack

When the BPF is started, R1 contains a pointer to ctx. This data structure is defined as struct __sk_buff. For example, to access the protocol field you'd need to run:

r0 = *(u32 *)(r1 + 16)

Or in other words:

ebpf.BPFIDstOffSrc(ebpf.LdXW, ebpf.Reg0, ebpf.Reg1, 16),

Which is exactly what we do in first line of our program, since we need to choose between IPv4 or IPv6 code branches.

Accessing the BPF payload

Next, there are special instructions for packet payload loading. Most BPF programs (but not all!) run in the context of packet filtering, so it makes sense to accelerate data lookups by having magic opcodes for accessing packet data.

Instead of dereferencing context, like ctx->data[x] to load a byte, BPF supports the BPF_LD instruction that can do it in one operation. There are caveats though, the documentation says:

eBPF has two non-generic instructions: (BPF_ABS |  | BPF_LD) and
(BPF_IND |  | BPF_LD) which are used to access packet data.

They had to be carried over from classic BPF to have strong performance of
socket filters running in eBPF interpreter. These instructions can only
be used when interpreter context is a pointer to 'struct sk_buff' and
have seven implicit operands. Register R6 is an implicit input that must
contain pointer to sk_buff. Register R0 is an implicit output which contains
the data fetched from the packet. Registers R1-R5 are scratch registers
and must not be used to store the data across BPF_ABS | BPF_LD or
BPF_IND | BPF_LD instructions.

In other words: before calling BPF_LD we must move ctx to R6, like this:

ebpf.BPFIDstSrc(ebpf.MovSrc, ebpf.Reg6, ebpf.Reg1),

Then we can call the load:

ebpf.BPFIImm(ebpf.LdAbsB, int32(-0x100000+7)),

At this stage the result is in r0, but we must remember the r1-r5 should be considered dirty. For an instruction the BPF_LD looks very much like a function call.

Magical Layer 3 offset

Next note the load offset - we loaded the -0x100000+7 byte of the packet. This magic offset is another BPF context curiosity. It turns out that the BPF script loaded under SO_ATTACH_BPF on a SOCK_STREAM (or SOCK_DGRAM) socket, will only see Layer 4 and higher OSI layers by default. To extract the TTL we need access to the layer 3 header (i.e. the IP header). To access L3 in the L4 context, we must offset the data lookups by magical -0x100000.

This magic constant is defined in the kernel.

For completeness, the +7 is, of course, the offset of the TTL field in an IPv4 packet. Our small BPF program also supports IPv6 where the TTL/Hop Count is at offset +8.

Return value

Finally, the return value of the BPF program is meaningful. In the context of packet filtering it will be interpreted as a truncated packet length.Had we returned 0 - the packet would be dropped and wouldn't be seen by the userspace socket application. It's quite interesting that we can do packet-based data manipulation with eBPF on a stream-based socket. Anyway, our script returns -1, which when cast to unsigned will be interpreted as a very large number:

ebpf.BPFIDstImm(ebpf.MovImm, ebpf.Reg0, -1),
ebpf.BPFIOp(ebpf.Exit),

Extracting data from map

Our running BPF program will set a key on our map for any matched packet. The key is the recorded TTL value, the value is the packet count. The value counter is somewhat vulnerable to a tiny race condition, but it's ignorable for our purposes. Later on, to extract the data from userspace program we use this Golang loop:

var (
	value   MapU64
	k1, k2  MapU32
)

for {
	ok, err := bpfMap.Get(k1, &value, 8)
	if ok {
		// k1 is TTL, value is counter
		...
	}

	ok, err = bpfMap.GetNextKey(k1, &k2, 4)
	if err != nil || ok == false {
		break
	}
	k1 = k2
}

Putting it all together

Now with all the pieces ready we can make it a proper runnable program. There is little point in discussing it here, so allow me to refer to the source code. The BPF pieces are here:

ebpf.go

We haven't discussed how to catch inbound SYN+ACK in the BPF program. This is a matter of setting up BPF before calling connect(). Sadly, it's impossible to customize net.Dial in Golang. Instead we wrote a surprisingly painful and awful custom Dial implementation. The ugly custom dialer code is here:

magic_conn.go

To run all this you need kernel 4.4+ Kernel with the bpf() syscall compiled in. BPF features of specific kernels are documented in this superb page from BCC:

docs/kernel-versions.md

Run the code to observe the TTL Hop Counts:

$ ./ttl-ebpf tcp4://google.com:80 tcp6://google.com:80 \
             tcp4://cloudflare.com:80 tcp6://cloudflare.com:80
[+] TTL distance to tcp4://google.com:80 172.217.4.174 is 6
[+] TTL distance to tcp6://google.com:80 [2607:f8b0:4005:809::200e] is 4
[+] TTL distance to tcp4://cloudflare.com:80 198.41.215.162 is 3
[+] TTL distance to tcp6://cloudflare.com:80 [2400:cb00:2048:1::c629:d6a2] is 3

Takeaways

In this blog post we dived into the new eBPF machinery, including the bpf() syscall, maps and SO_ATTACH_BPF. This work allowed me to realize the potential of running SO_ATTACH_BPF on fully established TCP sockets. Undoubtedly, eBPF still requires plenty of love and documentation, but it seems to be a perfect bridge to expose low level toggles to userspace applications.

I highly recommend keeping the dependencies small. For small BPF programs, like the one shown, there is little need for complex clang compilation and ELF loading. Don't be afraid of the eBPF bytecode!

We only touched on SO_ATTACH_BPF, where we analyzed network packets with BPF running on network sockets. There is more! First, you can attach BPFs to a dozen "things", XDP being the most obvious example. Full list. Then, it's possible to actually affect kernel packet processing, here is a full list of helper functions, some of which can modify kernel data structures.

In February LWN jokingly wrote:

Developers should be careful, though; this could
prove to be a slippery slope leading toward something 
that starts to look like a microkernel architecture.

There is a grain of truth here. Maybe the ability to run eBPF on variety of subsystems feels like microkernel coding, but definitely the SO_ATTACH_BPF smells like STREAMS programming model from 1984.

Thanks to Gilberto Bertin and David Wragg for helping out with the eBPF bytecode.

Doing eBPF work sound interesting? Join our world famous team in London, Austin, San Francisco, Champaign and our elite office in Warsaw, Poland.