The Cloudflare Blog

Building the foundation for running extra-large language models

Michelle Chen — Thu, 16 Apr 2026 14:00:00 GMT

An agent needs to be powered by a large language model. A few weeks ago, we announced that Workers AI is officially entering the arena for hosting large open-source models like Moonshot’s Kimi K2.5. Since then, we’ve made Kimi K2.5 3x faster and have more model additions in-flight. These models have been the backbone of a lot of the agentic products, harnesses, and tools that we have been launching this week.

Hosting AI models is an interesting challenge: it requires a delicate balance between software and very, very expensive hardware. At Cloudflare, we’re good at squeezing every bit of efficiency out of our hardware through clever software engineering. This is a deep dive on how we’re laying the foundation to run extra-large language models.

Hardware configurations

As we mentioned in our previous Kimi K2.5 blog post, we’re using a variety of hardware configurations in order to best serve models. A lot of hardware configurations depend on the size of inputs and outputs that users are sending to the model. For example, if you are using a model to write fanfiction, you might give it a few small prompts (input tokens) while asking it to generate pages of content (output tokens).

Conversely, if you are running a summarization task, you might be sending in hundreds of thousands of input tokens, but only generating a small summary with a few thousand output tokens. Presented with these opposing use cases, you have to make a choice — should you tune your model configuration so it’s faster at processing input tokens, or faster at generating output tokens?

When we launched large language models on Workers AI, we knew that most of the use cases would be used for agents. With agents, you send in a large number of input tokens. It starts off with a large system prompt, all the tools, MCPs. With the first user prompt, that context keeps growing. Each new prompt from the user sends a request to the model, which consists of everything that was said before — all the previous user prompts, assistant messages, code generated, etc. For Workers AI, that means we had to focus on two things: fast input token processing and fast tool calling.

Prefill decode (PD) disaggregation

One hardware configuration that we use to improve performance and efficiency is disaggregated prefill. There are two stages to processing an LLM request: prefill, which processes the input tokens and populates the KV cache, and decode, which generates output tokens. Prefill is usually compute bound, while decode is memory bound. This means that the parts of the GPU that are used in each stage are different, and since prefill is always done before decode, the stages block one another. Ultimately, it means that we are not efficiently utilizing all of our GPU power if we do both prefill and decode on a single machine.

With prefill decode disaggregation, separate inference servers are run for each stage. First, a request is sent to the prefill stage which performs prefill and stores it in its KV cache. Then the same request is sent to the decode server, with information about how to transfer the KV cache from the prefill server and begin decoding. This has a number of advantages, because it allows the servers to be tuned independently for the role they are performing, scaled to account for more input-heavy or output-heavy traffic, or even to run on heterogeneous hardware.

This architecture requires a relatively complex load balancer to achieve. Beyond just routing the requests as described above, it must rewrite the responses (including streaming SSE) of the decode server to include information from the prefill server such as cached tokens. To complicate matters, different inference servers require different information to initiate the KV cache transfer. We extended this to implement token-aware load balancing, in which there is a pool of prefill and decode endpoints, and the load balancer estimates how many prefill or decode tokens are in-flight to each endpoint in the pool and attempts to spread this load evenly.

After our public model launch, our input/output patterns changed drastically again. We took the time to analyze our new usage patterns and then tuned our configuration to fit our customer’s use cases.

Here’s a graph of our p90 Time to First Token drop after shifting traffic to our new PD disaggregated architecture, whilst request volume increased, using the same quantity of GPUs. We see a significant improvement in the tail latency variance.

Similarly, p90 time per token went from ~100 ms with high variance to 20-30 ms, a 3x improvement in intertoken latency.

Prompt Caching

Since agentic use cases usually have long contexts, we optimize for efficient prompt caching in order to not recompute input tensors on every turn. We leverage a header called x-session-affinity in order to help requests route to the right region that previously had the computed input tensors. We wrote about this in our original blog post about launching large LLMs on Workers AI. We added session affinity headers to popular agent harnesses like OpenCode, where we noticed a significant increase in total throughput. A small difference in prompt caching from our users can sum to a factor of additional GPUs needed to run a model. While we have KV-aware routing internally, we also rely on clients sending the x-session-affinity in order to be explicit about prompt caching. We incentivize the use of the header by offering discounted cached tokens. We highly encourage users to leverage prompt caching in order to have faster inference and cheaper pricing.

We worked with our heaviest internal users to adopt this header. The result was an increase in input token cache hit ratios from 60% to 80% during peak times. This significantly increases the request throughput that we can handle, while offering better performance for interactive or time-sensitive sessions like OpenCode or AI code reviews.

KV-cache optimization

As we’re serving larger models now, one instance can span multiple GPUs. This means that we had to find an efficient way to share KV cache across GPUs. KV cache is where all the input tensors from prefill (result of prompts in a session) are stored, and initially lives in the VRAM of a GPU. Every GPU has a fixed VRAM size, but if your model instance requires multiple GPUs, there needs to be a way for the KV cache to live across GPUs and talk to each other. To achieve this for Kimi, we leveraged Moonshot AI’s Mooncake Transfer Engine and Mooncake Store.

Mooncake’s Transfer Engine is a high-performance data transfer framework. It works with different Remote Direct Memory Access (RDMA) protocols such as NVLink and NVMe over Fabric, which enables direct memory-to-memory data transfer without involving the CPU. It improves the speed of transferring data across multiple GPU machines, which is particularly important in multi-GPU and multi-node configurations for models.

When paired with LMCache or SGLang HiCache, the cache is shared across all nodes in the cluster, allowing a prefill node to identify and re-use a cache from a previous request that was originally pre-filled on a different node. This eliminates the need for session aware routing within a cluster and allows us to load balance the traffic much more evenly. Mooncake Store also allows us to extend the cache beyond GPU VRAM, and leverage NVMe storage. This extends the time that sessions remain in cache, improving our cache hit ratio and allowing us to handle more traffic and offer better performance to users.

Speculative decoding

LLMs work by predicting the next token in a sequence, based on the tokens that came before it. With a naive implementation, models only predict the next n token, but we can actually make it predict the next n+1, n+2... tokens in a single forward pass of the model. This popular technique is known as speculative decoding, which we’ve written about in a previous post on Workers AI.

With speculative decoding, we leverage a smaller LLM (the draft model) to generate a few candidate tokens for the target model to choose from. The target model then just has to select from a small pool of candidate tokens in a single forward pass. Validating the tokens is faster and less computationally expensive than using the larger target model to generate the tokens. However, quality is still upheld as the target model ultimately has to accept or reject the draft tokens.

In agentic use cases, speculative decoding really shines because of the volume of tool calls and structured outputs that models need to generate. A tool call is largely predictable — you know there will be a name, description, and it’s wrapped in a JSON envelope.

To do this with Kimi K2.5, we leverage NVIDIA’s EAGLE-3 (Extrapolation Algorithm for Greater Language-model Efficiency) draft model. The levers for tuning speculative decoding include the number of future tokens to generate. As a result, we’re able to achieve high-quality inference while speeding up tokens per second throughput.

Infire: our proprietary inference engine

As we announced during Birthday Week in 2025, Cloudflare has a proprietary inference engine, Infire, that makes machine learning models faster. Infire is an inference engine written in Rust, designed to support Cloudflare’s unique challenges with inference given our distributed global network. We’ve extended Infire support for this new class of large language models we are planning to run, which meant we had to build a few new features to make it all work.

Multi-GPU support

Large language models like Kimi K2.5 are over 1 trillion parameters, which is about 560GB of model weights. A typical H100 has about 80GB of VRAM and the model weights need to be loaded in GPU memory in order to run. This means that a model like Kimi K2.5 needs at least 8 H100s in order to load the model into memory and run — and that’s not even including the extra VRAM you would need for KV Cache, which includes your context window.

Since we initially launched Infire, we had to add support for multi-GPU, letting the inference engine run across multiple GPUs in either pipeline-parallel or tensor-parallel modes with expert-parallelism supported as well.

For pipeline parallelism, Infire attempts to properly load balance all stages of the pipeline, in order to prevent the GPUs of one stage from starving while other stages are executing. On the other hand, for tensor parallelism, Infire optimizes for reducing cross-GPU communication, making it as fast as possible. For most models, utilizing both pipeline parallelism and tensor parallelism in tandem provides the best balance of throughput and latency.

Even lower memory overhead

While already having much lower GPU memory overhead than vLLM, we optimized Infire even further, tightening the memory required for internal state like activations. Currently Infire is capable of running Llama 4 Scout on just two H200 GPUs with more than 56 GiB remaining for KV-cache, sufficient for more than 1.2m tokens. Infire is also capable of running Kimi K2.5 on 8 H100 GPUs (yes that is H100), with more than 30 GiB still available for KV-cache. In both cases you would have trouble even booting vLLM in the first place.

Faster cold-starts

While adding multi-GPU support, we identified additional opportunities to improve boot times. Even for the largest models, such as Kimi K2.5, Infire can begin serving requests in under 20 seconds. The load times are only bounded by the drive speed.

Maximizing our hardware for faster throughput

Investing in our proprietary inference engine enables us to maximize our hardware by getting up to 20% higher tokens per second throughput on unconstrained systems, and also enabling us to use lower-end hardware to run the latest models, where it was previously completely infeasible.

The journey doesn’t end

New technologies, research, and models come out on a weekly basis for the machine learning community. We’re continuously optimizing our technology stack in order to provide high-quality, performant inference for our customers while operating our GPUs efficiently. If these sound like interesting challenges for you – we’re hiring!

How we built the most efficient inference engine for Cloudflare’s network

Vlad Krasnov — Wed, 27 Aug 2025 14:00:00 GMT

Inference powers some of today’s most powerful AI products: chat bot replies, AI agents, autonomous vehicle decisions, and fraud detection. The problem is, if you’re building one of these products on top of a hyperscaler, you’ll likely need to rent expensive GPUs from large centralized data centers to run your inference tasks. That model doesn’t work for Cloudflare — there’s a mismatch between Cloudflare’s globally-distributed network and a typical centralized AI deployment using large multi-GPU nodes. As a company that operates our own compute on a lean, fast, and widely distributed network within 50ms of 95% of the world’s Internet-connected population, we need to be running inference tasks more efficiently than anywhere else.

This is further compounded by the fact that AI models are getting larger and more complex. As we started to support these models, like the Llama 4 herd and gpt-oss, we realized that we couldn’t just throw money at the scaling problems by buying more GPUs. We needed to utilize every bit of idle capacity and be agile with where each model is deployed.

After running most of our models on the widely used open source inference and serving engine vLLM, we figured out it didn’t allow us to fully utilize the GPUs at the edge. Although it can run on a very wide range of hardware, from personal devices to data centers, it is best optimized for large data centers. When run as a dedicated inference server on powerful hardware serving a specific model, vLLM truly shines. However, it is much less optimized for dynamic workloads, distributed networks, and for the unique security constraints of running inference at the edge alongside other services.

That’s why we decided to build something that will be able to meet the needs of Cloudflare inference workloads for years to come. Infire is an LLM inference engine, written in Rust, that employs a range of techniques to maximize memory, network I/O, and GPU utilization. It can serve more requests with fewer GPUs and significantly lower CPU overhead, saving time, resources, and energy across our network.

Our initial benchmarking has shown that Infire completes inference tasks up to 7% faster than vLLM 0.10.0 on unloaded machines equipped with an H100 NVL GPU. On infrastructure under real load, it performs significantly better.

Currently, Infire is powering the Llama 3.1 8B model for Workers AI, and you can test it out today at @cf/meta/llama-3.1-8b-instruct!

The Architectural Challenge of LLM Inference at Cloudflare

Thanks to industry efforts, inference has improved a lot over the past few years. vLLM has led the way here with the recent release of the vLLM V1 engine with features like an optimized KV cache, improved batching, and the implementation of Flash Attention 3. vLLM is great for most inference workloads — we’re currently using it for several of the models in our Workers AI catalog — but as our AI workloads and catalog has grown, so has our need to optimize inference for the exact hardware and performance requirements we have.

Cloudflare is writing much of our new infrastructure in Rust, and vLLM is written in Python. Although Python has proven to be a great language for prototyping ML workloads, to maximize efficiency we need to control the low-level implementation details. Implementing low-level optimizations through multiple abstraction layers and Python libraries adds unnecessary complexity and leaves a lot of CPU performance on the table, simply due to the inefficiencies of Python as an interpreted language.

We love to contribute to open-source projects that we use, but in this case our priorities may not fit the goals of the vLLM project, so we chose to write a server for our needs. For example, vLLM does not support co-hosting multiple models on the same GPU without using Multi-Instance GPU (MIG), and we need to be able to dynamically schedule multiple models on the same GPU to minimize downtime. We also have an in-house AI Research team exploring unique features that are difficult, if not impossible, to upstream to vLLM.

Finally, running code securely is our top priority across our platform and Workers AI is no exception. We simply can’t trust a 3rd party Python process to run on our edge nodes alongside the rest of our services without strong sandboxing. We are therefore forced to run vLLM via gvisor. Having an extra virtualization layer adds another performance overhead to vLLM. More importantly, it also increases the startup and tear downtime for vLLM instances — which are already pretty long. Under full load on our edge nodes, vLLM running via gvisor consumes as much as 2.5 CPU cores, and is forced to compete for CPU time with other crucial services, that in turn slows vLLM down and lowers GPU utilization as a result.

While developing Infire, we’ve been incorporating the latest research in inference efficiency — let’s take a deeper look at what we actually built.

How Infire works under the hood

Infire is composed of three major components: an OpenAI compatible HTTP server, a batcher, and the Infire engine itself.

^{An overview of Infire’s architecture}

Platform startup

When a model is first scheduled to run on a specific node in one of our data centers by our auto-scaling service, the first thing that has to happen is for the model weights to be fetched from our R2 object storage. Once the weights are downloaded, they are cached on the edge node for future reuse.

As the weights become available either from cache or from R2, Infire can begin loading the model onto the GPU.

Model sizes vary greatly, but most of them are large, so transferring them into GPU memory can be a time-consuming part of Infire’s startup process. For example, most non-quantized models store their weights in the BF16 floating point format. This format has the same dynamic range as the 32-bit floating format, but with reduced accuracy. It is perfectly suited for inference providing the sweet spot of size, performance and accuracy. As the name suggests, the BF16 format requires 16 bits, or 2 bytes per weight. The approximate in-memory size of a given model is therefore double the size of its parameters. For example, LLama3.1 8B has approximately 8B parameters, and its memory footprint is about 16 GB. A larger model, like LLama4 Scout, has 109B parameters, and requires around 218 GB of memory. Infire utilizes a combination of Page Locked memory with CUDA asynchronous copy mechanism over multiple streams to speed up model transfer into GPU memory.

While loading the model weights, Infire begins just-in-time compiling the required kernels based on the model's parameters, and loads them onto the device. Parallelizing the compilation with model loading amortizes the latency of both processes. The startup time of Infire when loading the Llama-3-8B-Instruct model from disk is just under 4 seconds.

The HTTP server

The Infire server is built on top of hyper, a high performance HTTP crate, which makes it possible to handle hundreds of connections in parallel – while consuming a modest amount of CPU time. Because of ChatGPT’s ubiquity, vLLM and many other services offer OpenAI compatible endpoints out of the box. Infire is no different in that regard. The server is responsible for handling communication with the client: accepting connections, handling prompts and returning responses. A prompt will usually consist of some text, or a "transcript" of a chat session along with extra parameters that affect how the response is generated. Some parameters that come with a prompt include the temperature, which affects the randomness of the response, as well as other parameters that affect the randomness and length of a possible response.

After a request is deemed valid, Infire will pass it to the tokenizer, which transforms the raw text into a series of tokens, or numbers that the model can consume. Different models use different kinds of tokenizers, but the most popular ones use byte-pair encoding. For tokenization, we use HuggingFace's tokenizers crate. The tokenized prompts and params are then sent to the batcher, and scheduled for processing on the GPU, where they will be processed as vectors of numbers, called embeddings.

The batcher

The most important part of Infire is in how it does batching: by executing multiple requests in parallel. This makes it possible to better utilize memory bandwidth and caches.

In order to understand why batching is so important, we need to understand how the inference algorithm works. The weights of a model are essentially a bunch of two-dimensional matrices (also called tensors). The prompt represented as vectors is passed through a series of transformations that are largely dominated by one operation: vector-by-matrix multiplication. The model weights are so large, that the cost of the multiplication is dominated by the time it takes to fetch it from memory. In addition, modern GPUs have hardware units dedicated to matrix-by-matrix multiplications (called Tensor Cores on Nvidia GPUs). In order to amortize the cost of memory access and take advantage of the Tensor Cores, it is necessary to aggregate multiple operations into a larger matrix multiplication.

Infire utilizes two techniques to increase the size of those matrix operations. The first one is called prefill: this technique is applied to the prompt tokens. Because all the prompt tokens are available in advance and do not require decoding, they can all be processed in parallel. This is one reason why input tokens are often cheaper (and faster) than output tokens.

^{How Infire enables larger matrix multiplications via batching}

The other technique is called batching: this technique aggregates multiple prompts into a single decode operation.

Infire mixes both techniques. It attempts to process as many prompts as possible in parallel, and fills the remaining slots in a batch with prefill tokens from incoming prompts. This is also known as continuous batching with chunked prefill.

As tokens get decoded by the Infire engine, the batcher is also responsible for retiring prompts that reach an End of Stream token, and sending tokens back to the decoder to be converted into text.

Another job the batcher has is handling the KV cache. One demanding operation in the inference process is called attention. Attention requires going over the KV values computed for all the tokens up to the current one. If we had to recompute those previously encountered KV values for every new token we decode, the runtime of the process would explode for longer context sizes. However, using a cache, we can store all the previous values and re-read them for each consecutive token. Potentially the KV cache for a prompt can store KV values for as many tokens as the context window allows. In LLama 3, the maximal context window is 128K tokens. If we pre-allocated the KV cache for each prompt in advance, we would only have enough memory available to execute 4 prompts in parallel on H100 GPUs! The solution for this is paged KV cache. With paged KV caching, the cache is split into smaller chunks called pages. When the batcher detects that a prompt would exceed its KV cache, it simply assigns another page to that prompt. Since most prompts rarely hit the maximum context window, this technique allows for essentially unlimited parallelism under typical load.

Finally, the batcher drives the Infire forward pass by scheduling the needed kernels to run on the GPU.

CUDA kernels

Developing Infire gives us the luxury of focusing on the exact hardware we use, which is currently Nvidia Hopper GPUs. This allowed us to improve performance of specific compute kernels using low-level PTX instructions for this specific architecture.

Infire just-in-time compiles its kernel for the specific model it is running, optimizing for the model’s parameters, such as the hidden state size, dictionary size and the GPU it is running on. For some operations, such as large matrix multiplications, Infire will utilize the high performance cuBLASlt library, if it would deem it faster.

Infire also makes use of very fine-grained CUDA graphs, essentially creating a dedicated CUDA graph for every possible batch size on demand. It then stores it for future launch. Conceptually, a CUDA graph is another form of just-in-time compilation: the CUDA driver replaces a series of kernel launches with a single construct (the graph) that has a significantly lower amortized kernel launch cost, thus kernels executed back to back will execute faster when launched as a single graph as opposed to individual launches.

How Infire performs in the wild

We ran synthetic benchmarks on one of our edge nodes with an H100 NVL GPU.

The benchmark we ran was on the widely used ShareGPT v3 dataset. We ran the benchmark on a set of 4,000 prompts with a concurrency of 200. We then compared Infire and vLLM running on bare metal as well as vLLM running under gvisor, which is the way we currently run in production. In a production traffic scenario, an edge node would be competing for resources with other traffic. To simulate this, we benchmarked vLLM running in gvisor with only one CPU available.

	requests/s	tokens/s	CPU load
Infire	40.91	17224.21	25%
vLLM 0.10.0	38.38	16164.41	140%
vLLM under gvisor	37.13	15637.32	250%
vLLM under gvisor with CPU constraints	22.04	9279.25	100%

As evident from the benchmarks we achieved our initial goal of matching and even slightly surpassing vLLM performance, but more importantly, we’ve done so at a significantly lower CPU usage, in large part because we can run Infire as a trusted bare-metal process. Inference no longer takes away precious resources from our other services and we see GPU utilization upward of 80%, reducing our operational costs.

This is just the beginning. There are still multiple proven performance optimizations yet to be implemented in Infire – for example, we’re integrating Flash Attention 3, and most of our kernels don’t utilize kernel fusion. Those and other optimizations will allow us to unlock even faster inference in the near future.

What’s next

Running AI inference presents novel challenges and demands to our infrastructure. Infire is how we’re running AI efficiently — close to users around the world. By building upon techniques like continuous batching, a paged KV-cache, and low-level optimizations tailored to our hardware, Infire maximizes GPU utilization while minimizing overhead. Infire completes inference tasks faster and with a fraction of the CPU load of our previous vLLM-based setup, especially under the strict security constraints we require. This allows us to serve more requests with fewer resources, making requests served via Workers AI faster and more efficient.

However, this is just our first iteration — we’re excited to build in multi-GPU support for larger models, quantization, and true multi-tenancy into the next version of Infire. This is part of our goal to make Cloudflare the best possible platform for developers to build AI applications.

Want to see if your AI workloads are faster on Cloudflare? Get started with Workers AI today.

Sippy helps you avoid egress fees while incrementally migrating data from S3 to R2

Phillip Jones — Tue, 26 Sep 2023 13:00:44 GMT

Earlier in 2023, we announced Super Slurper, a data migration tool that makes it easy to copy large amounts of data to R2 from other cloud object storage providers. Since the announcement, developers have used Super Slurper to run thousands of successful migrations to R2!

While Super Slurper is perfect for cases where you want to move all of your data to R2 at once, there are scenarios where you may want to migrate your data incrementally over time. Maybe you want to avoid the one time upfront AWS data transfer bill? Or perhaps you have legacy data that may never be accessed, and you only want to migrate what’s required?

Today, we’re announcing the open beta of Sippy, an incremental migration service that copies data from S3 (other cloud providers coming soon!) to R2 as it’s requested, without paying unnecessary cloud egress fees typically associated with moving large amounts of data. On top of addressing vendor lock-in, Sippy makes stressful, time-consuming migrations a thing of the past. All you need to do is replace the S3 endpoint in your application or attach your domain to your new R2 bucket and data will start getting copied over.

How does it work?

Sippy is an incremental migration service built directly into your R2 bucket. Migration-specific egress fees are reduced by leveraging requests within the flow of your application where you’d already be paying egress fees to simultaneously copy objects to R2. Here is how it works:

When an object is requested from Workers, S3 API, or public bucket, it is served from your R2 bucket if it is found.

If the object is not found in R2, it will simultaneously be returned from your S3 bucket and copied to R2.

Note: Some large objects may take multiple requests to copy.

That means after objects are copied, subsequent requests will be served from R2, and you’ll begin saving on egress fees immediately.

Start incrementally migrating data from S3 to R2

Create an R2 bucket

To get started with incremental migration, you’ll first need to create an R2 bucket if you don’t already have one. To create a new R2 bucket from the Cloudflare dashboard:

Log in to the Cloudflare dashboard and select R2.
Select Create bucket.
Give your bucket a name and select Create bucket.

To learn more about other ways to create R2 buckets refer to the documentation on creating buckets.

Enable Sippy on your R2 bucket

Next, you’ll enable Sippy for the R2 bucket you created. During the beta, you can do this by using the API. Here’s an example of how to enable Sippy for an R2 bucket with cURL:

curl -X PUT https://api.cloudflare.com/client/v4/accounts/{account_id}/r2/buckets/{bucket_name}/sippy \
--header "Authorization: Bearer " \
--data '{"provider": "AWS", "bucket": "", "zone": "","key_id": "", "access_key":"", "r2_key_id": "", "r2_access_key": ""}'

For more information on getting started, please refer to the documentation. Once enabled, requests to your bucket will now start copying data over from S3 if it’s not already present in your R2 bucket.

Finish your migration with Super Slurper

You can run your incremental migration for as long as you want, but eventually you may want to complete the migration to R2. To do this, you can pair Sippy with Super Slurper to easily migrate your remaining data that hasn’t been accessed to R2.

What’s next?

We’re excited about open beta, but it’s only the starting point. Next, we plan on making incremental migration configurable from the Cloudflare dashboard, complete with analytics that show you the progress of your migration and how much you are saving by not paying egress fees for objects that have been copied over so far.

If you are looking to start incrementally migrating your data to R2 and have any questions or feedback on what we should build next, we encourage you to join our Discord community to share!

BoringTun, a userspace WireGuard implementation in Rust

Vlad Krasnov — Wed, 27 Mar 2019 13:43:27 GMT

Today we are happy to release the source code of a project we’ve been working on for the past few months. It is called BoringTun, and is a userspace implementation of the WireGuard® protocol written in Rust.

A Bit About WireGuard

WireGuard is relatively new project that attempts to replace old VPN protocols, with a simple, fast, and safe protocol. Unlike legacy VPNs, WireGuard is built around the Noise Protocol Framework and relies only on a select few, modern, cryptographic primitives: X25519 for public key operations, ChaCha20-Poly1305 for authenticated encryption, and Blake2s for message authentication.

Like QUIC, WireGuard works over UDP, but its only goal is to securely encapsulate IP packets. As a result, it does not guarantee the delivery of packets, or that packets are delivered in the order they are sent.

The simplicity of the protocol means it is more robust than old, unmaintainable codebases, and can also be implemented relatively quickly. Despite its relatively young age, WireGuard is quickly gaining in popularity.

Starting From Scratch

While evaluating the potential value WireGuard could provide us, we first considered the existing implementations. Currently, there are three usable implementations

The WireGuard kernel module - written in C, it is tightly integrated with the Linux kernel, and is not usable outside of it. Due to its integration with the kernel it provides the best possible performance. It is licensed under the GPL-2.0 license.
wireguard-go - this is the only compliant userspace implementation of WireGuard. As its name suggests it is written in Go, a language that we love, and is licensed under the permissive MIT license.
TunSafe - written in C++, it does not implement the userspace protocol exactly, but rather a deviation of it. It supports several platforms, but by design it supports only a single peer. TunSafe uses the AGPL-1.0 license.

Whereas we were looking for:

Userspace
Cross-platform - including Linux, Windows, macOS, iOS and Android
Fast

Clearly we thought, only one of those fits the bill, and that is wireguard-go. However, benchmarks quickly showed that wireguard-go falls very short of the performance offered by the kernel module. This is because while the Go language is very good for writing servers, it is not so good for raw packet processing, which a VPN essentially does.

Choosing Rust

After we decided to create a userspace WireGuard implementation, there was the small matter of choosing the right language. While C and C++ are both high performance, low level languages, recent history has demonstrated that their memory model was too fragile for a modern cryptography and security-oriented project. Go was shown to be suboptimal for this use case by wireguard-go.

The obvious answer was Rust. Rust is a modern, safe language that is both as fast as C++ and is arguably safer than Go (it is memory safe and also imposes rules that allow for safer concurrency), while supporting a huge selection of platforms. We also have some of the best Rust talent in the industry working at the company.

In fact, another Rust implementation of WireGuard, wireguard-rs, exists. But wireguard-rs is very immature, and we strongly felt that it would benefit the WireGuard ecosystem if there was a completely independent implementation under a permissive license.

Thus BoringTun was born.

The name might sound a bit boring but there's a reason for it: BoringTun creates a tunnel by 'boring' it. And it’s a nod to Google’s BoringSSL which strips the complexity out of OpenSSL. We think WireGuard has the opportunity to do the same for legacy VPN protocols like OpenVPN. And we hope BoringTun can be a valuable tool as part of that ecosystem.

WireGuard is an incredible tool and we believe it has a chance of being the defacto standard for VPN-like technologies going forward. We're adding our Rust implementation of WireGuard to the ecosystem and hope people find it useful.

Next steps

BoringTun is currently under internal security review, and it is probably not fully ready to be used in mission critical tasks. We will fix issues as we find them, and we also welcome contributions from the community on the projects Github page. The project is licensed under the open source 3-Clause BSD License.

Note: WireGuard is a registered trademark of Jason A. Donenfeld.

Know your SCM_RIGHTS

Vlad Krasnov — Thu, 29 Nov 2018 09:54:22 GMT

As TLS 1.3 was ratified earlier this year, I was recollecting how we got started with it here at Cloudflare. We made the decision to be early adopters of TLS 1.3 a little over two years ago. It was a very important decision, and we took it very seriously.

It is no secret that Cloudflare uses nginx to handle user traffic. A little less known fact, is that we have several instances of nginx running. I won’t go into detail, but there is one instance whose job is to accept connections on port 443, and proxy them to another instance of nginx that actually handles the requests. It has pretty limited functionality otherwise. We fondly call it nginx-ssl.

Back then we were using OpenSSL for TLS and Crypto in nginx, but OpenSSL (and BoringSSL) had yet to announce a timeline for TLS 1.3 support, therefore we had to implement our own TLS 1.3 stack. Obviously we wanted an implementation that would not affect any customer or client that would not enable TLS 1.3. We also needed something that we could iterate on quickly, because the spec was very fluid back then, and also something that we can release frequently without worrying about the rest of the Cloudflare stack.

The obvious solution was to implement it on top of OpenSSL. The OpenSSL version we were using was 1.0.2, but not only were we looking ahead to replace it with version 1.1.0 or with BoringSSL (which we eventually did), it was so ingrained in our stack and so fragile that we wouldn’t be able to achieve our stated goals, without risking serious bugs.

Instead, Filippo Valsorda and Brendan McMillion suggested that the easier path would be to implement TLS 1.3 on top of the Go TLS library and make a Go replica of nginx-ssl (go-ssl). Go is very easy to iterate and prototype, with a powerful standard library, and we had a great pool of Go talent to use, so it made a lot of sense. Thus tls-tris was born.

The question remained how would we have Go handle only TLS 1.3 while letting nginx handling all prior versions of TLS?

And herein lies the problem. Both TLS 1.3 and older versions of TLS communicate on port 443, and it is common knowledge that only one application can listen on a given TCP port, and that application is nginx, that would still handle the bulk of the TLS traffic. We could pipe all the TCP data into another connection in Go, effectively creating an additional proxy layer, but where is the fun in that? Also it seemed a little inefficient.

Meet SCM_RIGHTS

So how do you make two different processes, written in two different programming languages, share the same TCP socket?

Fortunately, Linux (or rather UNIX) provides us with just the tool that we need. You can use UNIX-domain sockets to pass file descriptors between applications, and like everything else in UNIX connections are files.Looking at man 7 unix we see the following:

   Ancillary messages
       Ancillary  data  is  sent and received using sendmsg(2) and recvmsg(2).
       For historical reasons the ancillary message  types  listed  below  are
       specified with a SOL_SOCKET type even though they are AF_UNIX specific.
       To send them  set  the  cmsg_level  field  of  the  struct  cmsghdr  to
       SOL_SOCKET  and  the cmsg_type field to the type.  For more information
       see cmsg(3).

       SCM_RIGHTS
              Send or receive a set of  open  file  descriptors  from  another
              process.  The data portion contains an integer array of the file
              descriptors.  The passed file descriptors behave as though  they
              have been created with dup(2).

Technically you do not send “file descriptors”. The “file descriptors” you handle in the code are simply indices into the processes' local file descriptor table, which in turn points into the OS' open file table, that finally points to the vnode representing the file. Thus the “file descriptor” observed by the other process will most likely have a different numeric value, despite pointing to the same file.

We can also check man 3 cmsg as suggested, to find a handy example on how to use SCM_RIGHTS:

   struct msghdr msg = { 0 };
   struct cmsghdr *cmsg;
   int myfds[NUM_FD];  /* Contains the file descriptors to pass */
   int *fdptr;
   union {         /* Ancillary data buffer, wrapped in a union
                      in order to ensure it is suitably aligned */
       char buf[CMSG_SPACE(sizeof(myfds))];
       struct cmsghdr align;
   } u;

   msg.msg_control = u.buf;
   msg.msg_controllen = sizeof(u.buf);
   cmsg = CMSG_FIRSTHDR(&msg);
   cmsg->cmsg_level = SOL_SOCKET;
   cmsg->cmsg_type = SCM_RIGHTS;
   cmsg->cmsg_len = CMSG_LEN(sizeof(int) * NUM_FD);
   fdptr = (int *) CMSG_DATA(cmsg);    /* Initialize the payload */
   memcpy(fdptr, myfds, NUM_FD * sizeof(int));

And that is what we decided to use. We let OpenSSL read the “Client Hello” message from an established TCP connection. If the “Client Hello” indicated TLS version 1.3, we would use SCM_RIGHTS to send it to the Go process. The Go process would in turn try to parse the rest of the “Client Hello”, if it were successful it would proceed with TLS 1.3 connection, and upon failure it would give the file descriptor back to OpenSSL, to handle regularly.

So how exactly do you implement something like that?

Since in our case we established that the C process will listen for TCP connections, our other process will have to listen on a UNIX socket, for connections C will want to forward.

For example in Go:

type scmListener struct {
	*net.UnixListener
}

type scmConn struct {
	*net.UnixConn
}

var path = "/tmp/scm_example.sock"

func listenSCM() (*scmListener, error) {
	syscall.Unlink(path)

	addr, err := net.ResolveUnixAddr("unix", path)
	if err != nil {
		return nil, err
	}

	ul, err := net.ListenUnix("unix", addr)
	if err != nil {
		return nil, err
	}

	err = os.Chmod(path, 0777)
	if err != nil {
		return nil, err
	}

	return &scmListener{ul}, nil
}

func (l *scmListener) Accept() (*scmConn, error) {
	uc, err := l.AcceptUnix()
	if err != nil {
		return nil, err
	}
	return &scmConn{uc}, nil
}

Then in the C process, for each connection we want to pass, we will connect to that socket first:

int connect_unix()
{
    struct sockaddr_un addr = {.sun_family = AF_UNIX,
                               .sun_path = "/tmp/scm_example.sock"};

    int unix_sock = socket(AF_UNIX, SOCK_STREAM, 0);
    if (unix_sock == -1)
        return -1;

    if (connect(unix_sock, (struct sockaddr *)&addr, sizeof(addr)) == -1)
    {
        close(unix_sock);
        return -1;
    }

    return unix_sock;
}

To actually pass a file descriptor we utilize the example from man 3 cmsg:

int send_fd(int unix_sock, int fd)
{
    struct iovec iov = {.iov_base = ":)", // Must send at least one byte
                        .iov_len = 2};

    union {
        char buf[CMSG_SPACE(sizeof(fd))];
        struct cmsghdr align;
    } u;

    struct msghdr msg = {.msg_iov = &iov,
                         .msg_iovlen = 1,
                         .msg_control = u.buf,
                         .msg_controllen = sizeof(u.buf)};

    struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
    *cmsg = (struct cmsghdr){.cmsg_level = SOL_SOCKET,
                             .cmsg_type = SCM_RIGHTS,
                             .cmsg_len = CMSG_LEN(sizeof(fd))};

    memcpy(CMSG_DATA(cmsg), &fd, sizeof(fd));

    return sendmsg(unix_sock, &msg, 0);
}

Then to receive the file descriptor in Go:

func (c *scmConn) ReadFD() (*os.File, error) {
	msg, oob := make([]byte, 2), make([]byte, 128)

	_, oobn, _, _, err := c.ReadMsgUnix(msg, oob)
	if err != nil {
		return nil, err
	}

	cmsgs, err := syscall.ParseSocketControlMessage(oob[0:oobn])
	if err != nil {
		return nil, err
	} else if len(cmsgs) != 1 {
		return nil, errors.New("invalid number of cmsgs received")
	}

	fds, err := syscall.ParseUnixRights(&cmsgs[0])
	if err != nil {
		return nil, err
	} else if len(fds) != 1 {
		return nil, errors.New("invalid number of fds received")
	}

	fd := os.NewFile(uintptr(fds[0]), "")
	if fd == nil {
		return nil, errors.New("could not open fd")
	}

	return fd, nil
}

Rust

We can also do this in Rust, although the standard library in Rust does not yet support UNIX sockets, but it does let you address the C library via the libc crate. Warning, unsafe code ahead!

First we want to implement some UNIX socket functionality in Rust:

use libc::*;
use std::io::prelude::*;
use std::net::TcpStream;
use std::os::unix::io::FromRawFd;
use std::os::unix::io::RawFd;

fn errno_str() -> String {
    let strerr = unsafe { strerror(*__error()) };
    let c_str = unsafe { std::ffi::CStr::from_ptr(strerr) };
    c_str.to_string_lossy().into_owned()
}

pub struct UNIXSocket {
    fd: RawFd,
}

pub struct UNIXConn {
    fd: RawFd,
}

impl Drop for UNIXSocket {
    fn drop(&mut self) {
        unsafe { close(self.fd) };
    }
}

impl Drop for UNIXConn {
    fn drop(&mut self) {
        unsafe { close(self.fd) };
    }
}

impl UNIXSocket {
    pub fn new() -> Result {
        match unsafe { socket(AF_UNIX, SOCK_STREAM, 0) } {
            -1 => Err(errno_str()),
            fd @ _ => Ok(UNIXSocket { fd }),
        }
    }

    pub fn bind(self, address: &str) -> Result {
        assert!(address.len() < 104);

        let mut addr = sockaddr_un {
            sun_len: std::mem::size_of::() as u8,
            sun_family: AF_UNIX as u8,
            sun_path: [0; 104],
        };

        for (i, c) in address.chars().enumerate() {
            addr.sun_path[i] = c as i8;
        }

        match unsafe {
            unlink(&addr.sun_path as *const i8);
            bind(
                self.fd,
                &addr as *const sockaddr_un as *const sockaddr,
                std::mem::size_of::() as u32,
            )
        } {
            -1 => Err(errno_str()),
            _ => Ok(self),
        }
    }

    pub fn listen(self) -> Result {
        match unsafe { listen(self.fd, 50) } {
            -1 => Err(errno_str()),
            _ => Ok(self),
        }
    }

    pub fn accept(&self) -> Result {
        match unsafe { accept(self.fd, std::ptr::null_mut(), std::ptr::null_mut()) } {
            -1 => Err(errno_str()),
            fd @ _ => Ok(UNIXConn { fd }),
        }
    }
}

And the code to extract the file desciptor:

#[repr(C)]
pub struct ScmCmsgHeader {
    cmsg_len: c_uint,
    cmsg_level: c_int,
    cmsg_type: c_int,
    fd: c_int,
}

impl UNIXConn {
    pub fn recv_fd(&self) -> Result {
        let mut iov = iovec {
            iov_base: std::ptr::null_mut(),
            iov_len: 0,
        };

        let mut scm = ScmCmsgHeader {
            cmsg_len: 0,
            cmsg_level: 0,
            cmsg_type: 0,
            fd: 0,
        };

        let mut mhdr = msghdr {
            msg_name: std::ptr::null_mut(),
            msg_namelen: 0,
            msg_iov: &mut iov as *mut iovec,
            msg_iovlen: 1,
            msg_control: &mut scm as *mut ScmCmsgHeader as *mut c_void,
            msg_controllen: std::mem::size_of::() as u32,
            msg_flags: 0,
        };

        let n = unsafe { recvmsg(self.fd, &mut mhdr, 0) };

        if n == -1
            || scm.cmsg_len as usize != std::mem::size_of::()
            || scm.cmsg_level != SOL_SOCKET
            || scm.cmsg_type != SCM_RIGHTS
        {
            Err("Invalid SCM message".to_string())
        } else {
            Ok(scm.fd)
        }
    }
}

Conclusion

SCM_RIGHTS is a very powerful tool that can be used for many purposes. In our case we used to to introduce a new service in a non-obtrusive fashion. Other uses may be:

A/B testing
Phasing out of an old C based service in favor of new Go or Rust one
Passing connections from a privileged process to an unprivileged one

And more

You can find the full example here.

NEON is the new black: fast JPEG optimization on ARM server

Vlad Krasnov — Fri, 13 Apr 2018 16:38:47 GMT

As engineers at Cloudflare quickly adapt our software stack to run on ARM, a few parts of our software stack have not been performing as well on ARM processors as they currently do on our Xeon® Silver 4116 CPUs. For the most part this is a matter of Intel specific optimizations some of which utilize SIMD or other special instructions.

One such example is the venerable jpegtran, one of the workhorses behind our Polish image optimization service.

A while ago I optimized our version of jpegtran for Intel processors. So when I ran a comparison on my test image, I was expecting that the Xeon would outperform ARM:

vlad@xeon:~$ time  ./jpegtran -outfile /dev/null -progressive -optimise -copy none test.jpg

real    0m2.305s
user    0m2.059s
sys     0m0.252s

vlad@arm:~$ time ./jpegtran -outfile /dev/null -progressive -optimise -copy none test.jpg

real    0m8.654s
user    0m8.433s
sys     0m0.225s

Ideally we want to have the ARM performing at or above 50% of the Xeon performance per core. This would make sure we have no performance regressions, and net performance gain, since the ARM CPUs have double the core count as our current 2 socket setup.

In this case, however, I was disappointed to discover an almost 4X slowdown.

Not one to despair, I figured out that applying the same optimizations I did for Intel would be trivial. Surely the NEON instructions map neatly to the SSE instructions I used before?

CC BY-SA 2.0 image by viZZZual.com

What is NEON

NEON is the ARMv8 version of SIMD, Single Instruction Multiple Data instruction set, where a single operation performs (generally) the same operation on several operands.

NEON operates on 32 dedicated 128-bit registers, similarly to Intel SSE. It can perform operations on 32-bit and 64-bit floating point numbers, or 8-bit, 16-bit, 32-bit and 64-bit signed or unsigned integers.

As with SSE you can program either in the assembly language, or in C using intrinsics. The intrinsics are usually easier to use, and depending on the application and the compiler can provide better performance, however intrinsics based code tends to be quite verbose.

If you opt to use the NEON intrinsics you have to include . While SSE intrinsic use __m128i for all SIMD integer operations, the intrinsics for NEON have distinct type for each integer and float width. For example operations on signed 16-bit integers use the int16x8_t type, which we are going to use. Similarly there is a uint16x8_t type for unsigned integer, as well as int8x16_t, int32x4_t and int64x2_t and their uint derivatives, that are self explanatory.

Getting started

Running perf tells me that the same two culprits are responsible for most of the CPU time spent:

perf record ./jpegtran -outfile /dev/null -progressive -optimise -copy none test.jpeg
perf report
  71.24%  lt-jpegtran  libjpeg.so.9.1.0   [.] encode_mcu_AC_refine
  15.24%  lt-jpegtran  libjpeg.so.9.1.0   [.] encode_mcu_AC_first

Aha, encode_mcu_AC_refine and encode_mcu_AC_first, my old nemeses!

The straightforward approach

encode_mcu_AC_refine

Let's recoup the optimizations we applied to encode_mcu_AC_refine previously. The function has two loops, with the heavier loop performing the following operation:

for (k = cinfo->Ss; k <= Se; k++) {
  temp = (*block)[natural_order[k]];
  if (temp < 0)
    temp = -temp;      /* temp is abs value of input */
  temp >>= Al;         /* apply the point transform */
  absvalues[k] = temp; /* save abs value for main pass */
  if (temp == 1)
    EOB = k;           /* EOB = index of last newly-nonzero coef */
}

And the SSE solution to this problem was:

__m128i x1 = _mm_setzero_si128(); // Load 8 16-bit values sequentially
x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+0]], 0);
x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+1]], 1);
x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+2]], 2);
x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+3]], 3);
x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+4]], 4);
x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+5]], 5);
x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+6]], 6);
x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+7]], 7);

x1 = _mm_abs_epi16(x1);       // Get absolute value of 16-bit integers
x1 = _mm_srli_epi16(x1, Al);  // >> 16-bit integers by Al bits

_mm_storeu_si128((__m128i*)&absvalues[k], x1);   // Store

x1 = _mm_cmpeq_epi16(x1, _mm_set1_epi16(1));     // Compare to 1
unsigned int idx = _mm_movemask_epi8(x1);        // Extract byte mask
EOB = idx? k + 16 - __builtin_clz(idx)/2 : EOB;  // Compute index

For the most part the transition to NEON is indeed straightforward.

To initialize a register to all zeros, we can use the vdupq_n_s16 intrinsic, that duplicates a given value across all lanes of a register. The insertions are performed with the vsetq_lane_s16 intrinsic. Use vabsq_s16 to get the absolute values.

The shift right instruction made me pause for a while. I simply couldn't find an instruction that can shift right by a non constant integer value. It doesn't exist. However the solution is very simple, you shift left by a negative amount! The intrinsic for that is vshlq_s16.

The absence of a right shift instruction is no coincidence. Unlike the x86 instruction set, that can theoretically support arbitrarily long instructions, and thus don't have to think twice before adding a new instruction, no matter how specialized or redundant it is, ARMv8 instruction set can only support 32-bit long instructions, and have a very limited opcode space. For this reason the instruction set is much more concise, and many instructions are in fact aliases to other instruction. Even the most basic MOV instruction is an alias for ORR (binary or). That means that programming for ARM and NEON sometimes requires greater creativity.

The final step of the loop, is comparing each element to 1, then getting the mask. Comparing for equality is performed with vceqq_s16. But again there is no operation to extract the mask. That is a problem. However, instead of getting a bitmask, it is possible to extract a whole byte from every lane into a 64-bit value, by first applying vuzp1q_u8 to the comparison result. vuzp1q_u8 interleaves the even indexed bytes of two vectors (whereas vuzp2q_u8 interleaves the odd indexes). So the solution would look something like that:

int16x8_t zero = vdupq_n_s16(0);
int16x8_t al_neon = vdupq_n_s16(-Al);
int16x8_t x0 = zero;
int16x8_t x1 = zero;

// Load 8 16-bit values sequentially
x1 = vsetq_lane_s16((*block)[natural_order[k+0]], x1, 0);
// Interleave the loads to compensate for latency
x0 = vsetq_lane_s16((*block)[natural_order[k+1]], x0, 1);
x1 = vsetq_lane_s16((*block)[natural_order[k+2]], x1, 2);
x0 = vsetq_lane_s16((*block)[natural_order[k+3]], x0, 3);
x1 = vsetq_lane_s16((*block)[natural_order[k+4]], x1, 4);
x0 = vsetq_lane_s16((*block)[natural_order[k+5]], x0, 5);
x1 = vsetq_lane_s16((*block)[natural_order[k+6]], x1, 6);
x0 = vsetq_lane_s16((*block)[natural_order[k+7]], x0, 7);
int16x8_t x = vorrq_s16(x1, x0);

x = vabsq_s16(x);            // Get absolute value of 16-bit integers
x = vshlq_s16(x, al_neon);   // >> 16-bit integers by Al bits

vst1q_s16(&absvalues[k], x); // Store
uint8x16_t is_one = vreinterpretq_u8_u16(vceqq_s16(x, one));  // Compare to 1
is_one = vuzp1q_u8(is_one, is_one);  // Compact the compare result into 64 bits

uint64_t idx = vgetq_lane_u64(vreinterpretq_u64_u8(is_one), 0); // Extract
EOB = idx ? k + 8 - __builtin_clzl(idx)/8 : EOB;                // Get the index

Note the intrinsics for explicit type casts. They don't actually emit any instructions, since regardless of the type the operands always occupy the same registers.

On to the second loop:

if ((temp = absvalues[k]) == 0) {
  r++;
  continue;
}

The SSE solution was:

__m128i t = _mm_loadu_si128((__m128i*)&absvalues[k]);
t = _mm_cmpeq_epi16(t, _mm_setzero_si128()); // Compare to 0
int idx = _mm_movemask_epi8(t);              // Extract byte mask
if (idx == 0xffff) {                         // Skip all zeros
  r += 8;
  k += 8;
  continue;
} else {                                     // Skip up to the first nonzero
  int skip = __builtin_ctz(~idx)/2;
  r += skip;
  k += skip;
  if (k>Se) break;      // Stop if gone too far
}
temp = absvalues[k];    // Load the next nonzero value

But we already know that there is no way to extract the byte mask. Instead of using NEON I chose to simply skip four zero values at a time, using 64-bit integers, like so:

uint64_t tt, *t = (uint64_t*)&absvalues[k];
if ( (tt = *t) == 0) while ( (tt = *++t) == 0); // Skip while all zeroes
int skip = __builtin_ctzl(tt)/16 + ((int64_t)t - 
           (int64_t)&absvalues[k])/2;           // Get index of next nonzero
k += skip;
r += skip;
temp = absvalues[k];

How fast are we now?

vlad@arm:~$ time ./jpegtran -outfile /dev/null -progressive -optimise -copy none test.jpg

real    0m4.008s
user    0m3.770s
sys     0m0.241s

Wow, that is incredible. Over 2X speedup!

encode_mcu_AC_first

The other function is quite similar, but the logic slightly differs on the first pass:

temp = (*block)[natural_order[k]];
if (temp < 0) {
  temp = -temp;             // Temp is abs value of input
  temp >>= Al;              // Apply the point transform
  temp2 = ~temp;
} else {
  temp >>= Al;              // Apply the point transform
  temp2 = temp;
}
t1[k] = temp;
t2[k] = temp2;

Here it is required to assign the absolute value of temp to t1[k], and its inverse to t2[k] if temp is negative, otherwise t2[k] assigned the same value as t1[k].

To get the inverse of a value, we use the vmvnq_s16 intrinsic, to check if the values are negative we need to compare with zero using the vcgezq_s16 and finally selecting based on the mask using vbslq_s16.

int16x8_t zero = vdupq_n_s16(0);
int16x8_t al_neon = vdupq_n_s16(-Al);

int16x8_t x0 = zero;
int16x8_t x1 = zero;

// Load 8 16-bit values sequentially
x1 = vsetq_lane_s16((*block)[natural_order[k+0]], x1, 0);
// Interleave the loads to compensate for latency
x0 = vsetq_lane_s16((*block)[natural_order[k+1]], x0, 1);
x1 = vsetq_lane_s16((*block)[natural_order[k+2]], x1, 2);
x0 = vsetq_lane_s16((*block)[natural_order[k+3]], x0, 3);
x1 = vsetq_lane_s16((*block)[natural_order[k+4]], x1, 4);
x0 = vsetq_lane_s16((*block)[natural_order[k+5]], x0, 5);
x1 = vsetq_lane_s16((*block)[natural_order[k+6]], x1, 6);
x0 = vsetq_lane_s16((*block)[natural_order[k+7]], x0, 7);
int16x8_t x = vorrq_s16(x1, x0);

uint16x8_t is_positive = vcgezq_s16(x); // Get positive mask

x = vabsq_s16(x);                 // Get absolute value of 16-bit integers
x = vshlq_s16(x, al_neon);        // >> 16-bit integers by Al bits
int16x8_t n = vmvnq_s16(x);       // Binary inverse
n = vbslq_s16(is_positive, x, n); // Select based on positive mask

vst1q_s16(&t1[k], x); // Store
vst1q_s16(&t2[k], n);

And the moment of truth:

vlad@arm:~$ time ./jpegtran -outfile /dev/null -progressive -optimise -copy none test.jpg

real    0m3.480s
user    0m3.243s
sys     0m0.241s

Overall 2.5X speedup from the original C implementation, but still 1.5X slower than Xeon.

Batch benchmark

While the improvement for the single image was impressive, it is not necessarily representative of all jpeg files. To understand the impact on overall performance I ran jpegtran over a set of 34,159 actual images from one of our caches. The total size of those images was 3,325,253KB. The total size after jpegtran was 3,067,753KB, or 8% improvement on average.

Using one thread, the Intel Xeon managed to process all those images in 14 minutes and 43 seconds. The original jpegtran on our ARM server took 29 minutes and 34 seconds. The improved jpegtran took only 13 minutes and 52 seconds, slightly outperforming even the Xeon processor, despite losing on the test image.

Going deeper

3.48 seconds, down from 8.654 represents a respectful 2.5X speedup.

It definitely meets the goal of being at least 50% as fast as Xeon, and it is faster in the batch benchmark, but it still feels like it is slower than it could be.

While going over the ARMv8 NEON instruction set, I found several unique instructions, that have no equivalent in SSE.

The first such instruction is TBL. It works as a lookup table, that can lookup 8 or 16 bytes from one to four consecutive registers. In the single register variant it is similar to the pshufb SSE instruction. In the four register variant, however, it can simultaneously lookup 16 bytes in a 64 byte table! What sorcery is that?

The intrinsic to use the 4 register variant is vqtbl4q_u8. Interestingly there is an instruction that can lookup 64 bytes in AVX-512, but we don't want to use that.

The next interesting thing I found, are instructions that can load or store and de/interleave data at the same time. They can load or store up to four registers simultaneously, while de/interleaving two, three or even four elements, of any supported width. The specifics are well presented in here. The load intrinsics used are of the form: vldNq_uW, where N can be 1,2,3,4 to indicate the interleave factor and W can be 8, 16, 32 or 64. Similarly vldNq_sW is used for signed types.

Finally very interesting instructions are the shift left/right and insert SLI and SRI. What they do is they shift the elements left or right, like a regular shift would, however instead of shifting in zero bits, the zeros are replaced with the original bits of the destination register! An intrinsic for that would look like vsliq_n_u16 or vsriq_n_u32.

Applying the new instructions

It might not be visible at first how those new instruction can help. Since I didn't have much time to dig into libjpeg or the jpeg spec, I had to resolve to heuristics.

From a quick look it became apparent that *block is defined as an array of 64 16-bit values. natural_order is an array of 32-bit integers that varies in length depending on the real block size, but is always padded with 16 entries. Also, despite the fact that it uses integers, the values are indexes in the range [0..63].

Another interesting observation is that blocks of size 64 are the most common by far for both encode_mcu_AC_refine and encode_mcu_AC_first. And it always makes sense to optimize for the most common case.

So essentially what we have here, is a 64 entry lookup table *block that uses natural_order as indices. Hmm, 64 entry lookup table, where did I see that before? Of course, the TBL instruction. Although TBL looks up bytes, and we need to lookup shorts, it is easy to do, since NEON lets us load and deinterleave the short into bytes in a single instruction using LD2, then we can use two lookups for each byte individually, and finally interleave again with ZIP1 and ZIP2. Similarly despite the fact that the indices are integers, and we only need the least significant byte of each, we can use LD4 to deinterleave them into bytes (the kosher way of course would be to rewrite the library to use bytes, but I wanted to avoid big changes).

After the data loading step is done, the point transforms for both functions remain the same, but in the end, to get a single bitmask for all 64 values we can use SLI and SRI to intelligently align the bits such that only one bit of each comparison mask remains, using TBL again to combine them.

For whatever reason, the compiler in that case produces somewhat suboptimal code, so I had to revert to assembly language for this specific optimization.

The code for encode_mcu_AC_refine:

    # Load and deintreleave the block
    ld2 {v0.16b - v1.16b}, [x0], 32
    ld2 {v16.16b - v17.16b}, [x0], 32
    ld2 {v18.16b - v19.16b}, [x0], 32
    ld2 {v20.16b - v21.16b}, [x0]

    mov v4.16b, v1.16b
    mov v5.16b, v17.16b
    mov v6.16b, v19.16b
    mov v7.16b, v21.16b
    mov v1.16b, v16.16b
    mov v2.16b, v18.16b
    mov v3.16b, v20.16b
    # Load the order 
    ld4 {v16.16b - v19.16b}, [x1], 64
    ld4 {v17.16b - v20.16b}, [x1], 64
    ld4 {v18.16b - v21.16b}, [x1], 64
    ld4 {v19.16b - v22.16b}, [x1]
    # Table lookup, LSB and MSB independently
    tbl v20.16b, {v0.16b - v3.16b}, v16.16b
    tbl v16.16b, {v4.16b - v7.16b}, v16.16b
    tbl v21.16b, {v0.16b - v3.16b}, v17.16b
    tbl v17.16b, {v4.16b - v7.16b}, v17.16b
    tbl v22.16b, {v0.16b - v3.16b}, v18.16b
    tbl v18.16b, {v4.16b - v7.16b}, v18.16b
    tbl v23.16b, {v0.16b - v3.16b}, v19.16b
    tbl v19.16b, {v4.16b - v7.16b}, v19.16b
    # Interleave MSB and LSB back
    zip1 v0.16b, v20.16b, v16.16b
    zip2 v1.16b, v20.16b, v16.16b
    zip1 v2.16b, v21.16b, v17.16b
    zip2 v3.16b, v21.16b, v17.16b
    zip1 v4.16b, v22.16b, v18.16b
    zip2 v5.16b, v22.16b, v18.16b
    zip1 v6.16b, v23.16b, v19.16b
    zip2 v7.16b, v23.16b, v19.16b
    # -Al
    neg w3, w3
    dup v16.8h, w3
    # Absolute then shift by Al
    abs v0.8h, v0.8h
    sshl v0.8h, v0.8h, v16.8h
    abs v1.8h, v1.8h
    sshl v1.8h, v1.8h, v16.8h
    abs v2.8h, v2.8h
    sshl v2.8h, v2.8h, v16.8h
    abs v3.8h, v3.8h
    sshl v3.8h, v3.8h, v16.8h
    abs v4.8h, v4.8h
    sshl v4.8h, v4.8h, v16.8h
    abs v5.8h, v5.8h
    sshl v5.8h, v5.8h, v16.8h
    abs v6.8h, v6.8h
    sshl v6.8h, v6.8h, v16.8h
    abs v7.8h, v7.8h
    sshl v7.8h, v7.8h, v16.8h
    # Store
    st1 {v0.16b - v3.16b}, [x2], 64
    st1 {v4.16b - v7.16b}, [x2]
    # Constant 1
    movi v16.8h, 0x1
    # Compare with 0 for zero mask
    cmeq v17.8h, v0.8h, #0
    cmeq v18.8h, v1.8h, #0
    cmeq v19.8h, v2.8h, #0
    cmeq v20.8h, v3.8h, #0
    cmeq v21.8h, v4.8h, #0
    cmeq v22.8h, v5.8h, #0
    cmeq v23.8h, v6.8h, #0
    cmeq v24.8h, v7.8h, #0
    # Compare with 1 for EOB mask
    cmeq v0.8h, v0.8h, v16.8h
    cmeq v1.8h, v1.8h, v16.8h
    cmeq v2.8h, v2.8h, v16.8h
    cmeq v3.8h, v3.8h, v16.8h
    cmeq v4.8h, v4.8h, v16.8h
    cmeq v5.8h, v5.8h, v16.8h
    cmeq v6.8h, v6.8h, v16.8h
    cmeq v7.8h, v7.8h, v16.8h
    # For both masks -> keep only one byte for each comparison
    uzp1 v0.16b, v0.16b, v1.16b
    uzp1 v1.16b, v2.16b, v3.16b
    uzp1 v2.16b, v4.16b, v5.16b
    uzp1 v3.16b, v6.16b, v7.16b

    uzp1 v17.16b, v17.16b, v18.16b
    uzp1 v18.16b, v19.16b, v20.16b
    uzp1 v19.16b, v21.16b, v22.16b
    uzp1 v20.16b, v23.16b, v24.16b
    # Shift left and insert (int16) to get a single bit from even to odd bytes
    sli v0.8h, v0.8h, 15
    sli v1.8h, v1.8h, 15
    sli v2.8h, v2.8h, 15
    sli v3.8h, v3.8h, 15

    sli v17.8h, v17.8h, 15
    sli v18.8h, v18.8h, 15
    sli v19.8h, v19.8h, 15
    sli v20.8h, v20.8h, 15
    # Shift right and insert (int32) to get two bits from off to even indices
    sri v0.4s, v0.4s, 18
    sri v1.4s, v1.4s, 18
    sri v2.4s, v2.4s, 18
    sri v3.4s, v3.4s, 18

    sri v17.4s, v17.4s, 18
    sri v18.4s, v18.4s, 18
    sri v19.4s, v19.4s, 18
    sri v20.4s, v20.4s, 18
    # Regular shift right to align the 4 bits at the bottom of each int64
    ushr v0.2d, v0.2d, 12
    ushr v1.2d, v1.2d, 12
    ushr v2.2d, v2.2d, 12
    ushr v3.2d, v3.2d, 12

    ushr v17.2d, v17.2d, 12
    ushr v18.2d, v18.2d, 12
    ushr v19.2d, v19.2d, 12
    ushr v20.2d, v20.2d, 12
    # Shift left and insert (int64) to combine all 8 bits into one byte
    sli v0.2d, v0.2d, 36
    sli v1.2d, v1.2d, 36
    sli v2.2d, v2.2d, 36
    sli v3.2d, v3.2d, 36

    sli v17.2d, v17.2d, 36
    sli v18.2d, v18.2d, 36
    sli v19.2d, v19.2d, 36
    sli v20.2d, v20.2d, 36
    # Combine all the byte mask insto a bit 64-bit mask for EOB and zero masks
    ldr d4, .shuf_mask
    tbl v5.8b, {v0.16b - v3.16b}, v4.8b
    tbl v6.8b, {v17.16b - v20.16b}, v4.8b
    # Extract lanes
    mov x0, v5.d[0]
    mov x1, v6.d[0]
    # Compute EOB
    rbit x0, x0
    clz x0, x0
    mov x2, 64
    sub x0, x2, x0
    # Not of zero mask (so 1 bits indecates non-zeroes)
    mvn x1, x1
    ret

If you look carefully at the code, you will see, that I decided that while generating the mask to find EOB is useful, I can use the same method to generate the mask for zero values, and then I can find the next nonzero value, and zero runlength this way:

uint64_t skip =__builtin_clzl(zero_mask << k);
r += skip;
k += skip;

Similarly for encode_mcu_AC_first:

    # Load the block
    ld2 {v0.16b - v1.16b}, [x0], 32
    ld2 {v16.16b - v17.16b}, [x0], 32
    ld2 {v18.16b - v19.16b}, [x0], 32
    ld2 {v20.16b - v21.16b}, [x0]

    mov v4.16b, v1.16b
    mov v5.16b, v17.16b
    mov v6.16b, v19.16b
    mov v7.16b, v21.16b
    mov v1.16b, v16.16b
    mov v2.16b, v18.16b
    mov v3.16b, v20.16b

    # Load the order 
    ld4 {v16.16b - v19.16b}, [x1], 64
    ld4 {v17.16b - v20.16b}, [x1], 64
    ld4 {v18.16b - v21.16b}, [x1], 64
    ld4 {v19.16b - v22.16b}, [x1]
    # Table lookup, LSB and MSB independently
    tbl v20.16b, {v0.16b - v3.16b}, v16.16b
    tbl v16.16b, {v4.16b - v7.16b}, v16.16b
    tbl v21.16b, {v0.16b - v3.16b}, v17.16b
    tbl v17.16b, {v4.16b - v7.16b}, v17.16b
    tbl v22.16b, {v0.16b - v3.16b}, v18.16b
    tbl v18.16b, {v4.16b - v7.16b}, v18.16b
    tbl v23.16b, {v0.16b - v3.16b}, v19.16b
    tbl v19.16b, {v4.16b - v7.16b}, v19.16b
    # Interleave MSB and LSB back
    zip1 v0.16b, v20.16b, v16.16b
    zip2 v1.16b, v20.16b, v16.16b
    zip1 v2.16b, v21.16b, v17.16b
    zip2 v3.16b, v21.16b, v17.16b
    zip1 v4.16b, v22.16b, v18.16b
    zip2 v5.16b, v22.16b, v18.16b
    zip1 v6.16b, v23.16b, v19.16b
    zip2 v7.16b, v23.16b, v19.16b
    # -Al
    neg w4, w4
    dup v24.8h, w4
    # Compare with 0 to get negative mask
    cmge v16.8h, v0.8h, #0
    # Absolute value and shift by Al
    abs v0.8h, v0.8h
    sshl v0.8h, v0.8h, v24.8h
    cmge v17.8h, v1.8h, #0
    abs v1.8h, v1.8h
    sshl v1.8h, v1.8h, v24.8h
    cmge v18.8h, v2.8h, #0
    abs v2.8h, v2.8h
    sshl v2.8h, v2.8h, v24.8h
    cmge v19.8h, v3.8h, #0
    abs v3.8h, v3.8h
    sshl v3.8h, v3.8h, v24.8h
    cmge v20.8h, v4.8h, #0
    abs v4.8h, v4.8h
    sshl v4.8h, v4.8h, v24.8h
    cmge v21.8h, v5.8h, #0
    abs v5.8h, v5.8h
    sshl v5.8h, v5.8h, v24.8h
    cmge v22.8h, v6.8h, #0
    abs v6.8h, v6.8h
    sshl v6.8h, v6.8h, v24.8h
    cmge v23.8h, v7.8h, #0
    abs v7.8h, v7.8h
    sshl v7.8h, v7.8h, v24.8h
    # ~
    mvn v24.16b, v0.16b
    mvn v25.16b, v1.16b
    mvn v26.16b, v2.16b
    mvn v27.16b, v3.16b
    mvn v28.16b, v4.16b
    mvn v29.16b, v5.16b
    mvn v30.16b, v6.16b
    mvn v31.16b, v7.16b
    # Select
    bsl v16.16b, v0.16b, v24.16b
    bsl v17.16b, v1.16b, v25.16b
    bsl v18.16b, v2.16b, v26.16b
    bsl v19.16b, v3.16b, v27.16b
    bsl v20.16b, v4.16b, v28.16b
    bsl v21.16b, v5.16b, v29.16b
    bsl v22.16b, v6.16b, v30.16b
    bsl v23.16b, v7.16b, v31.16b
    # Store t1
    st1 {v0.16b - v3.16b}, [x2], 64
    st1 {v4.16b - v7.16b}, [x2]
    # Store t2
    st1 {v16.16b - v19.16b}, [x3], 64
    st1 {v20.16b - v23.16b}, [x3]
    # Compute zero mask like before
    cmeq v17.8h, v0.8h, #0
    cmeq v18.8h, v1.8h, #0
    cmeq v19.8h, v2.8h, #0
    cmeq v20.8h, v3.8h, #0
    cmeq v21.8h, v4.8h, #0
    cmeq v22.8h, v5.8h, #0
    cmeq v23.8h, v6.8h, #0
    cmeq v24.8h, v7.8h, #0

    uzp1 v17.16b, v17.16b, v18.16b
    uzp1 v18.16b, v19.16b, v20.16b
    uzp1 v19.16b, v21.16b, v22.16b
    uzp1 v20.16b, v23.16b, v24.16b

    sli v17.8h, v17.8h, 15
    sli v18.8h, v18.8h, 15
    sli v19.8h, v19.8h, 15
    sli v20.8h, v20.8h, 15

    sri v17.4s, v17.4s, 18
    sri v18.4s, v18.4s, 18
    sri v19.4s, v19.4s, 18
    sri v20.4s, v20.4s, 18

    ushr v17.2d, v17.2d, 12
    ushr v18.2d, v18.2d, 12
    ushr v19.2d, v19.2d, 12
    ushr v20.2d, v20.2d, 12

    sli v17.2d, v17.2d, 36
    sli v18.2d, v18.2d, 36
    sli v19.2d, v19.2d, 36
    sli v20.2d, v20.2d, 36

    ldr d4, .shuf_mask
    tbl v6.8b, {v17.16b - v20.16b}, v4.8b

    mov x0, v6.d[0]
    mvn x0, x0
    ret

Final results and power

The final version of our jpegtran managed to reduce the test image in 2.756 seconds. Or an extra 1.26X speedup, that gets it incredibly close to the performance of the Xeon on that image. As a bonus batch performance also improved!

Another favorite part of mine, working with the Qualcomm Centriq CPU is the ability to take power readings, and be pleasantly surprised every time.

With the new implementation Centriq outperforms the Xeon at batch reduction for every number of workers. We usually run Polish with four workers, for which Centriq is now 1.3 times faster while also 6.5 times more power efficient.

Conclusion

It is evident that the Qualcomm Centriq is a powerful processor, that definitely provides a good bang for a buck. However, years of Intel leadership in the server and desktop space mean that a lot of software is better optimized for Intel processors.

For the most part writing optimizations for ARMv8 is not difficult, and we will be adjusting our software as needed, and publishing our efforts as we go.

You can find the updated code on our Github page.

Useful resources

How "expensive" is crypto anyway?

Vlad Krasnov — Thu, 28 Dec 2017 18:22:17 GMT

I wouldn’t be surprised if the title of this post attracts some Bitcoin aficionados, but if you are such, I want to disappoint you. For me crypto means cryptography, not cybermoney, and the price we pay for it is measured in CPU cycles, not USD.

If you got to this second paragraph you probably heard that TLS today is very cheap to deploy. Considerable effort was put to optimize the cryptography stacks of OpenSSL and BoringSSL, as well as the hardware that runs them. However, aside for the occasional benchmark, that can tell us how many GB/s a given algorithm can encrypt, or how many signatures a certain elliptic curve can generate, I did not find much information about the cost of crypto in real world TLS deployments.

CC BY-SA 2.0 image by Michele M. F.

As Cloudflare is the largest provider of TLS on the planet, one would think we perform a lot of cryptography related tasks, and one would be absolutely correct. More than half of our external traffic is now TLS, as well as all of our internal traffic. Being in that position means that crypto performance is critical to our success, and as it happens, every now and then we like to profile our production servers, to identify and fix hot spots.

In this post I want to share the latest profiling results that relate to crypto.

The profiled server is located in our Frankfurt data center, and sports 2 Xeon Silver 4116 processors. Every geography has a slightly different use pattern of TLS. In Frankfurt 73% of the requests are TLS, and the negotiated cipher-suites break down like so:

Processing all of those different ciphersuites, BoringSSL consumes just 1.8% of the CPU time. That’s right, mere 1.8%. And that is not even pure cryptography, there is a considerable overhead involved too.

Let’s take a deeper dive, shall we?

Ciphers

If we break down the negotiated cipher suites, by the AEAD used, we get the following breakdown:

BoringSSL speed tells us that AES-128-GCM, ChaCha20-Poly1305 and AES-128-CBC-SHA1 can achieve encryption speeds of 3,733.3 MB/s, 1,486.9 MB/s and 387.0 MB/s, but this speed varies greatly as a function of the record size. Indeed we see that GCM uses proportionately less CPU time.

Still the CPU time consumed by encryption and decryption depends on typical record size, as well as the amount of data processed, both metrics we don’t currently log. We do know that ChaCha20-Poly1305 is usually used by older phones, where the connections are short lived to save power, while AES-CBC is used for … well your guess is as good as mine who still uses AES-CBC and for what, but good thing its usage keeps declining.

Finally keep in mind that 6.8% of BoringSSL usage in the graph translates into 6.8% x 1.8% = 0.12% of total CPU time.

Public Key

Public key algorithms in TLS serve two functions.

The first function is as a key exchange algorithm, the prevalent algorithm here is ECDHE that uses the NIST P256 curve, the runner up is ECDHE using DJB’s x25519 curve. Finally there is a small fraction that still uses RSA for key exchange, the only key exchange algorithm currently used, that does not provide Forward Secrecy guarantees.

The second function is that of a signature used to sign the handshake parameters and thus authenticate the server to the client. As a signature RSA is very much alive, present in almost one quarter of the connections, the other three quarters using ECDSA.

BoringSSL speed reports that a single core on our server can perform 1,120 RSA2048 signatures/s, 120 RSA4096 signatures/s, 18,477 P256 ESDSA signatures/s, 9,394 P256 ECDHE operations/s and 9,278 x25519 ECDHE operations/s.

Looking at the CPU consumption, it is clear that RSA is very expensive. Roughly half the time BoringSSL performs an operation related to RSA. P256 consumes twice as much CPU time as x25519, but considering that it handles twice as much key-exchanges, while also being used as a signature, that is commendable.

If you want to make the internet a better place, please get an ECDSA signed certificate next time!

Hash functions

Only two hash function are currently used in TLS: SHA1 and SHA2 (including SHA384). SHA3 will probably debut with TLS1.3. Hash functions serve several purposes in TLS. First, they are used as part of the signature for both the certificate and the handshake, second they are used for key derivation, finally when using AES-CBC, SHA1 and SHA2 are used in HMAC to authenticate the records.

Here we see SHA1 consuming more resources than expected, but that is really because it is used as HMAC, whereas most cipher suites that negotiate SHA256 use AEADs. In terms of benchmarks BoringSSL speed reports 667.7 MB/s for SHA1, 309.0 MB/s for SHA256 and 436.0 MB/s for SHA512 (truncated to SHA384 in TLS, that is not visible in the graphs because its usage approaches 0%).

Conclusions

Using TLS is very cheap, even at the scale of Cloudflare. Modern crypto is very fast, with AES-GCM and P256 being great examples. RSA, once a staple of cryptography, that truly made SSL accessible to everyone, is now a dying dinosaur, replaced by faster and safer algorithms, still consumes a disproportionate amount of resources, but even that is easily manageable.

The future however is less clear. As we approach the era of Quantum computers it is clear that TLS must adapt sooner rather than later. We already support SIDH as a key exchange algorithm for some services, and there is a NIST competition in place, that will determine the most likely Post Quantum candidates for TLS adoption, but none of the candidates can outperform P256. I just hope that when I profile our edge two years from now, my conclusion won’t change to “Whoa, crypto is expensive!”.

Go, don't collect my garbage

Vlad Krasnov — Mon, 13 Nov 2017 10:31:09 GMT

Not long ago I needed to benchmark the performance of Golang on a many-core machine. I took several of the benchmarks that are bundled with the Go source code, copied them, and modified them to run on all available threads. In that case the machine has 24 cores and 48 threads.

CC BY-SA 2.0 image by sponki25

I started with ECDSA P256 Sign, probably because I have warm feeling for that function, since I optimized it for amd64.

First, I ran the benchmark on a single goroutine: ECDSA-P256 Sign,30618.50, op/s

That looks good; next I ran it on 48 goroutines: ECDSA-P256 Sign,78940.67, op/s.

OK, that is not what I expected. Just over 2X speedup, from 24 physical cores? I must be doing something wrong. Maybe Go only uses two cores? I ran top, it showed 2,266% utilization. That is not the 4,800% I expected, but it is also way above 400%.

How about taking a step back, and running the benchmark on two goroutines? ECDSA-P256 Sign,55966.40, op/s. Almost double, so pretty good. How about four goroutines? ECDSA-P256 Sign,108731.00, op/s. That is actually faster than 48 goroutines, what is going on?

I ran the benchmark for every number of goroutines from 1 to 48:

Looks like the number of signatures per second peaks at 274,622, with 17 goroutines. And starts dropping rapidly after that.

Time to do some profiling.

(pprof) top 10
Showing nodes accounting for 47.53s, 50.83% of 93.50s total
Dropped 224 nodes (cum <= 0.47s)
Showing top 10 nodes out of 138
      flat  flat%   sum%        cum   cum%
     9.45s 10.11% 10.11%      9.45s 10.11%  runtime.procyield /state/home/vlad/go/src/runtime/asm_amd64.s
     7.55s  8.07% 18.18%      7.55s  8.07%  runtime.futex /state/home/vlad/go/src/runtime/sys_linux_amd64.s
     6.77s  7.24% 25.42%     19.18s 20.51%  runtime.sweepone /state/home/vlad/go/src/runtime/mgcsweep.go
     4.20s  4.49% 29.91%     16.28s 17.41%  runtime.lock /state/home/vlad/go/src/runtime/lock_futex.go
     3.92s  4.19% 34.11%     12.58s 13.45%  runtime.(*mspan).sweep /state/home/vlad/go/src/runtime/mgcsweep.go
     3.50s  3.74% 37.85%     15.92s 17.03%  runtime.gcDrain /state/home/vlad/go/src/runtime/mgcmark.go
     3.20s  3.42% 41.27%      4.62s  4.94%  runtime.gcmarknewobject /state/home/vlad/go/src/runtime/mgcmark.go
     3.09s  3.30% 44.58%      3.09s  3.30%  crypto/elliptic.p256OrdSqr /state/home/vlad/go/src/crypto/elliptic/p256_asm_amd64.s
     3.09s  3.30% 47.88%      3.09s  3.30%  runtime.(*lfstack).pop /state/home/vlad/go/src/runtime/lfstack.go
     2.76s  2.95% 50.83%      2.76s  2.95%  runtime.(*gcSweepBuf).push /state/home/vlad/go/src/runtime/mgcsweepbuf.go

Clearly Go spends a disproportionate amount of time collecting garbage. All my benchmark does is generates signatures and then dumps them.

So what are our options? The Go runtime states the following:

The GOGC variable sets the initial garbage collection target percentage. A collection is triggered when the ratio of freshly allocated data to live data remaining after the previous collection reaches this percentage. The default is GOGC=100. Setting GOGC=off disables the garbage collector entirely. The runtime/debug package's SetGCPercent function allows changing this percentage at run time. See https://golang.org/pkg/runtime/debug/#SetGCPercent.

The GODEBUG variable controls debugging variables within the runtime. It is a comma-separated list of name=val pairs setting these named variables:

Let’s see what setting GODEBUG to gctrace=1 does.

gc 1 @0.021s 0%: 0.15+0.37+0.25 ms clock, 3.0+0.19/0.39/0.60+5.0 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 2 @0.024s 0%: 0.097+0.94+0.16 ms clock, 0.29+0.21/1.3/0+0.49 ms cpu, 4->4->1 MB, 5 MB goal, 48 P
gc 3 @0.027s 1%: 0.10+0.43+0.17 ms clock, 0.60+0.48/1.5/0+1.0 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 4 @0.028s 1%: 0.18+0.41+0.28 ms clock, 0.18+0.69/2.0/0+0.28 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 5 @0.031s 1%: 0.078+0.35+0.29 ms clock, 1.1+0.26/2.0/0+4.4 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 6 @0.032s 1%: 0.11+0.50+0.32 ms clock, 0.22+0.99/2.3/0+0.64 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 7 @0.034s 1%: 0.18+0.39+0.27 ms clock, 0.18+0.56/2.2/0+0.27 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 8 @0.035s 2%: 0.12+0.40+0.27 ms clock, 0.12+0.63/2.2/0+0.27 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 9 @0.036s 2%: 0.13+0.41+0.26 ms clock, 0.13+0.52/2.2/0+0.26 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 10 @0.038s 2%: 0.099+0.51+0.20 ms clock, 0.19+0.56/1.9/0+0.40 ms cpu, 4->5->0 MB, 5 MB goal, 48 P
gc 11 @0.039s 2%: 0.10+0.46+0.20 ms clock, 0.10+0.23/1.3/0.005+0.20 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 12 @0.040s 2%: 0.066+0.46+0.24 ms clock, 0.93+0.40/1.7/0+3.4 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 13 @0.041s 2%: 0.099+0.30+0.20 ms clock, 0.099+0.60/1.7/0+0.20 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 14 @0.042s 2%: 0.095+0.45+0.24 ms clock, 0.38+0.58/2.0/0+0.98 ms cpu, 4->5->0 MB, 5 MB goal, 48 P
gc 15 @0.044s 2%: 0.095+0.45+0.21 ms clock, 1.0+0.78/1.9/0+2.3 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 16 @0.045s 3%: 0.10+0.45+0.23 ms clock, 0.10+0.70/2.1/0+0.23 ms cpu, 4->5->0 MB, 5 MB goal, 48 P
gc 17 @0.046s 3%: 0.088+0.40+0.17 ms clock, 0.088+0.45/1.9/0+0.17 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
.
.
.
.
gc 6789 @9.998s 12%: 0.17+0.91+0.24 ms clock, 0.85+1.8/5.0/0+1.2 ms cpu, 4->6->1 MB, 6 MB goal, 48 P
gc 6790 @10.000s 12%: 0.086+0.55+0.24 ms clock, 0.78+0.30/4.2/0.043+2.2 ms cpu, 4->5->1 MB, 6 MB goal, 48 P

The first round of GC kicks in at 0.021s, then it starts collecting every 3ms and then every 1ms. That is insane, the benchmark runs for 10 seconds, and I saw 6,790 rounds of GC. The number that starts with @ is the time since program start, followed by a percentage that supposedly states the amount of time spent collecting garbage. This number is clearly misleading, because the performance indicates at least 90% of the time is wasted (indirectly) on GC, not 12%. The synchronization overhead is not taken into account. What really is interesting are the three numbers separated by arrows. They show the size of the heap at GC start, GC end, and the live heap size. Remember that a collection is triggered when the ratio of freshly allocated data to live data remaining after the previous collection reaches this percentage, and defaults to 100%.

I am running a benchmark, where all allocated data is immediately discarded, and collected at the next GC cycle. The only live heap is fixed to the Go runtime, and having more goroutines does not add to the live heap. In contrast the freshly allocated data grows much faster with each additional goroutine, triggering increasingly frequent, and expensive GC cycles.

Clearly what I needed to do next was to run the benchmark with the GC disabled, by setting GOGC=off. This lead to a dramatic improvement: ECDSA-P256 Sign,413740.30, op/s.

But still not the number I was looking for, and running an application without garbage collection is unsustainable in the long run. I started playing with the GOGC variable. First I set it to 2,400, which made sense since we have 24 cores, perhaps collecting garbage 24 times less frequently will do the trick: ECDSA-P256 Sign,671538.90, op/s, oh my that is getting better.

What if I tried 4,800, for the number of threads? ECDSA-P256 Sign,685810.90, op/s. Getting warmer.

I ran a script to find the best value, from 100 to 20,000, in increments of 100. This is what I got:

Looks like the optimal value for GOGC in that case is 11,300 and it gets us 691,054 signatures/second. That is 22.56X times faster than the single core score, and overall pretty good for a 24 core processor. Remember that when running on a single core, the CPU frequency is 3.0GHz, and only 2.1GHz when running on all cores.

Per goroutine performance when running with GOGC=11300 now looks like that:

The scaling looks much better, and even past 24 goroutines, when we run out of physical cores, and start sharing cores with hyper-threading, the overall performance improves.

The bottom line here is that although this type of benchmarking is definitely an edge case for garbage collection, where 48 threads allocate large amounts of short lived data, this situation can occur in real world scenarios. As many-core CPUs become a commodity, one should be aware of the pitfalls.

Most languages with garbage collection offer some sort of garbage collection control. Go has the GOGC variable, that can also be controlled with the SetGCPercent function in the runtime/debug package. Don't be afraid to tune the GC to suit your needs.

We're always looking for Go programmers, so if you found this blog post interesting, why not check out our jobs page?

On the dangers of Intel's frequency scaling

Vlad Krasnov — Fri, 10 Nov 2017 11:06:58 GMT

While I was writing the post comparing the new Qualcomm server chip, Centriq, to our current stock of Intel Skylake-based Xeons, I noticed a disturbing phenomena.

When benchmarking OpenSSL 1.1.1dev, I discovered that the performance of the cipher ChaCha20-Poly1305 does not scale very well. On a single thread, it performed at the speed of approximately 2.89GB/s, whereas on 24 cores, and 48 threads it performed at just over 35 GB/s.

CC BY-SA 2.0 image by blumblaum

Now this is a very high number, but I would like to see something closer to 69GB/s. 35GB/s is just 1.46GB/s/core, or roughly 50% of the single core performance. AES-GCM scales much better, to 80% of single core performance, which is understandable, because the CPU can sustain higher frequency turbo on a single core, but not all cores.

Why is the scaling of ChaCha20-Poly1305 so poor? Meet AVX-512. AVX-512 is a new Intel instruction set that adds many new 512-bit wide SIMD instructions and promotes most of the existing ones to 512-bit. The problem with such wide instructions is that they consume power. A lot of power. Imagine a single instruction that does the work of 64 regular byte instructions, or 8 full blown 64-bit instructions.

To keep power in check Intel introduced something called dynamic frequency scaling. It reduces the base frequency of the processor whenever AVX2 or AVX-512 instructions are used. This is not new, and has existed since Haswell introduced AVX2 three years ago.

The scaling gets worse when more cores execute AVX-512 and when multiplication is used.

If you only run AVX-512 code, then everything is good. The frequency is lower, but your overall productivity is higher, because each instruction does more work.

OpenSSL 1.1.1dev implements several variants of ChaCha20-Poly1305, including AVX2 and AVX-512 variants. BoringSSL implements a different AVX2 version of ChaCha20-Poly1305. It is understandable then why BoringSSL achieves only 1.6GB/s on a single core, compared to the 2.89GB/s OpenSSL does.

So how does this affect you, if you mix a little AVX-512 with your real workload? We use the Xeon Silver 4116 CPUs, with a base frequency 2.1GHz, in a dual socket configuration. From a figure I found on wikichip it seems that running AVX-512 even just on one core on this CPU will reduce the base frequency to 1.8GHz. Running AVX-512 on all cores will reduce it to just 1.4GHz.

Now imagine you run a webserver with Apache or NGINX. In addition you have many other services, performing some real, important work. What happens if you start encrypting your traffic with ChaCha20-Poly1305 using AVX-512? That is the question I asked myself.

I compiled two versions of NGINX, one with OpenSSL1.1.1dev and the other with BoringSSL, and installed it on our server with two Xeon Silver 4116 CPUs, for a total of 24 cores.

I configured the server to serve a medium sized HTML page, and perform some meaningful work on it. I used LuaJIT to remove line breaks and extra spaces, and brotli to compress the file.

I then monitored the number of requests per second served under full load. This is what I got:

By using ChaCha20-Poly1305 over AES-128-GCM, the server that uses OpenSSL serves 10% fewer requests per second. And that is a huge number! It is equivalent to giving up on two cores, for nothing. One might think that this is due to ChaCha20-Poly1305 being inherently slower. But that is not the case.

First, BoringSSL performs equivalently well with AES-GCM and ChaCha20-Poly1305.

Second, even when only 20% of the requests use ChaCha20-Poly1305, the server throughput drops by more than 7%, and by 5.5% when 10% of the requests are ChaCha20-Poly1305. For reference, 15% of the TLS requests Cloudflare handles are ChaCha20-Poly1305.

Finally, according to perf, the AVX-512 workload consumes only 2.5% of the CPU time when all the requests are ChaCha20-Poly1305, and less then 0.3% when doing ChaCha20-Poly1305 for 10% of the requests. Irregardless the CPU throttles down, because that what it does when it sees AVX-512 running on all cores.

It is hard to say just how much each core is throttled at any given time, but doing some sampling using lscpu, I found out that when executing the openssl speed -evp chacha20-poly1305 -multi 48 benchmark, it shows CPU MHz: 1199.963, for OpenSSL with all AES-GCM connections I got CPU MHz: 2399.926 and for OpenSSL with all ChaCha20-Poly1305 connections I saw CPU MHz: 2184.338, which is obviously 9% slower.

Another interesting distinction is that ChaCha20-Poly1305 with AVX2 is slightly slower in OpenSSL but is the same in BoringSSL. Why might that be? The reason here is that the BoringSSL code does not use AVX2 multiplication instructions for Poly1305, and only uses simple xor, shift and add operations for ChaCha20, which allows it to run at the base frequency.

OpenSSL 1.1.1dev is still in development, therefore I suspect no one is affected by this issue yet. We switched to BoringSSL months ago, and our server performance is not affected by this issue.

What the future holds in unclear. Intel announced very cool new ISA extensions for the future generation of CPUs, that are expected to improve crypto performance even further. Those extensions include AVX512+VAES, AVX512+VPCLMULQDQ and AVX512IFMA. But if the frequency scaling issue is not resolved by then, using those for general purpose cryptography libraries will do (much) more harm than good.

The problem is not with cryptography libraries alone. OpenSSL did nothing wrong by trying to get the best possible performance, on the contrary, I wrote a decent amount of AVX-512 code for OpenSSL myself. The observed behavior is a sad side effect. There are many libraries that use AVX and AVX2 instructions out there, they will probably be updated to AVX-512 at some point, and users are not likely to be aware of the implementation details. If you do not require AVX-512 for some specific high performance tasks, I suggest you disable AVX-512 execution on your server or desktop, to avoid accidental AVX-512 throttling.

ARM Takes Wing: Qualcomm vs. Intel CPU comparison

Vlad Krasnov — Wed, 08 Nov 2017 20:03:14 GMT

One of the nicer perks I have here at Cloudflare is access to the latest hardware, long before it even reaches the market.

Until recently I mostly played with Intel hardware. For example Intel supplied us with an engineering sample of their Skylake based Purley platform back in August 2016, to give us time to evaluate it and optimize our software. As a former Intel Architect, who did a lot of work on Skylake (as well as Sandy Bridge, Ivy Bridge and Icelake), I really enjoy that.

Our previous generation of servers was based on the Intel Broadwell micro-architecture. Our configuration includes dual-socket Xeons E5-2630 v4, with 10 cores each, running at 2.2GHz, with a 3.1GHz turboboost and hyper-threading enabled, for a total of 40 threads per server.

Since Intel was, and still is, the undisputed leader of the server CPU market with greater than 98% market share, our upgrade process until now was pretty straightforward: every year Intel releases a new generation of CPUs, and every year we buy them. In the process we usually get two extra cores per socket, and all the extra architectural features such upgrade brings: hardware AES and CLMUL in Westmere, AVX in Sandy Bridge, AVX2 in Haswell, etc.

In the current upgrade cycle, our next server processor ought to be the Xeon Silver 4116, also in a dual-socket configuration. In fact, we have already purchased a significant number of them. Each CPU has 12 cores, but it runs at a lower frequency of 2.1GHz, with 3.0GHz turboboost. It also has smaller last level cache: 1.375 MiB/core, compared to 2.5 MiB the Broadwell processors had. In addition, the Skylake based platform supports 6 memory channels and the AVX-512 instruction set.

As we head into 2018, however, change is in the air. For the first time in a while, Intel has serious competition in the server market: Qualcomm and Cavium both have new server platforms based on the ARMv8 64-bit architecture (aka aarch64 or arm64). Qualcomm has the Centriq platform (code name Amberwing), based on the Falkor core, and Cavium has the ThunderX2 platform, based on the ahm ... ThunderX2 core?

The majestic Amberwing powered by the Falkor CPU CC BY-SA 2.0 image by DrPhotoMoto

Recently, both Qualcomm and Cavium provided us with engineering samples of their ARM based platforms, and in this blog post I would like to share my findings about Centriq, the Qualcomm platform.

The actual Amberwing in question

Overview

I tested the Qualcomm Centriq server, and compared it with our newest Intel Skylake based server and previous Broadwell based server.

Platform	Grantley (Intel)	Purley (Intel)	Centriq (Qualcomm)
Core	Broadwell	Skylake	Falkor
Process	14nm	14nm	10nm
Issue	8 µops/cycle	8 µops/cycle	8 instructions/cycle
Dispatch	4 µops/cycle	5 µops/cycle	4 instructions/cycle
# Cores	10 x 2S + HT (40 threads)	12 x 2S + HT (48 threads)	46
Frequency	2.2GHz (3.1GHz turbo)	2.1GHz (3.0GHz turbo)	2.5 GHz
LLC	2.5 MB/core	1.35 MB/core	1.25 MB/core
Memory Channels	4	6	6
TDP	170W (85W x 2S)	170W (85W x 2S)	120W
Other features	AES CLMUL AVX2	AES CLMUL AVX512	AES CLMUL NEON Trustzone CRC32

Overall on paper Falkor looks very competitive. In theory a Falkor core can process 8 instructions/cycle, same as Skylake or Broadwell, and it has higher base frequency at a lower TDP rating.

Ecosystem readiness

Up until now, a major obstacle to the deployment of ARM servers was lack, or weak, support by the majority of the software vendors. In the past two years, ARM’s enablement efforts have paid off, as most Linux distros, as well as most popular libraries support the 64-bit ARM architecture. Driver availability, however, is unclear at that point.

At Cloudflare, we run a complex software stack that consists of many integrated services, and running each of them efficiently is top priority.

On the edge we have the NGINX server software, that does support ARMv8. NGINX is written in C, and it also uses several libraries written in C, such as zlib and BoringSSL, therefore solid C compiler support is very important.

In addition, our flavor of NGINX is highly integrated with the lua-nginx-module, and we rely a lot on LuaJIT.

Finally, a lot of our services, such as our DNS server, RRDNS, are written in Go.

The good news is that both gcc and clang not only support ARMv8 in general, but have optimization profiles for the Falkor core.

Go has official support for ARMv8 as well, and they improve the arm64 backend constantly.

As for LuaJIT, the stable version, 2.0.5 does not support ARMv8, but the beta version, 2.1.0 does. Let’s hope it gets out of beta soon.

Benchmarks

OpenSSL

The first benchmark I wanted to perform, was OpenSSL version 1.1.1 (development version), using the bundled openssl speed tool. Although we recently switched to BoringSSL, I still prefer OpenSSL for benchmarking, because it has almost equally well optimized assembly code paths for both ARMv8 and the latest Intel processors.

In my opinion handcrafted assembly is the best measure of a CPU’s potential, as it bypasses the compiler bias.

Public key cryptography

Public key cryptography is all about raw ALU performance. It is interesting, but not surprising to see that in the single core benchmark, the Broadwell core is faster than Skylake, and both in turn are faster than Falkor. This is because Broadwell runs at a higher frequency, while architecturally it is not much inferior to Skylake.

Falkor is at a disadvantage here. First, in a single core benchmark, the turbo is engaged, meaning the Intel processors run at a higher frequency. Second, in Broadwell, Intel introduced two special instructions to accelerate big number multiplication: ADCX and ADOX. These perform two independent add-with-carry operations per cycle, whereas ARM can only do one. Similarly, the ARMv8 instruction set does not have a single instruction to perform 64-bit multiplication, instead it uses a pair of MUL and UMULH instructions.

Nevertheless, at the SoC level, Falkor wins big time. It is only marginally slower than Skylake at an RSA2048 signature, and only because RSA2048 does not have an optimized implementation for ARM. The ECDSA performance is ridiculously fast. A single Centriq chip can satisfy the ECDSA needs of almost any company in the world.

It is also very interesting to see Skylake outperform Broadwell by a 30% margin, despite losing the single core benchmark, and only having 20% more cores. This can be explained by more efficient all-core turbo, and improved hyper-threading.

Symmetric key cryptography

Symmetric key performance of the Intel cores is outstanding.

AES-GCM uses a combination of special hardware instructions to accelerate AES and CLMUL (carryless multiplication). Intel first introduced those instructions back in 2010, with their Westmere CPU, and every generation since they have improved their performance. ARM introduced a set of similar instructions just recently, with their 64-bit instruction set, and as an optional extension. Fortunately every hardware vendor I know of implemented those. It is very likely that Qualcomm will improve the performance of the cryptographic instructions in future generations.

ChaCha20-Poly1305 is a more generic algorithm, designed in such a way as to better utilize wide SIMD units. The Qualcomm CPU only has the 128-bit wide NEON SIMD, while Broadwell has 256-bit wide AVX2, and Skylake has 512-bit wide AVX-512. This explains the huge lead Skylake has over both in single core performance. In the all-cores benchmark the Skylake lead lessens, because it has to lower the clock speed when executing AVX-512 workloads. When executing AVX-512 on all cores, the base frequency goes down to just 1.4GHz---keep that in mind if you are mixing AVX-512 and other code.

The bottom line for symmetric crypto is that although Skylake has the lead, Broadwell and Falkor both have good enough performance for any real life scenario, especially considering the fact that on our edge, RSA consumes more CPU time than all the other crypto algorithms combined.

Compression

The next benchmark I wanted to see was compression. This is for two reasons. First, it is a very important workload on the edge, as having better compression saves bandwidth, and helps deliver content faster to the client. Second, it is a very demanding workload, with a high rate of branch mispredictions.

Obviously the first benchmark would be the popular zlib library. At Cloudflare, we use an improved version of the library, optimized for 64-bit Intel processors, and although it is written mostly in C, it does use some Intel specific intrinsics. Comparing this optimized version to the generic zlib library wouldn’t be fair. Not to worry, with little effort I adapted the library to work very well on the ARMv8 architecture, with the use of NEON and CRC32 intrinsics. In the process it is twice as fast as the generic library for some files.

The second benchmark is the emerging brotli library, it is written in C, and allows for a level playing field for all platforms.

All the benchmarks are performed on the HTML of blog.cloudflare.com, in memory, similar to the way NGINX performs streaming compression. The size of the specific version of the HTML file is 29,329 bytes, making it a good representative of the type of files we usually compress. The parallel benchmark compresses multiple files in parallel, as opposed to compressing a single file on many threads, also similar to the way NGINX works.

gzip

When using gzip, at the single core level Skylake is the clear winner. Despite having lower frequency than Broadwell, it seems that having lower penalty for branch misprediction helps it pull ahead. The Falkor core is not far behind, especially with lower quality settings. At the system level Falkor performs significantly better, thanks to the higher core count. Note how well gzip scales on multiple cores.

brotli

With brotli on single core the situation is similar. Skylake is the fastest, but Falkor is not very much behind, and with quality setting 9, Falkor is actually faster. Brotli with quality level 4 performs very similarly to gzip at level 5, while actually compressing slightly better (8,010B vs 8,187B).

When performing many-core compression, the situation becomes a bit messy. For levels 4, 5 and 6 brotli scales very well. At level 7 and 8 we start seeing lower performance per core, bottoming with level 9, where we get less than 3x the performance of single core, running on all cores.

My understanding is that at those quality levels Brotli consumes significantly more memory, and starts thrashing the cache. The scaling improves again at levels 10 and 11.

Bottom line for brotli, Falkor wins, since we would not consider going above quality 7 for dynamic compression.

Golang

Golang is another very important language for Cloudflare. It is also one of the first languages to offer ARMv8 support, so one would expect good performance. I used some of the built-in benchmarks, but modified them to run on multiple goroutines.

Go crypto

I would like to start the benchmarks with crypto performance. Thanks to OpenSSL we have good reference numbers, and it is interesting to see just how good the Go library is.

As far as Go crypto is concerned ARM and Intel are not even on the same playground. Go has very optimized assembly code for ECDSA, AES-GCM and Chacha20-Poly1305 on Intel. It also has Intel optimized math functions, used in RSA computations. All those are missing for ARMv8, putting it at a big disadvantage.

Nevertheless, the gap can be bridged with a relatively small effort, and we know that with the right optimizations, performance can be on par with OpenSSL. Even a very minor change, such as implementing the function addMulVVW in assembly, lead to an over tenfold improvement in RSA performance, putting Falkor ahead of both Broadwell and Skylake, with 8,009 signatures/second.

Another interesting thing to note is that on Skylake, the Go Chacha20-Poly1305 code, that uses AVX2 performs almost identically to the OpenSSL AVX512 code, this is again due to AVX2 running at higher clock speeds.

Go gzip

Next in Go performance is gzip. Here again we have a reference point to pretty well optimized code, and we can compare it to Go. In the case of the gzip library, there are no Intel specific optimizations in place.

Gzip performance is pretty good. The single core Falkor performance is way below both Intel processors, but at the system level it manages to outperform Broadwell, and lags behind Skylake. Since we already know that Falkor outperforms both when C is used, it can only mean that Go’s backend for ARMv8 is still pretty immature compared to gcc.

Go regexp

Regexp is widely used in a variety of tasks, so its performance is quite important too. I ran the built-in benchmarks on 32KB strings.

Go regexp performance is not very good on Falkor. In the medium and hard tests it takes second place, thanks to the higher core count, but Skylake is significantly faster still.

Doing some profiling shows that a lot of the time is spent in the function bytes.IndexByte. This function has an assembly implementation for amd64 (runtime.indexbytebody), but generic implementation for Go. The easy regexp tests spend most time in this function, which explains the even wider gap.

Go strings

Another important library for a web server is the Go strings library. I only tested the basic Replacer class here.

In this test again, Falkor lags behind, and loses even to Broadwell. Profiling shows significant time is spent in the function runtime.memmove. Guess what? It has a highly optimized assembly code for amd64, that uses AVX2, but only very simple ARM assembly, that copies 8 bytes at a time. By changing three lines in that code, and using the LDP/STP instructions (load pair/store pair) to copy 16 bytes at a time, I improved the performance of memmove by 30%, which resulted in 20% faster EscapeString and UnescapeString performance. And that is just scratching the surface.

Go conclusion

Go support for aarch64 is quite disappointing. I am very happy to say that everything compiles and works flawlessly, but on the performance side, things should get better. Is seems like the enablement effort so far was concentrated on the compiler back end, and the library was left largely untouched. There are a lot of low hanging optimization fruits out there, like my 20-minute fix for addMulVVW clearly shows. Qualcomm and other ARMv8 vendors intends to put significant engineering resources to amend this situation, but really anyone can contribute to Go. So if you want to leave your mark, now is the time.

LuaJIT

Lua is the glue that holds Cloudflare together.

Except for the binary_trees benchmark, the performance of LuaJIT on ARM is very competitive. It wins two benchmarks, and is in almost a tie in a third one.

That being said, binary_trees is a very important benchmark, because it triggers many memory allocations and garbage collection cycles. It will require deeper investigation in the future.

NGINX

For the NGINX workload, I decided to generate a load that would resemble an actual server.

I set up a server that serves the HTML file used in the gzip benchmark, over https, with the ECDHE-ECDSA-AES128-GCM-SHA256 cipher suite.

It also uses LuaJIT to redirect the incoming request, remove all line breaks and extra spaces from the HTML file, while adding a timestamp. The HTML is then compressed using brotli with quality 5.

Each server was configured to work with as many workers as it has virtual CPUs. 40 for Broadwell, 48 for Skylake and 46 for Falkor.

As the client for this test, I used the hey program, running from 3 Broadwell servers.

Concurrently with the test, we took power readings from the respective BMC units of each server.

With the NGINX workload Falkor handled almost the same amount of requests as the Skylake server, and both significantly outperform Broadwell. The power readings, taken from the BMC show that it did so while consuming less than half the power of other processors. That means Falkor managed to get 214 requests/watt vs the Skylake’s 99 requests/watt and Broadwell’s 77.

I was a bit surprised to see Skylake and Broadwell consume about the same amount of power, given both are manufactured with the same process, and Skylake has more cores.

The low power consumption of Falkor is not surprising, Qualcomm processors are known for their great power efficiency, which has allowed them to be a dominant player in the mobile phone CPU market.

Conclusion

The engineering sample of Falkor we got certainly impressed me a lot. This is a huge step up from any previous attempt at ARM based servers. Certainly core for core, the Intel Skylake is far superior, but when you look at the system level the performance becomes very attractive.

The production version of the Centriq SoC will feature up to 48 Falkor cores, running at a frequency of up to 2.6GHz, for a potential additional 8% better performance.

Obviously the Skylake server we tested is not the flagship Platinum unit that has 28 cores, but those 28 cores come both with a big price and over 200W TDP, whereas we are interested in improving our bang for buck metric, and performance per watt.

Currently, my main concern is weak Go language performance, but that is bound to improve quickly once ARM based servers start gaining some market share.

Both C and LuaJIT performance is very competitive, and in many cases outperforms the Skylake contender. In almost every benchmark Falkor shows itself as a worthy upgrade from Broadwell.

The largest win by far for Falkor is the low power consumption. Although it has a TDP of 120W, during my tests it never went above 89W (for the go benchmark). In comparison, Skylake and Broadwell both went over 160W, while the TDP of the two CPUs is 170W.

If you enjoy testing and selecting hardware on behalf of millions of Internet properties, come [join us](https://www.cloudflare.com/careers/).

AES-CBC is going the way of the dodo

Vlad Krasnov — Fri, 21 Apr 2017 16:44:29 GMT

A little over a year ago, Nick Sullivan talked about the beginning of the end for AES-CBC cipher suites, following a plethora of attacks on this cipher mode.

Today we can safely confirm that this prediction is coming true, as for the first time ever the share of AES-CBC cipher suites on Cloudflare’s edge network dropped below that of ChaCha20-Poly1305 suites, and is fast approaching the 10% mark.

CC BY-SA 2.0 image by aesop

Over the course of the last six months, AES-CBC shed more than 33% of its “market” share, dropping from 20% to just 13.4%.

All of that share, went to AES-GCM, that currently encrypts over 71.2% of all connections. ChaCha20-Poly1305 is stable, with 15.3% of all connections opting for that cipher. Surprisingly 3DES is still around, with 0.1% of the connections.

The internal AES-CBC cipher suite breakdown as follows:

The majority of AES-CBC connections use ECDHE-RSA or RSA key exchange, and not ECDHE-ECDSA, which implies that we mostly deal with older clients.

RSA is also dying

In other good new, the use of ECDSA surpassed that of RSA at the beginning of the year. Currently more than 60% of all connections use the ECDSA signature.

Although 2048-bit RSA is not broken, it is generally considered less secure than 256-bit ECDSA, and is significantly slower to boot.

PFS is king

Last, but not least, 98.4% of all connections are PFS, using ECDHE for key exchange. That's up from 97.6% six months ago.

All in all we see the continuation of the positive trend on the web towards safer and faster cryptography. We believe this trend will continue with the finalization of TLS 1.3 later this year.

HPACK: the silent killer (feature) of HTTP/2

Vlad Krasnov — Mon, 28 Nov 2016 14:10:41 GMT

If you have experienced HTTP/2 for yourself, you are probably aware of the visible performance gains possible with HTTP/2 due to features like stream multiplexing, explicit stream dependencies, and Server Push.

There is however one important feature that is not obvious to the eye. This is the HPACK header compression. Current implementation of nginx, as well edge networks and CDNs using it, do not support the full HPACK implementation. We have, however, implemented the full HPACK in nginx, and upstreamed the part that performs Huffman encoding.

CC BY 2.0 image by Conor Lawless

This blog post gives an overview of the reasons for the development of HPACK, and the hidden bandwidth and latency benefits it brings.

Some Background

As you probably know, a regular HTTPS connection is in fact an overlay of several connections in the multi-layer model. The most basic connection you usually care about is the TCP connection (the transport layer), on top of that you have the TLS connection (mix of transport/application layers), and finally the HTTP connection (application layer).

In the the days of yore, HTTP compression was performed in the TLS layer, using gzip. Both headers and body were compressed indiscriminately, because the lower TLS layer was unaware of the transferred data type. In practice it meant both were compressed with the DEFLATE algorithm.

Then came SPDY with a new, dedicated, header compression algorithm. Although specifically designed for headers, including the use of a preset dictionary, it was still using DEFLATE, including dynamic Huffman codes and string matching.

Unfortunately both were found to be vulnerable to the CRIME attack, that can extract secret authentication cookies from compressed headers: because DEFLATE uses backward string matches and dynamic Huffman codes, an attacker that can control part of the request headers, can gradually recover the full cookie by modifying parts of the request and seeing how the total size of the request changes during compression.

Most edge networks, including Cloudflare, disabled header compression because of CRIME. That’s until HTTP/2 came along.

HPACK

HTTP/2 supports a new dedicated header compression algorithm, called HPACK. HPACK was developed with attacks like CRIME in mind, and is therefore considered safe to use.

HPACK is resilient to CRIME, because it does not use partial backward string matches and dynamic Huffman codes like DEFLATE. Instead, it uses these three methods of compression:

Static Dictionary: A predefined dictionary of 61 commonly used header fields, some with predefined values.
Dynamic Dictionary: A list of actual headers that were encountered during the connection. This dictionary has limited size, and when new entries are added, old entries might be evicted.
Huffman Encoding: A static Huffman code can be used to encode any string: name or value. This code was computed specifically for HTTP Response/Request headers - ASCII digits and lowercase letters are given shorter encodings. The shortest encoding possible is 5 bits long, therefore the highest compression ratio achievable is 8:5 (or 37.5% smaller).

HPACK flow

When HPACK needs to encode a header in the format name:value, it will first look in the static and dynamic dictionaries. If the full name:value is present, it will simply reference the entry in the dictionary. This will usually take one byte, and in most cases two bytes will suffice! A whole header encoded in a single byte! How crazy is that?

Since many headers are repetitive, this strategy has a very high success rate. For example, headers like :authority:www.cloudflare.com or the sometimes huge cookie headers are the usual suspects in this case.

When HPACK can't match a whole header in a dictionary, it will attempt to find a header with the same name. Most of the popular header names are present in the static table, for example: content-encoding, cookie, etag. The rest are likely to be repetitive and therefore present in the dynamic table. For example, Cloudflare assigns a unique cf-ray header to each response, and while the value of this field is always different, the name can be reused!

If the name was found, it can again be expressed in one or two bytes in most cases, otherwise the name will be encoded using either raw encoding or the Huffman encoding: the shorter of the two. The same goes for the value of the header.

We found that the Huffman encoding alone saves almost 30% of header size.

Although HPACK does string matching, for the attacker to find the value of a header, they must guess the entire value, instead of a gradual approach that was possible with DEFLATE matching, and was vulnerable to CRIME.

Request Headers

The gains HPACK provides for HTTP request headers are more significant than for response headers. Request headers get better compression, due to much higher duplication in the headers. For example, here are two requests for our own blog, using Chrome:

Request #1:

**:authority:**blog.cloudflare.com**:method:**GET:path: /**:scheme:**https**accept:**text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8**accept-encoding:**gzip, deflate, sdch, br**accept-language:**en-US,en;q=0.8cookie: 297 byte cookie**upgrade-insecure-requests:**1**user-agent:**Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2853.0 Safari/537.36

I marked in red the headers that can be compressed with the use of the static dictionary. Three fields: :method:GET, :path:/ and :scheme:https are always present in the static dictionary, and will each be encoded in a single byte. Then some fields will only have their names compressed in a byte: :authority, accept, accept-encoding, accept-language, cookie and user-agent are present in the static dictionary.

Everything else, marked in green will be Huffman encoded.

Headers that were not matched, will be inserted into the dynamic dictionary for the following requests to use.

Let's take a look at a later request:

Request #2:

:authority:blog.cloudflare.com****:method:GET:path:/assets/images/cloudflare-sprite-small.png**:scheme:**https**accept:**image/webp,image/,/*;q=0.8**accept-encoding:**gzip, deflate, sdch, br**accept-language:**en-US,en;q=0.8**cookie:**same 297 byte cookiereferer:http://blog.cloudflare.com/assets/css/screen.css?v=2237be22c2**user-agent:**Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2853.0 Safari/537.36

Here I added blue encoded fields. Those indicate fields that were matched from the dynamic dictionary. It is clear that most fields repeat between requests. In this case two fields are again present in the static dictionary and five more are repeated and therefore present in the dynamic dictionary, that means they can be encoded in one or two bytes each. One of those is the ~300 byte cookie header, and ~130 byte user-agent. That is 430 bytes encoded into mere 4 bytes, 99% compression!

All in all for the repeat request, only three short strings will be Huffman encoded.

This is how ingress header traffic appears on the Cloudflare edge network during a six hour period:

On average we are seeing a 76% compression for ingress headers. As the headers represent the majority of ingress traffic, it also provides substantial savings in the total ingress traffic:

We can see that the total ingress traffic is reduced by 53% as the result of HPACK compression!

In fact today, we process about the same number of requests for HTTP/1 and HTTP/2 over HTTPS, yet the ingress traffic for HTTP/2 is only half that of HTTP/1.

Response Headers

For the response headers (egress traffic) the gains are more modest, but still spectacular:

Response #1:

**cache-control:**public, max-age=30**cf-cache-status:**HITcf-h2-pushed:,**cf-ray:**2ded53145e0c1ffa-DFW**content-encoding:**gzip**content-type:**text/html; charset=utf-8**date:**Wed, 07 Sep 2016 21:41:23 GMT**expires:**Wed, 07 Sep 2016 21:41:53 GMTlink: ; rel=preload; as=script,<https://code.jquery.com/jquery-1.11.3.min.js>; rel=preload; as=script**server:**cloudflare-nginxstatus:200**vary:**Accept-Encoding**x-ghost-cache-status:**From Cache**x-powered-by:**Express

The majority of the first response will be Huffman encoded, with some of the field names being matched from the static dictionary.

Response #2:

**cache-control:**public, max-age=31536000**cf-bgj:imgq:**100**cf-cache-status:**HITcf-ray:2ded53163e241ffa-DFW**content-type:**image/png**date:**Wed, 07 Sep 2016 21:41:23 GMT**expires:**Thu, 07 Sep 2017 21:41:23 GMT**server:**cloudflare-nginx**status:**200**vary:**Accept-Encoding**x-ghost-cache-status:**From Cachex-powered-by:Express

Again, the blue color indicates matches from the dynamic table, red indicate matches from the static table, and the green ones represent Huffman encoded strings.

On the second response it is possible to fully match seven of twelve headers. For four of the remaining five, the name can be fully matched, and six strings will be efficiently encoded using the static Huffman encoding.

Although the two expires headers are almost identical, they can only be Huffman compressed, because they can't be matched in full.

The more requests are being processed, the bigger the dynamic table becomes, and more headers can be matched, leading to increased compression ratio.

This is how egress header traffic appears on the Cloudflare edge:

On average egress headers are compressed by 69%. The savings for the total egress traffic are not that significant however:

It is difficult to see, but we get 1.4% savings in the total egress HTTP/2 traffic. While it does not look like much, it is still more than increasing the compression level for data would give in many cases. This number is also significantly skewed by websites that serve very large files: we measured savings of well over 15% for some websites.

Test your HPACK

If you have nghttp2 installed, you can test the efficiency of HPACK compression on your website with a bundled tool called h2load.

For example:

h2load https://blog.cloudflare.com | tail -6 |head -1
traffic: 18.27KB (18708) total, 538B (538) headers (space savings 27.98%), 17.65KB (18076) data

We see 27.98% space savings in the headers. That is for a single request, and the gains are mostly due to the Huffman encoding. To test if the website utilizes the full power of HPACK, we need to issue two requests, for example:

h2load https://blog.cloudflare.com -n 2 | tail -6 |head -1
traffic: 36.01KB (36873) total, 582B (582) headers (space savings 61.15%), 35.30KB (36152) data

If for two similar requests the savings are 50% or more then it is very likely full HPACK compression is utilized.

Note that compression ratio improves with additional requests:

h2load https://blog.cloudflare.com -n 4 | tail -6 |head -1
traffic: 71.46KB (73170) total, 637B (637) headers (space savings 78.68%), 70.61KB (72304) data

Conclusion

By implementing HPACK compression for HTTP response headers we've seen a significant drop in egress bandwidth. HPACK has been enabled for all Cloudflare customers using HTTP/2, all of whom benefit from faster, smaller HTTP responses.

Announcing Support for HTTP/2 Server Push

Vlad Krasnov — Thu, 28 Apr 2016 13:00:00 GMT

Last November, we rolled out HTTP/2 support for all our customers. At the time, HTTP/2 was not in wide use, but more than 88k of the Alexa 2 million websites are now HTTP/2-enabled. Today, more than 70% of sites that use HTTP/2 are served via CloudFlare.

CC BY 2.0 image by Roger Price

Incremental Improvements On SPDY

HTTP/2’s main benefit is multiplexing, which allows multiple HTTP requests to share a single TCP connection. This has a huge impact on performance compared to HTTP/1.1, but it’s nothing new—SPDY has been multiplexing TCP connections since at least 2012.

Some of the most important aspects of HTTP/2 have yet to be implemented by major web servers or edge networks. The real promise of HTTP/2 comes from brand new features like Header Compression and Server Push. Since February, we’ve been quietly testing and deploying HTTP/2 Header Compression, which resulted in an average 30% reduction in header size for all of our clients using HTTP/2. That's awesome. However, the real opportunity for a quantum leap in web performance comes from Server Push.

Pushing Ahead

Today, we’re happy to announce HTTP/2 Server Push support for all of our customers. Server Push enables websites and APIs to speculatively deliver content to the web browser before the browser sends a request for it. This behavior is opportunistic, since in some cases, the content might already be in the client’s cache or not required at all.

It’s been touted as one of the major features of HTTP/2, and by enabling it, we cover the entire set of HTTP/2 features for all of our users. You can see it in action on our live Server Push demo.

Server Push provides significant performance gains if used judiciously. In its most basic form, Server Push allows the server to “bundle” assets that the client didn’t ask for. It works by sending PUSH_PROMISE—a declaration of the intention to send the assets, followed by the actual assets.

Upon the receipt of a PUSH_PROMISE, the client may respond with a RST_STREAM message, indicating this asset is not needed. Even in this case, due to the asynchronous nature of HTTP/2, the client may receive the asset before the server gets the RST_STREAM message.

The PUSH_PROMISE looks a lot like an HTTP/2 GET request, and the client attempts to match received push promises before sending outgoing requests.

Enabling Server Push

All of our customers using HTTP/2 now have Server Push enabled, but unfortunately, Server Push isn’t one of those features that just works—you’ll need to do a little bit of work to truly leverage the benefits of Server Push.

Our implementation follows the guidelines laid out in the W3C’s draft standard on the use of a preload keyword in Link headers[1], and we will continue to track that standard as it evolves.

So, if you want to push assets for a given request, you simply add a specially formatted Link header to the response:

Link: ; rel=preload; as=script

Those links can be manually added, but they’re already created automatically by many publishing tools or via plugins for popular content management systems (CMS) such as WordPress.

Only relative links will be pushed by our edge servers, which means Server Push won’t work with third-party resources.

Disabling Server Push

The Link header was originally designed to let browsers know that they should preload an asset. If you still need this behavior for legacy reasons, you can append the nopush directive to the header, like so:

Link: ; rel=preload; as=style; nopush

Is Server Push For Me?

Server Push has the potential for tremendous performance gains. However, it’s not able to speed up every website, and it can even degrade the performance somewhat if you get overzealous. Generally, the downside of pushing assets that will end up unused is only the wasted bandwidth, while the upside is a speedup equivalent to one round trip from the client to our edge network.

Both the downside and the upside are mostly evident on slow, lossy connections, such as mobile networks. We suggest you profile the behavior of your individual website/API with and without Server Push to estimate the benefits. In our testing, we saw around a 45% performance gain when using Server Push on a mobile network.

Also note that because Server Push operates over a given HTTP/2 connection, it can only be used to push resources from your domain. If your website is bottlenecked on third-party assets, Server Push is unlikely to help you.

Some of the best use cases for HTTP/2 Server Push are:

Uncacheable content - Content that is not cached on the edge benefits from Server Push, since it will be requested from the origin earlier in the connection.
All assets on a requested page - By pushing all the CSS, JS, and image assets on a given page, it’s possible to transfer the entire page in a single round trip. This is only useful when no third party assets are blocking the page rendering. If the majority of the assets are cached on the client’s browser, this behavior can be wasteful.
The most likely next page - If there is a link on the loaded page that is most likely clicked next (for example the most recent post in a blog) you could push both the HTML and all of that pages assets. When the user clicks the link, it will render almost instantly.

Profiling Server Push

There are currently several tools and browsers that support Server Push. However, to visualize the performance benefits of the feature, the current Canary version of Google Chrome is the best. Here is an example of a webpage with five images loading with and without Server Push, as depicted in the timeline:

Plain HTTP/2:

After the main page is loaded (and some processing time) the browser makes a request for the five images. After another round trip, those images are delivered and loaded.

HTTP/2 + Server Push:

With Server Push enabled, we can see that the images are delivered while the page is being processed, thus no additional round trip is required. As soon as the images are needed, Chrome matches them with existing Push Promises and immediately uses them.

In Canary Chrome, you can also see that the pushed assets are identified in the Initiator column as Push/Other.

Other browsers:

Firefox

Pushed assets will not get a timeline and are identified by a solid grey circle (unlike cached content that is identified by a non-solid circle, and a “cached” indication in the “Transferred tab”).

Plain HTTP/2:

HTTP/2 + Server Push:

Edge

Pushed assets won’t have the yellow bar in the timeline (TTFB), and under the Protocol tab, the protocol will appear as HTTPS and not HTTP/2.

Plain HTTP/2:

HTTP/2 + Server Push:

Safari

The Safari browser does not support Server Push at the moment, but support is expected in the near future.

What’s Next For HTTP/2?

In the future, we plan to develop new tools to help our users to make more educated decisions in regards to Server Push. Over time, CloudFlare will even be able to predict the best assets to push automatically.

This feature is very new, and CloudFlare is the first major provider to deploy it at scale. We look forward to hearing from customers who make use of Server Push, now that we’ve made it available for experimentation.

We believe that Server Push is the most important upgrade to the web since SPDY. It has enormous potential to speed up the Internet since, for the first time, it gives website owners the power to send information to a web browser before the browser even asks for it.We’ll also be monitoring the still ongoing process of HTTP/2 development, especially efforts to make Server Push more aware of the client’s cache (thus reducing the number of redundant pushes).

We are currently only looking at link elements in HTTP headers and not in page HTML. ↩︎

It takes two to ChaCha (Poly)

Vlad Krasnov — Mon, 04 Apr 2016 11:50:49 GMT

Not long ago we introduced support for TLS cipher suites based on the ChaCha20-Poly1305 AEAD, for all our customers. Back then those cipher suites were only supported by the Chrome browser and Google's websites, but were in the process of standardization. We introduced these cipher suites to give end users on mobile devices the best possible performance and security.

CC BY-ND 2.0 image by Edwin Lee

Today the standardization process is all but complete and implementations of the most recent specification of the cipher suites have begun to surface. Firefox and OpenSSL have both implemented the new cipher suites for upcoming versions, and Chrome updated its implementation as well.

We, as pioneers of ChaCha20-Poly1305 adoption on the web, also updated our open sourced patch for OpenSSL. It implements both the older "draft" version, to keep supporting millions of users of the existing versions of Chrome, and the newer "RFC" version that supports the upcoming browsers from day one.

In this blog entry I review the history of ChaCha20-Poly1305, its standardization process, as well as its importance for the future of the web. I will also take a peek at its performance, compared to the other standard AEAD.

What is an AEAD?

ChaCha20-Poly1305 is an AEAD, Authenticated Encryption with Additional Data cipher. AEADs support two operations: "seal" and "open". Another common AEAD in use for TLS connections is AES-GCM.

CC BY 2.0 image by Jeremy Keith

The "seal" operation receives the following as input:

The message to be encrypted - this is the plaintext.
A secret key.
A unique initialization value - aka the IV. It must be unique between invocations of the "seal" operation with the same key, otherwise the secrecy of the cipher is completely compromised.
Optionally some other, non-secret, additional data. This data will not be encrypted, but will be authenticated - this is the AD in AEAD.

The "seal" operation uses the key and the IV to encrypt the plaintext into a ciphertext of equal length, using the underlying cipher. For ChaCha20-Poly1305 the cipher is ChaCha20 and for AES-GCM (the other AEAD in common use) the cipher is AES in CounTeR mode (AES-CTR).

After the data is encrypted, "seal" uses the key (and optionally the IV) to generate a secondary key. The secondary key is used to generate a keyed hash of the AD, the ciphertext and the individual lengths of each. The hash used in ChaCha20-Poly1305, is Poly1305 and in AES-GCM the hash is GHASH.

The final step is to take the hash value and encrypt it too, generating the final MAC (Message Authentication Code) and appending it to the ciphertext.

CC BY-SA 2.0 image by Kai Schreiber

The "open" operation is the reverse of "seal". It takes the same key and IV and generates the MAC of the ciphertext and the AD, similarly to the way "seal" did. It then reads the MAC appended after the ciphertext, and compares the two. Any difference in the MAC values would mean the ciphertext or the AD was tampered with, and they should be discarded as unsafe. If the two match, however, the operation decrypts the ciphertext, returning the original plaintext.

What makes AEADs special?

AEADs are special in the sense that they combine two algorithms - cipher and MAC, into a single primitive, with provable security guarantees. Before AEADs, it was acceptable to take some cipher and some MAC, which were considered secure independently, and combine them into an insecure combination. For example some combinations were broken by reusing the same keys for encryption and MAC (AES-CBC with CBC-MAC), while others by performing MAC over plaintext instead of the ciphertext AES-CBC with HMAC in TLS.

The new kid in the block

Up until recently the only standard AEAD used was AES-GCM. The problem with that, is that if someone breaks the GCM mode, we are left with no alternative - you wouldn't want to jump out of a plane without a backup parachute, would you? ChaCha20-Poly1305 is this backup.

CC BY 2.0 image by lina smith

We can't really call either ChaCha20 or Poly1305 really new. Both are the brain children of Daniel J. Bernstein (DJB). ChaCha20 is based upon an earlier cipher developed by DJB called Salsa, that dates back to 2005, and was submitted to the eSTREAM competition. ChaCha20 itself was published in 2008. It slightly modifies the Salsa round, and the number 20 indicates that it repeats for 20 rounds in total. Similar to AES-CTR, ChaCha20 is a stream cipher. It generates a pseudo-random stream of bits from an incremented counter, the stream is then "XORed" with plaintext to encrypt it (or "XORed" with ciphertext to decrypt). Because you do not need to know the plaintext in advance to generate the stream, this approach allows both to be very efficient and parallelizable. ChaCha20 is a 256-bit cipher.

Poly1305 was published in 2004. Poly1305 is a MAC, and can be used with any encrypted or unencrypted message, to generate a keyed authentication token. The purpose of such tokens is to guarantee the integrity of a given message. Originally Poly1305 used AES as the underlying cipher (Poly1305-AES); now it uses ChaCha20. Again, similarly to GHASH, it is a polynomial evaluation hash. Unlike GHASH, its key changes for each new message, because it depends on the IV. When DJB developed this MAC, he made it especially suited for efficient execution on the floating point hardware present on the CPUs back then. Today it is much more efficient to execute using 64-bit integer instructions, and might have been designed slightly differently.

Both have received considerable scrutiny from the crypto community in the years since, and today are considered completely safe, albeit there is a concern about the monoculture that forms when one person is responsible for so many standards in the industry (DJB is also responsible for Curve25519 key exchange).

From zero to hero

The body that governs internet standards is the IETF - Internet Engineering Task Force. All the standards we use on the internet, including TLS, come from that organization. All standards that relate to encryption come from the TLS and CFRG workgroups of IETF. The standardization process is open to all, and the correspondence that relates to it is kept public in a special archive.

The first mention for ChaCha20-Poly1305 I found in the archive dates to 30 July 2013. It is still referred to as Salsa back then.

After some time and debate an initial draft was published by Adam Langley from Google in September 2013. The latest draft of the ChaCha20-Poly1305 for TLS including all the previous revisions can be found here. It is interesting to see the incremental process, and the gradual refinement. For example initially ChaCha20 was also supposed to work with HMAC-SHA1.

Another standard that defines the general usage of ChaCha20-Poly1305 is RFC7539. First published in January 2014, it was standardized in May 2015.

Differences

There are two key differences between the draft version we initially implemented and the current version of the cipher suites that make the two incompatible.

The first difference relates to how the cipher suite is used in TLS. The current version incorporated the TLS records sequence number into the IV generation, making it more resistant to dangerous IV reuse.

The second difference relates to how Poly1305 applies to the TLS record. Records are the equivalent of a TCP packet for TLS. When data is streamed over TLS, it is broken into many smaller records. Each record holds part of the data (encrypted), with the MAC calculated for that record. It also holds other information, such as the protocol version, record type and length. The maximum amount of data a single record can hold is 16KB.

The draft Poly1305 calculated the hash of the additional data, followed by the length of the additional data as an 8 byte string, followed by the ciphertext, followed by the length of the ciphertext as an 8 byte string. In the current iteration, the hash is generated over the additional data, padded with zeroes to 16 byte boundary, followed by ciphertext similarly padded with zeroes, followed by the length of the additional data as an 8 byte string, followed by the length of the ciphertext as an 8 byte string.

The older cipher suites can be identified by IDs {cc}{13}, {cc}{14} and {cc}{15}, while the newer cipher suites have IDs {cc}{a8} through {cc}{ae}.

Future of ChaCha20-Poly1305

Today we already see that almost 20% of all the request to sites using CloudFlare use ChaCha20-Poly1305. And that is with only one browser supporting it. In the coming months Firefox will join the party, potentially increasing this number.

More importantly, the IETF is currently finalizing another very important standard, TLS 1.3. Right now it looks like TLS 1.3 will allow AEADs only, leaving AES-GCM and ChaCha20-Poly1305 as the only two options (for now). This would definitely bring the usage of ChaCha20-Poly1305 up significantly.

Can you handle it?

Given the rising popularity of ChaCha20-Poly1305 suites, and TLS in general, it is important to have efficient implementations that does not hog too much of the servers' CPU time. ChaCha20-Poly1305 allows for highly efficient implementation using SIMD instructions. Most of our servers are based on Intel CPUs with 256-bit SIMD extensions called AVX2. We utilize those to get the maximal performance.

The main competition for ChaCha20-Poly1305 are the AES-GCM based cipher suites. The most widely used AES-GCM, uses AES with 128 bit key, however in terms of security AES-256 is more comparable to ChaCha20.

Usually cipher performance numbers are reported for large messages, to show asymptotic performance, but on our network we started using dynamic record sizing. In practice it means many connections will never reach the maximal size of a TLS record (16KB), but instead will use significantly smaller records (below 1400 bytes). The records dynamically grow as the connection progresses, scaling to about 4KB and eventually to 16KB. Most messages will also not fit precisely into a record, and all sizes are possible.

Below are two graphs, comparing the performance of our ChaCha20-Poly1305 to the implementation in OpenSSL 1.1.0 pre, and to AES-GCM. The performance is reported in CPU cycles per byte, for a plaintext of given length, when performing the "seal" operation on a given plaintext with 13 bytes of AD, similarly to TLS. The first graph covers sizes 64-1536, while the second covers the remaining sizes to 16KB.

CPU cycles per byte (Y-axis) vs. record size in bytes (X-axis)

We can see that our implementation significantly outperforms OpenSSL for short records, and is slightly faster for longer records. The average performance advantage is 7%. AES-128-GCM and AES-256-GCM both still beat ChaCha20-Poly1305 in pure performance for records larger than 320 bytes, but getting below 2 cycles/byte is a major performance achievement. Not many modes can beat this performance. It is also important to note that AES-GCM uses two dedicated CPU instructions (AESENC and CLMULQDQ), whereas both ChaCha and Poly only use the generic SIMD instructions.

Performance outlook

The current performance is outstanding. We measured the performance on a Haswell CPU. Broadwell and Skylake CPUs actually perform AES-GCM faster, but we don't use them in our servers yet.

In the future, processors with wider SIMD instructions are expected to bridge the performance gap. The AVX512 will provide instructions twice as wide, and potentially will improve the performance two fold as well, bringing it below 1 cycle/byte. Following AVX512, Intel is expected to release the AVX512IFMA extensions too, that will accelerate Poly1305 even further.

Conclusion

CloudFlare is constantly pushing the envelope in terms of TLS performance and availability of the most secure cipher suites and modes. We are actively involved in the development and specification of TLS 1.3 and are committed to open source by releasing our performance patches.

Results of experimenting with Brotli for dynamic web content

Vlad Krasnov — Fri, 23 Oct 2015 14:24:50 GMT

Compression is one of the most important tools CloudFlare has to accelerate website performance. Compressed content takes less time to transfer, and consequently reduces load times. On expensive mobile data plans, compression even saves money for consumers. However, compression is not free—it comes at a price. It is one of the most compute expensive operations our servers perform, and the better the compression rate we want, the more effort we have to spend.

The most popular compression format on the web is gzip. We put a great deal of effort into improving the performance of the gzip compression, so we can perform compression on the fly with fewer CPU cycles. Recently a potential replacement for gzip, called Brotli, was announced by Google. Being early adopters for many technologies, we at CloudFlare want to see for ourselves if it is as good as claimed.

This post takes a look at a bit of history behind gzip and Brotli, followed by a performance comparison.

Compression 101

Many popular lossless compression algorithms rely on LZ77 and Huffman coding, so it’s important to have a basic understanding of these two techniques before getting into gzip or Brotli.

LZ77

LZ77 is a simple technique developed by Abraham Lempel and Jacob Ziv in 1977 (hence the original name). Let's call the input to the algorithm a string (a sequence of bytes, not necessarily letters) and each consecutive sequence of bytes in the input a substring. LZ77 compresses the input string by replacing some of its substrings by pointers (or backreferences) to an identical substring previously encountered in the input.

The pointer usually has the form of , where length indicates the number of identical bytes found, and distance indicates how many bytes separate the current occurrence of the substring from the previous one. For example the string abcdeabcdf can be compressed with LZ77 to abcde<4,5>f and aaaaaaaaaa can be compressed to simply a<9, 1>. The decompressor when encountering a backreference will simply copy the required number of bytes from the already decompressed output, which makes it very fast for decompression. This is a nice illustration of LZ77 from my previous blog Improving compression with a preset DEFLATE dictionary:

Output (length tokens are blue, distance tokens are red):

149

157

141

The deflate algorithm managed to reduce the original text from 251 characters, to just 152 tokens! Those tokens are later compressed further by Huffman coding.

Huffman Coding

Huffman coding is another lossless compression algorithm. Developed by David Huffman back in the 50s, it is used for many compression algorithms, including JPEG. A Huffman code is a type of prefix coding, where, given an alphabet and an input, frequently occurring characters are replaced by shorter bit sequences and rarely occurring characters are replaced with longer sequences.

The code can be expressed as a binary tree, where the leaf nodes are the literals of the alphabet and the two edges from each node are marked with 0 and 1. To decode the next character, the decompressor can parse the tree from the root until it encounters a literal in the tree.

Each compression format uses Huffman coding differently, and for our little example we will create a Huffman code for an alphabet that includes only the literals and the length codes we actually used. To start, we must count the frequency of each letter in the LZ77 compressed text:

(space) - 19, o -14, e - 11, n - 8, t - 7, a - 6, d - 6, i - 6, 3 - 4, 5 - 4, h - 4, s - 4, u - 4, c - 3, m - 3, p - 3, r - 3, y - 3, (") - 2, F - 2, L - 2, b - 2, g - 2, l - 2, w - 2, (') - 1, (,) - 1, (.) - 1, A - 1, D - 1, G - 1, I - 1, 9 - 1, 20 - 1, S - 1, 56 - 1, W - 1, 6 - 1, f - 1.

We can then use the algorithm from Wikipedia to build this Huffman code (binary):

(space) - 101, o - 000, e - 1101, n - 0101, t - 0011, a - 11101, d - 11110, i - 11100, 3 - 01100, 5 - 01111, h - 10001, s - 11000, u - 10010, c - 00100, m - 111111, p - 110011, r - 110010, y - 111110, (") - 011010, F - 010000, L - 100000, b - 011101, g - 010011, l - 011100, w - 011011, (') - 0100101, (,) - 1001110, (.) - 1001100, A - 0100011, D - 0100010, G - 1000011, I - 0010110, 9 - 1000010, 20 - 0010111, S - 0010100, 56 - 0100100, W - 0010101, 6 - 1001101, f - 1001111.

Here the most frequent letters - space and 'o' got the shortest codes, only 3 bit long, whereas the letters that occur only once get 7 bit codes. If we were to represent the alphabet of 256 bytes and some length tokens, we would require 9 bits per every letter.

Now apply the code to the LZ77 output, (leaving the distance tokens untouched) and we get:

100000 11100 0011 0011 011100 1101 101 011101 10010 0101 0101 111110 101 010000 000 000 01111 4 0010101 1101 0101 0011 101 10001 000 110011 110011 11100 0101 010011 101 0011 10001 110010 000 10010 010011 01100 8 10001 1101 101 1001111 000 110010 1101 11000 0011 101 0010100 00100 000 000 01111 28 10010 110011 1001101 23 11100 1101 011100 11110 101 111111 11100 00100 1101 101 0100011 0101 11110 101 011101 1000010 58 1101 111111 101 000 0101 01111 35 10001 1101 11101 11110 101 0100010 000 011011 0101 101 00100 11101 111111 1101 01111 19 1000011 000 000 11110 101 010000 11101 11100 110010 111110 1001110 101 11101 01100 55 11000 01100 20 11000 11101 11100 11110 101 011010 100000 0010111 149 0010110 101 11110 000 0101 0100101 0011 101 011011 11101 01100 157 0011 000 101 11000 1101 1101 101 111110 000 10010 0100100 141 1001100 011010

Assuming all the distance tokens require one byte to store, the total output length is 898 bits. Compared to the 1356 bits we would require to store the LZ77 output, and 2008 bits for the original input. We achieved a compression ratio of 31% which is very good. In reality this is not the case, since we must also encode the Huffman tree we used, otherwise it would be impossible to decompress the text.

gzip

The compression algorithm used in gzip is called "deflate," and it’s a combination of the LZ77 and Huffman algorithms discussed above.

On the LZ77 side, gzip has a minimum size of 3 bytes and a maximum of 258 bytes for the length tokens and a maximum of 32768 bytes for the distance tokens. The maximum distance also defines the sliding window size the implementation uses for compression and decompression.

For Huffman coding, deflate has two alphabets. The first alphabet includes the input literals ("letters" 0-255), the end-of-block symbol (the "letter" 256) and the length tokens ("letters" 257-285). There are only 29 letters to encode all the possible lengths and the tokens 265-284 will always have additional bits encoded in the stream to cover the entire range from 3 to 258. For example the letter 257 indicates the minimal length of 3, whereas the letter 265 will indicate the length 11 if followed by the bit 0, and the length 12 if followed by the bit 1.

The second alphabet is for the distance tokens only. Its letters are the codewords 0 through 29. Similar to the length tokens, the codes 4-29 are followed by 1 to 13 additional bits to cover the whole range from 1 to 32768.

Using two distinct alphabets makes a lot of sense, because it allows us to represent the distance code with fewer bits while avoiding any ambiguity, since the distance codes always follow the length codes.

The most common implementation of gzip compression is the zlib library. At CloudFlare, we use a custom version of zlib optimized for our servers.

zlib has 9 preset quality settings for the deflate algorithm, labeled from 1 to 9. It can also be run with quality set to 0, in which case it does not perform any compression. These quality settings can be divided into two categories:

Fast compression (levels 1-3): When a sufficiently long backreference is found, it is emitted immediately, and the search moves to the position at the end of the matched string. If the match is longer than a few bytes, the entire matched string will not be hashed, meaning its substrings will never be referenced in the future. Clearly, this reduces the compression ratio.
Slow compression (levels 4-9): Here, every substring is hashed; therefore, any substring can be referenced in the future. In addition, slow compression enables "lazy" matches. If at a given position a sufficiently long match is found, it will not be immediately emitted. Instead, the algorithm attempts to find a match at the next position. If a longer match is found, the algorithm will emit a single literal at the current position instead the shorter match, and continue to the next position. As a rule, this results in a better compression ratio than levels 1-3.

Other differences between the quality levels are how far to search for a backreference and how long a match should be before stopping.

A very decent compression rate can be observed at levels 3-4, which are also quite fast. Increasing compression quality from level 4 and up gives incrementally smaller compression gains, while requiring substantially more time. Often, levels 8 and 9 will produce similarly compressed output, but level 9 will require more time to do it.

It is important to understand that the zlib implementation does not guarantee the best compression possible with the format, even when the quality setting is set to 9. Instead, it uses a set of heuristics and optimizations that allow for very good compression at reasonable speed.

Better gzip compression can be achieved by the zopfli library, but it is significantly slower than zlib.

Brotli

The Brotli format was developed by Google and has been refined for a while now. Here at CloudFlare, we built an nginx module that performs dynamic Brotli compression, and we deployed it on our test server that supports HTTP2 and other new features.

To check the Brotli compression on the test server you can use the nightly build of the Firefox browser, which also supports this format.

Brotli, deflate, and gzip

Brotli and deflate are very closely related. Brotli also uses the LZ77 and Huffman algorithms for compression. Both algorithms use a sliding window for backreferences. Gzip uses a fixed size, 32KB window, and Brotli can use any window size from 1KB to 16MB, in powers of 2 (minus 16 bytes). This means that the Brotli window can be up to 512 times larger window than the deflate window. This difference is almost irrelevant in web-server context, as text files larger than 32KB are the minority.

Other differences include smaller minimal match length (2 bytes minimum in Brotli, compared to 3 bytes minimum in deflate) and larger maximal match length (16779333 bytes in Broli, compared to 258 bytes in deflate).

Static dictionary

Brotli also features a static dictionary. The "dictionary" supported by deflate can greatly improve compression, but has to be supplied independently and can only be addressed as part of the sliding window. The Brotli dictionary is part of the implementation and can be referenced from anywhere in the stream, somewhat increasing its efficiency for larger files. Moreover, different transformations can be applied to words of the dictionary effectively increasing its size.

Context modeling

Brotli also supports something called context modeling. Context modeling is a feature that allows multiple Huffman trees for the same alphabet in the same block. For example, in deflate each block consists of a series of literals (bytes that could not be compressed by backreferencing) and pairs that define a backreference for copying. Literals and lengths form a single alphabet, while the distances are a different alphabet.

In Brotli, each block is composed of "commands". A command consists of 3 parts. The first part of each command is a word . "Insert" defines the number of literals that will follow the word, and it may have the value of 0. "Copy" defines the number of literals to copy from a back reference. The word is followed by a sequence of "insert" literals: … . Finally, the command ends with a . Distance defines the backreference from which to copy the previously defined number of bytes. Unlike the distance in deflate, Brotli distance can have additional meanings, such as references to the static dictionary, or references to recently used distances.

There are three alphabets here. One is for s and it simply covers all the possible byte values from 0 to 255. The other one is for s, and its size depends on the size of the sliding window and other parameters. The third alphabet is for the length pairs, with 704 letters. Here indicates a single letter in the alphabet, as opposed to pairs in deflate where length and distance are letters in distinct alphabets.

So why do we care about context modeling? It means that for any of the alphabets—up to 256 different Huffman trees—can be used in the same block. The switch between different trees is determined by "context". This can be useful when the compressed file consists of different types of characters. For example binary data interleaved with UTF-8 strings, or a multilingual dictionary.

Whereas the basic idea behind Brotli remains identical to that of deflate, the way the data is encoded is very different. Those improvements allow for significantly better compression, but they also require a significant amount of processing. To alleviate the performance cost somewhat, Brotli drops the error detection CRC check present in gzip.

Benchmarking

The heaviest operation in both deflate and Brotli is the search for backward references. A higher level in zlib generally means that the backward reference search will attempt to find a better match for longer, but not necessarily succeed. This leads to significant increase in processing time. In contrast, Brotli trades longer searches for lighter operations that give better ROI, such as context modeling, dictionary references and more efficient data encoding in general. For that reason Brotli is, in theory, capable of outperforming zlib at some stage for similar compression rates, as well as giving better compression at its maximal levels.

We decided to put those theories to the test. At CloudFlare, one of the primary use cases for Brotli/gzip is on-the-fly compression of textual web assets like HTML, CSS, and JavaScript, so that’s what we’ll be testing.

There is a tradeoff between compression speed and transfer speed. It is only beneficial to increase your compression ratio if you can reduce the number of bytes you have to transfer faster than you would actually transfer them. Slower compression will actually slow the connection down. CloudFlare currently uses our own zlib implementation with quality set to 8, and that is our benchmarking baseline.

The benchmark set consists of 10,655 HTML, CSS, and JavaScript files. The benchmarks were performed on an Intel 3.5GHz, E3-1241 v3 CPU.

The files were grouped into several size groups, since different websites have different size characteristics, and we must optimize for them all.

Although the quality setting in the Brotli implementation states the possible values are 0 to 11, we didn't see any differences between 0 and 1, or between 10 and 11, therefore only quality settings 1 to 10 are reported.

Compression quality

Compression quality is measured as (total size of all files after compression)/(total size of all files before compression)*100%. The columns represent the size distribution of the HTML, CSS, and JavaScript files used in the test in bytes (the file size ranges are shown using the [x,y) notation).

	[20, 1024)	[1024, 2048)	[2048, 3072)	[3072, 4096)	[4096, 8192)	[8192, 16384)	[16384, 32768)	[32768, 65536)	[65536, +∞)	All files
zlib 1	65.0%	46.8%	42.8%	38.7%	34.4%	32.0%	29.9%	31.1%	31.4%	31.5%
zlib 2	64.9%	46.5%	42.5%	38.3%	33.8%	31.4%	29.2%	30.3%	30.5%	30.6%
zlib 3	64.9%	46.4%	42.3%	38.0%	33.6%	31.1%	28.8%	29.8%	29.9%	30.1%
zlib 4	64.5%	45.8%	41.6%	37.1%	32.6%	30.0%	27.8%	28.5%	28.7%	28.9%
zlib 5	64.5%	45.5%	41.3%	36.7%	32.1%	29.4%	27.1%	27.8%	27.8%	28.0%
zlib 6	64.4%	45.5%	41.3%	36.7%	32.0%	29.3%	27.0%	27.6%	27.6%	27.8%
zlib 7	64.4%	45.5%	41.3%	36.6%	32.0%	29.2%	26.9%	27.5%	27.5%	27.7%
zlib 8	64.4%	45.5%	41.3%	36.6%	32.0%	29.2%	26.9%	27.5%	27.4%	27.7%
zlib 9	64.4%	45.5%	41.3%	36.6%	32.0%	29.2%	26.9%	27.5%	27.4%	27.7%
brotli 1	61.3%	46.6%	42.7%	38.4%	33.8%	30.8%	28.6%	29.5%	28.6%	29.0%
brotli 2	61.9%	46.9%	42.8%	38.3%	33.6%	30.6%	28.3%	29.1%	28.3%	28.6%
brotli 3	61.8%	46.8%	42.7%	38.2%	33.3%	30.4%	28.1%	28.9%	28.0%	28.4%
brotli 4	53.8%	40.8%	38.6%	34.8%	30.9%	28.7%	27.0%	28.0%	27.5%	27.7%
brotli 5	49.9%	37.7%	35.7%	32.3%	28.7%	26.6%	25.2%	26.2%	26.0%	26.1%
brotli 6	50.0%	37.7%	35.7%	32.3%	28.6%	26.5%	25.1%	26.0%	25.7%	25.9%
brotli 7	50.0%	37.6%	35.6%	32.3%	28.5%	26.4%	25.0%	25.9%	25.5%	25.7%
brotli 8	50.0%	37.6%	35.6%	32.3%	28.5%	26.4%	25.0%	25.9%	25.4%	25.6%
brotli 9	50.0%	37.6%	35.5%	32.2%	28.5%	26.4%	25.0%	25.8%	25.3%	25.5%
brotli 10	46.8%	33.4%	32.5%	29.4%	26.0%	23.9%	22.9%	23.8%	23.0%	23.3%

Clearly, the compression possible by Brotli is significant. On average, Brotli at the maximal quality setting produces 1.19X smaller results than zlib at the maximal quality. For files smaller than 1KB the result is 1.38X smaller on average, a very impressive improvement, that can probably be attributed to the use of static dictionary.

Compression speed

Compression speed is measured as (total size of files before compression)/(total time to compress all files) and is reported in MB/s.

	[20, 1024)	[1024, 2048)	[2048, 3072)	[3072, 4096)	[4096, 8192)	[8192, 16384)	[16384, 32768)	[32768, 65536)	[65536, +∞)	All files
zlib 1	5.9	21.8	34.4	43.4	62.1	89.8	117.9	127.9	139.6	125.5
zlib 2	5.9	21.7	34.3	43.0	61.2	87.6	114.3	123.1	130.7	118.9
zlib 3	5.9	21.7	34.0	42.4	60.5	84.8	108.0	114.5	114.9	106.9
zlib 4	5.8	20.9	32.2	39.8	54.8	74.9	93.3	97.5	96.1	90.7
zlib 5	5.8	20.6	31.4	38.3	51.6	68.4	82.0	81.3	73.2	71.6
zlib 6	5.8	20.6	31.2	37.9	50.6	64.0	73.7	70.2	57.5	58.0
zlib 7	5.8	20.5	31.0	37.4	49.6	60.8	67.4	64.6	51.0	52.0
zlib 8	5.8	20.5	31.0	37.2	48.8	53.2	56.6	56.5	41.6	43.1
zlib 9	5.8	20.6	30.8	37.3	48.6	51.7	56.6	54.2	40.4	41.9
brotli 1	3.4	12.8	20.4	25.9	37.8	57.3	80.0	94.1	105.8	91.3
brotli 2	3.4	12.4	19.5	24.4	35.2	52.3	71.2	82.0	89.0	78.8
brotli 3	3.4	12.3	19.0	23.7	34.0	49.8	67.4	76.3	81.5	73.0
brotli 4	2.0	7.6	11.9	15.2	22.2	33.1	44.7	51.9	58.5	51.0
brotli 5	2.0	5.2	8.0	10.3	15.0	22.0	29.7	33.3	32.8	30.3
brotli 6	1.8	3.8	5.5	7.0	10.5	16.3	23.5	28.6	28.4	25.6
brotli 7	1.5	2.3	3.1	3.7	4.9	7.2	10.7	15.5	19.6	16.2
brotli 8	1.4	2.3	2.7	3.1	4.0	5.3	7.1	10.6	15.1	12.2
brotli 9	1.3	2.1	2.4	2.8	3.4	4.3	5.5	7.0	10.6	8.8
brotli 10	0.2	0.4	0.4	0.5	0.5	0.6	0.6	0.6	0.5	0.5

On average for all files, we can see that Brotli at quality level 4 is slightly faster than zlib at quality level 8 (and 9) while having comparable compression ratio. However that is misleading. Most files are smaller than 64KB, and if we look only at those files then Brotli 4 is actually 1.48X slower than zlib level 8!

Connection speedup

For on-the-fly compression, the most important question is how much time to invest in compression to make data transfer faster. Because increasing compression quality only gives incremental improvement over a given level, we need the added compressed bytes to outweigh the additional time spent compressing those bytes.

Again, CloudFlare uses compression quality 8 with zlib, so that’s our baseline. For each quality setting of Brotli starting at 4 (which is somewhat comparable to zlib 8 in terms of both time and compression ratio), we compute the added compression speed as: ((total size for zlib 8) - (total size after compression with Brotli))/((total time for Brotli)-(total time for zlib 8)).

The results are reported in MB/s. Negative numbers indicate lower compression ratio.

	[20, 1024)	[1024, 2048)	[2048, 3072)	[3072, 4096)	[4096, 8192)	[8192, 16384)	[16384, 32768)	[32768, 65536)	[65536, +∞)	All files
brotli 4	0.33	0.56	0.52	0.47	0.42	0.44	-0.24	-2.86	0.02	0.00
brotli 5	0.44	0.55	0.60	0.61	0.71	0.97	1.01	1.08	2.24	1.58
brotli 6	0.36	0.36	0.37	0.37	0.45	0.63	0.69	0.86	1.52	1.12
brotli 7	0.28	0.20	0.19	0.18	0.19	0.23	0.24	0.34	0.72	0.52
brotli 8	0.26	0.20	0.17	0.15	0.15	0.17	0.15	0.21	0.48	0.35
brotli 9	0.25	0.19	0.15	0.13	0.13	0.13	0.11	0.13	0.30	0.24
brotli 10	0.03	0.04	0.04	0.04	0.03	0.03	0.02	0.02	0.02	0.02

Those numbers are quite low, due to the slow speed of Brotli. To get any speedup we really want to see those numbers being greater than the connection speed. It seems that for files greater than 64KB, Brotli at quality setting 5 can speed up slow connections.

Keep in mind, that on a real server, compression is only one of many tasks that share the CPU, and compression speeds would be slower there.

Conclusions

The current state of Brotli gives us some mixed impressions. There is no yes/no answer to the question "Is Brotli better than gzip?". It definitely looks like a big win for static content compression, but on the web where the content is dynamic we also need to consider on-the-fly compression.

The way I see it, Brotli already has an advantage over zlib for large files (larger than 64KB) on slow connections. However, those constitute only 20% of our sampled dataset (and 80% of the total size).

Our Brotli module has a minimal size setting for Brotli compression that allows us to use gzip for smaller files and Brotli only for large ones.

It is important to remember that zlib has the advantage of being the optimization target for years by the entire web community, while Brotli is the development effort of a small but capable and talented team. There is no doubt that the current implementation will only improve with time.

Doubling the speed of jpegtran with SIMD

Vlad Krasnov — Thu, 08 Oct 2015 10:10:41 GMT

It is no secret that at CloudFlare we put a great effort into accelerating our customers' websites. One way to do it is to reduce the size of the images on the website. This is what our Polish product is for. It takes various images and makes them smaller using open source tools, such as jpegtran, gifsicle and pngcrush.

However those tools are computationally expensive, and making them go faster, makes our servers go faster, and subsequently our customers' websites as well.

Recently, I noticed that we spent ten times as much time "polishing" jpeg images as we do when polishing pngs.

We already improved the performance of pngcrush by using our supercharged version of zlib. So it was time to look what can be done for jpegtran (part of the libjpeg distribution).

Quick profiling

To get fast results I usually use the Linux perf utility. It gives a nice, if simple, view of the hotspots in the code. I used this image for my benchmark.

CC BY 4.0 image by ESO

perf record ./jpegtran -outfile /dev/null -progressive -optimise -copy none test.jpeg

And we get:

perf report54.90% lt-jpegtran libjpeg.so.9.1.0 [.] encode_mcu_AC_refine28.58% lt-jpegtran libjpeg.so.9.1.0 [.] encode_mcu_AC_first

Wow. That is 83.5% of the time spent in just two functions! A quick look suggested that both functions deal with Huffman coding of the jpeg image. That is a fairly simple compression technique where frequently occurring characters are replaced with short encodings, and rare characters get longer encodings.

The time elapsed for the tested file was 6 seconds.

Where to start?

First let's look at the most used function, encode_mcu_AC_refine. It is the heaviest function, and therefore optimizing it would benefit us the most.

The function has two loops. The first loop:

for (k = cinfo->Ss; k <= Se; k++) {
    temp = (*block)[natural_order[k]];
    if (temp < 0)
      temp = -temp;      /* temp is abs value of input */
    temp >>= Al;         /* apply the point transform */
    absvalues[k] = temp; /* save abs value for main pass */
    if (temp == 1)
      EOB = k;           /* EOB = index of last newly-nonzero coef */
  }

This loop iterates over one array (natural_order), reads indices that it uses to read from *block into temp, computes the absolute value of temp, and shifts it to the right by some value. In addition it keeps the EOB variable pointing to the last value of the index for which temp was 1.

Generally looping over an array is the schoolbook exercise for SIMD (Single Instruction Multiple Data) instructions, however in this case we have two problems: 1) Indirect indices are hard for SIMD 2) Keeping the EOB value updated.

Optimizing

The first problem can not be solved on the function level. While Haswell introduced the new Gather instructions, that allow loading data using indices, those instructions are still slow, and more importantly they operate only on 32-bit and 64-bit values. We are dealing with 16-bit values here.

Our only option is therefore to load all the values sequentially. One SIMD XMM register is 128-bits wide and can hold eight 16-bit integers, therefore we will load eight values each iteration of the loop. (Note: we could also choose to use 256-bit wide YMM registers here, but that would only work on the newest CPUs, while gaining little performance).

    __m128i x1 = _mm_setzero_si128(); // Load 8 16-bit values sequentially
    x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+0]], 0);
    x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+1]], 1);
    x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+2]], 2);
    x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+3]], 3);
    x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+4]], 4);
    x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+5]], 5);
    x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+6]], 6);
    x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+7]], 7);

What you see above are intrinsic functions. Such intrinsics usually map to a specific CPU instruction. By using intrinsics we can integrate high performance code inside a C function, without writing assembly code. Usually they give more than enough flexibility and performance. The __m128i type corresponds to a 128-bit XMM register.

Performing the transformation is straightforward:

    x1 = _mm_abs_epi16(x1);        // Get absolute value of 16-bit integers
    x1 = _mm_srli_epi16(x1, Al);   // >> 16-bit integers by Al bits
    _mm_storeu_si128((__m128i*)&absvalues[k], x1); // Store

As you can see we only need two instructions to perform the simple transformation on eight values, and no branch is required as a bonus.

The next tricky part is to update EOB, so it points to the last nonzero index. For that we need to find the index of the highest nonzero value inside x1, and if there is one update EOB accordingly.

We can easily compare all the values inside x1 to one:

    x1 = _mm_cmpeq_epi16(x1, _mm_set1_epi16(1));  // Compare to 1

The result of this operation is a mask. For all equal values all bits are set to 1 and for all non-equal values all bits are cleared to 0. However, how do we find the highest index whose bits are all ones? Well, there is an instruction that extracts the top bit of each byte inside an __m128i value, and puts it into a regular integer. From there we can find the index by finding the top set bit in that integer:

    unsigned int idx = _mm_movemask_epi8(x1);        // Extract byte mask
    EOB = idx? k + 16 - __builtin_clz(idx)/2 : EOB;  // Compute index

That is it. EOB will be updated correctly.

What does that simple optimization give us? Running jpegtran again shows that the same image now takes 5.2 seconds! That is a 1.15x speedup right there. But no reason to stop now.

Keep going

Next the loop iterates over the array we just prepared (absvalues). At the beginning of the loop we see:

  if ((temp = absvalues[k]) == 0) {
      r++;
      continue;
    }

So it simply skips all the 0 values in the array, until it finds a nonzero value. Is it worth optimizing? Running a quick test, I found out that it is common to have long sequences of zero values. And iterating over them individually can be expensive. Let's try our previous method of finding the top nonzero element in an array here:

    __m128i t = _mm_loadu_si128((__m128i*)&absvalues[k]);
    t = _mm_cmpeq_epi16(t, _mm_setzero_si128()); // Compare to 0
    int idx = _mm_movemask_epi8(t);              // Extract byte mask
    if (idx == 0xffff) {                         // If all the values are zero skip them all
      r += 8;
      k += 8;
      continue;
    } else {             // If some are nonzero, skip up to the first nonzero
      int skip = __builtin_ctz(~idx)/2;
      r += skip;
      k += skip;
      if (k>Se) break;   // Stop if gone too far
    }
    temp = absvalues[k]; // Load the first nonzero value

What do we get now? Our test image now takes only 3.7 seconds. We are already at 1.62x speedup.

And what does perf show?

45.94% lt-jpegtran libjpeg.so.9.1.0 [.] encode_mcu_AC_first28.73% lt-jpegtran libjpeg.so.9.1.0 [.] encode_mcu_AC_refine

We can see that the tables have turned. Now encode_mcu_AC_first is the slower function of the two, and takes almost 50% of the time.

Moar optimizations!

Looking at encode_mcu_AC_first we see one loop, that performs similar operations as the function we already optimized. Why not reuse the optimizations we already know are good?

We shall split the one big loop into two smaller loops, one would perform the required transformations, and store them into auxiliary arrays, the other would read those values skipping the zero entries.

Note that transformations are a little different now:

  if (temp < 0) {
      temp = -temp;
      temp >>= Al;
      temp2 = ~temp;
    } else {
      temp >>= Al;
      temp2 = temp;
    }

The equivalent for SIMD would be:

    __m128i neg = _mm_cmpgt_epi16(_mm_setzero_si128(), x1);// Negative mask
    x1 = _mm_abs_epi16(x1);                                // Absolute value
    x1 = _mm_srli_epi16(x1, Al);                           // >>
    __m128i x2 = _mm_andnot_si128(x1, neg);                // Invert x1, and apply negative mask
    x2 = _mm_xor_si128(x2, _mm_andnot_si128(neg, x1));     // Apply non-negative mask

And skipping zero values is performed in the same manner as before.

Measuring the performance now gives as 2.65 seconds for the test image! And that is 2.25X faster than the original jpegtran.

Conclusions

Let's take a final look at the perf output:

40.78% lt-jpegtran libjpeg.so.9.1.0 [.] encode_mcu_AC_refine26.01% lt-jpegtran libjpeg.so.9.1.0 [.] encode_mcu_AC_first

We can see that the two functions we optimized still take a significant portion of the execution. But now it is just 67% of the entire program. If you are asking yourself how we got 2.25X speedup with those functions going from 83.5% to 67%, read about the Potato Paradox.

It is definitely worth making further optimizations here. But it might be better looking outside the functions into how the data is arranged and passed around.

The bottom line is: optimizing code is fun, and using SIMD instructions can give you significant boost. Such as in this case.

We at CloudFlare make sure that our servers run at top notch performance, so our customers' websites do as well!.

You can find the final code on our GitHub repository: https://github.com/cloudflare/jpegtran

Fighting Cancer: The Unexpected Benefit Of Open Sourcing Our Code

Vlad Krasnov — Wed, 08 Jul 2015 13:28:00 GMT

Recently I was contacted by Dr. Igor Kozin from The Institute of Cancer Research in London. He asked about the optimal way to compile CloudFlare's open source fork of zlib. It turns out that zlib is widely used to compress the SAM/BAM files that are used for DNA sequencing. And it turns out our zlib fork is the best open source solution for that file format.

CC BY-SA 2.0 image by Shaury Nash

The files used for this kind of research reach hundreds of gigabytes and every time they are compressed and decompressed with our library many important seconds are saved, bringing the cure for cancer that much closer. At least that's what I am going to tell myself when I go to bed.

This made me realize that the benefits of open source go much farther than one can imagine, and you never know where a piece of code may end up. Open sourcing makes sophisticated algorithms and software accessible to individuals and organizations that would not have the resources to develop them on their own, or the money pay for a proprietary solution.

It also made me wonder exactly what we did to zlib that makes it stand out from other zlib forks.

Recap

Zlib is a compression library that supports two formats: deflate and gzip. Both formats use the same algorithm also called DEFLATE, but with different headers and checksum functions. The deflate algorithm is described here.

Both formats are supported by the absolute majority of web browsers, and we at CloudFlare compress all text content on the fly using the gzip format. Moreover DEFLATE is also used by the PNG file format, and our fork of zlib also accelerates our image optimization engine Polish. You can find the optimized fork of pngcrush here.

Given the amount of traffic we must handle, compression optimization really makes sense for us. Therefore we included several improvements over the default implementation.

First of all it is important to understand the current state of zlib. It is a very old library, one of the oldest that is still used as is to this day. It is so old it was written in K&R C. It is so old USB was not invented yet. It is so old that DOS was still a thing. It is so old (insert your favorite so old joke here). More precisely it dates back to 1995. Back to the days 16-bit computers with 64KB addressable space were still in use.

Still it represents one of the best pieces of code ever written, and even modernizing it gives only modest performance boost. Which shows the great skill of its authors and the long way compilers have come since 1995.

Below is a list of some of the improvements in our fork of zlib. This work was done by me, my colleague Shuxin Yang, and also includes improvements from other sources.

uint64_t as the standard type - the default fork used 16-bit types.
Using an improved hash function - we use the iSCSI CRC32 function as the hash function in our zlib. This specific function is implemented as a hardware instruction on Intel processors. It has very fast performance and better collision properties.
Search for matches of at least 4 bytes, instead the 3 bytes the format suggests. This leads to fewer hash collisions, and less effort wasted on insignificant matches. It also improves the compression rate a little bit for the majority of cases (but not all).
Using SIMD instructions for window rolling.
Using the hardware carry-less multiplication instruction PLCMULQDQ for the CRC32 checksum.
Optimized longest-match function. This is the most performance demanding function in the library. It is responsible for finding the (length, distance) matches in the current window.

In addition, we have an experimental branch that implements an improved version of the linked list used in zlib. It has much better performance for compression levels 6 to 9, while retaining the same compression ratio. You can find the experimental branch here.

Benchmarking

You can find independent benchmarks of our library here and here. In addition, I performed some in-house benchmarking, and put the results here for your convenience.

All the benchmarks were performed on an i5-4278U CPU. The compression was performed from and to a ramdisk. All libraries were compiled with gcc version 4.8.4 with the compilation flags: "-O3 -march=native".

I tested the performance of the main zlib fork, optimized implementation by Intel, our own main branch, and our experimental branch.

Four data sets were used for the benchmarks. The Calgary corpus, the Canterbury corpus, the Large Canterbury corpus and the Silesia corpus.

Calgary corpus

Performance:

Compression rates:

For this benchmark, Intel only outperforms our implementation for level 1, but at the cost of 1.39X larger files. This difference is far greater than even the difference between levels 1 and 9, and should probably be regarded as a different compression level. CloudFlare is faster on all other levels, and outperforms significantly for levels 6 to 9. The experimental implementation is even faster for those levels.

Canterbury corpus

Performance:

Compression rates:

Here we see a similar situation. Intel at level 1 gets 1.44X larger files. CloudFlare is faster for levels 2 to 9. On level 9, the experimental branch outperforms the reference implementation by 2X.

Large corpus

Performance:

Compression rates:

This time Intel is slightly faster for levels 5 and 6 than the CloudFlare implementation. The experimental CloudFlare implementation is faster still on level 6. The compression rate for Intel level 1 is 1.58 lower than CloudFlare. On level 9, the experimental fork is 7.5X(!) faster than reference.

Silesia corpus

Performance:

Compression rates:

Here again, CloudFlare is the fastest on levels 2 to 9. On level 9 the difference in speed between the experimental fork and the reference fork is 2.44X.

Conclusion

As evident from the benchmarks, the CloudFlare implementation outperforms the competition in the vast majority of settings. We put great effort in making it as fast as possible on our servers.

If you intend to use our library, you should check for yourself if it delivers the best balance of performance and compression for your dataset. As between different file format and sizes performance can vary.

And if you like open source software, don't forget to give back to the community, by contributing your own code!

Go crypto: bridging the performance gap

Vlad Krasnov — Thu, 07 May 2015 10:06:00 GMT

It is no secret that we at CloudFlare love Go. We use it, and we use it a LOT. There are many things to love about Go, but what I personally find appealing is the ability to write assembly code!

CC BY 2.0 image by Jon Curnow

That is probably not the first thing that pops to your mind when you think of Go, but yes, it does allow you to write code "close to the metal" if you need the performance!

Another thing we do a lot in CloudFlare is... cryptography. To keep your data safe we encrypt everything. And everything in CloudFlare is a LOT.

Unfortunately the built-in cryptography libraries in Go do not perform nearly as well as state-of-the-art implementations such as OpenSSL. That is not acceptable at CloudFlare's scale, therefore we created assembly implementations of Elliptic Curves and AES-GCM for Go on the amd64 architecture, supporting the AES and CLMUL NI to bring performance up to par with the OpenSSL implementation we use for Universal SSL.

We have been using those improved implementations for a while, and attempting to make them part of the official Go build for the good of the community. For now you can use our special fork of Go to enjoy the improved performance.

Both implementations are constant-time and side-channel protected. In addition the fork includes small improvements to Go's RSA implementation.

The performance benefits of this fork over the standard Go 1.4.2 library on the tested Haswell CPU are:

                         CloudFlare          Go 1.4.2        Speedup
AES-128-GCM           2,138.4 MB/sec          91.4 MB/sec     23.4X

P256 operations:
Base Multiply           26,249 ns/op        737,363 ns/op     28.1X
Multiply/ECDH comp     110,003 ns/op      1,995,365 ns/op     18.1X
Generate Key/ECDH gen   31,824 ns/op        753,174 ns/op     23.7X
ECDSA Sign              48,741 ns/op      1,015,006 ns/op     20.8X
ECDSA Verify           146,991 ns/op      3,086,282 ns/op     21.0X

RSA2048:
Sign                 3,733,747 ns/op      7,979,705 ns/op      2.1X
Sign 3-prime         1,973,009 ns/op      5,035,561 ns/op      2.6X

AES-GCM in a brief

So what is AES-GCM and why do we care? Well, it is an AEAD - Authenticated Encryption with Associated Data. Specifically AEAD is a special combination of a cipher and a MAC (Message Authentication Code) algorithm into a single robust algorithm, using a single key. This is different from the other method of performing authenticated encryption "encrypt-then-MAC" (or as TLS does it with CBC-SHAx, "MAC-then-encrypt"), where you can use any combination of cipher and MAC.

Using a dedicated AEAD reduces the dangers of bad combinations of ciphers and MACs, and other mistakes, such as using related keys for encryption and authentication.

Given the many vulnerabilities related to the use of AES-CBC with HMAC, and the weakness of RC4, AES-GCM is the de-facto secure standard on the web right now, as the only IETF-approved AEAD to use with TLS at the moment.

Another AEAD you may have heard of is ChaCha20-Poly1305, which CloudFlare also supports, but it is not a standard just yet.

That is why we use AES-GCM as the preferred cipher for customer HTTPS only prioritizing ChaCha20-Poly1305 for mobile browsers that support it. You can see it in our configuration. As such today more than 60% of our client facing traffic is encrypted with AES-GCM, and about 10% is encrypted with ChaCha20-Poly1305. This percentage grows every day, as browser support improves. We also use AES-GCM to encrypt all the traffic between our data centers.

CC BY 2.0 image by 3:19

AES-GCM as an AEAD

As I mentioned AEAD is a special combination of a cipher and a MAC. In the case of AES-GCM the cipher is the AES block cipher in Counter Mode (AES-CTR). For the MAC it uses a universal hash called GHASH, encrypted with AES-CTR.

The inputs to the AES-GCM AEAD encryption are as follows:

The secret key (K), that may be 128, 192 or 256 bit long. In TLS, the key is usually valid for the entire connection.
A special unique value called IV (initialization value) - in TLS it is 96 bits. The IV is not secret, but the same IV may not be used for more than one message with the same key under any circumstance! To achieve that, usually part of the IV is generated as a nonce value, and the rest of it is incremented as a counter. In TLS the IV counter is also the record sequence number. The IV of GCM is unlike the IV in CBC mode, which must also be unpredictable.The disadvantage of using this type of IV, is that in order to avoid collisions, one must change the encryption key, before the IV counter overflows.
The additional data (A) - this data is not secret, and therefore not encrypted, but it is being authenticated by the GHASH. In TLS the additional data is 13 bytes, and includes data such as the record sequence number, type, length and the protocol version.
The plaintext (P) - this is the secret data, it is both encrypted and authenticated.

The operation outputs the ciphertext (C) and the authentication tag (T). The length of C is identical to that of P, and the length of T is 128 bits (although some applications allow for shorter tags). The tag T is computed over A and C, so if even a single bit of either of them is changed, the decryption process will detect the tampering attempt and refuse to use the data. In TLS, the tag T is attached at the end of the ciphertext C.

When decrypting the data, the function will receive A, C and T and compute the authentication tag of the received A and C. It will compare the resulting value to T, and only if both are equal it will output the plaintext P.

By supporting the two state of the art AEADs - AES-GCM and ChaCha20-Poly1305, together with ECDSA and ECDH algorithms, CloudFlare is able to provide the fastest, most flexible and most secure TLS experience possible to date on all platforms, be it a PC or a mobile phone.

Bottom line

Go is a very easy to learn and fun to use, yet it is one of the most powerful languages for system programming. It allows us to deliver robust web-scale software in short time frames. With the performance improvements CloudFlare brings to its crypto stack, Go can now be used for high performance TLS servers as well!

Improving compression with a preset DEFLATE dictionary

Vlad Krasnov — Mon, 30 Mar 2015 09:21:21 GMT

A few years ago Google made a proposal for a new HTTP compression method, called SDCH (SanDwiCH). The idea behind the method is to create a dictionary of long strings that appear throughout many pages of the same domain (or popular search results). The compression is then simply searching for the appearance of the long strings in a dictionary and replacing them with references to the aforementioned dictionary. Afterwards the output is further compressed with DEFLATE.

_{CC BY SA 2.0}_image_by_{Quinn Dombrowski}

With the right dictionary for the right page the savings can be spectacular, even 70% smaller than gzip alone. In theory, a whole file can be replaced by a single token.

The drawbacks of the method are twofold: first - the dictionary that is created is fairly large and must be distributed as a separate file, in fact the dictionary is often larger than the individual pages it compresses; second - the dictionary is usually absolutely useless for another set of pages.

For large domains that are visited repeatedly the advantage is huge: at a cost of single dictionary download, all the following page views can be compressed with much higher efficiency. Currently we aware of Google and LinkedIn compressing content with SDCH.

SDCH for millions of domains

Here at CloudFlare our task is to support millions of domains, which have little in common, and creating a single SDCH dictionary is very difficult. Nevertheless better compression is important, because it produces smaller payloads, which result in content being delivered faster to our clients. That is why we set out to find that little something that is common to all the pages and to see if we could compress them further.

Besides SDCH (which is only supported by the Chrome browser), the common compression methods over HTTP are gzip and DEFLATE. Some do not know it but the compression they perform is identical. The two formats differ in the content headers they use, with gzip having slightly larger headers, and the error detection function - gzip uses CRC32, whereas DEFLATE uses Adler32.

Usually the servers opt to compress with gzip, however its cousin DEFLATE supports a neat feature called "Preset Dictionary". This dictionary is not like the dictionary used by SDCH, in fact it is not a real dictionary. To understand how this "dictionary" can be used to our advantage, it is important to understand how the DEFLATE algorithm works.

_{CC BY 2.0}_image_by_{Caleb Roenigk}

The DEFLATE algorithm consists of two stages, first it performs the LZ77 algorithm, where it simply goes over the input, and replaces occurrences of previously encountered strings with (short) "directions" where the same string can be found in the previous input. The directions are a tuple of (length, distance), where distance tells how far back in the input the string was and length tells how many bytes were matched. The minimal length deflate will match is 3 (4 in the highly optimized implementation CloudFlare uses), the maximal length is 258, and the farthest distance back is 32KB.

This is an illustration of the LZ77 algorithm:

Input:

L	i	t	t	l	e		b	u	n	n	y		F	o	o		F	o	o
	W	e	n	t		h	o	p	p	i	n	g		t	h	r	o	u	g
h		t	h	e		f	o	r	e	s	t		S	c	o	o	p	i	n
g		u	p		t	h	e		f	i	e	l	d		m	i	c	e
A	n	d		b	o	p	p	i	n	g		t	h	e	m		o	n
t	h	e		h	e	a	d		D	o	w	n		c	a	m	e		t
h	e		G	o	o	d		F	a	i	r	y	,		a	n	d		s
h	e		s	a	i	d		"	L	i	t	t	l	e		b	u	n	n
y		F	o	o		F	o	o		I		d	o	n	'	t		w	a
n	t		t	o		s	e	e		y	o	u		S	c	o	o	p	i
n	g		u	p		t	h	e		f	i	e	l	d		m	i	c	e
	A	n	d		b	o	p	p	i	n	g		t	h	e	m		o	n
	t	h	e		h	e	a	d	.	"

Output (length tokens are blue, distance tokens are red):

L	i	t	t	l	e		b	u	n	n	y		F	o	o	5	4	W	e
n	t		h	o	p	p	i	n	g		t	h	r	o	u	g	h	3	8
e		f	o	r	e	s	t		S	c	o	o	5	28	u	p	6	23	i
e	l	d		m	i	c	e		A	n	d		b	9	58	e	m		o
n	5	35	h	e	a	d		D	o	w	n		c	a	m	e	5	19	G
o	o	d		F	a	i	r	y	,		a	3	55	s	3	20	s	a	i
d		"	L	20	149	I		d	o	n	'	t		w	a	3	157	t	o
	s	e	e		y	o	o	56	141	.	"

DEFLATE managed to reduce the original text from 251 characters, to just 152 tokens! Those tokens are later compressed further by Huffman encoding, which is the second stage.

How long and devotedly the algorithm searches for a string before it stops is defined by the compression level. For example with compression level 4 the algorithm will be happy to find a match of 16 bytes, whereas with level 9 it will attempt to look for the maximal 258 byte match. If a match was not found the algorithm outputs the input as is, uncompressed.

Clearly at the beginning of the input, there can be no references to previous strings, and it is always uncompressed. Similarly the first occurrence of any string in the input will never be compressed. For example almost all HTML files start with the string "

To solve this problem the deflate dictionary effectively acts as an initial back reference for possible matches. So if we add the aforementioned string "

To illustrate, lets compress the children’s song with the help of a 42 byte dictionary, containing the following: Little bunny Foo hopping forest Good Fairy. The compressed output will then be:


17	42	4	4	W	e	n	t	9	51	t	h		o	u	g	h	3	8	e
8	63	S	c	o	o	5	28	u	p	6	23	i	e	l	d		m	i	c
e		A	n	d		b	9	58	e	m		o	n	5	35	h	e	a	d
	D	o	w	n		c	a	m	e	5	19	10	133	,	a	3	55	s	3
20	s	a	i	d		"	21	149	I		d	o	n	'	t		w	a	3
157	t	o		s	e	e		y	o	o	56	141	.	"

Now, even strings at the very beginning of the input are compressed and strings that only appear once in the file are compressed as well. With the help of the dictionary we are down to 115 tokens. That means roughly 25% better compression rate.

An experiment

We wanted to see if we could make a dictionary that would benefit ALL the HTML pages we serve, and not just a specific domain. To that end we scanned over 20,000 publicly available random HTML pages that passed through our servers on a random sunny day, we took the first 16KB of each page, and used them to prepare two dictionaries, one of 16KB and the other of 32KB. Using a larger dictionary is useless, because then it would be larger than the LZ77 window used by deflate.

To build a dictionary, I made a little go program that takes a set of files and performs "pseudo" LZ77 over them, finding strings that DEFLATE would not compress in the first 16KB of each input file. It then performs a frequency count of the individual strings, and scores them according to their length and frequency. In the end the highest scoring strings are saved into the dictionary file.

Our benchmark consists of another set of pages obtained in similar manner. The number of benchmarked files was about 19,000 with total size of 563MB.

deflate -4	deflate -9	deflate -4 + 16K dict	deflate -9 + 16K dict	deflate -4 + 32K dict	deflate -9 + 32K dict
Size (KB)	169,176	166,012	161,896	158,352	161,212	157,444
Time (sec)	6.90	11.56	7.15	11.80	7.88	11.82

We can see from the results that the compression we gain for level 4 is almost 5% better than without the dictionary, which is even greater than the compression gained by using level 9 compression, while being substantially faster. For level 9, the gain is greater than 5% without a significant performance hit.

The results highly depend on the dataset used for the dictionary and on the compressed pages. For example when making a dicitonary aimed at a specific web site, the compression rate for that site increased by up to 30%.

For very small pages, such as error pages, with size less than 1KB, a DEFLATE dictionary was able to gain compression rates of up to 50% smaller than DEFLATE alone.

Of course different dictionaries may be used for different file types. In fact we think that it would make sense to create a standard set of dictionaries that could be used accross the web.

The utility to make a dictionary for deflate can be found at https://github.com/vkrasnov/dictator.The optimized version of zlib used by CloudFlare can be found at https://github.com/cloudflare/zlib

Improving PicoHTTPParser further with AVX2

Vlad Krasnov — Thu, 18 Dec 2014 16:21:00 GMT

Vlad Krasnov recently joined CloudFlare to work on low level optimization of CloudFlare's servers. This is the first of a number of blog posts that will include code he's optimized and open sourced.

In a recent post, Kazuho's Weblog describes an improvement to PicoHTTPParser. This improvement utilizes the SSE4.2 instruction PCMPESTRI in order to find the delimiters in a HTTP request/response and parse them accordingly. This update, compared to the previous version of the code, is impressive.

CC BY-SA 2.0 image by Intel Free Press

PCMPESTRI is a versatile instruction that allows scanning of up to 16 bytes at once for occurrences of up to 16 distinct characters (bytes), or up to 8 ranges of characters (bytes). It can also be used for string and substring comparison. However, there are a few drawbacks: the instruction has a high latency of 11 cycles, and is limited to 16 bytes per instruction. It's also under utilized for range comparison in PicoHTTPParser, because it only tests two or three ranges per invocation (out of eight it is capable of). Furthermore, some simple math (16 bytes / 11 cycles) shows that using this instruction limits the parser to 1.45 bytes/cycle throughput.

Is this really as fast as this parser can go?

Today on the latest Haswell processors, we have the potent AVX2 instructions. The AVX2 instructions operate on 32 bytes, and most of the boolean/logic instructions perform at a throughput of 0.5 cycles per instruction. This means that we can execute roughly 22 AVX2 instructions in the same amount of time it takes to execute a single PCMPESTRI. Why not give it a shot?

Let's look at the logic first. If, for example, we search for bytes in the ranges 0-8, 10-31, and 127, we see that to convert the PCMPESTRI logic to AVX2 logic we first need to check for each byte X: ((-1 < X < 32) AND !(X = 9)) OR (X = 127). Luckily, we can treat the bytes as signed because the AVX2 instruction set does not provide a single instruction for unsigned 'greater equal' operations. This logic, when implemented with AVX2 instructions, would mark all the qualifying bytes. Second, to get the position of the first qualifying byte, as PCMPESTRI does, we should use the PMOVMSKB instructions to get the byte mask of the AVX2 register, and then apply the new TZCNT instruction (to get the position of the first set bit in the register). The latter two instructions have a latency of 3 cycles each, and the logic itself has a latency of around 4-5 cycles.

In theory, we traded a single instruction that has latency of 11 cycles for a complex flow with a theoretical latency of 10 cycles. But we would make it up in throughput, right? Apparently not. This is because the fields in the HTTP request are usually short so we don't gain from the increased throughput.

So what can we do?

The solution is to change the logical flow of the program from latency bound to throughput oriented. Instead of searching for the next occurrence of a token, we create a bitmap of all the occurrences in a long string. We did this when we used the PMOVMSKB instruction before; however, instead of discarding all but the first bit, we use all the bits! Moreover, to gain more from the high throughput of the AVX2 instructions, we created a bitmask for 128 bytes of the request at a time, and in addition, we create bitmasks for two types of token together—both the name/value delimiter and the new line delimiter.

To parse the resulting bit string, we use the TZCNT instruction repeatedly resulting in improved performance. This was done by compiling with gcc 4.9.2, with -mavx2 -mbmi2 -O3 for both versions. Here are the results from Haswell i5-4278U, with turbo disabled for consistency:

             PCMPESTRI   AVX2       Improvement
bench.c      3,900,156   6,963,788  1.79x
fukamachi.c  4,807,692   8,064,516  1.68x

While we managed to squeeze significant performance from the parser by using AVX2 instructions, there is still plenty of room to improve the code—feel free to try it yourself!

To read the actual code behind this blog post see this pull request.

L	i	t	t	l	e		b	u	n	n	y		F	o	o		F	o	o
	W	e	n	t		h	o	p	p	i	n	g		t	h	r	o	u	g
h		t	h	e		f	o	r	e	s	t		S	c	o	o	p	i	n
g		u	p		t	h	e		f	i	e	l	d		m	i	c	e
A	n	d		b	o	p	p	i	n	g		t	h	e	m		o	n
t	h	e		h	e	a	d		D	o	w	n		c	a	m	e		t
h	e		G	o	o	d		F	a	i	r	y	,		a	n	d		s
h	e		s	a	i	d		"	L	i	t	t	l	e		b	u	n	n
y		F	o	o		F	o	o		I		d	o	n	'	t		w	a
n	t		t	o		s	e	e		y	o	u		S	c	o	o	p	i
n	g		u	p		t	h	e		f	i	e	l	d		m	i	c	e
	A	n	d		b	o	p	p	i	n	g		t	h	e	m		o	n
	t	h	e		h	e	a	d	.	"

L	i	t	t	l	e		b	u	n	n	y		F	o	o	5	4	W	e
n	t		h	o	p	p	i	n	g		t	h	r	o	u	g	h	3	8
e		f	o	r	e	s	t		S	c	o	o	5	28	u	p	6	23	i
e	l	d		m	i	c	e		A	n	d		b	9	58	e	m		o
n	5	35	h	e	a	d		D	o	w	n		c	a	m	e	5	19	G
o	o	d		F	a	i	r	y	,		a	3	55	s	3	20	s	a	i
d		"	L	20	149	I		d	o	n	'	t		w	a	3	157	t	o
	s	e	e		y	o	o	56	141	.	"


17	42	4	4	W	e	n	t	9	51	t	h		o	u	g	h	3	8	e
8	63	S	c	o	o	5	28	u	p	6	23	i	e	l	d		m	i	c
e		A	n	d		b	9	58	e	m		o	n	5	35	h	e	a	d
	D	o	w	n		c	a	m	e	5	19	10	133	,	a	3	55	s	3
20	s	a	i	d		"	21	149	I		d	o	n	'	t		w	a	3
157	t	o		s	e	e		y	o	o	56	141	.	"

L	i	t	t	l	e		b	u	n	n	y		F	o	o		F	o	o
	W	e	n	t		h	o	p	p	i	n	g		t	h	r	o	u	g
h		t	h	e		f	o	r	e	s	t		S	c	o	o	p	i	n
g		u	p		t	h	e		f	i	e	l	d		m	i	c	e
A	n	d		b	o	p	p	i	n	g		t	h	e	m		o	n
t	h	e		h	e	a	d		D	o	w	n		c	a	m	e		t
h	e		G	o	o	d		F	a	i	r	y	,		a	n	d		s
h	e		s	a	i	d		"	L	i	t	t	l	e		b	u	n	n
y		F	o	o		F	o	o		I		d	o	n	'	t		w	a
n	t		t	o		s	e	e		y	o	u		S	c	o	o	p	i
n	g		u	p		t	h	e		f	i	e	l	d		m	i	c	e
	A	n	d		b	o	p	p	i	n	g		t	h	e	m		o	n
	t	h	e		h	e	a	d	.	"

L	i	t	t	l	e		b	u	n	n	y		F	o	o	5	4	W	e
n	t		h	o	p	p	i	n	g		t	h	r	o	u	g	h	3	8
e		f	o	r	e	s	t		S	c	o	o	5	28	u	p	6	23	i
e	l	d		m	i	c	e		A	n	d		b	9	58	e	m		o
n	5	35	h	e	a	d		D	o	w	n		c	a	m	e	5	19	G
o	o	d		F	a	i	r	y	,		a	3	55	s	3	20	s	a	i
d		"	L	20	149	I		d	o	n	'	t		w	a	3	157	t	o
	s	e	e		y	o	o	56	141	.	"


17	42	4	4	W	e	n	t	9	51	t	h		o	u	g	h	3	8	e
8	63	S	c	o	o	5	28	u	p	6	23	i	e	l	d		m	i	c
e		A	n	d		b	9	58	e	m		o	n	5	35	h	e	a	d
	D	o	w	n		c	a	m	e	5	19	10	133	,	a	3	55	s	3
20	s	a	i	d		"	21	149	I		d	o	n	'	t		w	a	3
157	t	o		s	e	e		y	o	o	56	141	.	"

L	i	t	t	l	e		b	u	n	n	y		F	o	o		F	o	o
	W	e	n	t		h	o	p	p	i	n	g		t	h	r	o	u	g
h		t	h	e		f	o	r	e	s	t		S	c	o	o	p	i	n
g		u	p		t	h	e		f	i	e	l	d		m	i	c	e
A	n	d		b	o	p	p	i	n	g		t	h	e	m		o	n
t	h	e		h	e	a	d		D	o	w	n		c	a	m	e		t
h	e		G	o	o	d		F	a	i	r	y	,		a	n	d		s
h	e		s	a	i	d		"	L	i	t	t	l	e		b	u	n	n
y		F	o	o		F	o	o		I		d	o	n	'	t		w	a
n	t		t	o		s	e	e		y	o	u		S	c	o	o	p	i
n	g		u	p		t	h	e		f	i	e	l	d		m	i	c	e
	A	n	d		b	o	p	p	i	n	g		t	h	e	m		o	n
	t	h	e		h	e	a	d	.	"

L	i	t	t	l	e		b	u	n	n	y		F	o	o	5	4	W	e
n	t		h	o	p	p	i	n	g		t	h	r	o	u	g	h	3	8
e		f	o	r	e	s	t		S	c	o	o	5	28	u	p	6	23	i
e	l	d		m	i	c	e		A	n	d		b	9	58	e	m		o
n	5	35	h	e	a	d		D	o	w	n		c	a	m	e	5	19	G
o	o	d		F	a	i	r	y	,		a	3	55	s	3	20	s	a	i
d		"	L	20	149	I		d	o	n	'	t		w	a	3	157	t	o
	s	e	e		y	o	o	56	141	.	"


17	42	4	4	W	e	n	t	9	51	t	h		o	u	g	h	3	8	e
8	63	S	c	o	o	5	28	u	p	6	23	i	e	l	d		m	i	c
e		A	n	d		b	9	58	e	m		o	n	5	35	h	e	a	d
	D	o	w	n		c	a	m	e	5	19	10	133	,	a	3	55	s	3
20	s	a	i	d		"	21	149	I		d	o	n	'	t		w	a	3
157	t	o		s	e	e		y	o	o	56	141	.	"