The Cloudflare Blog

An introduction to three-phase power and PDUs

Rob Dinh — Fri, 04 Dec 2020 14:05:37 GMT

Our fleet of over 200 locations comprises various generations of servers and routers. And with the ever changing landscape of services and computing demands, it’s imperative that we manage power in our data centers right. This blog is a brief Electrical Engineering 101 session going over specifically how power distribution units (PDU) work, along with some good practices on how we use them. It appears to me that we could all use a bit more knowledge on this topic, and more love and appreciation of something that’s critical but usually taken for granted, like hot showers and opposable thumbs.

A PDU is a device used in data centers to distribute power to multiple rack-mounted machines. It’s an industrial grade power strip typically designed to power an average consumption of about seven US households. Advanced models have monitoring features and can be accessed via SSH or webGUI to turn on and off power outlets. How we choose a PDU depends on what country the data center is and what it provides in terms of voltage, phase, and plug type.

For each of our racks, all of our dual power-supply (PSU) servers are cabled to one of the two vertically mounted PDUs. As shown in the picture above, one PDU feeds a server’s PSU via a red cable, and the other PDU feeds that server’s other PSU via a blue cable. This is to ensure we have power redundancy maximizing our service uptime; in case one of the PDUs or server PSUs fail, the redundant power feed will be available keeping the server alive.

Faraday’s Law and Ohm’s Law

Like most high-voltage applications, PDUs and servers are designed to use AC power. Meaning voltage and current aren’t constant — they’re sine waves with magnitudes that can alternate between positive and negative at a certain frequency. For example, a voltage feed of 100V is not constantly at 100V, but it bounces between 100V and -100V like a sine wave. One complete sine wave cycle is one phase of 360 degrees, and running at 50Hz means there are 50 cycles per second.

The sine wave can be explained by Faraday’s Law and by looking at how an AC power generator works. Faraday’s Law tells us that a current is induced to flow due to a changing magnetic field. Below illustrates a simple generator with a permanent magnet rotating at constant speed and a coil coming in contact with the magnet’s magnetic field. Magnetic force is strongest at the North and South ends of the magnet. So as it rotates on itself near the coil, current flow fluctuates in the coil. One complete rotation of the magnet represents one phase. As the North end approaches the coil, current increases from zero. Once the North end leaves, current decreases to zero. The South end in turn approaches, now the current “increases” in the opposite direction. And finishing the phase, the South end leaves returning the current back to zero. Current alternates its direction at every half cycle, hence the naming of Alternating Current.

Current and voltage in AC power fluctuate in-phase, or “in tandem”, with each other. So by Ohm's Law of Power = Voltage x Current, power will always be positive. Notice on the graph below that AC power (Watts) has two peaks per cycle. But for practical purposes, we'd like to use a constant power value. We do that by interpreting AC power into "DC" power using root-mean-square (RMS) averaging, which takes the max value and divides it by √2. For example, in the US, our conditions are 208V 24A at 60Hz. When we look at spec sheets, all of these values can be assumed as RMS'd into their constant DC equivalent values. When we say we're fully utilizing a PDU's max capacity of 5kW, it actually means that the power consumption of our machines bounces between 0 and 7.1kW (5kW x √2).

It’s also critical to figure out the sum of power our servers will need in a rack so that it falls under the PDU’s design max power capacity. For our US example, a PDU is typically 5kW (208 volts x 24 amps); therefore, we're budgeting 5kW and fit as many machines as we can under that. If we need more machines and the total sum power goes above 5kW, we'd need to provision another power source. That would lead to possibly another set of PDUs and racks that we may not fully use depending on demand; e.g. more underutilized costs. All we can do is abide by P = V x I.

However there is a way we can increase the max power capacity economically — 3-phase PDU. Compared to single phase, its max capacity is √3 or 1.7 times higher. A 3-phase PDU of the same US specs above has a capacity of 8.6kW (5kW x √3), allowing us to power more machines under the same source. Using a 3-phase setup might mean it has thicker cables and bigger plug; but despite being more expensive than a 1-phase, its value is higher compared to two 1-phase rack setups for these reasons:

It’s more cost-effective, because there are fewer hardware resources to buy
Say the computing demand adds up to 215kW of hardware, we would need 25 3-phase racks compared to 43 1-phase racks.
Each rack needs two PDUs for power redundancy. Using the example above, we would need 50 3-phase PDUs compared to 86 1-phase PDUs to power 215kW worth of hardware.
That also means a smaller rack footprint and fewer power sources provided and charged by the data center, saving us up to √3 or 1.7 times in opex.
It’s more resilient, because there are more circuit breakers in a 3-phase PDU — one more than in a 1-phase. For example, a 48-outlet PDU that is 1-phase would be split into two circuits of 24 outlets. While a 3-phase one would be split into 3 circuits of 16 outlets. If a breaker tripped, we’d lose 16 outlets using a 3-phase PDU instead of 24.

The PDU shown above is a 3-phase model of 48 outlets. We can see three pairs of circuit breakers for the three branches that are intertwined with each other — white, grey, and black. Industry demands today pressure engineers to maximize compute performance and minimize physical footprint, making the 3-phase PDU a widely-used part of operating a data center.

What is 3-phase?

A 3-phase AC generator has three coils instead of one where the coils are 120 degrees apart inside the cylindrical core, as shown in the figure below. Just like the 1-phase generator, current flow is induced by the rotation of the magnet thus creating power from each coil sequentially at every one-third of the magnet’s rotation cycle. In other words, we’re generating three 1-phase power offset by 120 degrees.

A 3-phase feed is set up by joining any of its three coils into line pairs. L1, L2, and L3 coils are live wires with each on their own phase carrying their own phase voltage and phase current. Two phases joining together form one line carrying a common line voltage and line current. L1 and L2 phase voltages create the L1/L2 line voltage. L2 and L3 phase voltages create the L2/L3 line voltage. L1 and L3 phase voltages create the L1/L3 line voltage.

Let’s take a moment to clarify the terminology. Some other sources may refer to line voltage (or current) as line-to-line or phase-to-phase voltage (or current). It can get confusing, because line voltage is the same as phase voltage in 1-phase circuits, as there’s only one phase. Also, the magnitude of the line voltage is equal to the magnitude of the phase voltage in 3-phase Delta circuits, while the magnitude of the line current is equal to the magnitude of the phase current in 3-phase Wye circuits.

Conversely, the line current equals to phase current times √3 in Delta circuits. In Wye circuits, the line voltage equals to phase voltage times √3.

In Delta circuits:

V_line = V_phase

I_line = √3 x I_phase

In Wye circuits:

V_line = √3 x V_phase

I_line = I_phase

Delta and Wye circuits are the two methods that three wires can join together. This happens both at the power source with three coils and at the PDU end with three branches of outlets. Note that the generator and the PDU don’t need to match each other’s circuit types.

On PDUs, these phases join when we plug servers into the outlets. So we conceptually use the wirings of coils above and replace them with resistors to represent servers. Below is a simplified wiring diagram of a 3-phase Delta PDU showing the three line pairs as three modular branches. Each branch carries two phase currents and its own one common voltage drop.

And this one below is of a 3-phase Wye PDU. Note that Wye circuits have an additional line known as the neutral line where all three phases meet at one point. Here each branch carries one phase and a neutral line, therefore one common current. The neutral line isn’t considered as one of the phases.

Thanks to a neutral line, a Wye PDU can offer a second voltage source that is √3 times lower for smaller devices, like laptops or monitors. Common voltages for Wye PDUs are 230V/400V or 120V/208V, particularly in North America.

Where does the √3 come from?

Why are we multiplying by √3? As the name implies, we are adding phasors. Phasors are complex numbers representing sine wave functions. Adding phasors is like adding vectors. Say your GPS tells you to walk 1 mile East (vector a), then walk a 1 mile North (vector b). You walked 2 miles, but you only moved by 1.4 miles NE from the original location (vector a+b). That 1.4 miles of “work” is what we want.

Let’s take in our application L1 and L2 in a Delta circuit. we add phases L1 and L2, we get a L1/L2 line. We assume the 2 coils are identical. Let’s say α represents the voltage magnitude for each phase. The 2 phases are 120 degrees offset as designed in the 3-phase power generator:

|L1| = |L2| = α
L1 = |L1|∠0° = α∠0°
L2 = |L2|∠-120° = α∠-120°

Using vector addition to solve for L1/L2:

L1/L2 = L1 + L2

$$\text{L1/L2} = α∠\text{0°} - α∠\text{-120°}$$

$$\text{L1/L2} = (α\cos{\text{0°}}+jα\sin{\text{0°}})-(α\cos{\text{-120°}}+jα\sin{\text{-120°}})$$

$$\text{L1/L2} = (α\cos{\text{0°}}-α\cos{\text{-120°}})+j(α\sin{\text{0°}}-α\sin{\text{-120°}})$$

$$\text{L1/L2} = \frac{3}{2}α+j\frac{\sqrt3}{2}α$$

Convert L1/L2 into polar form:

$$\text{L1/L2} = \sqrt{(\frac{3}{2}α)^2 + (\frac{\sqrt3}{2}α)^2}∠\tan{-1}(\frac{\frac{\sqrt3}{2}α}{\frac{3}{2}α})$$

$$\text{L1/L2} = \sqrt 3α∠\text{30°}$$

Since voltage is a scalar, we’re only interested in the “work”:

|L1/L2| = √3α

Given that α also applies for L3. This means for any of the three line pairs, we multiply the phase voltage by √3 to calculate the line voltage.

V_line = √3 x V_phase

Now with the three line powers being equal, we can add them all to get the overall effective power. The derivation below works for both Delta and Wye circuits.

P_overall = 3 x P_line

P_overall = 3 x (V_line x I_line)

P_overall = (3/√3) x (V_phase x I_phase)

P_overall = √3 x V_phase x I_phase

Using the US example, V_phase is 208V and I_phase is 24A. This leads to the overall 3-phase power to be 8646W (√3 x 208V x 24A) or 8.6kW. There lies the biggest advantage for using 3-phase systems. Adding 2 sets of coils and wires (ignoring the neutral wire), we’ve turned a generator that can produce √3 or 1.7 times more power!

Dealing with 3-phase

The derivation in the section above assumes that the magnitude at all three phases is equal, but we know in practice that's not always the case. In fact, it's barely ever. We rarely have servers and switches evenly distributed across all three branches on a PDU. Each machine may have different loads and different specs, so power could be wildly different, potentially causing a dramatic phase imbalance. Having a heavily imbalanced setup could potentially hinder the PDU's available capacity.

A perfectly balanced and fully utilized PDU at 8.6kW means that each of its three branches has 2.88kW of power consumed by machines. Laid out simply, it's spread 2.88 + 2.88 + 2.88. This is the best case scenario. If we were to take 1kW worth of machines out of one branch, spreading power to 2.88 + 1.88 + 2.88. Imbalance is introduced, the PDU is underutilized, but we're fine. However, if we were to put back that 1kW into another branch — like 3.88 + 1.88 + 2.88 — the PDU is over capacity, even though the sum is still 8.6kW. In fact, it would be over capacity even if you just added 500W instead of 1kW on the wrong branch, thus reaching 3.18 + 1.88 + 2.88 (8.1kW).

That's because a 8.6kW PDU is spec'd to have a maximum of 24A for each phase current. Overloading one of the branches can force phase currents to go over 24A. Theoretically, we can reach the PDU’s capacity by loading one branch until its current reaches 24A and leave the other two branches unused. That’ll render it into a 1-phase PDU, losing the benefit of the √3 multiplier. In reality, the branch would have fuses rated less than 24A (usually 20A) to ensure we won't reach that high and cause overcurrent issues. Therefore the same 8.6kW PDU would have one of its branches tripped at 4.2kW (208V x 20A).

Loading up one branch is the easiest way to overload the PDU. Being heavily imbalanced significantly lowers PDU capacity and increases risk of failure. To help minimize that, we must:

Ensure that total power consumption of all machines is under the PDU's max power capacity
Try to be as phase-balanced as possible by spreading cabling evenly across the three branches
Ensure that the sum of phase currents from powered machines at each branch is under the fuse rating at the circuit breaker.

This spreadsheet from Raritan is very useful when designing racks.

For the sake of simplicity, let's ignore other machines like switches. Our latest 2U4N servers are rated at 1800W. That means we can only fit a maximum of four of these 2U4N chassis (8600W / 1800W = 4.7 chassis). Rounding them up to 5 would reach a total rack level power consumption of 9kW, so that's a no-no.

Splitting 4 chassis into 3 branches evenly is impossible, and will force us to have one of the branches to have 2 chassis. That would lead to a non-ideal phase balancing:

Keeping phase currents under 24A, there's only 1.1A (24A - 22.9A) to add on L1 or L2 before the PDU gets overloaded. Say we want to add as many machines as we can under the PDU’s power capacity. One solution is we can add up to 242W on the L1/L2 branch until both L1 and L2 currents reach their 24A limit.

Alternatively, we can add up to 298W on the L2/L3 branch until L2 current reaches 24A. Note we can also add another 298W on the L3/L1 branch until L1 current reaches 24A.

In the examples above, we can see that various solutions are possible. Adding two 298W machines each at L2/L3 and L3/L1 is the most phase balanced solution, given the parameters. Nonetheless, PDU capacity isn't optimized at 7.8kW.

Dealing with a 1800W server is not ideal, because whichever branch we choose to power one would significantly swing the phase balance unfavorably. Thankfully, our Gen X servers take up less space and are more power efficient. Smaller footprint allows us to have more flexibility and fine-grained control over our racks in many of our diverse data centers. Assuming each 1U server is 450W, as if we physically split the 1800W 2U4N into fours each with their own power supplies, we're now able to fit 18 nodes. That's 2 more nodes than the four 2U4N setup:

Adding two more servers here means we've increased our value by 12.5%. While there are more factors not considered here to calculate the Total Cost of Ownership, this is still a great way to show we can be smarter with asset costs.

Cloudflare provides the back-end services so that our customers can enjoy the performance, reliability, security, and global scale of our edge network. Meanwhile, we manage all of our hardware in over 100 countries with various power standards and compliances, and ensure that our physical infrastructure is as reliable as it can be.

There’s no Cloudflare without hardware, and there’s no Cloudflare without power. Want to know more? Watch this Cloudflare TV segment about power: https://cloudflare.tv/event/7E359EDpCZ6mHahMYjEgQl.

An EPYC trip to Rome: AMD is Cloudflare's 10th-generation Edge server CPU

Rob Dinh — Tue, 25 Feb 2020 14:00:00 GMT

More than 1 billion unique IP addresses pass through the Cloudflare Network each day, serving on average 11 million HTTP requests per second and operating within 100ms of 95% of the Internet-connected population globally. Our network spans 200 cities in more than 90 countries, and our engineering teams have built an extremely fast and reliable infrastructure.

We’re extremely proud of our work and are determined to help make the Internet a better and more secure place. Cloudflare engineers who are involved with hardware get down to servers and their components to understand and select the best hardware to maximize the performance of our stack.

Our software stack is compute intensive and is very much CPU bound, driving our engineers to work continuously at optimizing Cloudflare’s performance and reliability at all layers of our stack. With the server, a straightforward solution for increasing computing power is to have more CPU cores. The more cores we can include in a server, the more output we can expect. This is important for us since the diversity of our products and customers has grown over time with increasing demand that requires our servers to do more. To help us drive compute performance, we needed to increase core density and that's what we did. Below is the processor detail for servers we’ve deployed since 2015, including the core counts:

---	Gen 6	Gen 7	Gen 8	Gen 9
Start of service	2015	2016	2017	2018
CPU	Intel Xeon E5-2630 v3	Intel Xeon E5-2630 v4	Intel Xeon Silver 4116	Intel Xeon Platinum 6162
Physical Cores	2 x 8	2 x 10	2 x 12	2 x 24
TDP	2 x 85W	2 x 85W	2 x 85W	2 x 150W
TDP per Core	10.65W	8.50W	7.08W	6.25W

In 2018, we made a big jump in total number of cores per server with Gen 9. Our physical footprint was reduced by 33% compared to Gen 8, giving us increased capacity and computing power per rack. Thermal Design Power (TDP aka typical power usage) are mentioned above to highlight that we've also been more power efficient over time. Power efficiency is important to us: first, because we'd like to be as carbon friendly as we can; and second, so we can better utilize our provisioned power supplied by the data centers. But we know we can do better.

Our main defining metric is Requests per Watt. We can increase our Requests per Second number with more cores, but we have to stay within our power budget envelope. We are constrained by the data centers’ power infrastructure which, along with our selected power distribution units, leads us to power cap for each server rack. Adding servers to a rack obviously adds more power draw increasing power consumption at the rack level. Our Operational Costs significantly increase if we go over a rack’s power cap and have to provision another rack. What we need is more compute power inside the same power envelope which will drive a higher (better) Requests per Watt number – our key metric.

As you might imagine, we look at power consumption carefully in the design stage. From the above you can see that it’s not worth the time for us to deploy more power-hungry CPUs if TDP per Core is higher than our current generation which would hurt our Requests per Watt metric. As we started looking at production ready systems to power our Gen X solution, we took a long look at what is available to us in the market today, and we’ve made our decision. We’re moving on from Gen 9's 48-core setup of dual socket Intel® Xeon® Platinum 6162's to a 48-core single socket AMD EPYC™ 7642.

Gen X server setup with single socket 48-core AMD EPYC 7642.

---	Intel	AMD
CPU	Xeon Platinum 6162	EPYC 7642
Microarchitecture	“Skylake”	“Zen 2”
Codename	“Skylake SP”	“Rome”
Process	14nm	7nm
Physical Cores	2 x 24	48
Frequency	1.9 GHz	2.4 GHz
L3 Cache / socket	24 x 1.375MiB	16 x 16MiB
Memory / socket	6 channels, up to DDR4-2400	8 channels, up to DDR4-3200
TDP	2 x 150W	225W
PCIe / socket	48 lanes	128 lanes
ISA	x86-64	x86-64

From the specs, we see that with the AMD chip we get to keep the same amount of cores and lower TDP. Gen 9's TDP per Core was 6.25W, Gen X's will be 4.69W... That's a 25% decrease. With higher frequency, and perhaps going to a more simplified setup of single socket, we can speculate that the AMD chip will perform better. We're walking through a series of tests, simulations, and live production results in the rest of this blog to see how much better AMD performs.

As a side note before we go further, TDP is a simplified metric from the manufacturers’ datasheets that we use in the early stages of our server design and CPU selection process. A quick Google search leads to thoughts that AMD and Intel define TDP differently, which basically makes the spec unreliable. Actual CPU power draw, and more importantly server system power draw, are what we really factor in our final decisions.

Ecosystem Readiness

At the beginning of our journey to choose our next CPU, we got a variety of processors from different vendors that could fit well with our software stack and services, which are written in C, LuaJIT, and Go. More details about benchmarking for our stack were explained when we benchmarked Qualcomm's ARM® chip in the past. We're going to go through the same suite of tests as Vlad's blog this time around since it is a quick and easy "sniff test". This allows us to test a bunch of CPUs within a manageable time period before we commit to spend more engineering effort and need to apply our software stack.

We tried a variety of CPUs with different number of cores, sockets, and frequencies. Since we're explaining how we chose the AMD EPYC 7642, all the graphs in this blog focus on how AMD compares with our Gen 9's Intel Xeon Platinum 6162 CPU as a baseline.

Our results correspond to server node for both CPUs tested; meaning the numbers pertain to 2x 24-core processors for Intel, and 1x 48-core processor for AMD – a two socket Intel based server and a one socket AMD EPYC powered server. Before we started our testing, we changed the Cloudflare lab test servers’ BIOS settings to match our production server settings. This gave us CPU frequencies yields for AMD at 3.03 Ghz and Intel at 2.50 Ghz on average with very little variation. With gross simplification, we expect that with the same amount of cores AMD would perform about 21% better than Intel. Let’s start with our crypto tests.

Cryptography

Looking promising for AMD. In public key cryptography, it does 18% better. Meanwhile, for symmetric key, AMD loses on AES-128-GCM, but it’s comparable overall.

Compression

We do a lot of compression at the edge to save bandwidth and help deliver content faster. We go through both zlib and brotli libraries written in C. All tests are done on blog.cloudflare.com HTML file in memory.

AMD wins by an average of 29% using gzip across all qualities. It does even better with brotli with tests lower than quality 7, which we use for dynamic compression. There’s a throughput cliff starting brotli-9 which Vlad’s explanation is that Brotli consumes lots of memory and thrashes cache. Nevertheless, AMD wins by a healthy margin.

A lot of our services are written in Go. In the following graphs we’re redoing the crypto and compression tests in Go along with RegExp on 32KB strings and the strings' library.

Go Cryptography

Go Compression

Go Regexp

Go Strings

AMD performs better in all of our Go benchmarks except for ECDSA P256 Sign losing by 38%, which is peculiar since with the test in C it does 24% better. It’s worth investigating what’s going on here. Other than that, AMD doesn’t win by as much of a margin, but it still proves to be better.

LuaJIT

We rely a lot on LuaJIT in our stack. As Vlad said, it’s the glue that holds Cloudflare together. We’re glad to show that AMD wins here as well.

Overall our tests show a single EPYC 7642 to be more competitive than two Xeon Platinum 6162. While there are a couple of tests where AMD loses out such as OpenSSL AES-128-GCM and Go OpenSSL ECDSA-P256 Sign, AMD wins in all the others. By scanning quickly and treating all tests equally, AMD does on average 25% better than Intel.

Performance Simulations

After our ‘sniff’ tests, we put our servers through another series of emulations which apply synthetic workloads simulating our edge software stack. Here we are simulating workloads of scenarios with different types of requests we see in production. Types of requests vary from asset size, whether they go through HTTP or HTTPS, WAF, Workers, or one of many additional variables. Below shows the throughput comparison between the two CPUs of the types of requests we see most typically.

The results above are ratios using Gen 9's Intel CPUs as the baseline normalized at 1.0 on the X-axis. For example, looking at simple requests of 10KiB assets over HTTPS, we see that AMD does 1.50x better than Intel in Requests per Second. On average for the tests shown on the graph above, AMD performs 34% better than Intel. Considering that the TDP for the single AMD EPYC 7642 is 225W, when compared to two Intel's being 300W, we're looking at AMD delivering up to 2.0x better Requests per Watt vs. the Intel CPUs!

By this time, we were already leaning heavily toward a single socket setup with AMD EPYC 7642 as our CPU for Gen X. We were excited to see exactly how well AMD EPYC servers would do in production, so we immediately shipped a number of the servers out to some of our data centers.

Live Production

Step one of course was to get all our test servers set up for a production environment. All of our machines in the fleet are loaded with the same processes and services which makes for a great apples-to-apples comparison. Like data centers everywhere, we have multiple generations of servers deployed, and we deploy our servers in clusters such that each cluster is pretty homogeneous by server generation. In some environments this can lead to varying utilization curves between clusters. This is not the case for us. Our engineers have optimized CPU utilization across all server generations so that no matter if the machine's CPU has 8 cores or 24 cores, CPU usage is generally the same.

As you can see above and to illustrate our ‘similar CPU utilization’ comment, there is no significant difference in CPU usage between Gen X AMD powered servers and Gen 9 Intel based servers. This means both test and baseline servers are equally loaded. Good. This is exactly what we want to see with our setup, to have a fair comparison. The 2 graphs below show the comparative number of requests processed at the CPU single core and all core (server) level.

We see that AMD does on average about 23% more requests. That's really good! We talked a lot about bringing more muscle in the Gen 9 blog. We have the same number of cores, yet AMD does more work, and does it with less power. Just by looking at the specs for number of cores and TDP in the beginning, it's really nice to see that AMD also delivers significantly more performance with better power efficiency.

But as we mentioned earlier, TDP isn’t a standardized spec across manufacturers so let’s look at real power usage below. Measuring server power consumption along with requests per second (RPS) yields the graph below:

Observing our servers request rate over their power consumption, the AMD Gen X server performs 28% better. While we could have expected more out of AMD since its TDP is 25% lower, keep in mind that TDP is very ambiguous. In fact, we saw that AMD actual power draw ran nearly at spec TDP with it's much higher than base frequency; Intel was far from it. Another reason why TDP is becoming a less reliable estimate of power draw. Moreover, CPU is just one component contributing to the overall power of the system. Let’s remind that Intel CPUs are integrated in a multi-node system as described in the Gen 9 blog, while AMD is in a regular 1U form-factor machine. That actually doesn’t favor AMD since multi-node systems are designed for high density capabilities at lower power per node, yet it still outperformed the Intel system on a power per node basis anyway.

Through the majority of comparisons from the datasheets, test simulations, and live production performance, the 1P AMD EPYC 7642 configuration performed significantly better than the 2P Intel Xeon 6162. We’ve seen in some environments that AMD can do up to 36% better in live production, and we believe we can achieve that consistently with some optimization on both our hardware and software.

So that's it. AMD wins.

The additional graphs below show the median and p99 NGINX processing mostly on-CPU time latencies between the two CPUs throughout 24 hours. On average, AMD processes about 25% faster. At p99, it does about 20-50% depending on the time of day.

Conclusion

Hardware and Performance engineers at Cloudflare do significant research and testing to figure out the best server configuration for our customers. Solving big problems like this is why we love working here, and we're also helping to solve yours with our services like serverless edge compute and the array of security solutions such as Magic Transit, Argo Tunnel, and DDoS protection. All of our servers on the Cloudflare Network are designed to make our products work reliably, and we strive to make each new generation of our server design better than its predecessor. We believe the AMD EPYC 7642 is the answer for our Gen X's processor question.

With Cloudflare Workers, developers have enjoyed deploying their applications to our Network, which is ever expanding across the globe. We’ve been proud to empower our customers by letting them focus on writing their code while we are managing the security and reliability in the cloud. We are now even more excited to say that their work will be deployed on our Gen X servers powered by 2nd Gen AMD EPYC processors.

Expanding Rome to a data center near you

Thanks to AMD, using the EPYC 7642 allows us to increase our capacity and expand into more cities easier. Rome wasn’t built in one day, but it will be very close to many of you.

In the last couple of years, we've been experimenting with many Intel and AMD x86 chips along with ARM CPUs. We look forward to having these CPU manufacturers partner with us for future generations so that together we can help build a better Internet.

A Tour Inside Cloudflare's G9 Servers

Rob Dinh — Wed, 10 Oct 2018 15:17:00 GMT

Cloudflare operates at a significant scale, handling nearly 10% of the Internet HTTP requests that is at peak more than 25 trillion requests through our network every month. To ensure this is as efficient as possible, we own and operate all the equipment in our 154 locations around the world in order to process the volume of traffic that flows through our network. We spend a significant amount of time specing and designing servers that makes up our network to meet our ever changing and growing demands. On regular intervals, we will take everything we've learned about our last generation of hardware and refresh each component with the next generation…

If the above paragraph sounds familiar, it’s a reflecting glance to where we were 5 years ago using today’s numbers. We’ve done so much progress engineering and developing our tools with the latest tech through the years by pushing ourselves at getting smarter in what we do.

Here though we’re going to blog about muscle.

Since the last time we blogged about our G4 servers, we’ve iterated one generation each of the past 5 years. Our latest generation is now the G9 server. From a G4 server comprising 12 Intel Sandybridge CPU cores, our G9 server has 192 Intel Skylake CPU cores ready to handle today’s load across Cloudflare’s network. This server is QCT’s T42S-2U multi-node server where we have 4 nodes per chassis, therefore each node has 48 cores. Maximizing compute density is the primary goal since rental colocation space and power are costly. This 2U4N chassis form factor has served us well for the past 3 generations, we’re revisiting this option once more.

Exploded picture of the G9 server’s main components. 4 sleds represent 4 nodes, each with 2 24-core Intel CPUs

Each high-level hardware component has gone through their own upgrade as well for a balanced scale up keeping our stack CPU bound, making this generation the most radical revision since we moved on from using HP 5 years ago. Let’s glance through each of those components.

Hardware Changes

CPU

Previously: 2x 12 core Intel Xeon Silver 4116 2.1Ghz 85W
Now: 2x 24 core Intel custom off-roadmap 1.9Ghz 150W

The performance of our infrastructure is heavily directed by how much compute we can squeeze in a given physical space and power. In essence, requests per second (RPS) per Watt is a critical metric that Qualcomm’s ARM64 46 core Falkor chip had a big advantage over Intel’s Skylake 4116.

Intel proposed to co-innovate with us an off-roadmap 24-core Xeon Gold CPU specifically made for our workload offering considerable value in Performance per Watt. For this generation, we continue using Intel as system solutions are widely available while we’re working on realizing ARM64’s benefits to production. We expect this CPU to perform with better RPS per Watt right off the bat; increasing the RPS by 200% from doubling the amount of cores, and increasing the power consumption by 174% from increasing the CPUs TDP from 85W to 150W each.

Disk

Previously: 6x Micron 1100 512G SATA SSD
Now: 6x Intel S4500 480G SATA SSD

With all the requests we foresee for G9 to process, we need to tame down the outlying and long-tail latencies we have seen in our previous SSDs. Lowering p99 and p999 latency has been a serious endeavor. To help save milliseconds in disk response time for 0.01% or even 0.001% of all the traffic we see isn’t a joke!

Datacenter grade SSDs in Intel S4500 will proliferate our fleet. These disks come with better endurance to last over the expected service life of our servers and better performance consistency with lower p95+ latency.

Network

Previously: dual-port 10G Solarflare Flareon Ultra SFN8522
Now: dual-port 25G Mellanox ConnectX-4 Lx OCP

Our DDoS mitigation program is all done in userspace, so network adapter model can be anything on market as long as it supports XDP. We went with Mellanox for their solid reliability and their readily available 2x25G CX4 model. Upgrading to 25G intra-rack ethernet network is easy future-proofing since the 10G SFP+ ethernet port shares the same physical form factor as the 25G’s SFP28. Switch and NIC vendors offer models that can be configured as either 10G or 25G.

Another change is the adapter’s form factor itself being an OCP mezzanine instead of the more conventional Low Profile sized card. QCT is a server system integrator participating in the Open Compute Project, a non-profit organization establishing an open source hardware ecosystem founded by Facebook, Intel, and Rackspace. Their T42S-2U motherboards each include 2 PCIe x16 Gen3 expansion slots: 1 fit for a regular I/O card and 1 for an OCP mezzanine. The form factor change allows us to occupy the OCP slot leaving the regular slot free to integrate something else that may not be offered with an OCP form factor like a high capacity NVMe SSD or a GPU. We like that our server has the room for upgrades if needed.

Both Low Profile and OCP adapters offer the same throughput and features

Rear side of G9 chassis showing all 4 sled nodes, each leaving room to add on a PCI card

Memory

Previously: 192G (12x16G) DDR4 2400Mhz RDIMM
Now: 256G (8x32G) DDR4 2666Mhz RDIMM

Going from 192G (12x16G) to 256G (8x32G) made practical sense. The motherboard has 12 DIMM channels, which were all populated in the G8. We want to have the ability to upgrade just in case, as well as keeping memory configuration balanced and at optimal bandwidth capacity. 8x32G works well leaving 4 channels open for future upgrades.

Physical stress test

Our software stack scales nicely enough that we can confidently assume we’ll double the amount of requests having twice the amount of CPU cores compared to G8. What we need to ensure before we ship any G9 servers out to our current 154 and future PoPs is that there won’t be any design issues pertaining to thermal nor power failures. At the extreme case that all of our cores run up 100% load, would that cause our server to run above operating temperature? How much power would a whole server with 192 cores totaling 1200W TDP consume? We set out to record both by applying a stress test to the whole system.

Temperature readings were recorded off of ipmitool sdr list, then graphed showing socket and motherboard temperature. For 2U4N being such a compact form factor, it’s worth monitoring that a server running hot isn’t literally running hot. The red lines represent the 4 nodes that compose the whole G9 server under test; blue lines represent G8 nodes (we didn’t stress the G8’s so their temperature readings are constant).

Both graphs are looking fine and not out of control mostly thanks to the T42S-2U’s 4 80mm x 80mm fans capable of blowing over 90CFM; which we managed to reach their max spec RPM.

Recording the new system’s max power consumption is critical information we need to properly design our rack stack choosing the right Power Distribution Unit and ensuring we’re below the budgeted power while keeping adequate phase balancing. For example, a typical 3-phase US-rated 24-Amp PDU gives you a max power rating of 8.6 kilowatts. We wouldn’t be able to fit 9 servers powered by that same PDU if each were running at 1kW without any way to cap them.

Above right graph shows our max power to be 1.9kW as the red line, or crudely 475W per node which is excellent in a modern server. Notice the blue and yellow lines representing the G9’s 2 power supplies summing the total power. The yellow line PSU appearing off is intentional as part of our testing procedure to show the PSU’s resilience in abrupt power changes.

Stressing out all available CPU, I/O, and memory along with maxing out fan RPMs combined is a good indicator for the highest possible heat and power draw this server can do. Hopefully we won’t ever see such an extreme case like this in live production environments, and we expect much milder actual results (read: we don’t think catastrophic failures to be possible).

First Impression in live production

We increased capacity to one of our most loaded PoPs by adding G9 servers. The following time graphs represent a 24 hour range with how G9 performance compares with G8 in a live PoP.

Great! They're doing over 2x the requests compared to G8 with about 10% less CPU usage. Note that all results here are based from non-optimized systems, so we could add more load on the G9 and have their CPU usage comparable to the G8. Additionally, they're doing that amount with better CPU processing time shown as nginx execution time. You can see the latency gap between generations widening as we go towards the 99.9th percentile:

Long-tail latencies for NGINX CPU processing time (lower is better)

Talking about latency, let’s check how our new SSDs are doing on that front:

Cache disk IOPS and latency (lower is better)

The trend still holds that G9 is doing better. It’s a good thing that the G9’s SSDs aren’t seeing as many IOPS since it means we’re not hitting cache disks as often and are able to store and process more on CPU and memory. We’ve cut the read cache hits and latency by half. Less writes results in better performance consistency and longevity.

Another metric that G9 does more is power consumption, doing about 55% more than the G8. While it’s not a piece of information to brag about, it is expected when older CPUs were once rated at 85W TDP to now using ones with 150W TDP; and when considering how much work the G9 servers do:

G9 is actually 1.5x more power efficient than G8. Temperature readings were checked as well. Inlet and outlet chassis temps, as well as CPU temps, are well within operating temperatures.

Now that’s muscle! In other words for every 3 G8 servers, just 2 of those G9's would take on the same workload. If one of our racks normally would have 9 G8 servers, we can switch those out with only 6 G9's. Inversely planning to turn up a cage of 10 G9 racks would be the same as if we did 15 G8 racks!

We have big plans to cover our entire network with G9 servers, with most of them planned for the existing cities your site most likely uses. By 2019, you’ll benefit with increased bandwidth and lower wait times. And we’ll benefit in expanding and turning up datacenters quicker and more efficiently.

What's next?

Gen X? Right now is exciting times at Cloudflare. Many teams and engineers are testing, porting, and implementing new stuff that can help us lower operating costs, explore new products and possibilities, and improve Quality of Service. We’re tackling problems and taking on projects that are unique in the industry.

Serverless computing like Cloudflare Workers and beyond will ask for new challenges to our infrastructure as all of our customers can program their features on Cloudflare’s edge network.

The network architecture that was conventionally made up of routers, switches, and servers has been merged into 3-in-1 box solutions allowing Cloudflare services to be set up into locations that weren’t possible before.

The advent of NVMe and persistent memory, as well as the possibility of turning SSDs into DRAM, is redefining how we design cache servers and handle tiered caching. SSDs and memory aren’t treated as separate entities like they used to.

Hardware brings the company together like a rug in a living room. See how many links I mentioned above to show you how we’re one team dedicated to build a better Internet. Everything that we do here roots down to how we manage the tons of aluminum and silicon we’ve invested. There's a lot here to develop our hardware to help Cloudflare grow to where we envision ourselves to be. If you’d like to contribute, we’d love to hear from you.