The Cloudflare Blog

Talk Transcript: How Cloudflare Thinks About Security

John Graham-Cumming — Tue, 08 Oct 2019 09:00:00 GMT

Image courtesy of Unbabel

This is the text I used for a talk at artificial intelligence powered translation platform, Unbabel, in Lisbon on September 25, 2019.

Bom dia. Eu sou John Graham-Cumming o CTO do Cloudflare. E agora eu vou falar em inglês.

Thanks for inviting me to talk about Cloudflare and how we think about security. I’m about to move to Portugal permanently so I hope I’ll be able to do this talk in Portuguese in a few months.

I know that most of you don’t have English as a first language so I’m going to speak a little more deliberately than usual. And I’ll make the text of this talk available for you to read.

But there are no slides today.

I’m going to talk about how Cloudflare thinks about internal security, how we protect ourselves and how we secure our day to day work. This isn’t a talk about Cloudflare’s products.

Culture

Let’s begin with culture.

Many companies have culture statements. I think almost 100% of these are pure nonsense. Culture is how you act every day, not words written in the wall.

One significant piece of company culture is the internal Security Incident mailing list which anyone in the company can send a message to. And they do! So far this month there have been 55 separate emails to that list reporting a security problem.

These mails come from all over the company, from every department. Two to three per day. And each mail is investigated by the internal security team. Each mail is assigned a Security Incident issue in our internal Atlassian Jira instance.

People send: reports that their laptop or phone has been stolen (their credentials get immediately invalidated), suspicions about a weird email that they’ve received (it might be phishing or malware in an attachment), a concern about physical security (for example, someone wanders into the office and starts asking odd questions), that they clicked on a bad link, that they lost their access card, and, occasionally, a security concern about our product.

Things like stolen or lost laptops and phones happen way more often than you’d imagine. We seem to lose about two per month. For that reason and many others we use full disk encryption on devices, complex passwords and two factor auth on every service employees need to access. And we discourage anyone storing anything on their laptop and ask them to primarily use cloud apps for work. Plus we centrally manage machines and can remote wipe.

We have a 100% blame free culture. You clicked on a weird link? We’ll help you. Lost your phone? We’ll help you. Think you might have been phished? We’ll help you.

This has led to a culture of reporting problems, however minor, when they occur. It’s our first line of internal defense.

Just this month I clicked on a link that sent my web browser crazy hopping through redirects until I ended up at a bad place. I reported that to the mailing list.

I’ve never worked anywhere with such a strong culture of reporting security problems big and small.

Hackers

We also use HackerOne to let people report security problems from the outside. This month we’ve received 14 reports of security problems. To be honest, most of what we receive through HackerOne is very low priority. People run automated scanning tools and report the smallest of configuration problems, or, quite often, things that they don’t understand but that look like security problems to them. But we triage and handle them all.

And people do on occasion report things that we need to fix.

We also have a private paid bug bounty program where we work with a group of individual hackers (around 150 right now) who get paid for the vulnerabilities that they’ve found.

We’ve found that this combination of a public responsible disclosure program and then a private paid program is working well. We invite the best hackers who come in through the public program to work with us closely in the private program.

Identity

So, that’s all about people, internal and external, reporting problems, vulnerabilities, or attacks. A very short step from that is knowing who the people are.

And that’s where identity and authentication become critical. In fact, as an industry trend identity management and authentication are one of the biggest areas of spending by CSOs and CISOs. And Cloudflare is no different.

OK, well it is different, instead of spending a lot of identity and authentication we’ve built our own solutions.

We did not always have good identity practices. In fact, for many years our systems had different logins and passwords and it was a complete mess. When a new employee started accounts had to be made on Google for email and calendar, on Atlassian for Jira and Wiki, on the VPN, on the WiFi network and then on a myriad of other systems for the blog, HR, SSH, build systems, etc. etc.

And when someone left all that had to be undone. And frequently this was done incorrectly. People would leave and accounts would still be left running for a period of time. This was a huge headache for us and is a huge headache for literally every company.

If I could tell companies one thing they can do to improve their security it would be: sort out identity and authentication. We did and it made things so much better.

This makes the process of bringing someone on board much smoother and the same when they leave. We can control who accesses what systems from a single control panel.

I have one login via a product we built called Cloudflare Access and I can get access to pretty much everything. I looked in my LastPass Vault while writing this talk and there are a total of just five username and password combination and two of those needed deleting because we’ve migrated those systems to Access.

So, yes, we use password managers. And we lock down everything with high quality passwords and two factor authentication. Everyone at Cloudflare has a Yubikey and access to TOTP (such as Google Authenticator). There are three golden rules: all passwords should be created by the password manager, all authentication has to have a second factor and the second factor cannot be SMS.

We had great fun rolling out Yubikeys to the company because we did it during our annual retreat in a single company wide sitting. Each year Cloudflare gets the entire company together (now over 1,000 people) in a hotel for two to three days of working together, learning from outside experts and physical and cultural activities.

Last year the security team gave everyone a pair of physical security tokens (a Yubikey and a Titan Key from Google for Bluetooth) and in an epic session configured everyone’s accounts to use them.

Note: do not attempt to get 500 people to sync Bluetooth devices in the same room at the same time. Bluetooth cannot cope.

Another important thing we implemented is automatic timeout of access to a system. If you don’t use access to a system you lose it. That way we don’t have accounts that might have access to sensitive systems that could potentially be exploited.

Openness

To return to the subject of Culture for a moment an important Cloudflare trait is openness.

Some of you may know that back in 2017 Cloudflare had a horrible bug in our software that became called Cloudbleed. This bug leaked memory from inside our servers into people’s web browsing. Some of that web browsing was being done by search engine crawlers and ended up in the caches of search engines like Google.

We had to do two things: stop the actual bug (this was relatively easy and was done in under an hour) and then clean up the equivalent of an oil spill of data. That took longer (about a week to ten days) and was very complicated.

But from the very first night when we were informed of the problem we began documenting what had happened and what were doing. I opened an EMACS buffer in the dead of night and started keeping a record.

That record turned into a giant disclosure blog post that contained the gory details of the error we made, its consequences and how we reacted once the error was known.

We followed up a few days later with a further long blog post assessing the impact and risk associated with the problem.

This approach to being totally open ended up being a huge success for us. It increased trust in our product and made people want to work with us more.

I was on my way to Berlin to give a talk to a large retailer about Cloudbleed when I suddenly realized that the company I was giving the talk at was NOT a customer. And I asked the salesperson I was with what I was doing.

I walked in to their 1,000 person engineering team all assembled to hear my talk. Afterwards the VP of Engineering thanked me saying that our transparency had made them want to work with us rather than their current vendor. My talk was really a sales pitch.

Similarly, at RSA last year I gave a talk about Cloudbleed and a very large company’s CSO came up and asked to use my talk internally to try to encourage their company to be so open.

When on July 2 this year we had an outage, which wasn’t security related, we once again blogged in incredible detail about what happened. And once again we heard from people about how our transparency mattered to them.

The lesson is that being open about mistakes increases trust. And if people trust you then they’ll tend to tell you when there are problems. I get a ton of reports of potential security problems via Twitter or email.

Change

After Cloudbleed we started changing how we write software. Cloudbleed was caused, in part, by the use of memory-unsafe languages. In that case it was C code that could run past the end of a buffer.

We didn’t want that to happen again and so we’ve prioritized languages where that simply cannot happen. Such as Go and Rust. We were very well known for using Go. If you’ve ever visited a Cloudflare website, or used an app (and you have because of our scale) that uses us for its API then you’ve first done a DNS query to one of our servers.

That DNS query will have been responded to by a Go program called RRDNS.

There’s also a lot of Rust being written at Cloudflare and some of our newer products are being created using it. For example, Firewall Rules which do arbitrary filtering of requests to our customers are handled by a Rust program that needs to be low latency, stable and secure.

Security is a company wide commitment

The other post-Cloudbleed change was that any crashes on our machines came under the spotlight from the very top. If a process crashes I personally get emailed about it. And if the team doesn’t take those crashes seriously they get me poking at them until they do.

We missed the fact that Cloudbleed was crashing our machines and we won’t let that happen again. We use Sentry to correlate information about crashes and the Sentry output is one of the first things I look at in the morning.

Which, I think, brings up an important point. I spoke earlier about our culture of “If you see something weird, say something” but it’s equally important that security comes from the top down.

Our CSO, Joe Sullivan, doesn’t report to me, he reports to the CEO. That sends a clear message about where security sits in the company. But, also, the security team itself isn’t sitting quietly in the corner securing everything.

They are setting standards, acting as trusted advisors, and helping deal with incidents. But their biggest role is to be a source of knowledge for the rest of the company. Everyone at Cloudflare plays a role in keeping us secure.

You might expect me to have access to our all our systems, a passcard that gets me into any room, a login for any service. But the opposite is true: I don’t have access to most things. I don’t need it to get my job done and so I don’t have it.

This makes me a less attractive target for hackers, and we apply the same rule to everyone. If you don’t need access for your job you don’t get it. That’s made a lot easier by the identity and authentication systems and by our rule about timing out access if you don’t use a service. You probably didn’t need it in the first place.

The flip side of all of us owning security is that deliberately doing the wrong thing has severe consequences.

Making a mistake is just fine. The person who wrote the bad line of code that caused Cloudbleed didn’t get fired, the person who wrote the bad regex that brought our service to a halt on July 2 is still with us.‌‌

Detection and Response‌‌

Naturally, things do go wrong internally. Things that didn’t get reported. To do with them we need to detect problems quickly. This is an area where the security team does have real expertise and data.‌‌

We do this by collecting data about how our endpoints (my laptop, a company phone, servers on the edge of our network) are behaving. And this is fed into a homebuilt data platform that allows the security team to alert on anomalies.‌‌

It also allows them to look at historical data in case of a problem that occurred in the past, or to understand when a problem started. ‌‌

Initially the team was going to use a commercial data platform or SIEM but they quickly realized that these platforms are incredibly expensive and they could build their own at a considerably lower price.‌‌

Also, Cloudflare handles a huge amount of data. When you’re looking at operating system level events on machines in 194 cities plus every employee you’re dealing with a huge stream. And the commercial data platforms love to charge by the size of that stream.‌‌

We are integrating internal DNS data, activity on individual machines, network netflow information, badge reader logs and operating system level events to get a complete picture of what’s happening on any machine we own.‌‌

When someone joins Cloudflare they travel to our head office in San Francisco for a week of training. Part of that training involves getting their laptop and setting it up and getting familiar with our internal systems and security.‌‌

During one of these orientation weeks a new employee managed to download malware while setting up their laptop. Our internal detection systems spotted this happening and the security team popped over to the orientation room and helped the employee get a fresh laptop.‌‌

The time between the malware being downloaded and detected was about 40 minutes.‌‌

If you don’t want to build something like this yourself, take a look at Google’s Chronicle product. It’s very cool. ‌‌

One really rich source of data about your organization is DNS. For example, you can often spot malware just by the DNS queries it makes from a machine. If you do one thing then make sure all your machines use a single DNS resolver and get its logs.‌‌‌‌

Edge Security‌‌

In some ways the most interesting part of Cloudflare is the least interesting from a security perspective. Not because there aren’t great technical challenges to securing machines in 194 cities but because some of the more apparently mundane things I’ve talked about how such huge impact.‌‌

Identity, Authentication, Culture, Detection and Response.‌‌

But, of course, the edge needs securing. And it’s a combination of physical data center security and software. ‌‌

To give you one example let’s talk about SSL private keys. Those keys need to be distributed to our machines so that when an SSL connection is made to one of our servers we can respond. But SSL private keys are… private!‌‌

And we have a lot of them. So we have to distribute private key material securely. This is a hard problem. We encrypt the private keys while at rest and in transport with a separate key that is distributed to our edge machines securely. ‌‌

Access to that key is tightly controlled so that no one can start decrypting keys in our database. And if our database leaked then the keys couldn’t be decrypted since the key needed is stored separately.‌‌

And that key is itself GPG encrypted.‌‌

But wait… there’s more!‌‌

We don’t actually want to have decrypted keys stored in any process that accessible from the Internet. So we use a technology called Keyless SSL where the keys are kept by a separate process and accessed only when needed to perform operations.‌‌

And Keyless SSL can run anywhere. For example, it doesn’t have to be on the same machine as the machine handling an SSL connection. It doesn’t even have to be in the same country. Some of our customers make use of that to specify where their keys are distributed to).

Use Cloudflare to secure Cloudflare

One key strategy of Cloudflare is to eat our own dogfood. If you’ve not heard that term before it’s quite common in the US. The idea is that if you’re making food for dogs you should be so confident in its quality that you’d eat it yourself.

Cloudflare does the same for security. We use our own products to secure ourselves. But more than that if we see that there’s a product we don’t currently have in our security toolkit then we’ll go and build it.

Since Cloudflare is a cybersecurity company we face the same challenges as our customers, but we can also build our way out of those challenges. In this way, our internal security team is also a product team. They help to build or influence the direction of our own products.

The team is also a Cloudflare customer using our products to secure us and we get feedback internally on how well our products work. That makes us more secure and our products better.

Our customers data is more precious than ours‌‌

The data that passes through Cloudflare’s network is private and often very personal. Just think of your web browsing or app use. So we take great care of it.‌‌

We’re handling that data on behalf of our customers. They are trusting us to handle it with care and so we think of it as more precious than our own internal data.‌‌

Of course, we secure both because the security of one is related to the security of the other. But it’s worth thinking about the data you have that, in a way, belongs to your customer and is only in your care.‌‌‌‌

Finally‌‌

I hope this talk has been useful. I’ve tried to give you a sense of how Cloudflare thinks about security and operates. We don’t claim to be the ultimate geniuses of security and would love to hear your thoughts, ideas and experiences so we can improve.‌‌

Security is not static and requires constant attention and part of that attention is listening to what’s worked for others.‌‌

Thank you.‌‌‌‌‌‌‌‌‌‌‌‌

When TCP sockets refuse to die

Marek Majkowski — Fri, 20 Sep 2019 15:53:33 GMT

While working on our Spectrum server, we noticed something weird: the TCP sockets which we thought should have been closed were lingering around. We realized we don't really understand when TCP sockets are supposed to time out!

Image by Sergiodc2 CC BY SA 3.0

In our code, we wanted to make sure we don't hold connections to dead hosts. In our early code we naively thought enabling TCP keepalives would be enough... but it isn't. It turns out a fairly modern TCP_USER_TIMEOUT socket option is equally important. Furthermore, it interacts with TCP keepalives in subtle ways. Many people are confused by this.

In this blog post, we'll try to show how these options work. We'll show how a TCP socket can time out during various stages of its lifetime, and how TCP keepalives and user timeout influence that. To better illustrate the internals of TCP connections, we'll mix the outputs of the tcpdump and the ss -o commands. This nicely shows the transmitted packets and the changing parameters of the TCP connections.

SYN-SENT

Let's start from the simplest case - what happens when one attempts to establish a connection to a server which discards inbound SYN packets?

The scripts used here are available on our GitHub.

$ sudo ./test-syn-sent.py # all packets dropped 00:00.000 IP host.2 > host.1: Flags [S] # initial SYN State Recv-Q Send-Q Local:Port Peer:Port SYN-SENT 0 1 host:2 host:1 timer:(on,940ms,0) 00:01.028 IP host.2 > host.1: Flags [S] # first retry 00:03.044 IP host.2 > host.1: Flags [S] # second retry 00:07.236 IP host.2 > host.1: Flags [S] # third retry 00:15.427 IP host.2 > host.1: Flags [S] # fourth retry 00:31.560 IP host.2 > host.1: Flags [S] # fifth retry 01:04.324 IP host.2 > host.1: Flags [S] # sixth retry 02:10.000 connect ETIMEDOUT

Ok, this was easy. After the connect() syscall, the operating system sends a SYN packet. Since it didn't get any response the OS will by default retry sending it 6 times. This can be tweaked by the sysctl:

$ sysctl net.ipv4.tcp_syn_retries net.ipv4.tcp_syn_retries = 6

It's possible to overwrite this setting per-socket with the TCP_SYNCNT setsockopt:

setsockopt(sd, IPPROTO_TCP, TCP_SYNCNT, 6);

The retries are staggered at 1s, 3s, 7s, 15s, 31s, 63s marks (the inter-retry time starts at 2s and then doubles each time). By default, the whole process takes 130 seconds, until the kernel gives up with the ETIMEDOUT errno. At this moment in the lifetime of a connection, SO_KEEPALIVE settings are ignored, but TCP_USER_TIMEOUT is not. For example, setting it to 5000ms, will cause the following interaction:

$ sudo ./test-syn-sent.py 5000 # all packets dropped 00:00.000 IP host.2 > host.1: Flags [S] # initial SYN State Recv-Q Send-Q Local:Port Peer:Port SYN-SENT 0 1 host:2 host:1 timer:(on,996ms,0) 00:01.016 IP host.2 > host.1: Flags [S] # first retry 00:03.032 IP host.2 > host.1: Flags [S] # second retry 00:05.016 IP host.2 > host.1: Flags [S] # what is this? 00:05.024 IP host.2 > host.1: Flags [S] # what is this? 00:05.036 IP host.2 > host.1: Flags [S] # what is this? 00:05.044 IP host.2 > host.1: Flags [S] # what is this? 00:05.050 connect ETIMEDOUT

Even though we set user-timeout to 5s, we still saw the six SYN retries on the wire. This behaviour is probably a bug (as tested on 5.2 kernel): we would expect only two retries to be sent - at 1s and 3s marks and the socket to expire at 5s mark. Instead, we saw this, but also we saw further 4 retransmitted SYN packets aligned to 5s mark - which makes no sense. Anyhow, we learned a thing - the TCP_USER_TIMEOUT does affect the behaviour of connect().

SYN-RECV

SYN-RECV sockets are usually hidden from the application. They live as mini-sockets on the SYN queue. We wrote about the SYN and Accept queues in the past. Sometimes, when SYN cookies are enabled, the sockets may skip the SYN-RECV state altogether.

In SYN-RECV state, the socket will retry sending SYN+ACK 5 times as controlled by:

$ sysctl net.ipv4.tcp_synack_retries net.ipv4.tcp_synack_retries = 5

Here is how it looks on the wire:

$ sudo ./test-syn-recv.py 00:00.000 IP host.2 > host.1: Flags [S] # all subsequent packets dropped 00:00.000 IP host.1 > host.2: Flags [S.] # initial SYN+ACK State Recv-Q Send-Q Local:Port Peer:Port SYN-RECV 0 0 host:1 host:2 timer:(on,996ms,0) 00:01.033 IP host.1 > host.2: Flags [S.] # first retry 00:03.045 IP host.1 > host.2: Flags [S.] # second retry 00:07.301 IP host.1 > host.2: Flags [S.] # third retry 00:15.493 IP host.1 > host.2: Flags [S.] # fourth retry 00:31.621 IP host.1 > host.2: Flags [S.] # fifth retry 01:04:610 SYN-RECV disappears

With default settings, the SYN+ACK is re-transmitted at 1s, 3s, 7s, 15s, 31s marks, and the SYN-RECV socket disappears at the 64s mark.

Neither SO_KEEPALIVE nor TCP_USER_TIMEOUT affect the lifetime of SYN-RECV sockets.

Final handshake ACK

After receiving the second packet in the TCP handshake - the SYN+ACK - the client socket moves to an ESTABLISHED state. The server socket remains in SYN-RECV until it receives the final ACK packet.

Losing this ACK doesn't change anything - the server socket will just take a bit longer to move from SYN-RECV to ESTAB. Here is how it looks:

00:00.000 IP host.2 > host.1: Flags [S] 00:00.000 IP host.1 > host.2: Flags [S.] 00:00.000 IP host.2 > host.1: Flags [.] # initial ACK, dropped State Recv-Q Send-Q Local:Port Peer:Port SYN-RECV 0 0 host:1 host:2 timer:(on,1sec,0) ESTAB 0 0 host:2 host:1 00:01.014 IP host.1 > host.2: Flags [S.] 00:01.014 IP host.2 > host.1: Flags [.] # retried ACK, dropped State Recv-Q Send-Q Local:Port Peer:Port SYN-RECV 0 0 host:1 host:2 timer:(on,1.012ms,1) ESTAB 0 0 host:2 host:1

As you can see SYN-RECV, has the "on" timer, the same as in example before. We might argue this final ACK doesn't really carry much weight. This thinking lead to the development of TCP_DEFER_ACCEPT feature - it basically causes the third ACK to be silently dropped. With this flag set the socket remains in SYN-RECV state until it receives the first packet with actual data:

$ sudo ./test-syn-ack.py 00:00.000 IP host.2 > host.1: Flags [S] 00:00.000 IP host.1 > host.2: Flags [S.] 00:00.000 IP host.2 > host.1: Flags [.] # delivered, but the socket stays as SYN-RECV State Recv-Q Send-Q Local:Port Peer:Port SYN-RECV 0 0 host:1 host:2 timer:(on,7.192ms,0) ESTAB 0 0 host:2 host:1 00:08.020 IP host.2 > host.1: Flags [P.], length 11 # payload moves the socket to ESTAB State Recv-Q Send-Q Local:Port Peer:Port ESTAB 11 0 host:1 host:2 ESTAB 0 0 host:2 host:1

The server socket remained in the SYN-RECV state even after receiving the final TCP-handshake ACK. It has a funny "on" timer, with the counter stuck at 0 retries. It is converted to ESTAB - and moved from the SYN to the accept queue - after the client sends a data packet or after the TCP_DEFER_ACCEPT timer expires. Basically, with DEFER ACCEPT the SYN-RECV mini-socket discards the data-less inbound ACK.

Idle ESTAB is forever

Let's move on and discuss a fully-established socket connected to an unhealthy (dead) peer. After completion of the handshake, the sockets on both sides move to the ESTABLISHED state, like:

State Recv-Q Send-Q Local:Port Peer:Port ESTAB 0 0 host:2 host:1 ESTAB 0 0 host:1 host:2

These sockets have no running timer by default - they will remain in that state forever, even if the communication is broken. The TCP stack will notice problems only when one side attempts to send something. This raises a question - what to do if you don't plan on sending any data over a connection? How do you make sure an idle connection is healthy, without sending any data over it?

This is where TCP keepalives come in. Let's see it in action - in this example we used the following toggles:

SO_KEEPALIVE = 1 - Let's enable keepalives.
TCP_KEEPIDLE = 5 - Send first keepalive probe after 5 seconds of idleness.
TCP_KEEPINTVL = 3 - Send subsequent keepalive probes after 3 seconds.
TCP_KEEPCNT = 3 - Time out after three failed probes.

$ sudo ./test-idle.py 00:00.000 IP host.2 > host.1: Flags [S] 00:00.000 IP host.1 > host.2: Flags [S.] 00:00.000 IP host.2 > host.1: Flags [.] State Recv-Q Send-Q Local:Port Peer:Port ESTAB 0 0 host:1 host:2 ESTAB 0 0 host:2 host:1 timer:(keepalive,2.992ms,0) # all subsequent packets dropped 00:05.083 IP host.2 > host.1: Flags [.], ack 1 # first keepalive probe 00:08.155 IP host.2 > host.1: Flags [.], ack 1 # second keepalive probe 00:11.231 IP host.2 > host.1: Flags [.], ack 1 # third keepalive probe 00:14.299 IP host.2 > host.1: Flags [R.], seq 1, ack 1

Indeed! We can clearly see the first probe sent at the 5s mark, two remaining probes 3s apart - exactly as we specified. After a total of three sent probes, and a further three seconds of delay, the connection dies with ETIMEDOUT, and final the RST is transmitted.

For keepalives to work, the send buffer must be empty. You can notice the keepalive timer active in the "timer:(keepalive)" line.

Keepalives with TCP_USER_TIMEOUT are confusing

We mentioned the TCP_USER_TIMEOUT option before. It sets the maximum amount of time that transmitted data may remain unacknowledged before the kernel forcefully closes the connection. On its own, it doesn't do much in the case of idle connections. The sockets will remain ESTABLISHED even if the connectivity is dropped. However, this socket option does change the semantics of TCP keepalives. The tcp(7) manpage is somewhat confusing:

Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option, TCP_USER_TIMEOUT will override keepalive to determine when to close a connection due to keepalive failure.

The original commit message has slightly more detail:

tcp: Add TCP_USER_TIMEOUT socket option

To understand the semantics, we need to look at the kernel code in linux/net/ipv4/tcp_timer.c:693:

if ((icsk->icsk_user_timeout != 0 && elapsed >= msecs_to_jiffies(icsk->icsk_user_timeout) && icsk->icsk_probes_out > 0) ||

For the user timeout to have any effect, the icsk_probes_out must not be zero. The check for user timeout is done only after the first probe went out. Let's check it out. Our connection settings:

TCP_USER_TIMEOUT = 5*1000 - 5 seconds
SO_KEEPALIVE = 1 - enable keepalives
TCP_KEEPIDLE = 1 - send first probe quickly - 1 second idle
TCP_KEEPINTVL = 11 - subsequent probes every 11 seconds
TCP_KEEPCNT = 3 - send three probes before timing out

00:00.000 IP host.2 > host.1: Flags [S] 00:00.000 IP host.1 > host.2: Flags [S.] 00:00.000 IP host.2 > host.1: Flags [.] # all subsequent packets dropped 00:01.001 IP host.2 > host.1: Flags [.], ack 1 # first probe 00:12.233 IP host.2 > host.1: Flags [R.] # timer for second probe fired, socket aborted due to TCP_USER_TIMEOUT

So what happened? The connection sent the first keepalive probe at the 1s mark. Seeing no response the TCP stack then woke up 11 seconds later to send a second probe. This time though, it executed the USER_TIMEOUT code path, which decided to terminate the connection immediately.

What if we bump TCP_USER_TIMEOUT to larger values, say between the second and third probe? Then, the connection will be closed on the third probe timer. With TCP_USER_TIMEOUT set to 12.5s:

00:01.022 IP host.2 > host.1: Flags [.] # first probe 00:12.094 IP host.2 > host.1: Flags [.] # second probe 00:23.102 IP host.2 > host.1: Flags [R.] # timer for third probe fired, socket aborted due to TCP_USER_TIMEOUT

We’ve shown how TCP_USER_TIMEOUT interacts with keepalives for small and medium values. The last case is when TCP_USER_TIMEOUT is extraordinarily large. Say we set it to 30s:

00:01.027 IP host.2 > host.1: Flags [.], ack 1 # first probe 00:12.195 IP host.2 > host.1: Flags [.], ack 1 # second probe 00:23.207 IP host.2 > host.1: Flags [.], ack 1 # third probe 00:34.211 IP host.2 > host.1: Flags [.], ack 1 # fourth probe! But TCP_KEEPCNT was only 3! 00:45.219 IP host.2 > host.1: Flags [.], ack 1 # fifth probe! 00:56.227 IP host.2 > host.1: Flags [.], ack 1 # sixth probe! 01:07.235 IP host.2 > host.1: Flags [R.], seq 1 # TCP_USER_TIMEOUT aborts conn on 7th probe timer

We saw six keepalive probes on the wire! With TCP_USER_TIMEOUT set, the TCP_KEEPCNT is totally ignored. If you want TCP_KEEPCNT to make sense, the only sensible USER_TIMEOUT value is slightly smaller than:

TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT

Busy ESTAB socket is not forever

Thus far we have discussed the case where the connection is idle. Different rules apply when the connection has unacknowledged data in a send buffer.

Let's prepare another experiment - after the three-way handshake, let's set up a firewall to drop all packets. Then, let's do a send on one end to have some dropped packets in-flight. An experiment shows the sending socket dies after ~16 minutes:

00:00.000 IP host.2 > host.1: Flags [S] 00:00.000 IP host.1 > host.2: Flags [S.] 00:00.000 IP host.2 > host.1: Flags [.] # All subsequent packets dropped 00:00.206 IP host.2 > host.1: Flags [P.], length 11 # first data packet 00:00.412 IP host.2 > host.1: Flags [P.], length 11 # early retransmit, doesn't count 00:00.620 IP host.2 > host.1: Flags [P.], length 11 # 1nd retry 00:01.048 IP host.2 > host.1: Flags [P.], length 11 # 2rd retry 00:01.880 IP host.2 > host.1: Flags [P.], length 11 # 3th retry State Recv-Q Send-Q Local:Port Peer:Port ESTAB 0 0 host:1 host:2 ESTAB 0 11 host:2 host:1 timer:(on,1.304ms,3) 00:03.543 IP host.2 > host.1: Flags [P.], length 11 # 4th 00:07.000 IP host.2 > host.1: Flags [P.], length 11 # 5th 00:13.656 IP host.2 > host.1: Flags [P.], length 11 # 6th 00:26.968 IP host.2 > host.1: Flags [P.], length 11 # 7th 00:54.616 IP host.2 > host.1: Flags [P.], length 11 # 8th 01:47.868 IP host.2 > host.1: Flags [P.], length 11 # 9th 03:34.360 IP host.2 > host.1: Flags [P.], length 11 # 10th 05:35.192 IP host.2 > host.1: Flags [P.], length 11 # 11th 07:36.024 IP host.2 > host.1: Flags [P.], length 11 # 12th 09:36.855 IP host.2 > host.1: Flags [P.], length 11 # 13th 11:37.692 IP host.2 > host.1: Flags [P.], length 11 # 14th 13:38.524 IP host.2 > host.1: Flags [P.], length 11 # 15th 15:39.500 connection ETIMEDOUT

The data packet is retransmitted 15 times, as controlled by:

$ sysctl net.ipv4.tcp_retries2 net.ipv4.tcp_retries2 = 15

From the ip-sysctl.txt documentation:

The default value of 15 yields a hypothetical timeout of 924.6 seconds and is a lower bound for the effective timeout. TCP will effectively time out at the first RTO which exceeds the hypothetical timeout.

The connection indeed died at ~940 seconds. Notice the socket has the "on" timer running. It doesn't matter at all if we set SO_KEEPALIVE - when the "on" timer is running, keepalives are not engaged.

TCP_USER_TIMEOUT keeps on working though. The connection will be aborted exactly after user-timeout specified time since the last received packet. With the user timeout set the tcp_retries2 value is ignored.

Zero window ESTAB is... forever?

There is one final case worth mentioning. If the sender has plenty of data, and the receiver is slow, then TCP flow control kicks in. At some point the receiver will ask the sender to stop transmitting new data. This is a slightly different condition than the one described above.

In this case, with flow control engaged, there is no in-flight or unacknowledged data. Instead the receiver throttles the sender with a "zero window" notification. Then the sender periodically checks if the condition is still valid with "window probes". In this experiment we reduced the receive buffer size for simplicity. Here's how it looks on the wire:

00:00.000 IP host.2 > host.1: Flags [S] 00:00.000 IP host.1 > host.2: Flags [S.], win 1152 00:00.000 IP host.2 > host.1: Flags [.]

00:00.202 IP host.2 > host.1: Flags [.], length 576 # first data packet 00:00.202 IP host.1 > host.2: Flags [.], ack 577, win 576 00:00.202 IP host.2 > host.1: Flags [P.], length 576 # second data packet 00:00.244 IP host.1 > host.2: Flags [.], ack 1153, win 0 # throttle it! zero-window

00:00.456 IP host.2 > host.1: Flags [.], ack 1 # zero-window probe 00:00.456 IP host.1 > host.2: Flags [.], ack 1153, win 0 # nope, still zero-window

State Recv-Q Send-Q Local:Port Peer:Port ESTAB 1152 0 host:1 host:2 ESTAB 0 129920 host:2 host:1 timer:(persist,048ms,0)

The packet capture shows a couple of things. First, we can see two packets with data, each 576 bytes long. They both were immediately acknowledged. The second ACK had "win 0" notification: the sender was told to stop sending data.

But the sender is eager to send more! The last two packets show a first "window probe": the sender will periodically send payload-less "ack" packets to check if the window size had changed. As long as the receiver keeps on answering, the sender will keep on sending such probes forever.

The socket information shows three important things:

The read buffer of the reader is filled - thus the "zero window" throttling is expected.
The write buffer of the sender is filled - we have more data to send.
The sender has a "persist" timer running, counting the time until the next "window probe".

In this blog post we are interested in timeouts - what will happen if the window probes are lost? Will the sender notice?

By default, the window probe is retried 15 times - adhering to the usual tcp_retries2 setting.

The tcp timer is in persist state, so the TCP keepalives will not be running. The SO_KEEPALIVE settings don't make any difference when window probing is engaged.

As expected, the TCP_USER_TIMEOUT toggle keeps on working. A slight difference is that similarly to user-timeout on keepalives, it's engaged only when the retransmission timer fires. During such an event, if more than user-timeout seconds since the last good packet passed, the connection will be aborted.

Note about using application timeouts

In the past we have shared an interesting war story:

The curious case of slow downloads

Our HTTP server gave up on the connection after an application-managed timeout fired. This was a bug - a slow connection might have correctly slowly drained the send buffer, but the application server didn't notice that.

We abruptly dropped slow downloads, even though this wasn't our intention. We just wanted to make sure the client connection was still healthy. It would be better to use TCP_USER_TIMEOUT than rely on application-managed timeouts.

But this is not sufficient. We also wanted to guard against a situation where a client stream is valid, but is stuck and doesn't drain the connection. The only way to achieve this is to periodically check the amount of unsent data in the send buffer, and see if it shrinks at a desired pace.

For typical applications sending data to the Internet, I would recommend:

Enable TCP keepalives. This is needed to keep some data flowing in the idle-connection case.
Set TCP_USER_TIMEOUT to TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT.
Be careful when using application-managed timeouts. To detect TCP failures use TCP keepalives and user-timeout. If you want to spare resources and make sure sockets don't stay alive for too long, consider periodically checking if the socket is draining at the desired pace. You can use ioctl(TIOCOUTQ) for that, but it counts both data buffered (notsent) on the socket and in-flight (unacknowledged) bytes. A better way is to use TCP_INFO tcpi_notsent_bytes parameter, which reports only the former counter.

An example of checking the draining pace:

while True: notsent1 = get_tcp_info(c).tcpi_notsent_bytes notsent1_ts = time.time() ... poll.poll(POLL_PERIOD) ... notsent2 = get_tcp_info(c).tcpi_notsent_bytes notsent2_ts = time.time() pace_in_bytes_per_second = (notsent1 - notsent2) / (notsent2_ts - notsent1_ts) if pace_in_bytes_per_second > 12000: # pace is above effective rate of 96Kbps, ok! else: # socket is too slow...

There are ways to further improve this logic. We could use TCP_NOTSENT_LOWAT, although it's generally only useful for situations where the send buffer is relatively empty. Then we could use the SO_TIMESTAMPING interface for notifications about when data gets delivered. Finally, if we are done sending the data to the socket, it's possible to just call close() and defer handling of the socket to the operating system. Such a socket will be stuck in FIN-WAIT-1 or LAST-ACK state until it correctly drains.

Summary

In this post we discussed five cases where the TCP connection may notice the other party going away:

SYN-SENT: The duration of this state can be controlled by TCP_SYNCNT or tcp_syn_retries.
SYN-RECV: It's usually hidden from application. It is tuned by tcp_synack_retries.
Idling ESTABLISHED connection, will never notice any issues. A solution is to use TCP keepalives.
Busy ESTABLISHED connection, adheres to tcp_retries2 setting, and ignores TCP keepalives.
Zero-window ESTABLISHED connection, adheres to tcp_retries2 setting, and ignores TCP keepalives.

Especially the last two ESTABLISHED cases can be customized with TCP_USER_TIMEOUT, but this setting also affects other situations. Generally speaking, it can be thought of as a hint to the kernel to abort the connection after so-many seconds since the last good packet. This is a dangerous setting though, and if used in conjunction with TCP keepalives should be set to a value slightly lower than TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT. Otherwise it will affect, and potentially cancel out, the TCP_KEEPCNT value.

In this post we presented scripts showing the effects of timeout-related socket options under various network conditions. Interleaving the tcpdump packet capture with the output of ss -o is a great way of understanding the networking stack. We were able to create reproducible test cases showing the "on", "keepalive" and "persist" timers in action. This is a very useful framework for further experimentation.

Finally, it's surprisingly hard to tune a TCP connection to be confident that the remote host is actually up. During our debugging we found that looking at the send buffer size and currently active TCP timer can be very helpful in understanding whether the socket is actually healthy. The bug in our Spectrum application turned out to be a wrong TCP_USER_TIMEOUT setting - without it sockets with large send buffers were lingering around for way longer than we intended.

The scripts used in this article can be found on our GitHub.

Figuring this out has been a collaboration across three Cloudflare offices. Thanks to Hiren Panchasara from San Jose, Warren Nelson from Austin and Jakub Sitnicki from Warsaw. Fancy joining the team? Apply here!

mmproxy - Creative Linux routing to preserve client IP addresses in L7 proxies

Marek Majkowski — Tue, 17 Apr 2018 22:11:00 GMT

In previous blog post we discussed how we use the TPROXY iptables module to power Cloudflare Spectrum. With TPROXY we solved a major technical issue on the server side, and we thought we might find another use for it on the client side of our product.

This is Addressograph. Source Wikipedia

When building an application level proxy, the first consideration is always about retaining real client source IP addresses. Some protocols make it easy, e.g. HTTP has a defined X-Forwarded-For header[1], but there isn't a similar thing for generic TCP tunnels.

Others have faced this problem before us, and have devised three general solutions:

(1) Ignore the client IP

For certain applications it may be okay to ignore the real client IP address. For example, sometimes the client needs to identify itself with a username and password anyway, so the source IP doesn't really matter. In general, it's not a good practice because...

(2) Nonstandard TCP header

A second method was developed by Akamai: the client IP is saved inside a custom option in the TCP header in the SYN packet. Early implementations of this method weren't conforming to any standards, e.g. using option field 28, but recently RFC7974 was ratified for this option. We don't support this method for a number of reasons:

The space in TCP headers is very limited. It's insufficient to store the full 128 bits of client IPv6 addresses, especially with 15%+ of Cloudflare’s traffic being IPv6.
No software or hardware supports the RFC7974 yet.
It's surprisingly hard to add support for RFC7947 in real world applications. One option is to patch the operating system and overwrite getpeername(2) and accept4(2) syscalls, another is to use getsockopt(TCP_SAVED_SYN) to extract the client IP from a SYN packet in the userspace application. Neither technique is simple.

(3) Use the PROXY protocol

Finally, there is the last method. HAProxy developers, faced with this problem developed the "PROXY protocol". The premise of this protocol is to prepend client metadata in front of the original data stream. For example, this string could be sent to the origin server in front of proxied data:

PROXY TCP4 192.0.2.123 104.16.112.25 19235 80\r\n

As you can see, the PROXY protocol is rather trivial to implement, and is generally sufficient for most use cases. However, it requires application support. The PROXY protocol (v1) is supported by Cloudflare Spectrum, and we highly encourage using it over other methods of keeping client source IP addresses.

Mmproxy to the rescue

But sometimes adding PROXY protocol support to the application isn't an option. This can be the case when the application isn’t open source, or when it's hard to edit. A good example is "sshd" - it doesn't support PROXY protocol and adding the support would be far from trivial. For such applications it may just be impossible to use any application level load balancer whatsoever. This is very unfortunate.

Fortunately we think we found a workaround.

Allow me to present mmproxy, a PROXY protocol gateway. mmproxy listens for remote connections coming from an application level load balancer, like Spectrum. It then reads a PROXY protocol header, opens a localhost connection to the target application, and duly proxies data in and out.

Such a proxy wouldn't be too useful if not for one feature—the localhost connection from mmproxy to the target application is sent with a real client source IP.

That's right, mmproxy spoofs the client IP address. From the application’s point of view, this spoofed connection, coming through Spectrum and mmproxy, is indistinguishable from a real one, connecting directly to the application.

This technique requires some Linux routing trickery. The mmproxy daemon will walk you through the necessary details, but there are the important bits:

mmproxy works only on Linux.
Since it forwards traffic over the loopback interface, it must be run on the same machine as the target application.
It requires kernel 2.6.28 or newer.
It guides the user to add four iptables firewall rules, and four iproute2 routing rules, covering both IPv4 and IPv6.
For IPv4, mmproxy requires the route_localnet sysctl to be set.
For IPv6, it needs a working IPv6 configuration. A working ping6 cloudflare.com is a prerequisite.
mmproxy needs root or CAP_NET_RAW permissions to set the IP_TRANSPARENT socket option. Once started, it jails itself with seccomp-bpf for a bit of added security.

How to run mmproxy

To run mmproxy, first download the source and compile it:

git clone https://github.com/cloudflare/mmproxy.git --recursive
cd mmproxy
make

Please report any issues on GitHub.

Then set up the needed configuration:

sudo iptables -t mangle -I PREROUTING -m mark --mark 123 -j CONNMARK --save-mark
sudo iptables -t mangle -I OUTPUT -m connmark --mark 123 -j CONNMARK --restore-mark
sudo ip rule add fwmark 123 lookup 100
sudo ip route add local 0.0.0.0/0 dev lo table 100
sudo ip6tables -t mangle -I PREROUTING -m mark --mark 123 -j CONNMARK --save-mark
sudo ip6tables -t mangle -I OUTPUT -m connmark --mark 123 -j CONNMARK --restore-mark
sudo ip -6 rule add fwmark 123 lookup 100
sudo ip -6 route add local ::/0 dev lo table 100

You will also need route_localnet to be set on your default outbound interface, for example for eth0:

echo 1 | sudo tee /proc/sys/net/ipv4/conf/eth0/route_localnet

Finally, verify your IPv6 connectivity:

$ ping6 cloudflare.com
PING cloudflare.com(2400:cb00:2048:1::c629:d6a2) 56 data bytes
64 bytes from 2400:cb00:2048:1::c629:d6a2: icmp_seq=1 ttl=61 time=0.650 ms

Now, you are ready to run mmproxy. For example, forwarding localhost SSH would look like this:

$ sudo ./mmproxy --allowed-subnets ./cloudflare-ip-ranges.txt \
      -l 0.0.0.0:2222 \
      -4 127.0.0.1:22 -6 '[::1]:22'
root@ubuntu:~# ./mmproxy -a cloudflare-ip-ranges.txt -l 0.0.0.0:2222 -4 127.0.0.1:22 -6 [::1]:22[ ] Remember to set the reverse routing rules correctly:
iptables -t mangle -I PREROUTING -m mark --mark 123 -m comment --comment mmproxy -j CONNMARK --save-mark        # [+] VERIFIED
iptables -t mangle -I OUTPUT -m connmark --mark 123 -m comment --comment mmproxy -j CONNMARK --restore-mark     # [+] VERIFIED
ip6tables -t mangle -I PREROUTING -m mark --mark 123 -m comment --comment mmproxy -j CONNMARK --save-mark       # [+] VERIFIED
ip6tables -t mangle -I OUTPUT -m connmark --mark 123 -m comment --comment mmproxy -j CONNMARK --restore-mark    # [+] VERIFIED
ip rule add fwmark 123 lookup 100               # [+] VERIFIED
ip route add local 0.0.0.0/0 dev lo table 100   # [+] VERIFIED
ip -6 rule add fwmark 123 lookup 100            # [+] VERIFIED
ip -6 route add local ::/0 dev lo table 100     # [+] VERIFIED
[+] OK. Routing to 127.0.0.1 points to a local machine.
[+] OK. Target server 127.0.0.1:22 is up and reachable using conventional connection.
[+] OK. Target server 127.0.0.1:22 is up and reachable using spoofed connection.
[+] OK. Routing to ::1 points to a local machine.
[+] OK. Target server [::1]:22 is up and reachable using conventional connection.
[+] OK. Target server [::1]:22 is up and reachable using spoofed connection.
[+] Listening on 0.0.0.0:2222

On startup, mmproxy performs a number of self checks. Since we prepared the necessary routing and firewall rules, its self check passes with a "VERIFIED" mark. It's important to confirm these pass.

We're almost ready to go! The last step is to create a Spectrum application that sends PROXY protocol traffic to mmproxy, port 2222. Here is an example configuration[2]:

With Spectrum we are forwarding TCP/22 on domain "ssh.example.org", to our origin at 192.0.2.1, port 2222. We’ve enabled the PROXY protocol toggle.

mmproxy in action

Now we can see if it works. My testing VPS has IP address 79.1.2.3. Let's see if the whole setup behaves:

vps$ nc ssh.example.org 22
SSH-2.0-OpenSSH_7.2p2 Ubuntu-4ubuntu2.1

Hurray, this worked! The "ssh.example.org" on port 22 is indeed tunneled over Spectrum. Let's see mmproxy logs:

[+] 172.68.136.1:32654 connected, proxy protocol source 79.1.2.3:0,
        local destination 127.0.0.1:22

The log confirmed what happened - Cloudflare IP 172.68.136.1 has connected, advertised client IP 79.1.2.3 over the PROXY protocol, and established a spoofed connection to 127.0.0.1:22. The ssh daemon logs show:

$ tail /var/log/auth.log
Apr 15 14:39:09 ubuntu sshd[7703]: Did not receive identification
        string from 79.1.2.3

Hurray! All works! sshd recorded the real client IP address, and with mmproxy’s help we never saw that it's actually traffic flowing through Cloudflare Spectrum.

Under the hood

Under the hood mmproxy relies on two hacks.

The first hack is about setting source IP on outgoing connections. We are using the well known bind-before-connect technique to do this.

Normally, it's only possible to set a valid source IP that is actually handled by a local machine. We can override this by using the IP_TRANSPARENT socket option. With it set, we can select arbitrary source IP addresses before establishing a legitimate connection handled by kernel. For example, we can have a localhost socket between, say 8.8.8.8 and 127.0.0.1, even though 8.8.8.8 may not be explicitly assigned to our server.

It's worth saying that IP_TRANSPARENT was not created for this use case. This socket option was specifically added as support for TPROXY module.

The second hack is about routing. Normally, response packets coming from the application are routed to the Internet - via a default gateway. We must prevent that from happening, and instead direct these packets towards the loopback interface. To achieve this, we rely on CONNMARK and an additional routing table selected by fwmark. mmproxy sets a MARK value of 123 (by default) on packets it sends, which is preserved at the CONNMARK layer, and restored for the return packets. Then we route the packets with MARK == 123 to a specific routing table (number 100 by default), which force-routes everything back to the loopback interface. We do this by totally abusing the AnyIP trick and assigning 0.0.0.0/0 to "local" - meaning that entire internet shall be treated as belonging to our machine.

Summary

mmproxy is not the only tool that uses this IP spoofing technique to preserve real client IP addresses. One example is OpenBSD's relayd "transparent" mode. Another is the pen load balancer. Compared to mmproxy, these tools look heavyweight and require more complex routing.

mmproxy is the first daemon to do just one thing: unwrap the PROXY protocol and spoof the client IP address on locally running connections going to the application process. While it requires some firewall and routing setup, it's small enough to make an mmproxy deployment acceptable in many situations.

We hope that mmproxy, while a gigantic hack, could help some of our customers with onboarding onto Cloudflare Spectrum.

However, frankly speaking - we don't know. mmproxy should be treated as a great experiment. If you find it useful, let us know! If you find a problem, please report it!We are looking for feedback. If our users will find the mmproxy approach useful, we will repackage it and release as an easier to use tool.

Doing low level socket work sound interesting? Join our world famous team in London, Austin, San Francisco, Champaign and our elite office in Warsaw, Poland.

In addition to supporting standard X-Forwarded-For HTTP header, Cloudflare supports custom a CF-Connecting-IP header. ↩︎
Spectrum is available for Enterprise plan domains and can be enabled by your account manager. ↩︎

Using Go as a scripting language in Linux

Ignat Korchagin — Tue, 20 Feb 2018 19:49:17 GMT

At Cloudflare we like Go. We use it in many in-house software projects as well as parts of bigger pipeline systems. But can we take Go to the next level and use it as a scripting language for our favourite operating system, Linux?

gopher image CC BY 3.0 Renee French Tux image CC0 BY OpenClipart-Vectors

Why consider Go as a scripting language

Short answer: why not? Go is relatively easy to learn, not too verbose and there is a huge ecosystem of libraries which can be reused to avoid writing all the code from scratch. Some other potential advantages it might bring:

Go-based build system for your Go project: go build command is mostly suitable for small, self-contained projects. More complex projects usually adopt some build system/set of scripts. Why not have these scripts written in Go then as well?
Easy non-privileged package management out of the box: if you want to use a third-party library in your script, you can simply go get it. And because the code will be installed in your GOPATH, getting a third-party library does not require administrative privileges on the system (unlike some other scripting languages). This is especially useful in large corporate environments.
Quick code prototyping on early project stages: when you're writing the first iteration of the code, it usually takes a lot of edits even to make it compile and you have to waste a lot of keystrokes on "edit->build->check" cycle. Instead you can skip the "build" part and just immediately execute your source file.
Strongly-typed scripting language: if you make a small typo somewhere in the middle of the script, most scripts will execute everything up to that point and fail on the typo itself. This might leave your system in an inconsistent state. With strongly-typed languages many typos can be caught at compile time, so the buggy script will not run in the first place.

Current state of Go scripting

At first glance Go scripts seem easy to implement with Unix support of shebang lines for scripts. A shebang line is the first line of the script, which starts with #! and specifies the script interpreter to be used to execute the script (for example, #!/bin/bash or #!/usr/bin/env python), so the system knows exactly how to execute the script regardless of the programming language used. And Go already supports interpreter-like invocation for .go files with go run command, so it should be just a matter of adding a proper shebang line, something like #!/usr/bin/env go run, to any .go file, setting the executable bit and we're good to go.

However, there are problems around using go run directly. This great post describes in detail all the issues around go run and potential workarounds, but the gist is:

go run does not properly return the script error code back to the operating system and this is important for scripts, because error codes are one of the most common ways multiple scripts interact with each other and the operating system environment.
you can't have a shebang line in a valid .go file, because Go does not know how to process lines starting with #. Other scripting languages do not have this problem, because for most of them # is a way to specify comments, so the final interpreter just ignores the shebang line, but Go comments start with // and go run on invocation will just produce an error like:

package main:
helloscript.go:1:1: illegal character U+0023 '#'

The post describes several workarounds for above issues including using a custom wrapper program gorun as an interpreter, but all of them do not provide an ideal solution. You either:

have to use non-standard shebang line, which starts with //. This is technically not even a shebang line, but the way how bash shell processes executable text files, so this solution is bash specific. Also, because of the specific behaviour of go run, this line is rather complex and not obvious (see original post for examples).
have to use a custom wrapper program gorun in the shebang line, which works well, however, you end up with .go files, which are not compilable with standard go build command because of the illegal # character.

How Linux executes files

OK, it seems the shebang approach does not provide us with an all-rounder solution. Is there anything else we could use? Let's take a closer look how Linux kernel executes binaries in the first place. When you try to execute a binary/script (or any file for that matter which has executable bit set), your shell in the end will just use Linux execve system call passing it the filesystem path of the binary in question, command line parameters and currently defined environment variables. Then the kernel is responsible for correct parsing of the file and creating a new process with the code from the file. Most of us know that Linux (and many other Unix-like operating systems) use ELF binary format for its executables.

However, one of the core principles of Linux kernel development is to avoid "vendor/format lock-in" for any subsystem, which is part of the kernel. Therefore, Linux implements a "pluggable" system, which allows any binary format to be supported by the kernel - all you have to do is to write a correct module, which can parse the format of your choosing. And if you take a closer look at the kernel source code, you'll see that Linux supports more binary formats out of the box. For example, for the recent 4.14 Linux kernel we can see that it supports at least 7 binary formats (in-tree modules for various binary formats usually have binfmt_ prefix in their names). It is worth to note the binfmt_script module, which is responsible for parsing above mentioned shebang lines and executing scripts on the target system (not everyone knows that the shebang support is actually implemented in the kernel itself and not in the shell or other daemon/process).

Extending supported binary formats from userspace

But since we concluded that shebang is not the best option for our Go scripting, seems we need something else. Surprisingly Linux kernel already has a "something else" binary support module, which has an appropriate name binfmt_misc. The module allows an administrator to dynamically add support for various executable formats directly from userspace through a well-defined procfs interface and is well-documented.

Let's follow the documentation and try to setup a binary format description for .go files. First of all the guide tells you to mount special binfmt_misc filesystem to /proc/sys/fs/binfmt_misc. If you're using relatively recent systemd-based Linux distribution, it is highly likely the filesystem is already mounted for you, because systemd by default installs special mount and automount units for this purpose. To double-check just run:

$ mount | grep binfmt_misc
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=27,pgrp=1,timeout=0,minproto=5,maxproto=5,direct)

Another way is to check if you have any files in /proc/sys/fs/binfmt_misc: properly mounted binfmt_misc filesystem will create at least two special files with names register and status in that directory.

Next, since we do want our .go scripts to be able to properly pass the exit code to the operating system, we need the custom gorun wrapper as our "interpreter":

$ go get github.com/erning/gorun
$ sudo mv ~/go/bin/gorun /usr/local/bin/

Technically we don't need to move gorun to /usr/local/bin or any other system path as binfmt_misc requires full path to the interpreter anyway, but the system may run this executable with arbitrary privileges, so it is a good idea to limit access to the file from security perspective.

At this point let's create a simple toy Go script helloscript.go and verify we can successfully "interpret" it. The script:

package main

import (
	"fmt"
	"os"
)

func main() {
	s := "world"

	if len(os.Args) > 1 {
		s = os.Args[1]
	}

	fmt.Printf("Hello, %v!", s)
	fmt.Println("")

	if s == "fail" {
		os.Exit(30)
	}
}

Checking if parameter passing and error handling works as intended:

$ gorun helloscript.go
Hello, world!
$ echo $?
0
$ gorun helloscript.go gopher
Hello, gopher!
$ echo $?
0
$ gorun helloscript.go fail
Hello, fail!
$ echo $?
30

Now we need to tell binfmt_misc module how to execute our .go files with gorun. Following the documentation we need this configuration string: :golang:E::go::/usr/local/bin/gorun:OC, which basically tells the system: "if you encounter an executable file with .go extension, please, execute it with /usr/local/bin/gorun interpreter". The OC flags at the end of the string make sure, that the script will be executed according to the owner information and permission bits set on the script itself, and not the ones set on the interpreter binary. This makes Go script execution behaviour same as the rest of the executables and scripts in Linux.

Let's register our new Go script binary format:

$ echo ':golang:E::go::/usr/local/bin/gorun:OC' | sudo tee /proc/sys/fs/binfmt_misc/register
:golang:E::go::/usr/local/bin/gorun:OC

If the system successfully registered the format, a new file golang should appear under /proc/sys/fs/binfmt_misc directory. Finally, we can natively execute our .go files:

$ chmod u+x helloscript.go
$ ./helloscript.go
Hello, world!
$ ./helloscript.go gopher
Hello, gopher!
$ ./helloscript.go fail
Hello, fail!
$ echo $?
30

That's it! Now we can edit helloscript.go to our liking and see the changes will be immediately visible the next time the file is executed. Moreover, unlike the previous shebang approach, we can compile this file any time into a real executable with go build.

Whether you like Go or digging in Linux internals, we have positions for either or these and even both of them at once. Check-out our careers page.

Results of experimenting with Brotli for dynamic web content

Vlad Krasnov — Fri, 23 Oct 2015 14:24:50 GMT

Compression is one of the most important tools CloudFlare has to accelerate website performance. Compressed content takes less time to transfer, and consequently reduces load times. On expensive mobile data plans, compression even saves money for consumers. However, compression is not free—it comes at a price. It is one of the most compute expensive operations our servers perform, and the better the compression rate we want, the more effort we have to spend.

The most popular compression format on the web is gzip. We put a great deal of effort into improving the performance of the gzip compression, so we can perform compression on the fly with fewer CPU cycles. Recently a potential replacement for gzip, called Brotli, was announced by Google. Being early adopters for many technologies, we at CloudFlare want to see for ourselves if it is as good as claimed.

This post takes a look at a bit of history behind gzip and Brotli, followed by a performance comparison.

Compression 101

Many popular lossless compression algorithms rely on LZ77 and Huffman coding, so it’s important to have a basic understanding of these two techniques before getting into gzip or Brotli.

LZ77

LZ77 is a simple technique developed by Abraham Lempel and Jacob Ziv in 1977 (hence the original name). Let's call the input to the algorithm a string (a sequence of bytes, not necessarily letters) and each consecutive sequence of bytes in the input a substring. LZ77 compresses the input string by replacing some of its substrings by pointers (or backreferences) to an identical substring previously encountered in the input.

The pointer usually has the form of , where length indicates the number of identical bytes found, and distance indicates how many bytes separate the current occurrence of the substring from the previous one. For example the string abcdeabcdf can be compressed with LZ77 to abcde<4,5>f and aaaaaaaaaa can be compressed to simply a<9, 1>. The decompressor when encountering a backreference will simply copy the required number of bytes from the already decompressed output, which makes it very fast for decompression. This is a nice illustration of LZ77 from my previous blog Improving compression with a preset DEFLATE dictionary:

Output (length tokens are blue, distance tokens are red):

149

157

141

The deflate algorithm managed to reduce the original text from 251 characters, to just 152 tokens! Those tokens are later compressed further by Huffman coding.

Huffman Coding

Huffman coding is another lossless compression algorithm. Developed by David Huffman back in the 50s, it is used for many compression algorithms, including JPEG. A Huffman code is a type of prefix coding, where, given an alphabet and an input, frequently occurring characters are replaced by shorter bit sequences and rarely occurring characters are replaced with longer sequences.

The code can be expressed as a binary tree, where the leaf nodes are the literals of the alphabet and the two edges from each node are marked with 0 and 1. To decode the next character, the decompressor can parse the tree from the root until it encounters a literal in the tree.

Each compression format uses Huffman coding differently, and for our little example we will create a Huffman code for an alphabet that includes only the literals and the length codes we actually used. To start, we must count the frequency of each letter in the LZ77 compressed text:

(space) - 19, o -14, e - 11, n - 8, t - 7, a - 6, d - 6, i - 6, 3 - 4, 5 - 4, h - 4, s - 4, u - 4, c - 3, m - 3, p - 3, r - 3, y - 3, (") - 2, F - 2, L - 2, b - 2, g - 2, l - 2, w - 2, (') - 1, (,) - 1, (.) - 1, A - 1, D - 1, G - 1, I - 1, 9 - 1, 20 - 1, S - 1, 56 - 1, W - 1, 6 - 1, f - 1.

We can then use the algorithm from Wikipedia to build this Huffman code (binary):

(space) - 101, o - 000, e - 1101, n - 0101, t - 0011, a - 11101, d - 11110, i - 11100, 3 - 01100, 5 - 01111, h - 10001, s - 11000, u - 10010, c - 00100, m - 111111, p - 110011, r - 110010, y - 111110, (") - 011010, F - 010000, L - 100000, b - 011101, g - 010011, l - 011100, w - 011011, (') - 0100101, (,) - 1001110, (.) - 1001100, A - 0100011, D - 0100010, G - 1000011, I - 0010110, 9 - 1000010, 20 - 0010111, S - 0010100, 56 - 0100100, W - 0010101, 6 - 1001101, f - 1001111.

Here the most frequent letters - space and 'o' got the shortest codes, only 3 bit long, whereas the letters that occur only once get 7 bit codes. If we were to represent the alphabet of 256 bytes and some length tokens, we would require 9 bits per every letter.

Now apply the code to the LZ77 output, (leaving the distance tokens untouched) and we get:

100000 11100 0011 0011 011100 1101 101 011101 10010 0101 0101 111110 101 010000 000 000 01111 4 0010101 1101 0101 0011 101 10001 000 110011 110011 11100 0101 010011 101 0011 10001 110010 000 10010 010011 01100 8 10001 1101 101 1001111 000 110010 1101 11000 0011 101 0010100 00100 000 000 01111 28 10010 110011 1001101 23 11100 1101 011100 11110 101 111111 11100 00100 1101 101 0100011 0101 11110 101 011101 1000010 58 1101 111111 101 000 0101 01111 35 10001 1101 11101 11110 101 0100010 000 011011 0101 101 00100 11101 111111 1101 01111 19 1000011 000 000 11110 101 010000 11101 11100 110010 111110 1001110 101 11101 01100 55 11000 01100 20 11000 11101 11100 11110 101 011010 100000 0010111 149 0010110 101 11110 000 0101 0100101 0011 101 011011 11101 01100 157 0011 000 101 11000 1101 1101 101 111110 000 10010 0100100 141 1001100 011010

Assuming all the distance tokens require one byte to store, the total output length is 898 bits. Compared to the 1356 bits we would require to store the LZ77 output, and 2008 bits for the original input. We achieved a compression ratio of 31% which is very good. In reality this is not the case, since we must also encode the Huffman tree we used, otherwise it would be impossible to decompress the text.

gzip

The compression algorithm used in gzip is called "deflate," and it’s a combination of the LZ77 and Huffman algorithms discussed above.

On the LZ77 side, gzip has a minimum size of 3 bytes and a maximum of 258 bytes for the length tokens and a maximum of 32768 bytes for the distance tokens. The maximum distance also defines the sliding window size the implementation uses for compression and decompression.

For Huffman coding, deflate has two alphabets. The first alphabet includes the input literals ("letters" 0-255), the end-of-block symbol (the "letter" 256) and the length tokens ("letters" 257-285). There are only 29 letters to encode all the possible lengths and the tokens 265-284 will always have additional bits encoded in the stream to cover the entire range from 3 to 258. For example the letter 257 indicates the minimal length of 3, whereas the letter 265 will indicate the length 11 if followed by the bit 0, and the length 12 if followed by the bit 1.

The second alphabet is for the distance tokens only. Its letters are the codewords 0 through 29. Similar to the length tokens, the codes 4-29 are followed by 1 to 13 additional bits to cover the whole range from 1 to 32768.

Using two distinct alphabets makes a lot of sense, because it allows us to represent the distance code with fewer bits while avoiding any ambiguity, since the distance codes always follow the length codes.

The most common implementation of gzip compression is the zlib library. At CloudFlare, we use a custom version of zlib optimized for our servers.

zlib has 9 preset quality settings for the deflate algorithm, labeled from 1 to 9. It can also be run with quality set to 0, in which case it does not perform any compression. These quality settings can be divided into two categories:

Fast compression (levels 1-3): When a sufficiently long backreference is found, it is emitted immediately, and the search moves to the position at the end of the matched string. If the match is longer than a few bytes, the entire matched string will not be hashed, meaning its substrings will never be referenced in the future. Clearly, this reduces the compression ratio.
Slow compression (levels 4-9): Here, every substring is hashed; therefore, any substring can be referenced in the future. In addition, slow compression enables "lazy" matches. If at a given position a sufficiently long match is found, it will not be immediately emitted. Instead, the algorithm attempts to find a match at the next position. If a longer match is found, the algorithm will emit a single literal at the current position instead the shorter match, and continue to the next position. As a rule, this results in a better compression ratio than levels 1-3.

Other differences between the quality levels are how far to search for a backreference and how long a match should be before stopping.

A very decent compression rate can be observed at levels 3-4, which are also quite fast. Increasing compression quality from level 4 and up gives incrementally smaller compression gains, while requiring substantially more time. Often, levels 8 and 9 will produce similarly compressed output, but level 9 will require more time to do it.

It is important to understand that the zlib implementation does not guarantee the best compression possible with the format, even when the quality setting is set to 9. Instead, it uses a set of heuristics and optimizations that allow for very good compression at reasonable speed.

Better gzip compression can be achieved by the zopfli library, but it is significantly slower than zlib.

Brotli

The Brotli format was developed by Google and has been refined for a while now. Here at CloudFlare, we built an nginx module that performs dynamic Brotli compression, and we deployed it on our test server that supports HTTP2 and other new features.

To check the Brotli compression on the test server you can use the nightly build of the Firefox browser, which also supports this format.

Brotli, deflate, and gzip

Brotli and deflate are very closely related. Brotli also uses the LZ77 and Huffman algorithms for compression. Both algorithms use a sliding window for backreferences. Gzip uses a fixed size, 32KB window, and Brotli can use any window size from 1KB to 16MB, in powers of 2 (minus 16 bytes). This means that the Brotli window can be up to 512 times larger window than the deflate window. This difference is almost irrelevant in web-server context, as text files larger than 32KB are the minority.

Other differences include smaller minimal match length (2 bytes minimum in Brotli, compared to 3 bytes minimum in deflate) and larger maximal match length (16779333 bytes in Broli, compared to 258 bytes in deflate).

Static dictionary

Brotli also features a static dictionary. The "dictionary" supported by deflate can greatly improve compression, but has to be supplied independently and can only be addressed as part of the sliding window. The Brotli dictionary is part of the implementation and can be referenced from anywhere in the stream, somewhat increasing its efficiency for larger files. Moreover, different transformations can be applied to words of the dictionary effectively increasing its size.

Context modeling

Brotli also supports something called context modeling. Context modeling is a feature that allows multiple Huffman trees for the same alphabet in the same block. For example, in deflate each block consists of a series of literals (bytes that could not be compressed by backreferencing) and pairs that define a backreference for copying. Literals and lengths form a single alphabet, while the distances are a different alphabet.

In Brotli, each block is composed of "commands". A command consists of 3 parts. The first part of each command is a word . "Insert" defines the number of literals that will follow the word, and it may have the value of 0. "Copy" defines the number of literals to copy from a back reference. The word is followed by a sequence of "insert" literals: … . Finally, the command ends with a . Distance defines the backreference from which to copy the previously defined number of bytes. Unlike the distance in deflate, Brotli distance can have additional meanings, such as references to the static dictionary, or references to recently used distances.

There are three alphabets here. One is for s and it simply covers all the possible byte values from 0 to 255. The other one is for s, and its size depends on the size of the sliding window and other parameters. The third alphabet is for the length pairs, with 704 letters. Here indicates a single letter in the alphabet, as opposed to pairs in deflate where length and distance are letters in distinct alphabets.

So why do we care about context modeling? It means that for any of the alphabets—up to 256 different Huffman trees—can be used in the same block. The switch between different trees is determined by "context". This can be useful when the compressed file consists of different types of characters. For example binary data interleaved with UTF-8 strings, or a multilingual dictionary.

Whereas the basic idea behind Brotli remains identical to that of deflate, the way the data is encoded is very different. Those improvements allow for significantly better compression, but they also require a significant amount of processing. To alleviate the performance cost somewhat, Brotli drops the error detection CRC check present in gzip.

Benchmarking

The heaviest operation in both deflate and Brotli is the search for backward references. A higher level in zlib generally means that the backward reference search will attempt to find a better match for longer, but not necessarily succeed. This leads to significant increase in processing time. In contrast, Brotli trades longer searches for lighter operations that give better ROI, such as context modeling, dictionary references and more efficient data encoding in general. For that reason Brotli is, in theory, capable of outperforming zlib at some stage for similar compression rates, as well as giving better compression at its maximal levels.

We decided to put those theories to the test. At CloudFlare, one of the primary use cases for Brotli/gzip is on-the-fly compression of textual web assets like HTML, CSS, and JavaScript, so that’s what we’ll be testing.

There is a tradeoff between compression speed and transfer speed. It is only beneficial to increase your compression ratio if you can reduce the number of bytes you have to transfer faster than you would actually transfer them. Slower compression will actually slow the connection down. CloudFlare currently uses our own zlib implementation with quality set to 8, and that is our benchmarking baseline.

The benchmark set consists of 10,655 HTML, CSS, and JavaScript files. The benchmarks were performed on an Intel 3.5GHz, E3-1241 v3 CPU.

The files were grouped into several size groups, since different websites have different size characteristics, and we must optimize for them all.

Although the quality setting in the Brotli implementation states the possible values are 0 to 11, we didn't see any differences between 0 and 1, or between 10 and 11, therefore only quality settings 1 to 10 are reported.

Compression quality

Compression quality is measured as (total size of all files after compression)/(total size of all files before compression)*100%. The columns represent the size distribution of the HTML, CSS, and JavaScript files used in the test in bytes (the file size ranges are shown using the [x,y) notation).

	[20, 1024)	[1024, 2048)	[2048, 3072)	[3072, 4096)	[4096, 8192)	[8192, 16384)	[16384, 32768)	[32768, 65536)	[65536, +∞)	All files
zlib 1	65.0%	46.8%	42.8%	38.7%	34.4%	32.0%	29.9%	31.1%	31.4%	31.5%
zlib 2	64.9%	46.5%	42.5%	38.3%	33.8%	31.4%	29.2%	30.3%	30.5%	30.6%
zlib 3	64.9%	46.4%	42.3%	38.0%	33.6%	31.1%	28.8%	29.8%	29.9%	30.1%
zlib 4	64.5%	45.8%	41.6%	37.1%	32.6%	30.0%	27.8%	28.5%	28.7%	28.9%
zlib 5	64.5%	45.5%	41.3%	36.7%	32.1%	29.4%	27.1%	27.8%	27.8%	28.0%
zlib 6	64.4%	45.5%	41.3%	36.7%	32.0%	29.3%	27.0%	27.6%	27.6%	27.8%
zlib 7	64.4%	45.5%	41.3%	36.6%	32.0%	29.2%	26.9%	27.5%	27.5%	27.7%
zlib 8	64.4%	45.5%	41.3%	36.6%	32.0%	29.2%	26.9%	27.5%	27.4%	27.7%
zlib 9	64.4%	45.5%	41.3%	36.6%	32.0%	29.2%	26.9%	27.5%	27.4%	27.7%
brotli 1	61.3%	46.6%	42.7%	38.4%	33.8%	30.8%	28.6%	29.5%	28.6%	29.0%
brotli 2	61.9%	46.9%	42.8%	38.3%	33.6%	30.6%	28.3%	29.1%	28.3%	28.6%
brotli 3	61.8%	46.8%	42.7%	38.2%	33.3%	30.4%	28.1%	28.9%	28.0%	28.4%
brotli 4	53.8%	40.8%	38.6%	34.8%	30.9%	28.7%	27.0%	28.0%	27.5%	27.7%
brotli 5	49.9%	37.7%	35.7%	32.3%	28.7%	26.6%	25.2%	26.2%	26.0%	26.1%
brotli 6	50.0%	37.7%	35.7%	32.3%	28.6%	26.5%	25.1%	26.0%	25.7%	25.9%
brotli 7	50.0%	37.6%	35.6%	32.3%	28.5%	26.4%	25.0%	25.9%	25.5%	25.7%
brotli 8	50.0%	37.6%	35.6%	32.3%	28.5%	26.4%	25.0%	25.9%	25.4%	25.6%
brotli 9	50.0%	37.6%	35.5%	32.2%	28.5%	26.4%	25.0%	25.8%	25.3%	25.5%
brotli 10	46.8%	33.4%	32.5%	29.4%	26.0%	23.9%	22.9%	23.8%	23.0%	23.3%

Clearly, the compression possible by Brotli is significant. On average, Brotli at the maximal quality setting produces 1.19X smaller results than zlib at the maximal quality. For files smaller than 1KB the result is 1.38X smaller on average, a very impressive improvement, that can probably be attributed to the use of static dictionary.

Compression speed

Compression speed is measured as (total size of files before compression)/(total time to compress all files) and is reported in MB/s.

	[20, 1024)	[1024, 2048)	[2048, 3072)	[3072, 4096)	[4096, 8192)	[8192, 16384)	[16384, 32768)	[32768, 65536)	[65536, +∞)	All files
zlib 1	5.9	21.8	34.4	43.4	62.1	89.8	117.9	127.9	139.6	125.5
zlib 2	5.9	21.7	34.3	43.0	61.2	87.6	114.3	123.1	130.7	118.9
zlib 3	5.9	21.7	34.0	42.4	60.5	84.8	108.0	114.5	114.9	106.9
zlib 4	5.8	20.9	32.2	39.8	54.8	74.9	93.3	97.5	96.1	90.7
zlib 5	5.8	20.6	31.4	38.3	51.6	68.4	82.0	81.3	73.2	71.6
zlib 6	5.8	20.6	31.2	37.9	50.6	64.0	73.7	70.2	57.5	58.0
zlib 7	5.8	20.5	31.0	37.4	49.6	60.8	67.4	64.6	51.0	52.0
zlib 8	5.8	20.5	31.0	37.2	48.8	53.2	56.6	56.5	41.6	43.1
zlib 9	5.8	20.6	30.8	37.3	48.6	51.7	56.6	54.2	40.4	41.9
brotli 1	3.4	12.8	20.4	25.9	37.8	57.3	80.0	94.1	105.8	91.3
brotli 2	3.4	12.4	19.5	24.4	35.2	52.3	71.2	82.0	89.0	78.8
brotli 3	3.4	12.3	19.0	23.7	34.0	49.8	67.4	76.3	81.5	73.0
brotli 4	2.0	7.6	11.9	15.2	22.2	33.1	44.7	51.9	58.5	51.0
brotli 5	2.0	5.2	8.0	10.3	15.0	22.0	29.7	33.3	32.8	30.3
brotli 6	1.8	3.8	5.5	7.0	10.5	16.3	23.5	28.6	28.4	25.6
brotli 7	1.5	2.3	3.1	3.7	4.9	7.2	10.7	15.5	19.6	16.2
brotli 8	1.4	2.3	2.7	3.1	4.0	5.3	7.1	10.6	15.1	12.2
brotli 9	1.3	2.1	2.4	2.8	3.4	4.3	5.5	7.0	10.6	8.8
brotli 10	0.2	0.4	0.4	0.5	0.5	0.6	0.6	0.6	0.5	0.5

On average for all files, we can see that Brotli at quality level 4 is slightly faster than zlib at quality level 8 (and 9) while having comparable compression ratio. However that is misleading. Most files are smaller than 64KB, and if we look only at those files then Brotli 4 is actually 1.48X slower than zlib level 8!

Connection speedup

For on-the-fly compression, the most important question is how much time to invest in compression to make data transfer faster. Because increasing compression quality only gives incremental improvement over a given level, we need the added compressed bytes to outweigh the additional time spent compressing those bytes.

Again, CloudFlare uses compression quality 8 with zlib, so that’s our baseline. For each quality setting of Brotli starting at 4 (which is somewhat comparable to zlib 8 in terms of both time and compression ratio), we compute the added compression speed as: ((total size for zlib 8) - (total size after compression with Brotli))/((total time for Brotli)-(total time for zlib 8)).

The results are reported in MB/s. Negative numbers indicate lower compression ratio.

	[20, 1024)	[1024, 2048)	[2048, 3072)	[3072, 4096)	[4096, 8192)	[8192, 16384)	[16384, 32768)	[32768, 65536)	[65536, +∞)	All files
brotli 4	0.33	0.56	0.52	0.47	0.42	0.44	-0.24	-2.86	0.02	0.00
brotli 5	0.44	0.55	0.60	0.61	0.71	0.97	1.01	1.08	2.24	1.58
brotli 6	0.36	0.36	0.37	0.37	0.45	0.63	0.69	0.86	1.52	1.12
brotli 7	0.28	0.20	0.19	0.18	0.19	0.23	0.24	0.34	0.72	0.52
brotli 8	0.26	0.20	0.17	0.15	0.15	0.17	0.15	0.21	0.48	0.35
brotli 9	0.25	0.19	0.15	0.13	0.13	0.13	0.11	0.13	0.30	0.24
brotli 10	0.03	0.04	0.04	0.04	0.03	0.03	0.02	0.02	0.02	0.02

Those numbers are quite low, due to the slow speed of Brotli. To get any speedup we really want to see those numbers being greater than the connection speed. It seems that for files greater than 64KB, Brotli at quality setting 5 can speed up slow connections.

Keep in mind, that on a real server, compression is only one of many tasks that share the CPU, and compression speeds would be slower there.

Conclusions

The current state of Brotli gives us some mixed impressions. There is no yes/no answer to the question "Is Brotli better than gzip?". It definitely looks like a big win for static content compression, but on the web where the content is dynamic we also need to consider on-the-fly compression.

The way I see it, Brotli already has an advantage over zlib for large files (larger than 64KB) on slow connections. However, those constitute only 20% of our sampled dataset (and 80% of the total size).

Our Brotli module has a minimal size setting for Brotli compression that allows us to use gzip for smaller files and Brotli only for large ones.

It is important to remember that zlib has the advantage of being the optimization target for years by the entire web community, while Brotli is the development effort of a small but capable and talented team. There is no doubt that the current implementation will only improve with time.

Single RX queue kernel bypass in Netmap for high packet rate networking

Gilberto Bertin — Fri, 09 Oct 2015 10:26:42 GMT

In a previous post we discussed the performance limitations of the Linux kernel network stack. We detailed the available kernel bypass techniques allowing user space programs to receive packets with high throughput. Unfortunately, none of the discussed open source solutions supported our needs. To improve the situation we decided to contribute to the Netmap project. In this blog post we'll describe our proposed changes.

CC BY-SA 2.0 image by Binary Koala

Our needs

At CloudFlare we are constantly dealing with large packet floods. Our network constantly receives a large volume of packets, often coming from many, simultaneous attacks. In fact, it is entirely possible that the server which just served you this blog post is dealing with a many-million packets per second flood right now.

Since the Linux Kernel can't really handle a large volume of packets, we need to work around it. During packet floods we offload selected network flows (belonging to a flood) to a user space application. This application filters the packets at very high speed. Most of the packets are dropped, as they belong to a flood. The small number of "valid" packets are injected back to the kernel and handled in the same way as usual traffic.

It’s important to emphasize that the kernel bypass is enabled only for selected flows, which means that all other packets go to the kernel as usual.

This setup works perfectly on our servers with Solarflare network cards - we can use the ef_vi API to achieve the kernel bypass. Unfortunately, we don’t have this functionality on our servers with Intel IXGBE NIC’s.

This is when Netmap comes in.

Netmap

Over the last few months we’ve been thinking hard about how to achieve bypass for selected flows (aka: bifurcated driver) on non-Solarflare network cards.

We’ve considered PF_RING, DPDK and other custom solutions, but sadly all of them take over the whole network card. Eventually we decided that the best way would be to patch Netmap with the functionality we need.

We chose Netmap because:

It’s fully open source and released under a BSD license.
It has a great NIC-agnostic API.
It’s very fast: can reach line rate easily.
The project is well maintained and reasonably mature.
The code is very high quality.
The driver-specific modifications are trivial: most of the magic happens in the shared Netmap module. It’s easy to add support for new hardware.

Introducing the single RX queue mode

Usually, when a network card goes into the Netmap mode, all the RX queues get disconnected from the kernel and are available to the Netmap applications.

We don't want that. We want to keep most of the RX queues back in the kernel mode, and enable Netmap mode only on selected RX queues. We call this functionality: "single RX queue mode".

The intention was to expose a minimal API which could:

Open a network interface in "a single RX queue mode".
This would allow netmap applications to receive packets from that specific RX queue.
While leaving all the other queues attached to the host network stack.
On demand add or remove RX queues from the "single RX queue mode".
Eventually remove the interface from the Netmap mode and reattach the RX queues to the host stack.

The patch to Netmap is awaiting code review and is available here:

https://github.com/luigirizzo/netmap/pull/87

The minimal program receiving packets from eth3 RX queue #4 would look like:

d = nm_open("netmap:eth3~4", NULL, 0, 0);
while (1) {
    fds = {fds: d->fd, events: POLLIN};
    poll(&fds, 1, -1);

    ring = NETMAP_RXRING(d->nifp, 4);
    while (!nm_ring_empty(ring)) {
        i   = ring->cur;
        buf = NETMAP_BUF(ring, ring->slot[i].buf_idx);
        len = ring->slot[i].len;
        //process(buf, len)
        ring->head = ring->cur = nm_ring_next(ring, i);
    }
}

This code is very close to a Netmap example program. Indeed the only difference is the nm_open() call, which uses the new syntax netmap:ifname~queue_number.

Once again, when running this code only packets arriving on the RX queue #4 will go to the netmap program. All other RX and TX queues will be handled by the Linux kernel network stack.

You can find a more complete example here:

https://github.com/jibi/nm-single-rx-queue

Isolating a queue

In multiqueue network cards, any packet can end up in almost any RX queue due to RSS. This is why before enabling the single RX mode it is necessary to make sure only the selected flow goes to the Netmap queue.

To do so it is necessary to:

Modify the indirection table to ensure no new RSS-hashed packets will go there.
Use flow steering to specifically direct some flows to the isolated queue.
Work around RFS - make sure no other application is running on the CPU Netmap will run on.

For example:

$ ethtool -X eth3 weight 1 1 1 1 0 1 1 1 1 1
$ ethtool -K eth3 ntuple on
$ ethtool -N eth3 flow-type udp4 dst-port 53 action 4

Here we are setting the indirection table to prevent traffic from going to RX queue #4. Then we are enabling flow steering to enqueue all UDP traffic with destination port 53 into queue #4.

Trying it out

Here's how to run it with the IXGBE NIC. First grab the sources:

$ git clone https://github.com/jibi/netmap.git
$ cd netmap
$ git checkout -B single-rx-queue-mode
$ ./configure --drivers=ixgbe --kernel-sources=/path/to/kernel

Load the netmap-patched modules and setup the interface:

$ insmod ./LINUX/netmap.ko
$ insmod ./LINUX/ixgbe/ixgbe.ko
$ # Distribute the interrupts:
$ (let CPU=0; cd /sys/class/net/eth3/device/msi_irqs/; for IRQ in *; do \
  echo $CPU > /proc/irq/$IRQ/smp_affinity_list; let CPU+=1
         done)
$ # Enable RSS:
$ ethtool -K eth3 ntuple on

At this point we started flooding the interface with 6M short UDP packets. htop shows the server being totally busy with handling the flood:

To counter the flood we started Netmap. First, we needed to edit the indirection table, to isolate the RX queue #4:

$ ethtool -X eth3 weight 1 1 1 1 0 1 1 1 1 1
$ ethtool -N eth3 flow-type udp4 dst-port 53 action 4

This caused all the flood packets to go to RX queue #4.

Before putting an interface in Netmap mode it is necessary to turn off hardware offload features:

$ ethtool -K eth3 lro off gro off

Finally we launched the netmap offload:

$ sudo taskset -c 15 ./nm_offload eth3 4
[+] starting test02 on interface eth3 ring 4
[+] UDP pps: 5844714
[+] UDP pps: 5996166
[+] UDP pps: 5863214
[+] UDP pps: 5986365
[+] UDP pps: 5867302
[+] UDP pps: 5964911
[+] UDP pps: 5909715
[+] UDP pps: 5865769
[+] UDP pps: 5906668
[+] UDP pps: 5875486

As you see the netmap program on a single RX queue was able to receive about 5.8M packets.

For completeness, here's an htop showing only a single core being busy with Netmap:

Thanks

We would like to thank Pavel Odintsov who suggested the possibility of using Netmap this way. He even prepared the initial hack we based our work on.

We would also like to thank Luigi Rizzo, for his Netmap work and great feedback on our patches.

Final words

At CloudFlare our application stack is based on open source software. We’re grateful to so many open source programmers for their awesome work. Whenever we can we try to contribute back to the community - we hope "the single RX Netmap mode" will be useful to others.

You can find more CloudFlare open source here.

How to receive a million packets per second

Marek Majkowski — Tue, 16 Jun 2015 13:47:27 GMT

Last week during a casual conversation I overheard a colleague saying: "The Linux network stack is slow! You can't expect it to do more than 50 thousand packets per second per core!"

That got me thinking. While I agree that 50kpps per core is probably the limit for any practical application, what is the Linux networking stack capable of? Let's rephrase that to make it more fun:

On Linux, how hard is it to write a program that receives 1 million UDP packets per second?

Hopefully, answering this question will be a good lesson about the design of a modern networking stack.

CC BY-SA 2.0 image by Bob McCaffrey

First, let us assume:

Measuring packets per second (pps) is much more interesting than measuring bytes per second (Bps). You can achieve high Bps by better pipelining and sending longer packets. Improving pps is much harder.
Since we're interested in pps, our experiments will use short UDPmessages. To be precise: 32 bytes of UDP payload. That means 74bytes on the Ethernet layer.
For the experiments we will use two physical servers: "receiver" and"sender".
They both have two six core 2GHz Xeon processors. With hyperthreading (HT) enabled that counts to 24 processors on each box. The boxes have a multi-queue 10G network card by Solarflare, with 11 receive queues configured. More on that later.
The source code of the test programs is available here: udpsender, udpreceiver.

Prerequisites

Let's use port 4321 for our UDP packets. Before we start we must ensure the traffic won't be interfered with by the iptables:

receiver$ iptables -I INPUT 1 -p udp --dport 4321 -j ACCEPT
receiver$ iptables -t raw -I PREROUTING 1 -p udp --dport 4321 -j NOTRACK

A couple of explicitly defined IP addresses will later become handy:

receiver$ for i in `seq 1 20`; do \
              ip addr add 192.168.254.$i/24 dev eth2; \
          done
sender$ ip addr add 192.168.254.30/24 dev eth3

The naive approach

To start let's do the simplest experiment. How many packets will be delivered for a naive send and receive?

The sender pseudo code:

fd = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
fd.bind(("0.0.0.0", 65400)) # select source port to reduce nondeterminism
fd.connect(("192.168.254.1", 4321))
while True:
    fd.sendmmsg(["\x00" * 32] * 1024)

While we could have used the usual send syscall, it wouldn't be efficient. Context switches to the kernel have a cost and it is be better to avoid it. Fortunately a handy syscall was recently added to Linux: sendmmsg. It allows us to send many packets in one go. Let's do 1,024 packets at once.

The receiver pseudo code:

fd = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
fd.bind(("0.0.0.0", 4321))
while True:
    packets = [None] * 1024
    fd.recvmmsg(packets, MSG_WAITFORONE)

Similarly, recvmmsg is a more efficient version of the common recv syscall.

Let's try it out:

sender$ ./udpsender 192.168.254.1:4321
receiver$ ./udpreceiver1 0.0.0.0:4321
  0.352M pps  10.730MiB /  90.010Mb
  0.284M pps   8.655MiB /  72.603Mb
  0.262M pps   7.991MiB /  67.033Mb
  0.199M pps   6.081MiB /  51.013Mb
  0.195M pps   5.956MiB /  49.966Mb
  0.199M pps   6.060MiB /  50.836Mb
  0.200M pps   6.097MiB /  51.147Mb
  0.197M pps   6.021MiB /  50.509Mb

With the naive approach we can do between 197k and 350k pps. Not too bad. Unfortunately there is quite a bit of variability. It is caused by the kernel shuffling our programs between cores. Pinning the processes to CPUs will help:

sender$ taskset -c 1 ./udpsender 192.168.254.1:4321
receiver$ taskset -c 1 ./udpreceiver1 0.0.0.0:4321
  0.362M pps  11.058MiB /  92.760Mb
  0.374M pps  11.411MiB /  95.723Mb
  0.369M pps  11.252MiB /  94.389Mb
  0.370M pps  11.289MiB /  94.696Mb
  0.365M pps  11.152MiB /  93.552Mb
  0.360M pps  10.971MiB /  92.033Mb

Now, the kernel scheduler keeps the processes on the defined CPUs. This improves processor cache locality and makes the numbers more consistent, just what we wanted.

Send more packets

While 370k pps is not bad for a naive program, it's still quite far from the goal of 1Mpps. To receive more, first we must send more packets. How about sending independently from two threads:

sender$ taskset -c 1,2 ./udpsender \
            192.168.254.1:4321 192.168.254.1:4321
receiver$ taskset -c 1 ./udpreceiver1 0.0.0.0:4321
  0.349M pps  10.651MiB /  89.343Mb
  0.354M pps  10.815MiB /  90.724Mb
  0.354M pps  10.806MiB /  90.646Mb
  0.354M pps  10.811MiB /  90.690Mb

The numbers on the receiving side didn't increase. ethtool -S will reveal where the packets actually went:

receiver$ watch 'sudo ethtool -S eth2 |grep rx'
     rx_nodesc_drop_cnt:    451.3k/s
     rx-0.rx_packets:     8.0/s
     rx-1.rx_packets:     0.0/s
     rx-2.rx_packets:     0.0/s
     rx-3.rx_packets:     0.5/s
     rx-4.rx_packets:  355.2k/s
     rx-5.rx_packets:     0.0/s
     rx-6.rx_packets:     0.0/s
     rx-7.rx_packets:     0.5/s
     rx-8.rx_packets:     0.0/s
     rx-9.rx_packets:     0.0/s
     rx-10.rx_packets:    0.0/s

Through these stats, the NIC reports that it had successfully delivered around 350kpps to RX queue number #4. The rx_nodesc_drop_cnt is a Solarflare specific counter saying the NIC failed to deliver 450kpps to the kernel.

Sometimes it's not obvious why the packets weren't delivered. In our case though, it's very clear: the RX queue #4 delivers packets to CPU #4. And CPU #4 can't do any more work - it's totally busy just reading the 350kpps. Here's how that looks in htop:

Crash course to multi-queue NICs

Historically, network cards had a single RX queue that was used to pass packets between hardware and kernel. This design had an obvious limitation - it was impossible to deliver more packets than a single CPU could handle.

To utilize multicore systems, NICs began to support multiple RX queues. The design is simple: each RX queue is pinned to a separate CPU, therefore, by delivering packets to all the RX queues a NIC can utilize all CPUs. But it raises a question: given a packet, how does the NIC decide to which RX queue to push it?

Round-robin balancing is not acceptable, as it might introduce reordering of packets within a single connection. An alternative is to use a hash from packet to decide the RX queue number. The hash is usually counted from a tuple (src IP, dst IP, src port, dst port). This guarantees that packets for a single flow will always end up on exactly the same RX queue, and reordering of packets within a single flow can't happen.

In our case, the hash could have been used like this:

RX_queue_number = hash('192.168.254.30', '192.168.254.1', 65400, 4321) % number_of_queues

Multi-queue hashing algorithms

The hash algorithm is configurable with ethtool. On our setup it is:

receiver$ ethtool -n eth2 rx-flow-hash udp4
UDP over IPV4 flows use these fields for computing Hash flow key:
IP SA
IP DA

This reads as: for IPv4 UDP packets, the NIC will hash (src IP, dst IP) addresses. i.e.:

RX_queue_number = hash('192.168.254.30', '192.168.254.1') % number_of_queues

This is pretty limited, as it ignores the port numbers. Many NICs allow customization of the hash. Again, using ethtool we can select the tuple (src IP, dst IP, src port, dst port) for hashing:

receiver$ ethtool -N eth2 rx-flow-hash udp4 sdfn
Cannot change RX network flow hashing options: Operation not supported

Unfortunately our NIC doesn't support it - we are constrained to (src IP, dst IP) hashing.

A note on NUMA performance

So far all our packets flow to only one RX queue and hit only one CPU. Let's use this as an opportunity to benchmark the performance of different CPUs. In our setup the receiver host has two separate processor banks, each is a different NUMA node.

We can pin the single-threaded receiver to one of four interesting CPUs in our setup. The four options are:

Run receiver on another CPU, but on the same NUMA node as the RX queue. The performance as we saw above is around 360kpps.
With receiver on exactly same CPU as the RX queue we can get up to ~430kpps. But it creates high variability. The performance drops down to zero if the NIC is overwhelmed with packets.
When the receiver runs on the HT counterpart of the CPU handling RX queue, the performance is half the usual number at around 200kpps.
With receiver on a CPU on a different NUMA node than the RX queue we get ~330k pps. The numbers aren't too consistent though.

While a 10% penalty for running on a different NUMA node may not sound too bad, the problem only gets worse with scale. On some tests I was able to squeeze out only 250kpps per core. On all the cross-NUMA tests the variability was bad. The performance penalty across NUMA nodes is even more visible at higher throughput. In one of the tests I got a 4x penalty when running the receiver on a bad NUMA node.

Multiple receive IPs

Since the hashing algorithm on our NIC is pretty limited, the only way to distribute the packets across RX queues is to use many IP addresses. Here's how to send packets to different destination IPs:

sender$ taskset -c 1,2 ./udpsender 192.168.254.1:4321 192.168.254.2:4321

ethtool confirms the packets go to distinct RX queues:

receiver$ watch 'sudo ethtool -S eth2 |grep rx'
     rx-0.rx_packets:     8.0/s
     rx-1.rx_packets:     0.0/s
     rx-2.rx_packets:     0.0/s
     rx-3.rx_packets:  355.2k/s
     rx-4.rx_packets:     0.5/s
     rx-5.rx_packets:  297.0k/s
     rx-6.rx_packets:     0.0/s
     rx-7.rx_packets:     0.5/s
     rx-8.rx_packets:     0.0/s
     rx-9.rx_packets:     0.0/s
     rx-10.rx_packets:    0.0/s

The receiving part:

receiver$ taskset -c 1 ./udpreceiver1 0.0.0.0:4321
  0.609M pps  18.599MiB / 156.019Mb
  0.657M pps  20.039MiB / 168.102Mb
  0.649M pps  19.803MiB / 166.120Mb

Hurray! With two cores busy with handling RX queues, and third running the application, it's possible to get ~650k pps!

We can increase this number further by sending traffic to three or four RX queues, but soon the application will hit another limit. This time the rx_nodesc_drop_cnt is not growing, but the netstat "receiver errors" are:

receiver$ watch 'netstat -s --udp'
Udp:
      437.0k/s packets received
        0.0/s packets to unknown port received.
      386.9k/s packet receive errors
        0.0/s packets sent
    RcvbufErrors:  123.8k/s
    SndbufErrors: 0
    InCsumErrors: 0

This means that while the NIC is able to deliver the packets to the kernel, the kernel is not able to deliver the packets to the application. In our case it is able to deliver only 440kpps, the remaining 390kpps + 123kpps are dropped due to the application not receiving them fast enough.

Receive from many threads

We need to scale out the receiver application. The naive approach, to receive from many threads, won't work well:

sender$ taskset -c 1,2 ./udpsender 192.168.254.1:4321 192.168.254.2:4321
receiver$ taskset -c 1,2 ./udpreceiver1 0.0.0.0:4321 2
  0.495M pps  15.108MiB / 126.733Mb
  0.480M pps  14.636MiB / 122.775Mb
  0.461M pps  14.071MiB / 118.038Mb
  0.486M pps  14.820MiB / 124.322Mb

The receiving performance is down compared to a single threaded program. That's caused by a lock contention on the UDP receive buffer side. Since both threads are using the same socket descriptor, they spend a disproportionate amount of time fighting for a lock around the UDP receive buffer. This paper describes the problem in more detail.

Using many threads to receive from a single descriptor is not optimal.

SO_REUSEPORT

Fortunately, there is a workaround recently added to Linux: the SO_REUSEPORT flag. When this flag is set on a socket descriptor, Linux will allow many processes to bind to the same port. In fact, any number of processes will be allowed to bind and the load will be spread across them.

With SO_REUSEPORT each of the processes will have a separate socket descriptor. Therefore each will own a dedicated UDP receive buffer. This avoids the contention issues previously encountered:

receiver$ taskset -c 1,2,3,4 ./udpreceiver1 0.0.0.0:4321 4 1
  1.114M pps  34.007MiB / 285.271Mb
  1.147M pps  34.990MiB / 293.518Mb
  1.126M pps  34.374MiB / 288.354Mb

This is more like it! The throughput is decent now!

More investigation will reveal further room for improvement. Even though we started four receiving threads, the load is not being spread evenly across them:

Two threads received all the work and the other two got no packets at all. This is caused by a hashing collision, but this time it is at the SO_REUSEPORT layer.

Final words

I've done some further tests, and with perfectly aligned RX queues and receiver threads on a single NUMA node it was possible to get 1.4Mpps. Running receiver on a different NUMA node caused the numbers to drop achieving at best 1Mpps.

To sum up, if you want a perfect performance you need to:

Ensure traffic is distributed evenly across many RX queues andSO_REUSEPORT processes. In practice, the load usually is well distributed as long as there are a large number of connections (or flows).
You need to have enough spare CPU capacity to actually pick up the packets from the kernel.
To make the things harder, both RX queues and receiver processes should be on a single NUMA node.

While we had shown that it is technically possible to receive 1Mpps on a Linux machine, the application was not doing any actual processing of received packets - it didn't even look at the content of the traffic. Don't expect performance like that for any practical application without a lot more work.

Interested in this sort of low-level, high-performance packet wrangling? CloudFlare is hiring in London, San Francisco and Singapore.

Courage to change things

John Graham-Cumming — Fri, 11 Jul 2014 13:00:00 GMT

This was an internal email that I sent to the CloudFlare team about how we are not afraid to throw away old code. We thought it was worth sharing with a wider audience.

Date: Thu, 10 Jul 2014 10:24:21 +0100 Subject: Courage to change things From: John Graham-Cumming To: Everyone

Folks,

At the Q3 planning meeting I started by making some remarks about how much code we are changing at CloudFlare. I understand that there were audio problems and people may not have heard these clearly, so I'm just going to reiterate them in writing.

One of the things that CloudFlare is being brave about is looking at old code and deciding to rewrite it. Lots of companies live with legacy code and build on it and it eventuallybecomes a maintenance nightmare and slows the company down.

Over the last year we've made major strides in rewriting parts of our code base so that they are faster, more maintainable, and easier to enhance. There are many parts of the Q3 roadmap that include replacing old parts of our stack. This is incredibly important as it enables us to be more agile and more stable in future.

We should feel good about this.

We're not just rewriting so that engineers have fun with new stuff, we're making things better across the board. Many, many companies don't have the courage to rewrite; even fewer have the courage to do things we've done like write a brand new DNS server or replace our core request handling functionality.

We should also not feel bad about this either: it's not a sign that we did things wrong in the first place. We operate in a very fast moving environment. That's the nature of a start-up. As we grow our requirements change (we have more sites on us, more traffic, more variety and more ideas) and so our code needs to.

So don't be surprised by items on the roadmap that talk about replacing code. Like a well maintained aircraft we take CloudFlare's code in for regular maintenance and change what's worn.

John.

Want to help out write new code and replace the old? We'rere hiring in San Francisco and London.

Technical Details Behind a 400Gbps NTP Amplification DDoS Attack

Matthew Prince — Thu, 13 Feb 2014 01:00:00 GMT

On Monday we mitigated a large DDoS that targeted one of our customers. The attack peaked just shy of 400Gbps. We've seen a handful of other attacks at this scale, but this is the largest attack we've seen that uses NTP amplification. This style of attacks has grown dramatically over the last six months and poses a significant new threat to the web. Monday's attack serves as a good case study to examine how these attacks work.

NTP Amplification 101

Before diving into the particular details of this attack, it's important to understand the basic mechanics of how NTP amplification attacks work. This is a quick overview of how these attacks occur. John Graham-Cumming on our team previously wrote a detailed primer on NTP amplification attacks if you're interested in further technical details. If you're interested in amplification attacks, you may also find interesting our posts about DNS Amplification attacks. These attacks use a similar method but target open DNS resolvers rather than NTP servers.

An NTP amplification attack begins with a server controlled by an attacker on a network that allows source IP address spoofing (e.g., it does not follow BCP38). The attacker generates a large number of UDP packets spoofing the source IP address to make it appear the packets are coming from the intended target. These UDP packets are sent to Network Time Protocol servers (port 123) that support the MONLIST command.

I'd personally be curious to talk with whoever added MONLIST as a command to NTP servers. The command seems of such little practical use -- it returns a list of up to the last 600 IP addresses that last accessed the NTP server -- and yet it can do so much harm. If an NTP server has its list fully populated, the response to a MONLIST request will be 206-times larger than the request. In the attack, since the source IP address is spoofed and UDP does not require a handshake, the amplified response is sent to the intended target. An attacker with a 1Gbps connection can theoretically generate more than 200Gbps of DDoS traffic.

Not Just Theoretical

Monday's DDoS proved these attacks aren't just theoretical. To generate approximately 400Gbps of traffic, the attacker used 4,529 NTP servers running on 1,298 different networks. On average, each of these servers sent 87Mbps of traffic to the intended victim on CloudFlare's network. Remarkably, it is possible that the attacker used only a single server running on a network that allowed source IP address spoofing to initiate the requests.

While NTP servers that support MONLIST are less common than open DNS resolvers, they tend to run on beefier servers with fatter connections to the network. Combined with the high amplification factor, this allows a much smaller number of NTP servers to generate very large attacks. For comparison, the attack that targeted Spamhaus used 30,956 open DNS resolvers to generate a 300Gbps DDoS. On Monday, with 1/7th the number of vulnerable servers, the attacker was able to generate an attack that was 33% larger than the Spamhaus attack.

Globally Distributed Threat

Learn About Tableau

We saw attack traffic hitting every one of CloudFlare's data centers. While we were generally able to mitigate the attack, it was large enough that it caused network congestion in parts of Europe. The map above shows the global distribution of the 4,529 NTP servers used in the attack. The chart below lists the AS Numbers and names of the top 24 networks we saw traffic from in the attack, as well as the number of exploited NTP servers running on each.

ASN Network Count 9808 CMNET-GD Guangdong Mobile Communication Co.Ltd. 136 4134 CHINANET-BACKBONE No.31,Jin-rong Street 116 16276 OVH OVH Systems 114 4837 CHINA169-BACKBONE CNCGROUP China169 Backbone 81 3320 DTAG Deutsche Telekom AG 69 39116 TELEHOUSE Telehouse Inter. Corp. of Europe Ltd 61 10796 SCRR-10796 - Time Warner Cable Internet LLC 53 6830 LGI-UPC Liberty Global Operations B.V. 48 6663 TTI-NET Euroweb Romania SA 46 9198 KAZTELECOM-AS JSC Kazakhtelecom 45 2497 IIJ Internet Initiative Japan Inc. 39 3269 ASN-IBSNAZ Telecom Italia S.p.a. 39 9371 SAKURA-C SAKURA Internet Inc. 39 12322 PROXAD Free SAS 37 20057 AT&T Wireless Service 37 30811 EPiServer AB 36 137 ASGARR GARR Italian academic and research network 34 209 ASN-QWEST-US NOVARTIS-DMZ-US 33 6315 XMISSION - XMission, L.C. 33 52967 NT Brasil Tecnologia Ltda. ME 32 4713 OCN NTT Communications Corporation 31 56041 CMNET-ZHEJIANG-AP China Mobile communications corporation 31 1659 ERX-TANET-ASN1 Tiawan Academic Network (TANet) Information Center 30 4538 ERX-CERNET-BKB China Education and Research Network Center 30

At this time, we've decided not to publish the full list of the IP addresses of the NTP servers involved in the attack out of concern that it could give even more attackers access to a powerful weapon. However, we have published a spreadsheet with the complete list of the networks with NTP servers that participated in the attack. While the per server amplification makes these attacks troubling, the smaller number of servers and networks involved gives us some hope that we can make a dent in getting them cleaned up. We are reaching out to network operators whose resources were used in the attack to encourage them to restrict access to their NTP servers and disable the MONLIST command.

Somewhat ironically, the large French hosting provider OVH was one of the largest sources of our attack and also a victim of a large scale NTP amplification attack around the same time. The company's founder Tweeted:

We see today lot of new DDoS attacks from Internet to our network. Type: NTP AMP Size: >350Gbps. No issue. VAC is great :) — Oles (@olesovhcom) February 12, 2014

Time to Clean Up the Problem

If you're a network administrator and on Monday you saw network graphs like the one in the Tweet below then you are running a vulnerable NTP server.

and here's what it looks like when a device participates in the NTP DDOS against @CloudFlare pic.twitter.com/QcrPGxbcUz
— Eric C (@ctrl_alt_esc) February 12, 2014

You can check whether there are open NTP servers that support the MONLIST command running on your network by visiting the Open NTP Project. Even if you don't think you're running an NTP server, you should check your network because you may be running one inadvertently. For example, some firmware on Supermicro's IPMI controllers shipped with a MONLIST-enabled NTP server on by default. More details on NTP attacks and instructions on how to disable the MONLIST command can be found on the Internet Storm Center's NTP attack advisory.

NTP and all other UDP-based amplification attacks rely on source IP address spoofing. If attackers weren't able to spoof the source IP address then they would only be able to DDoS themselves. If you're running a network then you should ensure that you are following BCP38 and preventing packets with spoofed source addresses from leaving your network. You can test whether your network currently follows BCP38 using tools from MIT's the Spoofer Project. If you're running a naughty network that allows source IP address spoofing, you can easily implement BCP38 by following the instructions listed at BCP38.info.

Finally, if you think NTP is bad, just wait for what's next. SNMP has a theoretical 650x amplification factor. We've already begun to see evidence attackers have begun to experiment with using it as a DDoS vector. Buckle up.

Protect Your Sites With Rapidly Deployed WAF Rules

John Graham-Cumming — Tue, 21 Jan 2014 16:00:00 GMT

This blog post originally appeared as a guest post on the Rackspace blog

An attack on your site could be catastrophic. Even a small attack can have major implications. Responding quickly to an attack is imperative.

In August 2013, we at CloudFlare rolled out a new global Web Application Firewall (WAF) that runs common sets of firewall rules such as the open source OWASP rules that protect against SQL injection, cross-site scripting and more. But the WAF also has a custom rule language that CloudFlare’s support engineers use to write one-of-a-kind rules to protect specific customers or specific software.

Since the deployment of the new WAF we’ve rolled out many global and customer-specific rules to protect websites from attacks. This blog post looks at a few of them and the process of getting a custom rule deployed.

Custom Rules For Customers

Commonly, a customer will complain of an attack that’s hitting their web site and will provide us with their web server logs showing the attack (or by giving us permission to examine the traffic going to their web site). The logs will often show a common pattern. For example, here’s an anonymized attack against a customer web site showing up in their logs:

12/30 01:42:32 23.75.345.200 example.com /index.php?2346354=-349087 WordPress/3.7.2 12/30 01:42:31 23.75.345.200 example.com /index.php?7231344=4454226 WordPress/3.3.1 12/30 01:42:25 23.75.345.200 example.com /index.php?1243847=9161112 WordPress/3.7.2 12/30 01:42:23 23.75.345.200 example.com /index.php?8809549=4423410 WordPress/3.3.1 12/30 01:42:21 23.75.345.200 example.com /index.php?1834306=3447145 WordPress/3.5.1 12/30 01:42:16 23.75.345.200 example.com /index.php?-234069=6121852 WordPress/3.3.3 12/30 01:42:16 23.75.345.200 example.com /index.php?-152536=6922268 WordPress/3.3.1 12/30 01:42:14 23.75.345.200 example.com /index.php?3433701=7147876 WordPress/3.4.2 12/30 01:42:14 23.75.345.200 example.com /index.php?6732828=-106444 WordPress/3.2.2

The attacker was using a botnet to hit the script /index.php on example.com. To make each request different they are appending a query string in the form number=number where each number is randomly chosen. The User-Agent in the requests is spoofed to look like WordPress.

Although this attack doesn’t cause CloudFlare a problem it can cause the customer’s web server to become overloaded. The customer contacted us and we put in place a custom rule for them that blocks the attack. The rule looks like this:

rule 12345678 WordPress Numbers Botnet REQUEST_HEADERS:User-Agent matches ^WordPress\/ and REQUEST_METHOD is GET and REQUEST_URI matches /index.php\?-?\d+=-?\d+ deny

The support engineer wrote the rule (and an associated set of tests to make sure that it works correctly) and then pushed the rule to our git repository. From there it gets deployed to CloudFlare’s servers around the world in under 30 seconds.

The customer sees the rule appear in the Manage WAF page in CloudFlare Settings. They can turn the rule on and off and see how many times it has activated (i.e. blocked a request).

Another customer saw an attack that claimed to come from “Anonymous.” The attack hit the same page on their web server over and over again with a message from Anonymous (in French). A typical request looked like this:

example.com/?msg=Nous%20sommes%20Anonymous%20Nous%20sommes%20L%C3%A9g ion%20Nous%20ne%20pardonnons%20pas%20Nous%20n%E2%80%99oublions%20pas

The rule for this one was simple:

rule 12345679 Anonymous attack REQUEST_METHOD is GET and REQUEST_URI begins /?msg=Nous%20sommes%20Anonymous deny

Another simple attack that was very troublesome for the customer came in the form of a POST request to a URL that didn’t exist on the customer’s site. Once again, the request rate wasn’t troublesome for CloudFlare, but was for the customer, so we put in place a rule that would block all POSTs to that URL. This took the load off the customer web site: CloudFlare took the load for them.

The attack used a botnet to send the request POST /q to the customer’s web site. The rule to block that was very simple.

rule 1234567A Simple POST botnet REQUEST_METHOD is POST and REQUEST_URI is /q deny

As simple as that looks, it took the load off the customer’s web server and in the first 12 hours after deployment was activated over 27 million times!

Custom Global Rules

Other attacks are likely to be much more widespread and not focused on a single customer. These attacks are often directed at specific technologies such as WordPress, Joomla or other software.

Back in October a zero-day exploit for WHMCS (the popular web hosting management software) was released. CloudFlare patched that by deploying a complex WAF rule on the day the zero-day became public. The WHMCS maintainers also patched the problem.

Shortly after that, two new WHMCS vulnerabilities were found and we patched them as well by deploying or modifying WHMCS WAF rules.

We’ve also added custom rules to protect Plone-based web sites, a PHP 5.x remote code execution exploit, and a number of other rules to protect against common botnets and other attack software.

All those are on top of the generic protection offered by the 2,372 other rules that CloudFlare’s WAF runs to check every request.

Conclusion

The flexibility of the CloudFlare WAF language and the ability to deploy WAF rules in seconds means we can respond to attacks very rapidly. Visit our learning center to learn more about Web Application Firewalls.

Efficiently compressing dynamically generated web content

John Graham-Cumming — Thu, 06 Dec 2012 09:19:00 GMT

I originally wrote this article for the Web Performance Calendar website, which is a terrific resource of expert opinions on making your website as fast as possible. We thought CloudFlare users would be interested so we reproduced it here. Enjoy!

Efficiently compressing dynamically generated web content

With the widespread adoption of high bandwidth Internet connections in the home, offices and on mobile devices, limitations in available bandwidth to download web pages have largely been eliminated.

At the same time latency remains a major problem. According to a recent presentation by Google, broadband Internet latency is 18ms for fiber technologies, 26ms for cable-based services, 43ms for DSL and 150ms-400ms for mobile devices. Ultimately, bandwidth can be expanded greatly with new technologies but latency is limited by the speed of light. The latency of an Internet connection directly affects the speed with which a web page can be downloaded.

The latency problem occurs because the TCP protocol requires round trips to acknowledge received information (since packets can and do get lost while traversing the Internet) and to prevent Internet congestion TCP has mechanisms to limit the amount of data sent per round trip until it has learnt how much it can send without causing congestion.

The collision between the speed of light and the TCP protocol is made worse by the fact that web site owners are likely to choose the cheapest hosting available without thinking about its physical location. In fact, the move to ‘the cloud' encourages the idea that web sites are simply ‘out there' without taking into account the very real problem of latency introduced by the distance between the end user's web browser and the server. It is not uncommon, for example, to see web sites aimed at UK consumers being hosted in the US. A web user in London accessing a .co.uk site that is actually hosted in Chicago incurs an additional 60ms round trip time because of the distance traversed.

Dealing with speed-of-light induced latency requires moving web content closer to user who are browsing, or making the web content smaller so that fewer round trips are required (or both).

The caching challenge

Caching technologies and content delivery services mean that static content (such as images, CSS, JavaScript) can be e cached close to end users helping to reduce latency when they are loaded. CloudFlare sees on average that about 65% of web content is cacheable.

But the most critical part of a web page, the actual HTML content is often dynamically generated and cannot be cached. Because none of the relatively fast to load content that's in cache cannot even be loaded before the HTML, any delay in the web browser receiving the HTML affects the entire web browsing experience.

Thus being able to deliver the page HTML as quickly as possible even in high latency environments is vital to ensuring a good browsing experience. Studies have shown that the slower the page load time the more likely the user is to give up and move elsewhere. A recent Google study said that a response time of less than 100ms is perceived by a human as ‘instant' (a human eye blink is somewhere in the 100ms to 400ms range); less than 300ms the computer seems sluggish; above 1s and the user's train of thought is lost to distraction or other thoughts. TCP's congestion avoidance algorithm means that many round trips are necessary when downloading a web page. For example, getting just the HTML for the CNN home page takes approximately 15 round trips; it's not hard to see how long latency can quickly multiply into a situation where the end-user is losing patience with the web site.

Unfortunately, it is not possible to cache the HTML of most web pages because it is dynamically generated. Dynamic pages are commonplace because the HTML is programmatically generated and not static. For example, a news web site will generate fresh HTML as news stories change or to show a different page depending on the geographical location of the end user. Many web pages are also dynamically generated because they are personalized for the end user — each person's Facebook page is unique. And web application frameworks, such as WordPress, encourage the use dynamically generate HTML by default and mark the content as uncachable.

Compression to the rescue

Given that web pages need to be dynamically generated the only viable option is to reduce the page size so that fewer TCP round trips are needed minimizing the effect of latency. The current best option for doing this is the use of the gzip encoding. On typical web page content gzip encoding will reduce the page size to about 20-25% of the original size. But this still results in multiple-kilobytes of page data being transmitted incurring the TCP congestion avoidance and latency penalty; in the CNN example above there were 15 round-trips even though the page was gzip compressed.

Gzip encoding is completely generic. It does not take into account any special features of the content it is compressing. It is also self-referential: a gzip encoded page is entirely self-contained. This is advantageous because it means that a system that uses gzipped content can be stateless, but it means that even larger compression ratios that would be possible with external dictionaries of common content are notpossible.

External dictionaries increase compression ratios dramatically because the compressed data can refer to items from the dictionary. Those references can be very small (a few bytes each) but expand to very large content from the dictionary.

For example, imagine that it's necessary to transmit The King James Bible to a user. The plain text version from Project Gutenberg is 4,452,097 bytes and compressed with gzip it is 1,404,452 bytes (a reduction in size to 31%). But imagine the case where the compressor knows that the end user has a separate copy of the Old Testament and New Testament in a dictionary of useful content. Instead of transmitting a megabyte of gzip compressed content they can transmit an instruction of the form . That instruction will just be a few bytes long.

Clearly, that's an extreme and unusual case but it highlights the usefulness of external shared dictionaries of common content that can be used to reconstruct an original, uncompressed document. External dictionaries can be applied to dynamically generated web content to achieve compression that exceeds that possible with gzip.

Caching page parts

On the web, shared dictionaries make sense because dynamic web content contains large chunks that's the same for all users and over time. Consider, for example the BBC News homepage which is approximately 116KB of HTML. That page is dynamically generated and the HTTP caching headers are set so that it is not cached. Even though the news stories on the page are frequently updated a large amount of boilerplate HTML does not change from request to request (or even user to user). The first 32KB of the page (28% of the HTML) consists of embedded JavaScript, headers, navigational elements and styles. If that ‘header block' were stored by web browsers in a local dictionary then the BBC would only need to send a small instruction saying instead of 32KB of data. That would save multiple round-trips. And throughout the BBC News page there are smaller chunks of unchanging content that could be referenced from a dictionary.

It's not hard to imagine that for any web site there are large parts of the HTML that are the same from request to request and from user to user. Even on a very personalized site like Facebook the HTML is similar from user to user.

And as more and more applications use HTTP for APIs there's an opportunity to increase API performance through the use of shared dictionaries of JSON or XML. APIs often contain even more common, repeated parts than HTML as they are intended for machine consumption and change slowly over time (whereas the HTML of a page will change more quickly as designers update the look of a page).

Two different proposals have tried to address this in different ways:

SDCH and ESI. Neither have achieved acceptance as Internet standards partly because of the added complexity of deploying them.

SDCH

In 2008, a group working at Google proposed a protocol for negotiating shared dictionaries of content so that a web server can compress a page in the knowledge that a web browser has chunks of the page in its cache. The proposal is known as SDCH (Shared Dictionary Compression over HTTP). Current versions of Google Chrome use SDCH to compress Google Search results.

This can be seen in the Developer Tools in Google Chrome. Any search request will contain an HTTP header specifying that the browser accepts SDCH compressed pages:

Accept-Encoding: gzip,deflate,sdch

And if SDCH is used then the server responds indicating the dictionary that was used. If necessary Chrome will retrieve the dictionary. Since the dictionary should change infrequently it will be in local web browser cache most of the time. For example, here's a sample HTTP header seen in a real response from a Google Search:

Get-Dictionary: /sdch/60W93cgP.dct

The dictionary file simply contains HTML (and JavaScript etc.) and the compressed page contains references to parts of the dictionary file using the VCDIFF format specified in RFC 3284. The compressed page consists mostly of COPY and ADD VCDIFF functions. A COPY x, y instruction tells the browser to copy y bytes of data from osition x in the dictionary (this is how common content gets compressed and expanded from the dictionary). The ADD instruction is used to insert uncompressed data (i.e. those parts of the page that are not in the dictionary).

In a Google Search the dictionary is used to locally cache infrequently changing parts of a page (such as the HTML header, navigation elements and page footer).

SDCH has not achieved widespread acceptance because of the difficulty of generating the shared dictionaries. Three problems arise: when to update the dictionary, how to update the dictionary and prevention of leakage of private information.

For maximum effectiveness it's desirable to produce a shared dictionary that will be useful in reducing page sizes across a large number of page views. To do this it's necessary to either implement an automatic technique that samples real web traffic and identifies common blocks of HTML, or to determine which pages are most viewed and compute dictionaries for them (perhaps based on specialised knowledge of what parts of the page are common across requests).

When automated techniques are used it's important to ensure that when sampling traffic that contains personal information (such as for a logged in user) that personal information does not end up in the dictionary.

Although SDCH is powerful when used, these dictionary generation difficulties have prevented its widespread use. The Apache mod_sdch project is inactive and the Google SDCH group has been largely inactive since 2011.

ESI

In 2001 a consortium of companies proposed addressing both latency and common content with ESI (Edge Side Includes). Edge Side Includes work by having a web page creator identify unchanging parts of the page and then making these available as separate mini-pages using HTTP.

For example, if a page contains a common header and navigation, a web page author might place that in a separate nav.html file and then in a page they are authoring enter the following XML in place of the header and navigation HTML:

\<esi:include src="http://example.com/nav.html" "continue"/>

ESI is intended for use with HTML content that is delivered via a Content Delivery Network and major CDNs were the sponsor of the original proposal.

When a user retrieves a CDN managed page that contains ESI components the CDN reconstructs the complete page from the component parts (which the CDN will either have to retrieve, or, more likely, have in cache since they change infrequently).

The CDN delivers the complete, normal HTML to the end user, but because the CDN has access nodes all over the world the latency between the end user web browser and the CDN is minimized. ESI tries to minimize the amount of data sent between the origin web server and the CDN (where the latency may be high) while being transparent to the browser.

The biggest problem with adoption of ESI is that it forces web page authors to break pages up into blocks that can be safely cached by a CDN adding to the complexity of web page authoring. In addition, a CDN has to be used to deliver the pages as web browsers do not understand the ESI directives.

The time dimension

The SDCH and ESI approaches rely on identifying parts of pages that are known to be unchanging and can be cached either at the edge of a CDN or in a shared dictionary in a web browser.

Another approach is to consider how web pages evolve over time. It is common for web users to visit the same web pages frequently (such as news sites, online email, social media and major retailers). This maymean that a user's web browser has some previous version of the web page they are loading in its local cache. Even though that web page may be out of date it could still be used as a shared dictionary as components of it are likely to appear in the latest version of the page.

For example, a daily visit to a news web site could be speeded up if a web server were only able to send the differences between yesterday's news and today's. It's likely that most of the HTML of a page like the BBC News homepage will have remained unchanged; only the stories will be new and they will only make up a small portion of the page.

CloudFlare looked at how much dynamically generated pages change over time and found that, for example, reddit.com changes by about 2.15% over five minutes and 3.16% over an hour. The New York Times home page changes by about 0.6% over five minutes and 3% over an hour. BBC News changes by about 0.4% over five minutes and 2% over an hour. With delta compression it would be possible to turn those figures directly into a compression ratio by only sending the tiny percentage of the page that has changed. Compressing the BBC News web site to 0.4% is an enormous improvement compared to gzip's 20-25% compression ratio meaning that 116KB would result in just 464 bytes transmitted (which would likely all fit in a single TCP packet requiring a single round trip).

This delta method is the essence of RFC 3229 which was written in 2002.

RFC 3229

This RFC proposes an extension to HTTP where a web browser can indicate to a server that it has a particular version of a page (using the value from the ETag HTTP header that was supplied when the page was previously downloaded). The receiving web server can then apply a delta compression technique (encoded using VCDIFF discussed above) to send only the parts that have changed since that particular version of the page.

The RFC also proposes that a web browser be able to send the identifiers of multiple versions of a single page so that the web server can choose among them. That way, if the web browser has multiple versions in cache there's an increased chance that the server will have one of the versions available to it for delta compression.

Although this technique is powerful because it greatly reduces the amount of data to be sent from a web server to browser it has not been widely deployed because of the enormous resources needed on web servers.

To be effective a web server would need to keep copies of versions of the pages it generates in order that when a request comes in it is able to perform delta compression. For a popular web site that would create a large storage burden; for a site with heavy personalization it would mean keeping a copy of the pages served to every single user. For example, Facebook has around 1 billion active users, just keeping a copy of the HTML of the last time they viewed their timeline would require 250TB of storage.

CloudFlare's Railgun

CloudFlare's Railgun is a transparent delta compression technology that takes advantage of CloudFlare's CDN network to greatly accelerate the transmission of dynamically generated web pages from origin web servers to the CDN node nearest end user web surfers. Unlike SDCH and ESI it does not require any work on the part of a web site creator and unlike RFC 3229 it does not require caching a version of each page for each end user.

Railgun consists of two components: the sender and the listener. The sender is installed at every CloudFlare data center around the world. The listener is a software component that customers install on their network.

The sender and listener establish a permanent TCP connection that's secured by TLS. This TCP connection is used for the Railgun protocol. It's an all binary multiplexing protocol that allows multiple HTTP requests to be run simultaneously and asynchronously across the link. To a web client the Railgun system looks like a proxy server, but instead of being a server it's a wide-area link with special properties. One of those properties is that it performs compression on non-cacheable content by synchronizing page versions.

Each end of the Railgun link keeps track of the last version of a web page that's been requested. When a new request comes in for a page that Railgun has already seen, only the changes are sent across the link. The listener component make an HTTP request to the real, origin web server for the uncacheable page, makes a comparison with the stored version and sends across the differences.

The sender then reconstructs the page from its cache and the difference sent by the other side. Because multiple users pass through the same Railgun link only a single cached version of the page is needed for delta compression as opposed to one per end user with techniques like RFC 3229.

For example, a test on a major news site sent 23,529 bytes of gzipped data which when decompressed become 92,516 bytes of page (so the page is compressed to 25.25% of its original size). Railgun compression between two version of the page at a five minute interval resulted in just 266 bytes of difference data being sent (a compression to 0.29% of the original page size). The one hour difference is 2,885 bytes (a compression to 3% of the original page size). Clearly, Railgun delta compression outperforms gzip enormously.

For pages that are frequently accessed the deltas are often so small that they fit inside a single TCP packet, and because the connection between the two parts of Railgun is kept active problems with TCP congestion avoidance are eliminated.

Conclusion

The use of external dictionaries of content is a powerful technique that can achieve much larger compression ratios that the self-contained gzip method. But only CloudFlare's Railgun implements delta compression in a manner that is completely transparent to end users and website owners.

Why Google Went Offline Today and a Bit about How the Internet Works

Tom Paseka — Tue, 06 Nov 2012 09:09:00 GMT

Today, Google's services experienced a limited outage for about 27 minutes over some portions of the Internet. The reason this happened dives into the deep, dark corners of networking. I'm a network engineer at CloudFlare and I played a small part in helping ensure Google came back online. Here's a bit about what happened.

At around 6:24pm PST / 02:24 UTC (5 Nov. 2012 PST / 6 Nov. 2012 UTC), CloudFlare employees noticed that Google's services were offline. We use Google Apps for things like email so when we can't reach their servers the office notices quickly. I'm on the Network Engineering team so I jumped online to figure out if the problem was local to us or global.

Troubleshooting

I quickly realised that we were unable to resolve all of Googles services — or even reach 8.8.8.8, Googles public DNS server — so I started troubleshooting DNS.

$ dig +trace google.com

Here's the response I got when I tried to reach any of Google.com's name servers:

google.com. 172800 IN NS ns2.google.com.google.com. 172800 IN NS ns1.google.com.google.com. 172800 IN NS ns3.google.com.google.com. 172800 IN NS ns4.google.com.;; Received 164 bytes from 192.12.94.30#53(e.gtld-servers.net) in 152 ms;; connection timed out; no servers could be reached

The fact that no servers could be reached means something was wrong. Specifically, it meant that from our office network we were unable to reach any of Googles DNS servers.

I started to look at the network layer, see if that's where the problems lay.

PING 216.239.32.10 (216.239.32.10): 56 data bytesRequest timeout for icmp_seq 092 bytes from 1-1-15.edge2-eqx-sin.moratelindo.co.id (202.43.176.217): Time to live exceeded

That was curious. Normally, we shouldn't be seeing an Indonesian ISP (Moratel) in the path to Google. I jumped on one of CloudFlare's routers to check what was going on. Meanwhile, others reports from around the globe on Twitter suggested we weren't the only ones seeing the problem.

Internet Routing

To understand what went wrong you need to understand a bit about how networking on the Internet works. The Internet is a collection of networks, known as "Autonomous Systems" (AS). Each network has a unique number to identify it known as AS number. CloudFlare's AS number is 13335, Google's is 15169. The networks are connected together by what is known as Border Gateway Protocol (BGP). BGP is the glue of the Internet — announcing what IP addresses belong to each network and establishing the routes from one AS to another. An Internet "route" is exactly what it sounds like: a path from the IP address on one AS to an IP address onanother AS.

BGP is largely a trust-based system. Networks trust each other to say which IP addresses and other networks are behind them. When you send a packet or make a request across the network, your ISP connects to its upstream providers or peers and finds the shortest path from your ISP to the destination network.

Unfortunately, if a network starts to send out an announcement of a particular IP address or network behind it, when in fact it is not, if that network is trusted by its upstreams and peers then packets can end up misrouted. That is what was happening here.

I looked at the BGP Routes for a Google IP Address. The route traversed Moratel (23947), an Indonesian ISP. Given that I'm looking at the routing from California and Google is operating Data Centre's not far from our office, packets should never be routed via Indonesia. The most likely cause was that Moratel was announcing a network that wasn't actually behind them.

The BGP Route I saw at the time was:

tom@edge01.sfo01> show route 216.239.34.10 inet.0: 422168 destinations, 422168 routes (422154 active, 0 holddown, 14 hidden)+ = Active Route, - = Last Active, * = Both216.239.34.0/24 *[BGP/170] 00:15:47, MED 18, localpref 100 AS path: 4436 3491 23947 15169 I > to 69.22.153.1 via ge-1/0/9.0

Looking at other routes, for example to Google's Public DNS, it was also stuck routing down the same (incorrect) path:

tom@edge01.sfo01> show route 8.8.8.8 inet.0: 422196 destinations, 422196 routes (422182 active, 0 holddown, 14 hidden)+ = Active Route, - = Last Active, * = Both8.8.8.0/24 *[BGP/170] 00:27:02, MED 18, localpref 100 AS path: 4436 3491 23947 15169 I > to 69.22.153.1 via ge-1/0/9.0

Route Leakage

(Image Credit: The Simpsons)

Situations like this are referred to in the industry as "route leakage", as the route has "leaked" past normal paths. This isn't an unprecedented event. Google previously suffered a similar outage when Pakistan was allegedly trying to censor a video on YouTube and the National ISP of Pakistan null routed the service's IP addresses. Unfortunately, they leaked the null route externally. Pakistan Telecom's upstream provider, PCCW, trusted what Pakistan Telecom's was sending them and the routes spread across the Internet. The effect was YouTube was knocked offline for around 2 hours.

The case today was similar. Someone at Moratel likely "fat fingered" an Internet route. PCCW, who was Moratel's upstream provider, trusted the routes Moratel was sending to them. And, quickly, the bad routes spread. It is unlikely this was malicious, but rather a misconfiguaration or an error evidencing some of the failings in the BGP Trust model.

The Fix

The solution was to get Moratel to stop announcing the routes they shouldn't be. A large part of being a network engineer, especially working at a large network like CloudFlare's, is having relationships with other network engineers around the world. When I figured out the problem, I contacted a colleague at Moratel to let him know what was going on. He was able to fix the problem at around 2:50 UTC / 6:50pm PST. Around 3 minutes later, routing returned to normal and Google's services came back online.

Looking at peering maps, I'd estimate the outage impacted around 3–5% of the Internet's population. The heaviest impact will have been felt in Hong Kong, where PCCW is the incumbent provider. If you were in the area and unable to reach Google's services around that time, now you know why.

Building a Better Internet

This all is a reminder about how the Internet is a system built on trust. Today's incident shows that, even if you're as big as Google, factors outside of your direct control can impact the ability of your customers to get to your site so it's important to have a network engineering team that is watching routes and managing your connectivity around the clock. CloudFlare works every day to ensure our customers get the optimal possible routes. We look out for all the websites on our network to ensure that their traffic is always delivered as fast as possible. Just another day in our ongoing efforts to #savetheweb.

Update: Tuesday, November 6 11:00am PST

Moratel says the issue was caused by an unexpected hardware failure, causing this abnormal condition. This was not a malicious attempt. Moratel immediately shutdown the BGP peering with Google after contact was made while the hardware failure was being looked into.

Thanks for reading all the way to the end. If you enjoyed this post, take a second to learn more about CloudFlare or nominate us for the 2012 Crunchie Award for Best TechnicalInnovation.

A note about Kerckhoff's Principle

John Graham-Cumming — Tue, 19 Jun 2012 15:56:00 GMT

Image credit: psicologiaclinica

The other day I wrote a long post describing in detail how we used to and how we now store customer passwords. Some people were surprised that we were open about this, and others wanted to understand exactly what part of a security system needs to be kept secret.

The simple answer is: a security system is only secure if its details can be safely shared with the world. This is known as Kerckhoff's Principle.

The principle is sometimes stated as "a cryptosystem should be secure even if everything about the system, except the key, is public knowledge" or using Claude Shannon's simpler version: "the enemy knows the system".

The idea is that if any part of a cryptosystem (except the individual secret key) has to be kept secret then the cryptosystem is not secure. That's because if the simple act of disclosing some detail of the system were to make it suddenly insecure then you've got a problem on your hands.

You've somehow got to keep that detail secret and for that you'll need a cryptosystem! Given that the whole point of the cryptosystem was to keep secrets, it's useless if it needs some other system to keep itself secret.

So, the gold standard for any secret keeping system is that all its details should be able to be made public without compromising the security of the system. The security relies on the system itself, not the secrecy of the system. (And as a corollary if anyone tells you've they've got some supersecret encryption system they can't tell you about then it's likely rubbish).

A great example of this is the breaking of the Nazi German Enigma cipher during the Second World War. By stealing machines, receiving information from other secret services, and reading the manuals, the Allies knew everything there was to know about how the Enigma machine worked.

Engima's security relied not on its secrecy, but on its complexity (and on keeping the daily key a secret). Engima was broken by attacking the mathematics behind its encryption and building special machines to exploit mathematical flaws in the encryption.

That's just as true today. The security of HTTPS, SSL and ciphers like AES or RSA rely on the complexity of the algorithm, not on keeping them secret. In fact, they are all published, detailed standards. The only secret is the key that's chosen when you connect to a secure web site (that's done automatically and randomly by your browser and the server) or when you encrypt a document using a program like GPG.

Another example is home security. Imagine if you bought a lock that stated that it must be hidden from view so that no one knew what type of lock you had installed. That wouldn't provide much reassurance that the lock was any good. The security of the lock should depend on its mechanism and you keeping the key safe not on you keeping the lock a secret!

Image credit: paul.orear

When storing passwords securely we rely on the complexity of the bcrypt algorithm. Everything about our storage mechanism is assumed to be something that can be made public. So it's safe to say that we choose a random salt, and that we use bcrypt. The salt is not a key, and it does not need to be kept secret.

But even more we assume that in the horrible case that our password database were accessed it will still be secure even though the hashed passwords and salts would be available to a hacker. The security of the system relies on the security of bcrypt and nothing else.

Of course, as a practical matter we don't leave the database lying around for anyone to access. It's kept behind firewalls and securely stored. But when thinking about security it's important to think about the worst case situation, a full disclosure of the secured information, and rely on the algorithm's strength and nothing else.

What Egypt Shutting Down the Internet Looks Like

Matthew Prince — Sat, 29 Jan 2011 04:26:00 GMT

CloudFlare gets quite a bit of traffic from Egypt -- the country is consistently in the top-20 originators of visitors to our network. That is, until last night when Egypt shut down the Internet. Here is a graph showing traffic to our network from Egypt. Just after midnight last night in Cairo virtually all traffic coming from the country stopped dead. Throughout the day today there has only been a trickle of visitors from Egypt to our network. It goes to show what a powerful and important force the Internet is, and yet how fragile our access to it can be. The thoughts of the CloudFlare team are with the citizens of Egypt. We're all hoping for a swift and peaceful end to the conflict and that we start seeing Egyptian visitors to our network and the rest of the web again very soon.