The Cloudflare Blog

Migrating billions of records: moving our active DNS database while it’s in use

Alex Fattouche — Tue, 29 Oct 2024 14:00:00 GMT

According to a survey done by W3Techs, as of October 2024, Cloudflare is used as an authoritative DNS provider by 14.5% of all websites. As an authoritative DNS provider, we are responsible for managing and serving all the DNS records for our clients’ domains. This means we have an enormous responsibility to provide the best service possible, starting at the data plane. As such, we are constantly investing in our infrastructure to ensure the reliability and performance of our systems.

DNS is often referred to as the phone book of the Internet, and is a key component of the Internet. If you have ever used a phone book, you know that they can become extremely large depending on the size of the physical area it covers. A zone file in DNS is no different from a phone book. It has a list of records that provide details about a domain, usually including critical information like what IP address(es) each hostname is associated with. For example:

example.com      59 IN A 198.51.100.0
blog.example.com 59 IN A 198.51.100.1
ask.example.com  59 IN A 198.51.100.2

It is not unusual for these zone files to reach millions of records in size, just for a single domain. The biggest single zone on Cloudflare holds roughly 4 million DNS records, but the vast majority of zones hold fewer than 100 DNS records. Given our scale according to W3Techs, you can imagine how much DNS data alone Cloudflare is responsible for. Given this volume of data, and all the complexities that come at that scale, there needs to be a very good reason to move it from one database cluster to another.

Why migrate

When initially measured in 2022, DNS data took up approximately 40% of the storage capacity in Cloudflare’s main database cluster (cfdb). This database cluster, consisting of a primary system and multiple replicas, is responsible for storing DNS zones, propagated to our data centers in over 330 cities via our distributed KV store Quicksilver. cfdb is accessed by most of Cloudflare's APIs, including the DNS Records API. Today, the DNS Records API is the API most used by our customers, with each request resulting in a query to the database. As such, it’s always been important to optimize the DNS Records API and its surrounding infrastructure to ensure we can successfully serve every request that comes in.

As Cloudflare scaled, cfdb was becoming increasingly strained under the pressures of several services, many unrelated to DNS. During spikes of requests to our DNS systems, other Cloudflare services experienced degradation in the database performance. It was understood that in order to properly scale, we needed to optimize our database access and improve the systems that interact with it. However, it was evident that system level improvements could only be just so useful, and the growing pains were becoming unbearable. In late 2022, the DNS team decided, along with the help of 25 other teams, to detach itself from cfdb and move our DNS records data to another database cluster.

Pre-migration

From a DNS perspective, this migration to an improved database cluster was in the works for several years. Cloudflare initially relied on a single Postgres database cluster, cfdb. At Cloudflare's inception, cfdb was responsible for storing information about zones and accounts and the majority of services on the Cloudflare control plane depended on it. Since around 2017, as Cloudflare grew, many services moved their data out of cfdb to be served by a microservice. Unfortunately, the difficulty of these migrations are directly proportional to the amount of services that depend on the data being migrated, and in this case, most services require knowledge of both zones and DNS records.

Although the term “zone” was born from the DNS point of view, it has since evolved into something more. Today, zones on Cloudflare store many different types of non-DNS related settings and help link several non-DNS related products to customers' websites. Therefore, it didn’t make sense to move both zone data and DNS record data together. This separation of two historically tightly coupled DNS concepts proved to be an incredibly challenging problem, involving many engineers and systems. In addition, it was clear that if we were going to dedicate the resources to solving this problem, we should also remove some of the legacy issues that came along with the original solution.

One of the main issues with the legacy database was that the DNS team had little control over which systems accessed exactly what data and at what rate. Moving to a new database gave us the opportunity to create a more tightly controlled interface to the DNS data. This was manifested as an internal DNS Records gRPC API which allows us to make sweeping changes to our data while only requiring a single change to the API, rather than coordinating with other systems. For example, the DNS team can alter access logic and auditing procedures under the hood. In addition, it allows us to appropriately rate-limit and cache data depending on our needs. The move to this new API itself was no small feat, and with the help of several teams, we managed to migrate over 20 services, using 5 different programming languages, from direct database access to using our managed gRPC API. Many of these services touch very important areas such as DNSSEC, TLS, Email, Tunnels, Workers, Spectrum, and R2 storage. Therefore, it was important to get it right.

One of the last issues to tackle was the logical decoupling of common DNS database functions from zone data. Many of these functions expect to be able to access both DNS record data and DNS zone data at the same time. For example, at record creation time, our API needs to check that the zone is not over its maximum record allowance. Originally this check occurred at the SQL level by verifying that the record count was lower than the record limit for the zone. However, once you remove access to the zone itself, you are no longer able to confirm this. Our DNS Records API also made use of SQL functions to audit record changes, which requires access to both DNS record and zone data. Luckily, over the past several years, we have migrated this functionality out of our monolithic API and into separate microservices. This allowed us to move the auditing and zone setting logic to the application level rather than the database level. Ultimately, we are still taking advantage of SQL functions in the new database cluster, but they are fully independent of any other legacy systems, and are able to take advantage of the latest Postgres version.

Now that Cloudflare DNS was mostly decoupled from the zones database, it was time to proceed with the data migration. For this, we built what would become our Change Data Capture and Transfer Service (CDCTS).

Requirements for the Change Data Capture and Transfer Service

The Database team is responsible for all Postgres clusters within Cloudflare, and were tasked with executing the data migration of two tables that store DNS data: cf_rec and cf_archived_rec, from the original cfdb cluster to a new cluster we called dnsdb. We had several key requirements that drove our design:

Don’t lose data. This is the number one priority when handling any sort of data. Losing data means losing trust, and it is incredibly difficult to regain that trust once it’s lost. Important in this is the ability to prove no data had been lost. The migration process would, ideally, be easily auditable.
Minimize downtime. We wanted a solution with less than a minute of downtime during the migration, and ideally with just a few seconds of delay.

These two requirements meant that we had to be able to migrate data changes in near real-time, meaning we either needed to implement logical replication, or some custom method to capture changes, migrate them, and apply them in a table in a separate Postgres cluster.

We first looked at using Postgres logical replication using pgLogical, but had concerns about its performance and our ability to audit its correctness. Then some additional requirements emerged that made a pgLogical implementation of logical replication impossible:

The ability to move data must be bidirectional. We had to have the ability to switch back to cfdb without significant downtime in case of unforeseen problems with the new implementation.
Partition the cf_rec table in the new database. This was a long-desired improvement and since most access to cf_rec is by zone_id, it was decided that mod(zone_id, num_partitions) would be the partition key.
Transferred data accessible from original database. In case we had functionality that still needed access to data, a foreign table pointing to dnsdb would be available in cfdb. This could be used as emergency access to avoid needing to roll back the entire migration for a single missed process.
Only allow writes in one database. Applications should know where the primary database is, and should be blocked from writing to both databases at the same time.

Details about the tables being migrated

The primary table, cf_rec, stores DNS record information, and its rows are regularly inserted, updated, and deleted. At the time of the migration, this table had 1.7 billion records, and with several indexes took up 1.5 TB of disk. Typical daily usage would observe 3-5 million inserts, 1 million updates, and 3-5 million deletes.

The second table, cf_archived_rec, stores copies of cf_rec that are obsolete — this table generally only has records inserted and is never updated or deleted. As such, it would see roughly 3-5 million inserts per day, corresponding to the records deleted from cf_rec. At the time of the migration, this table had roughly 4.3 billion records.

Fortunately, neither table made use of database triggers or foreign keys, which meant that we could insert/update/delete records in this table without triggering changes or worrying about dependencies on other tables.

Ultimately, both of these tables are highly active and are the source of truth for many highly critical systems at Cloudflare.

Designing the Change Data Capture and Transfer Service

There were two main parts to this database migration:

Initial copy: Take all the data from cfdb and put it in dnsdb.
Change copy: Take all the changes in cfdb since the initial copy and update dnsdb to reflect them. This is the more involved part of the process.

Normally, logical replication replays every insert, update, and delete on a copy of the data in the same transaction order, making a single-threaded pipeline. We considered using a queue-based system but again, speed and auditability were both concerns as any queue would typically replay one change at a time. We wanted to be able to apply large sets of changes, so that after an initial dump and restore, we could quickly catch up with the changed data. For the rest of the blog, we will only speak about cf_rec for simplicity, but the process for cf_archived_rec is the same.

What we decided on was a simple change capture table. Rows from this capture table would be loaded in real-time by a database trigger, with a transfer service that could migrate and apply thousands of changed records to dnsdb in each batch. Lastly, we added some auditing logic on top to ensure that we could easily verify that all data was safely transferred without downtime.

Basic model of change data capture

For cf_rec to be migrated, we would create a change logging table, along with a trigger function and a table trigger to capture the new state of the record after any insert/update/delete.

The change logging table named log_cf_rec had the same columns as cf_rec, as well as four new columns:

change_id: a sequence generated unique identifier of the record
action: a single character indicating whether this record represents an [i]nsert, [u]pdate, or [d]elete
change_timestamp: the date/time when the change record was created
change_user: the database user that made the change.

A trigger was placed on the cf_rec table so that each insert/update would copy the new values of the record into the change table, and for deletes, create a 'D' record with the primary key value.

Here is an example of the change logging where we delete, re-insert, update, and finally select from the log_cf_rec table. Note that the actual cf_rec and log_cf_rec tables have many more columns, but have been edited for simplicity.

dns_records=# DELETE FROM  cf_rec WHERE rec_id = 13;

dns_records=# SELECT * from log_cf_rec;
Change_id | action | rec_id | zone_id | name
----------------------------------------------
1         | D      | 13     |         |   

dns_records=# INSERT INTO cf_rec VALUES(13,299,'cloudflare.example.com');  

dns_records=# UPDATE cf_rec SET name = 'test.example.com' WHERE rec_id = 13;

dns_records=# SELECT * from log_cf_rec;
Change_id | action | rec_id | zone_id | name
----------------------------------------------
1         | D      | 13     |         |  
2         | I      | 13     | 299     | cloudflare.example.com
3         | U      | 13     | 299     | test.example.com

In addition to log_cf_rec, we also introduced 2 more tables in cfdb and 3 more tables in dnsdb:

cfdb

transferred_log_cf_rec: Responsible for auditing the batches transferred to dnsdb.
log_change_action: Responsible for summarizing the transfer size in order to compare with the log_change_action in dnsdb.

dnsdb

migrate_log_cf_rec: Responsible for collecting batch changes in dnsdb, which would later be applied to cf_rec in dnsdb.
applied_migrate_log_cf_rec: Responsible for auditing the batches that had been successfully applied to cf_rec in dnsdb.
log_change_action: Responsible for summarizing the transfer size in order to compare with the log_change_action in cfdb.

Initial copy

With change logging in place, we were now ready to do the initial copy of the tables from cfdb to dnsdb. Because we were changing the structure of the tables in the destination database and because of network timeouts, we wanted to bring the data over in small pieces and validate that it was brought over accurately, rather than doing a single multi-hour copy or pg_dump. We also wanted to ensure a long-running read could not impact production and that the process could be paused and resumed at any time. The basic model to transfer data was done with a simple psql copy statement piped into another psql copy statement. No intermediate files were used.

psql_cfdb -c "COPY (SELECT * FROM cf_rec WHERE id BETWEEN n and n+1000000 TO STDOUT)" |

psql_dnsdb -c "COPY cf_rec FROM STDIN"

Prior to a batch being moved, the count of records to be moved was recorded in cfdb, and after each batch was moved, a count was recorded in dnsdb and compared to the count in cfdb to ensure that a network interruption or other unforeseen error did not cause data to be lost. The bash script to copy data looked like this, where we included files that could be touched to pause or end the copy (if they cause load on production or there was an incident). Once again, this code below has been heavily simplified.

#!/bin/bash
for i in "$@"; do
   # Allow user to control whether this is paused or not via pause_copy file
   while [ -f pause_copy ]; do
      sleep 1
   done
   # Allow user to end migration by creating end_copy file
   if [ ! -f end_copy ]; then
      # Copy a batch of records from cfdb to dnsdb
      # Get count of records from cfdb 
	# Get count of records from dnsdb
 	# Compare cfdb count with dnsdb count and alert if different 
   fi
done

^{Bash copy script}

Change copy

Once the initial copy was completed, we needed to update dnsdb with any changes that had occurred in cfdb since the start of the initial copy. To implement this change copy, we created a function fn_log_change_transfer_log_cf_rec that could be passed a batch_id and batch_size, and did 5 things, all of which were executed in a single database transaction:

Select a batch_size of records from log_cf_rec in cfdb.
Copy the batch to transferred_log_cf_rec in cfdb to mark it as transferred.
Delete the batch from log_cf_rec.
Write a summary of the action to log_change_action table. This will later be used to compare transferred records with cfdb.
Return the batch of records.

We then took the returned batch of records and copied them to migrate_log_cf_rec in dnsdb. We used the same bash script as above, except this time, the copy command looked like this:

psql_cfdb -c "COPY (SELECT * FROM fn_log_change_transfer_log_cf_rec(,) TO STDOUT" |

psql_dnsdb -c "COPY migrate_log_cf_rec FROM STDIN"

Applying changes in the destination database

Now, with a batch of data in the migrate_log_cf_rec table, we called a newly created function log_change_apply to apply and audit the changes. Once again, this was all executed within a single database transaction. The function did the following:

Move a batch from the migrate_log_cf_rec table to a new temporary table.
Write the counts for the batch_id to the log_change_action table.
Delete from the temporary table all but the latest record for a unique id (last action). For example, an insert followed by 30 updates would have a single record left, the final update. There is no need to apply all the intermediate updates.
Delete any record from cf_rec that has any corresponding changes.
Insert any [i]nsert or [u]pdate records in cf_rec.
Copy the batch to applied_migrate_log_cf_rec for a full audit trail.

Putting it all together

There were 4 distinct phases, each of which was part of a different database transaction:

Call fn_log_change_transfer_log_cf_rec in cfdb to get a batch of records.
Copy the batch of records to dnsdb.
Call log_change_apply in dnsdb to apply the batch of records.
Compare the log_change_action table in each respective database to ensure counts match.

This process was run every 3 seconds for several weeks before the migration to ensure that we could keep dnsdb in sync with cfdb.

Managing which database is live

The last major pre-migration task was the construction of the request locking system that would be used throughout the actual migration. The aim was to create a system that would allow the database to communicate with the DNS Records API, to allow the DNS Records API to handle HTTP connections more gracefully. If done correctly, this could reduce downtime for DNS Record API users to nearly zero.

In order to facilitate this, a new table called cf_migration_manager was created. The table would be periodically polled by the DNS Records API, communicating two critical pieces of information:

Which database was active. Here we just used a simple A or B naming convention.
If the database was locked for writing. In the event the database was locked for writing, the DNS Records API would hold HTTP requests until the lock was released by the database.

Both pieces of information would be controlled within a migration manager script.

The benefit of migrating the 20+ internal services from direct database access to using our internal DNS Records gRPC API is that we were able to control access to the database to ensure that no one else would be writing without going through the cf_migration_manager.

During the migration

Although we aimed to complete this migration in a matter of seconds, we announced a DNS maintenance window that could last a couple of hours just to be safe. Now that everything was set up, and both cfdb and dnsdb were roughly in sync, it was time to proceed with the migration. The steps were as follows:

Lower the time between copies from 3s to 0.5s.
Lock cfdb for writes via cf_migration_manager. This would tell the DNS Records API to hold write connections.
Make cfdb read-only and migrate the last logged changes to dnsdb.
Enable writes to dnsdb.
Tell DNS Records API that dnsdb is the new primary database and that write connections can proceed via the cf_migration_manager.

Since we needed to ensure that the last changes were copied to dnsdb before enabling writing, this entire process took no more than 2 seconds. During the migration we saw a spike of API latency as a result of the migration manager locking writes, and then dealing with a backlog of queries. However, we recovered back to normal latencies after several minutes.

^{DNS Records API Latency and Requests during migration}

Unfortunately, due to the far-reaching impact that DNS has at Cloudflare, this was not the end of the migration. There were 3 lesser-used services that had slipped by in our scan of services accessing DNS records via cfdb. Fortunately, the setup of the foreign table meant that we could very quickly fix any residual issues by simply changing the table name.

Post-migration

Almost immediately, as expected, we saw a steep drop in usage across cfdb. This freed up a lot of resources for other services to take advantage of.

^cfdb^{usage dropped significantly after the migration period.}

Since the migration, the average requests per second to the DNS Records API has more than doubled. At the same time, our CPU usage across both cfdb and dnsdb has settled at below 10% as seen below, giving us room for spikes and future growth.

^cfdb^and^dnsdb^{CPU usage now}

As a result of this improved capacity, our database-related incident rate dropped dramatically.

As for query latencies, our latency post-migration is slightly lower on average, with fewer sustained spikes above 500ms. However, the performance improvement is largely noticed during high load periods, when our database handles spikes without significant issues. Many of these spikes come as a result of clients making calls to collect a large amount of DNS records or making several changes to their zone in short bursts. Both of these actions are common use cases for large customers onboarding zones.

In addition to these improvements, the DNS team also has more granular control over dnsdb cluster-specific settings that can be tweaked for our needs rather than catering to all the other services. For example, we were able to make custom changes to replication lag limits to ensure that services using replicas were able to read with some amount of certainty that the data would exist in a consistent form. Measures like this reduce overall load on the primary because almost all read queries can now go to the replicas.

Although this migration was a resounding success, we are always working to improve our systems. As we grow, so do our customers, which means the need to scale never really ends. We have more exciting improvements on the roadmap, and we are looking forward to sharing more details in the future.

The DNS team at Cloudflare isn’t the only team solving challenging problems like the one above. If this sounds interesting to you, we have many more tech deep dives on our blog, and we are always looking for curious engineers to join our team — see open opportunities here.

Making zone management more efficient with batch DNS record updates

Alex Fattouche — Mon, 23 Sep 2024 13:00:00 GMT

Customers that use Cloudflare to manage their DNS often need to create a whole batch of records, enable proxying on many records, update many records to point to a new target at the same time, or even delete all of their records. Historically, customers had to resort to bespoke scripts to make these changes, which came with their own set of issues. In response to customer demand, we are excited to announce support for batched API calls to the DNS records API starting today. This lets customers make large changes to their zones much more efficiently than before. Whether sending a POST, PUT, PATCH or DELETE, users can now execute these four different HTTP methods, and multiple HTTP requests all at the same time.

Efficient zone management matters

DNS records are an essential part of most web applications and websites, and they serve many different purposes. The most common use case for a DNS record is to have a hostname point to an IPv4 address, this is called an A record:

example.com 59 IN A 198.51.100.0

blog.example.com 59 IN A 198.51.100.1

ask.example.com 59 IN A 198.51.100.2

In its most simple form, this enables Internet users to connect to websites without needing to memorize their IP address.

Often, our customers need to be able to do things like create a whole batch of records, or enable proxying on many records, or update many records to point to a new target at the same time, or even delete all of their records. Unfortunately, for most of these cases, we were asking customers to write their own custom scripts or programs to do these tasks for them, a number of which are open sourced and whose content has not been checked by us. These scripts are often used to avoid needing to repeatedly make the same API calls manually. This takes time, not only for the development of the scripts, but also to simply execute all the API calls, not to mention it can leave the zone in a bad state if some changes fail while others succeed.

Introducing /batch

Starting today, everyone with a Cloudflare zone will have access to this endpoint, with free tier customers getting access to 200 changes in one batch, and paid plans getting access to 3,500 changes in one batch. We have successfully tested up to 100,000 changes in one call. The API is simple, expecting a POST request to be made to the new API endpoint /dns_records/batch, which passes in a JSON object in the body in the format:

{
    deletes:[]Record
    patches:[]Record
    puts:[]Record
    posts:[]Record
}

Each list of records []Record will follow the same requirements as the regular API, except that the record ID on deletes, patches, and puts will be required within the Record object itself. Here is a simple example:

{
    "deletes": [
        {
            "id": "143004ef463b464a504bde5a5be9f94a"
        },
        {
            "id": "165e9ef6f325460c9ca0eca6170a7a23"
        }
    ],
    "patches": [
        {
            "id": "16ac0161141a4e62a79c50e0341de5c6",
            "content": "192.0.2.45"
        },
        {
            "id": "6c929ea329514731bcd8384dd05e3a55",
            "name": "update.example.com",
            "proxied": true
        }
    ],
    "puts": [
        {
            "id": "ee93eec55e9e45f4ae3cb6941ffd6064",
            "content": "192.0.2.50",
            "name": "no-change.example.com",
            "proxied": false,
            "ttl:": 1
        },
        {
            "id": "eab237b5a67e41319159660bc6cfd80b",
            "content": "192.0.2.45",
            "name": "no-change.example.com",
            "proxied": false,
            "ttl:": 3000
        }
    ],
    "posts": [
        {
            "name": "@",
            "type": "A",
            "content": "192.0.2.45",
            "proxied": false,
            "ttl": 3000
        },
        {
            "name": "a.example.com",
            "type": "A",
            "content": "192.0.2.45",
            "proxied": true
        }
    ]
}

Our API will then parse this and execute these calls in the following order:

deletes
patches
puts
posts

Each of these respective lists will be executed in the order given. This ordering system is important because it removes the need for our clients to worry about conflicts, such as if they need to create a CNAME on the same hostname as a to-be-deleted A record, which is not allowed in RFC 1912. In the event that any of these individual actions fail, the entire API call will fail and return the first error it sees. The batch request will also be executed inside a single database transaction, which will roll back in the event of failure.

After the batch request has been successfully executed in our database, we then propagate the changes to our edge via Quicksilver, our distributed KV store. Each of the individual record changes inside the batch request is treated as a single key-value pair, and database transactions are not supported. As such, we cannot guarantee that the propagation to our edge servers will be atomic. For example, if replacing a delegation with an A record, some resolvers may see the NS record removed before the A record is added.

The response will follow the same format as the request. Patches and puts that result in no changes will be placed at the end of their respective lists.

We are also introducing some new changes to the Cloudflare dashboard, allowing users to select multiple records and subsequently:

Delete all selected records
Change the proxy status of all selected records

We plan to continue improving the dashboard to support more batch actions based on your feedback.

The journey

Although at the surface, this batch endpoint may seem like a fairly simple change, behind the scenes it is the culmination of a multi-year, multi-team effort. Over the past several years, we have been working hard to improve the DNS pipeline that takes our customers' records and pushes them to Quicksilver, our distributed database. As part of this effort, we have been improving our DNS Records API to reduce the overall latency. The DNS Records API is Cloudflare's most used API externally, serving twice as many requests as any other API at peak. In addition, the DNS Records API supports over 20 internal services, many of which touch very important areas such as DNSSEC, TLS, Email, Tunnels, Workers, Spectrum, and R2 storage. Therefore, it was important to build something that scales.

To improve API performance, we first needed to understand the complexities of the entire stack. At Cloudflare, we use Jaeger tracing to debug our systems. It gives us granular insights into a sample of requests that are coming into our APIs. When looking at API request latency, the span that stood out was the time spent on each individual database lookup. The latency here can vary anywhere from ~1ms to ~5ms.

_{Jaeger trace showing variable database latency}

Given this variability in database query latency, we wanted to understand exactly what was going on within each DNS Records API request. When we first started on this journey, the breakdown of database lookups for each action was as follows:

Action	Database Queries	Reason
POST	2	One to write and one to read the new record.
PUT	3	One to collect, one to write, and one to read back the new record.
PATCH	3	One to collect, one to write, and one to read back the new record.
DELETE	2	One to read and one to delete.

The reason we needed to read the newly created records on POST, PUT, and PATCH was because the record contains information filled in by the database which we cannot infer in the API.

Let’s imagine that a customer needed to edit 1,000 records. If each database lookup took 3ms to complete, that was 3ms * 3 lookups * 1,000 records = 9 seconds spent on database queries alone, not taking into account the round trip time to and from our API or any other processing latency. It’s clear that we needed to reduce the number of overall queries and ideally minimize per query latency variation. Let’s tackle the variation in latency first.

Each of these calls is not a simple INSERT, UPDATE, or DELETE, because we have functions wrapping these database calls for sanitization purposes. In order to understand the variable latency, we enlisted the help of PostgreSQL’s “auto_explain”. This module gives a breakdown of execution times for each statement without needing to EXPLAIN each one by hand. We used the following settings:

A handful of queries showed durations like the one below, which took an order of magnitude longer than other queries.

We noticed that in several locations we were doing queries like:

IF (EXISTS (SELECT id FROM table WHERE row_hash = __new_row_hash))

If you are trying to insert into very large zones, such queries could mean even longer database query times, potentially explaining the discrepancy between 1ms and 5ms in our tracing images above. Upon further investigation, we already had a unique index on that exact hash. Unique indexes in PostgreSQL enforce the uniqueness of one or more column values, which means we can safely remove those existence checks without risk of inserting duplicate rows.

The next task was to introduce database batching into our DNS Records API. In any API, external calls such as SQL queries are going to add substantial latency to the request. Database batching allows the DNS Records API to execute multiple SQL queries within one single network call, subsequently lowering the number of database round trips our system needs to make.

According to the table above, each database write also corresponded to a read after it had completed the query. This was needed to collect information like creation/modification timestamps and new IDs. To improve this, we tweaked our database functions to now return the newly created DNS record itself, removing a full round trip to the database. Here is the updated table:

Action	Database Queries	Reason
POST	1	One to write
PUT	2	One to read, one to write.
PATCH	2	One to read, one to write.
DELETE	2	One to read, one to delete.

We have room for improvement here, however we cannot easily reduce this further due to some restrictions around auditing and other sanitization logic.

Results:

Action	Average database time before	Average database time after	Percentage Decrease
POST	3.38ms	0.967ms	71.4%
PUT	4.47ms	2.31ms	48.4%
PATCH	4.41ms	2.24ms	49.3%
DELETE	1.21ms	1.21ms	0%

These are some pretty good improvements! Not only did we reduce the API latency, we also reduced the database query load, benefiting other systems as well.

Weren’t we talking about batching?

I previously mentioned that the /batch endpoint is fully atomic, making use of a single database transaction. However, a single transaction may still require multiple database network calls, and from the table above, that can add up to a significant amount of time when dealing with large batches. To optimize this, we are making use of pgx/batch, a Golang object that allows us to write and subsequently read multiple queries in a single network call. Here is a high level of how the batch endpoint works:

Collect all the records for the PUTs, PATCHes and DELETEs.
Apply any per record differences as requested by the PATCHes and PUTs.
Format the batch SQL query to include each of the actions.
Execute the batch SQL query in the database.
Parse each database response and return any errors if needed.
Audit each change.

This takes at most only two database calls per batch. One to fetch, and one to write/delete. If the batch contains only POSTs, this will be further reduced to a single database call. Given all of this, we should expect to see a significant improvement in latency when making multiple changes, which we do when observing how these various endpoints perform:

Note: Each of these queries was run from multiple locations around the world and the median of the response times are shown here. The server responding to queries is located in Portland, Oregon, United States. Latencies are subject to change depending on geographical location.

Create only:

	10 Records	100 Records	1,000 Records	10,000 Records
Regular API	7.55s	74.23s	757.32s	7,877.14s
Batch API - Without database batching	0.85s	1.47s	4.32s	16.58s
Batch API - with database batching	0.67s	1.21s	3.09s	10.33s

Delete only:

	10 Records	100 Records	1,000 Records	10,000 Records
Regular API	7.28s	67.35s	658.11s	7,471.30s
Batch API - without database batching	0.79s	1.32s	3.18s	17.49s
Batch API - with database batching	0.66s	0.78s	1.68s	7.73s

Create/Update/Delete:

	10 Records	100 Records	1,000 Records	10,000 Records
Regular API	7.11s	72.41s	715.36s	7,298.17s
Batch API - without database batching	0.79s	1.36s	3.05s	18.27s
Batch API - with database batching	0.74s	1.06s	2.17s	8.48s

Overall Average:

	10 Records	100 Records	1,000 Records	10,000 Records
Regular API	7.31s	71.33s	710.26s	7,548.87s
Batch API - without database batching	0.81s	1.38s	3.51s	17.44s
Batch API - with database batching	0.69s	1.02s	2.31s	8.85s

We can see that on average, the new batching API is significantly faster than the regular API trying to do the same actions, and it’s also nearly twice as fast as the batching API without batched database calls. We can see that at 10,000 records, the batching API is a staggering 850x faster than the regular API. As mentioned above, these numbers are likely to change for a number of different reasons, but it’s clear that making several round trips to and from the API adds substantial latency, regardless of the region.

Batch overload

Making our API faster is awesome, but we don’t operate in an isolated environment. Each of these records needs to be processed and pushed to Quicksilver, our distributed database. If we have customers creating tens of thousands of records every 10 seconds, we need to be able to handle this downstream so that we don’t overwhelm our system. In a May 2022 blog post titled How we improved DNS record build speed by more than 4,000x, I noted that:

We plan to introduce a batching system that will collect record changes into groups to minimize the number of queries we make to our database and Quicksilver.

This task has since been completed, and our propagation pipeline is now able to batch thousands of record changes into a single database query which can then be published to Quicksilver in order to be propagated to our global network.

Next steps

We have a couple more improvements we may want to bring into the API. We also intend to improve the UI to bring more usability improvements to the dashboard to more easily manage zones. We would love to hear your feedback, so please let us know what you think and if you have any suggestions for improvements.

For more details on how to use the new /batch API endpoint, head over to our developer documentation and API reference.

Connection errors in Asia Pacific region on July 9, 2023

Christian Elmerot — Tue, 11 Jul 2023 08:48:13 GMT

On Sunday, July 9, 2023, early morning UTC time, we observed a high number of DNS resolution failures — up to 7% of all DNS queries across the Asia Pacific region — caused by invalid DNSSEC signatures from Verisign .com and .net Top Level Domain (TLD) nameservers. This resulted in connection errors for visitors of Internet properties on Cloudflare in the region.

The local instances of Verisign’s nameservers started to respond with expired DNSSEC signatures in the Asia Pacific region. In order to remediate the impact, we have rerouted upstream DNS queries towards Verisign to locations on the US west coast which are returning valid signatures.

We have already reached out to Verisign to get more information on the root cause. Until their issues have been resolved, we will keep our DNS traffic to .com and .net TLD nameservers rerouted, which might cause slightly increased latency for the first visitor to domains under .com and .net in the region.

Background

In order to proxy a domain’s traffic through Cloudflare’s network, there are two components involved with respect to the Domain Name System (DNS) from the perspective of a Cloudflare data center: external DNS resolution, and upstream or origin DNS resolution.

To understand this, let’s look at the domain name blog.cloudflare.com — which is proxied through Cloudflare’s network — and let’s assume it is configured to use origin.example as the origin server:

Here, the external DNS resolution is the part where DNS queries to blog.cloudflare.com sent by public resolvers like 1.1.1.1 or 8.8.8.8 on behalf of a visitor return a set of Cloudflare Anycast IP addresses. This ensures that the visitor’s browser knows where to send an HTTPS request to load the website. Under the hood your browser performs a DNS query that looks something like this (the trailing dot indicates the DNS root zone):

$ dig blog.cloudflare.com. +short
104.18.28.7
104.18.29.7

(Your computer doesn’t actually use the dig command internally; we’ve used it to illustrate the process) Then when the next closest Cloudflare data center receives the HTTPS request for blog.cloudflare.com, it needs to fetch the content from the origin server (assuming it is not cached).

There are two ways Cloudflare can reach the origin server. If the DNS settings in Cloudflare contain IP addresses then we can connect directly to the origin. In some cases, our customers use a CNAME which means Cloudflare has to perform another DNS query to get the IP addresses associated with the CNAME. In the example above, blog.cloudflare.com has a CNAME record instructing us to look at origin.example for IP addresses. During the incident, only customers with CNAME records like this going to .com and .net domain names may have been affected.

The domain origin.example needs to be resolved by Cloudflare as part of the upstream or origin DNS resolution. This time, the Cloudflare data center needs to perform an outbound DNS query that looks like this:

$ dig origin.example. +short
192.0.2.1

DNS is a hierarchical protocol, so the recursive DNS resolver, which usually handles DNS resolution for whoever wants to resolve a domain name, needs to talk to several involved nameservers until it finally gets an answer from the authoritative nameservers of the domain (assuming no DNS responses are cached). This is the same process during the external DNS resolution and the origin DNS resolution. Here is an example for the origin DNS resolution:

Inherently, DNS is a public system that was initially published without any means to validate the integrity of the DNS traffic. So in order to prevent someone from spoofing DNS responses, DNS Security Extensions (DNSSEC) were introduced as a means to authenticate that DNS responses really come from who they claim to come from. This is achieved by generating cryptographic signatures alongside existing DNS records like A, AAAA, MX, CNAME, etc. By validating a DNS record’s associated signature, it is possible to verify that a requested DNS record really comes from its authoritative nameserver and wasn’t altered en-route. If a signature cannot be validated successfully, recursive resolvers usually return an error indicating the invalid signature. This is exactly what happened on Sunday.

Incident timeline and impact

On Saturday, July 8, 2023, at 21:10 UTC our logs show the first instances of DNSSEC validation errors that happened during upstream DNS resolution from multiple Cloudflare data centers in the Asia Pacific region. It appeared DNS responses from the TLD nameservers of .com and .net of the type NSEC3 (a DNSSEC mechanism to prove non-existing DNS records) included invalid signatures. About an hour later at 22:16 UTC, the first internal alerts went off (since it required issues to be consistent over a certain period of time), but error rates were still at a level at around 0.5% of all upstream DNS queries.

After several hours, the error rate had increased to a point in which ~13% of our upstream DNS queries in affected locations were failing. This percentage continued to fluctuate over the duration of the incident between the ranges of 10-15% of upstream DNS queries, and roughly 5-7% of all DNS queries (external & upstream resolution) to affected Cloudflare data centers in the Asia Pacific region.

Initially it appeared as though only a single upstream nameserver was having issues with DNS resolution, however upon further investigation it was discovered that the issue was more widespread. Ultimately, we were able to verify that the Verisign nameservers for .com and .net were returning expired DNSSEC signatures on a portion of DNS queries in the Asia Pacific region. Based on our tests, other nameserver locations were correctly returning valid DNSSEC signatures.

In response, we rerouted our DNS traffic to the .com and .net TLD nameserver IP addresses to Verisign’s US west locations. After this change was propagated, the issue very quickly subsided and origin resolution error rates returned to normal levels.

All times are in UTC:

2023-07-08 21:10 First instances of DNSSEC validation errors appear in our logs for origin DNS resolution.

2023-07-08 22:16 First internal alerts for Asia Pacific data centers go off indicating origin DNS resolution error on our test domain. Very few errors for other domains at this point.

2023-07-09 02:58 Error rates have increased substantially since the first instance. An incident is declared.

2023-07-09 03:28 DNSSEC validation issues seem to be isolated to a single upstream provider. It is not abnormal that a single upstream has issues that propagate back to us, and in this case our logs were predominantly showing errors from domains that resolve to this specific upstream.

2023-07-09 04:52 We realize that DNSSEC validation issues are more widespread and affect multiple .com and .net domains. Validation issues continue to be isolated to the Asia Pacific region only, and appear to be intermittent.

2023-07-09 05:15 DNS queries via popular recursive resolvers like 8.8.8.8 and 1.1.1.1 do not return invalid DNSSEC signatures at this point. DNS queries using the local stub resolver continue to return DNSSEC errors.

2023-07-09 06:24 Responses from .com and .net Verisign nameservers in Singapore contain expired DNSSEC signatures, but responses from Verisign TLD nameservers in other locations are fine.

2023-07-09 06:41 We contact Verisign and notify them about expired DNSSEC signatures.

2023-07-09 06:50 In order to remediate the impact, we reroute DNS traffic via IPv4 for the .com and .net Verisign nameserver IPs to US west IPs instead. This immediately leads to a substantial drop in the error rate.

2023-07-09 07:06 We also reroute DNS traffic via IPv6 for the .com and .net Verisign nameserver IPs to US west IPs. This leads to the error rate going down to zero.

2023-07-10 09:23 The rerouting is still in place, but the underlying issue with expired signatures in the Asia Pacific region has still not been resolved.

2023-07-10 18:23 Verisign gets back to us confirming that they “were serving stale data” at their local site and have resolved the issues.

Technical description of the error and how it happened

As mentioned in the introduction, the underlying cause for this incident was expired DNSSEC signatures for .net and .com zones. Expired DNSSEC signatures will cause a DNS response to be interpreted as invalid. There are two main scenarios in which this error was observed by a user:

CNAME flattening for external DNS resolution. This is when our authoritative nameservers attempt to return the IP address(es) that a CNAME record target resolves to rather than the CNAME record itself.
CNAME target lookup for origin DNS resolution. This is most commonly used when an HTTPS request is sent to a Cloudflare anycast IP address, and we need to determine what IP address to forward the request to. See How Cloudflare works for more details.

In both cases, behind the scenes the DNS query goes through an in-house recursive DNS resolver in order to lookup what a hostname resolves to. The purpose of this resolver is to cache queries, optimize future queries and provide DNSSEC validation. If the query from this resolver fails for whatever reason, our authoritative DNS system will not be able to perform the two scenarios outlined above.

During the incident, the recursive resolver was failing to validate the DNSSEC signatures in DNS responses for domains ending in .com and .net. These signatures are sent in the form of an RRSIG record alongside the other DNS records they cover. Together they form a Resource Record set (RRset). Each RRSIG has the corresponding fields:

As you can see, the main part of the payload is associated with the signature and its corresponding metadata. Each recursive resolver is responsible for not only checking the signature but also the expiration time of the signature. It is important to obey the expiration time in order to avoid returning responses for RRsets that have been signed by old keys, which could have potentially been compromised by that time.

During this incident, Verisign, the authoritative operator for the .com and .net TLD zones, was occasionally returning expired signatures in its DNS responses in the Asia Pacific region. As a result our recursive resolver was not able to validate the corresponding RRset. Ultimately this meant that a percentage of DNS queries would return SERVFAIL as response code to our authoritative nameserver. This in turn caused connection issues for users trying to connect to a domain on Cloudflare, because we weren't able to resolve the upstream target of affected domain names and thus didn’t know where to send proxied HTTPS requests to upstream servers.

Remediation and follow-up steps

Once we had identified the root cause we started to look at different ways to remedy the issue. We came to the conclusion that the fastest way to work around this very regionalized issue was to stop using the local route to Verisign's nameservers. This means that, at the time of writing this, our outgoing DNS traffic towards Verisign's nameservers in the Asia Pacific region now traverses the Pacific and ends up at the US west coast, rather than being served by closer nameservers. Due to the nature of DNS and the important role of DNS caching, this has less impact than you might initially expect. Frequently looked up names will be cached after the first request, and this only needs to happen once per data center, as we share and pool the local DNS recursor caches to maximize their efficiency.

Ideally, we would have been able to fix the issue right away as it potentially affected others in the region too, not just our customers. We will therefore work diligently to improve our incident communications channels with other providers in order to ensure that the DNS ecosystem remains robust against issues such as this. Being able to quickly get hold of other providers that can take action is vital when urgent issues like these arise.

Conclusion

This incident once again shows how impactful DNS failures are and how crucial this technology is for the Internet. We will investigate how we can improve our systems to detect and resolve issues like this more efficiently and quickly if they occur again in the future.

While Cloudflare was not the cause of this issue, and we are certain that we were not the only ones affected by this, we are still sorry for the disruption to our customers and to all the users who were unable to access Internet properties during this incident.

Update: On Tue Jul 11 22:24:21 UTC 2023_,_ Verisign posted an announcement, providing more details:

Last week, during a migration of one of our DNS resolution sites in Singapore, from one provider to another, we unexpectedly lost management access and the ability to deliver changes and DNS updates to the site. Following our standard procedure, we disabled all transit links to the affected site. Unfortunately, a peering router remained active, which was not immediately obvious to our teams due to the lack of connectivity there.

Over the weekend, this caused an issue that may have affected the ability of some internet users in the region to reach some .com and .net domains, as DNSSEC signatures on the site began expiring. The issue was resolved by powering off the site’s peering router, causing the anycast route announcement to be withdrawn and traffic to be directed to other sites.

We are updating our processes and procedures and will work to prevent such issues from recurring in the future.

The Singapore site is part of a highly redundant constellation of more than 200 sites that make up our global network. This issue had no effect on the core resolution of .com and .net resolution globally. We apologize to those who may have been affected.

How we improved DNS record build speed by more than 4,000x

Alex Fattouche — Wed, 25 May 2022 12:59:04 GMT

Since my previous blog about Secondary DNS, Cloudflare's DNS traffic has more than doubled from 15.8 trillion DNS queries per month to 38.7 trillion. Our network now spans over 270 cities in over 100 countries, interconnecting with more than 10,000 networks globally. According to w3 stats, “Cloudflare is used as a DNS server provider by 15.3% of all the websites.” This means we have an enormous responsibility to serve DNS in the fastest and most reliable way possible.

Although the response time we have on DNS queries is the most important performance metric, there is another metric that sometimes goes unnoticed. DNS Record Propagation time is how long it takes changes submitted to our API to be reflected in our DNS query responses. Every millisecond counts here as it allows customers to quickly change configuration, making their systems much more agile. Although our DNS propagation pipeline was already known to be very fast, we had identified several improvements that, if implemented, would massively improve performance. In this blog post I’ll explain how we managed to drastically improve our DNS record propagation speed, and the impact it has on our customers.

How DNS records are propagated

Cloudflare uses a multi-stage pipeline that takes our customers’ DNS record changes and pushes them to our global network, so they are available all over the world.

The steps shown in the diagram above are:

Customer makes a change to a record via our DNS Records API (or UI).
The change is persisted to the database.
The database event triggers a Kafka message which is consumed by the Zone Builder.
The Zone Builder takes the message, collects the contents of the zone from the database and pushes it to Quicksilver, our distributed KV store.
Quicksilver then propagates this information to the network.

Of course, this is a simplified version of what is happening. In reality, our API receives thousands of requests per second. All POST/PUT/PATCH/DELETE requests ultimately result in a DNS record change. Each of these changes needs to be actioned so that the information we show through our API and in the Cloudflare dashboard is eventually consistent with the information we use to respond to DNS queries.

Historically, one of the largest bottlenecks in the DNS propagation pipeline was the Zone Builder, shown in step 4 above. Responsible for collecting and organizing records to be written to our global network, our Zone Builder often ate up most of the propagation time, especially for larger zones. As we continue to scale, it is important for us to remove any bottlenecks that may exist in our systems, and this was clearly identified as one such bottleneck.

Growing pains

When the pipeline shown above was first announced, the Zone Builder received somewhere between 5 and 10 DNS record changes per second. Although the Zone Builder at the time was a massive improvement on the previous system, it was not going to last long given the growth that Cloudflare was and still is experiencing. Fast-forward to today, we receive on average 250 DNS record changes per second, a staggering 25x growth from when the Zone Builder was first announced.

The way that the Zone Builder was initially designed was quite simple. When a zone changed, the Zone Builder would grab all the records from the database for that zone and compare them with the records stored in Quicksilver. Any differences were fixed to maintain consistency between the database and Quicksilver.

This is known as a full build. Full builds work great because each DNS record change corresponds to one zone change event. This means that multiple events can be batched and subsequently dropped if needed. For example, if a user makes 10 changes to their zone, this will result in 10 events. Since the Zone Builder grabs all the records for the zone anyway, there is no need to build the zone 10 times. We just need to build it once after the final change has been submitted.

What happens if the zone contains one million records or 10 million records? This is a very real problem, because not only is Cloudflare scaling, but our customers are scaling with us. Today our largest zone currently has millions of records. Although our database is optimized for performance, even one full build containing one million records took up to 35 seconds, largely caused by database query latency. In addition, when the Zone Builder compares the zone contents with the records stored in Quicksilver, we need to fetch all the records from Quicksilver for the zone, adding time. However, the impact doesn’t just stop at the single customer. This also eats up more resources from other services reading from the database and slows down the rate at which our Zone Builder can build other zones.

Per-record build: a new build type

Many of you might already have the solution to this problem in your head:

Why doesn’t the Zone Builder just query the database for the record that has changed and propagate just the single record?

Of course this is the correct solution, and the one we eventually ended up at. However, the road to get there was not as simple as it might seem.

Firstly, our database uses a series of functions that, at zone touch time, create a PostgreSQL Queue (PGQ) event that ultimately gets turned into a Kafka event. Initially, we had no distinction for individual record events, which meant our Zone Builder had no idea what had actually changed until it queried the database.

Next, the Zone Builder is still responsible for DNS zone settings in addition to records. Some examples of DNS zone settings include custom nameserver control and DNSSEC control. As a result, our Zone Builder needed to be aware of specific build types to ensure that they don’t step on each other. Furthermore, per-record builds cannot be batched in the same way that zone builds can because each event needs to be actioned separately.

As a result, a brand new scheduling system needed to be written. Lastly, Quicksilver interaction needed to be re-written to account for the different types of schedulers. These issues can be broken down as follows:

Create a new Kafka event pipeline for record changes that contain information about the changed record.
Separate the Zone Builder into a new type of scheduler that implements some defined scheduler interface.
Implement the per-record scheduler to read events one by one in the correct order.
Implement the new Quicksilver interface for the per-record scheduler.

Below is a high level diagram of how the new Zone Builder looks internally with the new scheduler types.

It is critically important that we lock between these two schedulers because it would otherwise be possible for the full build scheduler to overwrite the per-record scheduler’s changes with stale data.

It is important to note that none of this per-record architecture would be possible without the use of Cloudflare’s black lie approach to negative answers with DNSSEC. Normally, in order to properly serve negative answers with DNSSEC, all the records within the zone must be canonically sorted. This is needed in order to maintain a list of references from the apex record through all the records in the zone. With this normal approach to negative answers, a single record that has been added to the zone requires collecting all records to determine its insertion point within this sorted list of names.

Bugs

I would love to be able to write a Cloudflare blog where everything went smoothly; however, that is never the case. Bugs happen, but we need to be ready to react to them and set ourselves up so that next time this specific bug cannot happen.

In this case, the major bug we discovered was related to the cleanup of old records in Quicksilver. With the full Zone Builder, we have the luxury of knowing exactly what records exist in both the database and in Quicksilver. This makes writing and cleaning up a fairly simple task.

When the per-record builds were introduced, record events such as creates, updates, and deletes all needed to be treated differently. Creates and deletes are fairly simple because you are either adding or removing a record from Quicksilver. Updates introduced an unforeseen issue due to the way that our PGQ was producing Kafka events. Record updates only contained the new record information, which meant that when the record name was changed, we had no way of knowing what to query for in Quicksilver in order to clean up the old record. This meant that any time a customer changed the name of a record in the DNS Records API, the old record would not be deleted. Ultimately, this was fixed by replacing those specific update events with both a creation and a deletion event so that the Zone Builder had the necessary information to clean up the stale records.

None of this is rocket surgery, but we spend engineering effort to continuously improve our software so that it grows with the scaling of Cloudflare. And it’s challenging to change such a fundamental low-level part of Cloudflare when millions of domains depend on us.

Results

Today, all DNS Records API record changes are treated as per-record builds by the Zone Builder. As I previously mentioned, we have not been able to get rid of full builds entirely; however, they now represent about 13% of total DNS builds. This 13% corresponds to changes made to DNS settings that require knowledge of the entire zone's contents.

When we compare the two build types as shown below we can see that per-record builds are on average 150x faster than full builds. The build time below includes both database query time and Quicksilver write time.

From there, our records are propagated to our global network through Quicksilver.

The 150x improvement above is with respect to averages, but what about that 4000x that I mentioned at the start? As you can imagine, as the size of the zone increases, the difference between full build time and per-record build time also increases. I used a test zone of one million records and ran several per-record builds, followed by several full builds. The results are shown in the table below:

Build Type	Build Time (ms)
Per Record #1	6
Per Record #2	7
Per Record #3	6
Per Record #4	8
Per Record #5	6
Full #1	34032
Full #2	33953
Full #3	34271
Full #4	34121
Full #5	34093

We can see that, given five per-record builds, the build time was no more than 8ms. When running a full build however, the build time lasted on average 34 seconds. That is a build time reduction of 4250x!

Given the full build times for both average-sized zones and large zones, it is apparent that all Cloudflare customers are benefitting from this improved performance, and the benefits only improve as the size of the zone increases. In addition, our Zone Builder uses less database and Quicksilver resources meaning other Cloudflare systems are able to operate at increased capacity.

Next Steps

The results here have been very impactful, though we think that we can do even better. In the future, we plan to get rid of full builds altogether by replacing them with zone setting builds. Instead of fetching the zone settings in addition to all the records, the zone setting builder would just fetch the settings for the zone and propagate that to our global network via Quicksilver. Similar to the per-record builds, this is a difficult challenge due to the complexity of zone settings and the number of actors that touch it. Ultimately if this can be accomplished, we can officially retire the full builds and leave it as a reminder in our git history of the scale at which we have grown over the years.

In addition, we plan to introduce a batching system that will collect record changes into groups to minimize the number of queries we make to our database and Quicksilver.

Does solving these kinds of technical and operational challenges excite you? Cloudflare is always hiring for talented specialists and generalists within our Engineering and other teams.

Moving k8s communication to gRPC

Alex Fattouche — Sat, 20 Mar 2021 14:00:00 GMT

Over the past year and a half, Cloudflare has been hard at work moving our back-end services running in our non-edge locations from bare metal solutions and Mesos Marathon to a more unified approach using Kubernetes(K8s). We chose Kubernetes because it allowed us to split up our monolithic application into many different microservices with granular control of communication.

For example, a ReplicaSet in Kubernetes can provide high availability by ensuring that the correct number of pods are always available. A Pod in Kubernetes is similar to a container in Docker. Both are responsible for running the actual application. These pods can then be exposed through a Kubernetes Service to abstract away the number of replicas by providing a single endpoint that load balances to the pods behind it. The services can then be exposed to the Internet via an Ingress. Lastly, a network policy can protect against unwanted communication by ensuring the correct policies are applied to the application. These policies can include L3 or L4 rules.

The diagram below shows a simple example of this setup.

Though Kubernetes does an excellent job at providing the tools for communication and traffic management, it does not help the developer decide the best way to communicate between the applications running on the pods. Throughout this blog we will look at some of the decisions we made and why we made them to discuss the pros and cons of two commonly used API architectures, REST and gRPC.

Out with the old, in with the new

When the DNS team first moved to Kubernetes, all of our pod-to-pod communication was done through REST APIs and in many cases also included Kafka. The general communication flow was as follows:

We use Kafka because it allows us to handle large spikes in volume without losing information. For example, during a Secondary DNS Zone zone transfer, Service A tells Service B that the zone is ready to be published to the edge. Service B then calls Service A’s REST API, generates the zone, and pushes it to the edge. If you want more information about how this works, I wrote an entire blog post about the Secondary DNS pipeline at Cloudflare.

HTTP worked well for most communication between these two services. However, as we scaled up and added new endpoints, we realized that as long as we control both ends of the communication, we could improve the usability and performance of our communication. In addition, sending large DNS zones over the network using HTTP often caused issues with sizing constraints and compression.

In contrast, gRPC can easily stream data between client and server and is commonly used in microservice architecture. These qualities made gRPC the obvious replacement for our REST APIs.

gRPC Usability

Often overlooked from a developer’s perspective, HTTP client libraries are clunky and require code that defines paths, handles parameters, and deals with responses in bytes. gRPC abstracts all of this away and makes network calls feel like any other function calls defined for a struct.

The example below shows a very basic schema to set up a GRPC client/server system. As a result of gRPC using protobuf for serialization, it is largely language agnostic. Once a schema is defined, the protoc command can be used to generate code for many languages.

Protocol Buffer data is structured as messages, with each message containing information stored in the form of fields. The fields are strongly typed, providing type safety unlike JSON or XML. Two messages have been defined, Hello and HelloResponse. Next we define a service called HelloWorldHandler which contains one RPC function called SayHello that must be implemented if any object wants to call themselves a HelloWorldHandler.

Simple Proto:

message Hello{
   string Name = 1;
}

message HelloResponse{}

service HelloWorldHandler {
   rpc SayHello(Hello) returns (HelloResponse){}
}

Once we run our protoc command, we are ready to write the server-side code. In order to implement the HelloWorldHandler, we must define a struct that implements all of the RPC functions specified in the protobuf schema above_._ In this case, the struct Server defines a function SayHello that takes in two parameters, context and *pb.Hello. *pb.Hello was previously specified in the schema and contains one field, Name. SayHello must also return the *pbHelloResponse which has been defined without fields for simplicity.

Inside the main function, we create a TCP listener, create a new gRPC server, and then register our handler as a HelloWorldHandlerServer. After calling Serve on our gRPC server, clients will be able to communicate with the server through the function SayHello.

Simple Server:

type Server struct{}

func (s *Server) SayHello(ctx context.Context, in *pb.Hello) (*pb.HelloResponse, error) {
    fmt.Println("%s says hello\n", in.Name)
    return &pb.HelloResponse{}, nil
}

func main() {
    lis, err := net.Listen("tcp", ":8080")
    if err != nil {
        panic(err)
    }
    gRPCServer := gRPC.NewServer()
    handler := Server{}
    pb.RegisterHelloWorldHandlerServer(gRPCServer, &handler)
    if err := gRPCServer.Serve(lis); err != nil {
        panic(err)
    }
}

Finally, we need to implement the gRPC Client. First, we establish a TCP connection with the server. Then, we create a new pb.HandlerClient. The client is able to call the server's SayHello function by passing in a *pb.Hello object.

Simple Client:

conn, err := gRPC.Dial("127.0.0.1:8080", gRPC.WithInsecure())
if err != nil {
    panic(err)
}
client := pb.NewHelloWorldHandlerClient(conn)
client.SayHello(context.Background(), &pb.Hello{Name: "alex"})

Though I have removed some code for simplicity, these services and messages can become quite complex if needed. The most important thing to understand is that when a server attempts to announce itself as a HelloWorldHandlerServer, it is required to implement the RPC functions as specified within the protobuf schema. This agreement between the client and server makes cross-language network calls feel like regular function calls.

In addition to the basic Unary server described above, gRPC lets you decide between four types of service methods:

Unary (example above): client sends a single request to the server and gets a single response back, just like a normal function call.
Server Streaming: server returns a stream of messages in response to a client's request.
Client Streaming: client sends a stream of messages to the server and the server replies in a single message, usually once the client has finished streaming.
Bi-directional Streaming: the client and server can both send streams of messages to each other asynchronously.

gRPC Performance

Not all HTTP connections are created equal. Though Golang natively supports HTTP/2, the HTTP/2 transport must be set by the client and the server must also support HTTP/2. Before moving to gRPC, we were still using HTTP/1.1 for client connections. We could have switched to HTTP/2 for performance gains, but we would have lost some of the benefits of native protobuf compression and usability changes.

The best option available in HTTP/1.1 is pipelining. Pipelining means that although requests can share a connection, they must queue up one after the other until the request in front completes. HTTP/2 improved pipelining by using connection multiplexing. Multiplexing allows for multiple requests to be sent on the same connection and at the same time.

HTTP REST APIs generally use JSON for their request and response format. Protobuf is the native request/response format of gRPC because it has a standard schema agreed upon by the client and server during registration. In addition, protobuf is known to be significantly faster than JSON due to its serialization speeds. I’ve run some benchmarks on my laptop, source code can be found here.

As you can see, protobuf performs better in small, medium, and large data sizes. It is faster per operation, smaller after marshalling, and scales well with input size. This becomes even more noticeable when unmarshaling very large data sets. Protobuf takes 96.4ns/op but JSON takes 22647ns/op, a 235X reduction in time! For large DNS zones, this efficiency makes a massive difference in the time it takes us to go from record change in our API to serving it at the edge.

Combining the benefits of HTTP/2 and protobuf showed almost no performance change from our application’s point of view. This is likely due to the fact that our pods were already so close together that our connection times were already very low. In addition, most of our gRPC calls are done with small amounts of data where the difference is negligible. One thing that we did notice — likely related to the multiplexing of HTTP/2 — was greater efficiency when writing newly created/edited/deleted records to the edge. Our latency spikes dropped in both amplitude and frequency.

gRPC Security

One of the best features in Kubernetes is the NetworkPolicy. This allows developers to control what goes in and what goes out.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: test-network-policy
  namespace: default
spec:
  podSelector:
    matchLabels:
      role: db
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - ipBlock:
        cidr: 172.17.0.0/16
        except:
        - 172.17.1.0/24
    - namespaceSelector:
        matchLabels:
          project: myproject
    - podSelector:
        matchLabels:
          role: frontend
    ports:
    - protocol: TCP
      port: 6379
  egress:
  - to:
    - ipBlock:
        cidr: 10.0.0.0/24
    ports:
    - protocol: TCP
      port: 5978

In this example, taken from the Kubernetes docs, we can see that this will create a network policy called test-network-policy. This policy controls both ingress and egress communication to or from any pod that matches the role db and enforces the following rules:

Ingress connections allowed:

Any pod in default namespace with label “role=frontend”
Any pod in any namespace that has a label “project=myproject”
Any source IP address in 172.17.0.0/16 except for 172.17.1.0/24

Egress connections allowed:

Any dest IP address in 10.0.0.0/24

NetworkPolicies do a fantastic job of protecting APIs at the network level, however, they do nothing to protect APIs at the application level. If you wanted to control which endpoints can be accessed within the API, you would need k8s to be able to not only distinguish between pods, but also endpoints within those pods. These concerns led us to per RPC credentials. Per RPC credentials are easy to set up on top of the pre-existing gRPC code. All you need to do is add interceptors to both your stream and unary handlers.

func (s *Server) UnaryAuthInterceptor(ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (interface{}, error) {
    // Get the targeted function
    functionInfo := strings.Split(info.FullMethod, "/")
    function := functionInfo[len(functionInfo)-1]
    md, _ := metadata.FromIncomingContext(ctx)

    // Authenticate
    err := authenticateClient(md.Get("username")[0], md.Get("password")[0], function)
    // Blocked
    if err != nil {
        return nil, err
    }
    // Verified
    return handler(ctx, req)
}

In this example code snippet, we are grabbing the username, password, and requested function from the info object. We then authenticate against the client to make sure that it has correct rights to call that function. This interceptor will run before any of the other functions get called, which means one implementation protects all functions. The client would initialize its secure connection and send credentials like so:

transportCreds, err := credentials.NewClientTLSFromFile(certFile, "")
if err != nil {
    return nil, err
}
perRPCCreds := Creds{Password: grpcPassword, User: user}
conn, err := grpc.Dial(endpoint, grpc.WithTransportCredentials(transportCreds), grpc.WithPerRPCCredentials(perRPCCreds))
if err != nil {
    return nil, err
}
client:= pb.NewRecordHandlerClient(conn)
// Can now start using the client

Here the client first verifies that the server matches with the certFile. This step ensures that the client does not accidentally send its password to a bad actor. Next, the client initializes the perRPCCreds struct with its username and password and dials the server with that information. Any time the client makes a call to an rpc defined function, its credentials will be verified by the server.

Next Steps

Our next step is to remove the need for many applications to access the database and ultimately DRY up our codebase by pulling all DNS-related code into a single API, accessed from one gRPC interface. This removes the potential for mistakes in individual applications and makes updating our database schema easier. It also gives us more granular control over which functions can be accessed rather than which tables can be accessed.

So far, the DNS team is very happy with the results of our gRPC migration. However, we still have a long way to go before we can move entirely away from REST. We are also patiently waiting for HTTP/3 support for gRPC so that we can take advantage of those super quic speeds!

Secondary DNS - Deep Dive

Alex Fattouche — Tue, 15 Sep 2020 11:00:00 GMT

How Does Secondary DNS Work?

If you already understand how Secondary DNS works, please feel free to skip this section. It does not provide any Cloudflare-specific information.

Secondary DNS has many use cases across the Internet; however, traditionally, it was used as a synchronized backup for when the primary DNS server was unable to respond to queries. A more modern approach involves focusing on redundancy across many different nameservers, which in many cases broadcast the same anycasted IP address.

Secondary DNS involves the unidirectional transfer of DNS zones from the primary to the Secondary DNS server(s). One primary can have any number of Secondary DNS servers that it must communicate with in order to keep track of any zone updates. A zone update is considered a change in the contents of a zone, which ultimately leads to a Start of Authority (SOA) serial number increase. The zone’s SOA serial is one of the key elements of Secondary DNS; it is how primary and secondary servers synchronize zones. Below is an example of what an SOA record might look like during a dig query.

example.com	3600	IN	SOA	ashley.ns.cloudflare.com. dns.cloudflare.com. 
2034097105  // Serial
10000 // Refresh
2400 // Retry
604800 // Expire
3600 // Minimum TTL

Each of the numbers is used in the following way:

Serial - Used to keep track of the status of the zone, must be incremented at every change.
Refresh - The maximum number of seconds that can elapse before a Secondary DNS server must check for a SOA serial change.
Retry - The maximum number of seconds that can elapse before a Secondary DNS server must check for a SOA serial change, after previously failing to contact the primary.
Expire - The maximum number of seconds that a Secondary DNS server can serve stale information, in the event the primary cannot be contacted.
Minimum TTL - Per RFC 2308, the number of seconds that a DNS negative response should be cached for.

Using the above information, the Secondary DNS server stores an SOA record for each of the zones it is tracking. When the serial increases, it knows that the zone must have changed, and that a zone transfer must be initiated.

Serial Tracking

Serial increases can be detected in the following ways:

The fastest way for the Secondary DNS server to keep track of a serial change is to have the primary server NOTIFY them any time a zone has changed using the DNS protocol as specified in RFC 1996, Secondary DNS servers will instantly be able to initiate a zone transfer.
Another way is for the Secondary DNS server to simply poll the primary every “Refresh” seconds. This isn’t as fast as the NOTIFY approach, but it is a good fallback in case the notifies have failed.

One of the issues with the basic NOTIFY protocol is that anyone on the Internet could potentially notify the Secondary DNS server of a zone update. If an initial SOA query is not performed by the Secondary DNS server before initiating a zone transfer, this is an easy way to perform an amplification attack. There is two common ways to prevent anyone on the Internet from being able to NOTIFY Secondary DNS servers:

Using transaction signatures (TSIG) as per RFC 2845. These are to be placed as the last record in the extra records section of the DNS message. Usually the number of extra records (or ARCOUNT) should be no more than two in this case.
Using IP based access control lists (ACL). This increases security but also prevents flexibility in server location and IP address allocation.

Generally NOTIFY messages are sent over UDP, however TCP can be used in the event the primary server has reason to believe that TCP is necessary (i.e. firewall issues).

Zone Transfers

In addition to serial tracking, it is important to ensure that a standard protocol is used between primary and Secondary DNS server(s), to efficiently transfer the zone. DNS zone transfer protocols do not attempt to solve the confidentiality, authentication and integrity triad (CIA); however, the use of TSIG on top of the basic zone transfer protocols can provide integrity and authentication. As a result of DNS being a public protocol, confidentiality during the zone transfer process is generally not a concern.

Authoritative Zone Transfer (AXFR)

AXFR is the original zone transfer protocol that was specified in RFC 1034 and RFC 1035 and later further explained in RFC 5936. AXFR is done over a TCP connection because a reliable protocol is needed to ensure packets are not lost during the transfer. Using this protocol, the primary DNS server will transfer all of the zone contents to the Secondary DNS server, in one connection, regardless of the serial number. AXFR is recommended to be used for the first zone transfer, when none of the records are propagated, and IXFR is recommended after that.

Incremental Zone Transfer (IXFR)

IXFR is the more sophisticated zone transfer protocol that was specified in RFC 1995. Unlike the AXFR protocol, during an IXFR, the primary server will only send the secondary server the records that have changed since its current version of the zone (based on the serial number). This means that when a Secondary DNS server wants to initiate an IXFR, it sends its current serial number to the primary DNS server. The primary DNS server will then format its response based on previous versions of changes made to the zone. IXFR messages must obey the following pattern:

Current latest SOA
Secondary server current SOA
DNS record deletions
Secondary server current SOA + changes
DNS record additions
Current latest SOA

Steps 2,3,4,5,6 can be repeated any number of times, as each of those represents one change set of deletions and additions, ultimately leading to a new serial.

IXFR can be done over UDP or TCP, but again TCP is generally recommended to avoid packet loss.

How Does Secondary DNS Work at Cloudflare?

The DNS team loves microservice architecture! When we initially implemented Secondary DNS at Cloudflare, it was done using Mesos Marathon. This allowed us to separate each of our services into several different marathon apps, individually scaling apps as needed. All of these services live in our core data centers. The following services were created:

Zone Transferer - responsible for attempting IXFR, followed by AXFR if IXFR fails.
Zone Transfer Scheduler - responsible for periodically checking zone SOA serials for changes.
Rest API - responsible for registering new zones and primary nameservers.

In addition to the marathon apps, we also had an app external to the cluster:

Notify Listener - responsible for listening for notifies from primary servers and telling the Zone Transferer to initiate an AXFR/IXFR.

Each of these microservices communicates with the others through Kafka.

Figure 1: Secondary DNS Microservice Architecture‌‌

Once the zone transferer completes the AXFR/IXFR, it then passes the zone through to our zone builder, and finally gets pushed out to our edge at each of our 200 locations.

Although this current architecture worked great in the beginning, it left us open to many vulnerabilities and scalability issues down the road. As our Secondary DNS product became more popular, it was important that we proactively scaled and reduced the technical debt as much as possible. As with many companies in the industry, Cloudflare has recently migrated all of our core data center services to Kubernetes, moving away from individually managed apps and Marathon clusters.

What this meant for Secondary DNS is that all of our Marathon-based services, as well as our NOTIFY Listener, had to be migrated to Kubernetes. Although this long migration ended up paying off, many difficult challenges arose along the way that required us to come up with unique solutions in order to have a seamless, zero downtime migration.

Challenges When Migrating to Kubernetes

Although the entire DNS team agreed that kubernetes was the way forward for Secondary DNS, it also introduced several challenges. These challenges arose from a need to properly scale up across many distributed locations while also protecting each of our individual data centers. Since our core does not rely on anycast to automatically distribute requests, as we introduce more customers, it opens us up to denial-of-service attacks.

The two main issues we ran into during the migration were:

How do we create a distributed and reliable system that makes use of kubernetes principles while also making sure our customers know which IPs we will be communicating from?
When opening up a public-facing UDP socket to the Internet, how do we protect ourselves while also preventing unnecessary spam towards primary nameservers?.

Issue 1:

As was previously mentioned, one form of protection in the Secondary DNS protocol is to only allow certain IPs to initiate zone transfers. There is a fine line between primary servers allow listing too many IPs and them having to frequently update their IP ACLs. We considered several solutions:

Open source k8s controllers
Altering Network Address Translation(NAT) entries
Do not use k8s for zone transfers
Allowlist all Cloudflare IPs and dynamically update
Proxy egress traffic

Ultimately we decided to proxy our egress traffic from k8s, to the DNS primary servers, using static proxy addresses. Shadowsocks-libev was chosen as the SOCKS5 implementation because it is fast, secure and known to scale. In addition, it can handle both UDP/TCP and IPv4/IPv6.

Figure 2: Shadowsocks proxy Setup

The partnership of k8s and Shadowsocks combined with a large enough IP range brings many benefits:

Horizontal scaling
Efficient load balancing
Primary server ACLs only need to be updated once
It allows us to make use of kubernetes for both the Zone Transferer and the Local ShadowSocks Proxy.
Shadowsocks proxy can be reused by many different Cloudflare services.

Issue 2:

The Notify Listener requires listening on static IPs for NOTIFY Messages coming from primary DNS servers. This is mostly a solved problem through the use of k8s services of type loadbalancer, however exposing this service directly to the Internet makes us uneasy because of its susceptibility to attacks. Fortunately DDoS protection is one of Cloudflare's strengths, which lead us to the likely solution of dogfooding one of our own products, Spectrum.

Spectrum provides the following features to our service:

Reverse proxy TCP/UDP traffic
Filter out Malicious traffic
Optimal routing from edge to core data centers
Dual Stack technology

Figure 3: Spectrum interaction with Notify Listener

Figure 3 shows two interesting attributes of the system:

Spectrum <-> k8s IPv4 only:
This is because our custom k8s load balancer currently only supports IPv4; however, Spectrum has no issue terminating the IPv6 connection and establishing a new IPv4 connection.
Spectrum <-> k8s routing decisions based of L4 protocol:
This is because k8s only supports one of TCP/UDP/SCTP per service of type load balancer. Once again, spectrum has no issues proxying this correctly.

One of the problems with using a L4 proxy in between services is that source IP addresses get changed to the source IP address of the proxy (Spectrum in this case). Not knowing the source IP address means we have no idea who sent the NOTIFY message, opening us up to attack vectors. Fortunately, Spectrum’s proxy protocol feature is capable of adding custom headers to TCP/UDP packets which contain source IP/Port information.

As we are using miekg/dns for our Notify Listener, adding proxy headers to the DNS NOTIFY messages would cause failures in validation at the DNS server level. Alternatively, we were able to implement custom read and write decorators that do the following:

Reader: Extract source address information on inbound NOTIFY messages. Place extracted information into new DNS records located in the additional section of the message.
Writer: Remove additional records from the DNS message on outbound NOTIFY replies. Generate a new reply using proxy protocol headers.

There is no way to spoof these records, because the server only permits two extra records, one of which is the optional TSIG. Any other records will be overwritten.

Figure 4: Proxying Records Between Notifier and Spectrum‌‌

This custom decorator approach abstracts the proxying away from the Notify Listener through the use of the DNS protocol.

Although knowing the source IP will block a significant amount of bad traffic, since NOTIFY messages can use both UDP and TCP, it is prone to IP spoofing. To ensure that the primary servers do not get spammed, we have made the following additions to the Zone Transferer:

Always ensure that the SOA has actually been updated before initiating a zone transfer.
Only allow at most one working transfer and one scheduled transfer per zone.

Additional Technical Challenges

Zone Transferer Scheduling

As shown in figure 1, there are several ways of sending Kafka messages to the Zone Transferer in order to initiate a zone transfer. There is no benefit in having a large backlog of zone transfers for the same zone. Once a zone has been transferred, assuming no more changes, it does not need to be transferred again. This means that we should only have at most one transfer ongoing, and one scheduled transfer at the same time, for any zone.

If we want to limit our number of scheduled messages to one per zone, this involves ignoring Kafka messages that get sent to the Zone Transferer. This is not as simple as ignoring specific messages in any random order. One of the benefits of Kafka is that it holds on to messages until the user actually decides to acknowledge them, by committing that messages offset. Since Kafka is just a queue of messages, it has no concept of order other than first in first out (FIFO). If a user is capable of reading from the Kafka topic concurrently, it is entirely possible that a message in the middle of the queue be committed before a message at the end of the queue.

Most of the time this isn’t an issue, because we know that one of the concurrent readers has read the message from the end of the queue and is processing it. There is one Kubernetes-related catch to this issue, though: pods are ephemeral. The kube master doesn’t care what your concurrent reader is doing, it will kill the pod and it’s up to your application to handle it.

Consider the following problem:

Figure 5: Kafka Partition‌‌

Read offset 1. Start transferring zone 1.
Read offset 2. Start transferring zone 2.
Zone 2 transfer finishes. Commit offset 2, essentially also marking offset 1.
Restart pod.
Read offset 3 Start transferring zone 3.

If these events happen, zone 1 will never be transferred. It is important that zones stay up to date with the primary servers, otherwise stale data will be served from the Secondary DNS server. The solution to this problem involves the use of a list to track which messages have been read and completely processed. In this case, when a zone transfer has finished, it does not necessarily mean that the kafka message should be immediately committed. The solution is as follows:

Keep a list of Kafka messages, sorted based on offset.
If finished transfer, remove from list:
If the message is the oldest in the list, commit the messages offset.

Figure 6: Kafka Algorithm to Solve Message Loss

This solution is essentially soft committing Kafka messages, until we can confidently say that all other messages have been acknowledged. It’s important to note that this only truly works in a distributed manner if the Kafka messages are keyed by zone id, this will ensure the same zone will always be processed by the same Kafka consumer.

Life of a Secondary DNS Request

Although Cloudflare has a large global network, as shown above, the zone transferring process does not take place at each of the edge datacenter locations (which would surely overwhelm many primary servers), but rather in our core data centers. In this case, how do we propagate to our edge in seconds? After transferring the zone, there are a couple more steps that need to be taken before the change can be seen at the edge.

Zone Builder - This interacts with the Zone Transferer to build the zone according to what Cloudflare edge understands. This then writes to Quicksilver, our super fast, distributed KV store.
Authoritative Server - This reads from Quicksilver and serves the built zone.

Figure 7: End to End Secondary DNS‌‌

What About Performance?

At the time of writing this post, according to dnsperf.com, Cloudflare leads in global performance for both Authoritative and Resolver DNS. Here, Secondary DNS falls under the authoritative DNS category here. Let’s break down the performance of each of the different parts of the Secondary DNS pipeline, from the primary server updating its records, to them being present at the Cloudflare edge.

Primary Server to Notify Listener - Our most accurate measurement is only precise to the second, but we know UDP/TCP communication is likely much faster than that.
NOTIFY to Zone Transferer - This is negligible
Zone Transferer to Primary Server - 99% of the time we see ~800ms as the average latency for a zone transfer.

Figure 8: Zone XFR latency

4. Zone Transferer to Zone Builder - 99% of the time we see ~10ms to build a zone.

Figure 9: Zone Build time

5. Zone Builder to Quicksilver edge: 95% of the time we see less than 1s propagation.

Figure 10: Quicksilver propagation time

End to End latency: less than 5 seconds on average. Although we have several external probes running around the world to test propagation latencies, they lack precision due to their sleep intervals, location, provider and number of zones that need to run. The actual propagation latency is likely much lower than what is shown in figure 10. Each of the different colored dots is a separate data center location around the world.

Figure 11: End to End Latency

An additional test was performed manually to get a real world estimate, the test had the following attributes:

Primary server: NS1Number of records changed: 1Start test timer event: Change record on NS1Stop test timer event: Observe record change at Cloudflare edge using digRecorded timer value: 6 seconds

Conclusion

Cloudflare serves 15.8 trillion DNS queries per month, operating within 100ms of 99% of the Internet-connected population. The goal of Cloudflare operated Secondary DNS is to allow our customers with custom DNS solutions, be it on-premise or some other DNS provider, to be able to take advantage of Cloudflare's DNS performance and more recently, through Secondary Override, our proxying and security capabilities too. Secondary DNS is currently available on the Enterprise plan, if you’d like to take advantage of it, please let your account team know. For additional documentation on Secondary DNS, please refer to our support article.

Orange Clouding with Secondary DNS

Alex Fattouche — Thu, 20 Aug 2020 11:00:00 GMT

What is secondary DNS?

In a traditional sense, secondary DNS servers act as a backup to the primary authoritative DNS server. When a change is made to the records on the primary server, a zone transfer occurs, synchronizing the secondary DNS servers with the primary server. The secondary servers can then serve the records as if they were the primary server, however changes can only be made by the primary server, not the secondary servers. This creates redundancy across many different servers that can be distributed as necessary.

There are many common ways to take advantage of Secondary DNS, some of which are:

Secondary DNS as passive backup - The secondary DNS server sits idle until the primary server goes down, at which point a failover can occur and the secondary can start serving records.
Secondary DNS as active backup - The secondary DNS server works alongside the primary server to serve records.
Secondary DNS with a hidden primary - The nameserver records at the registrar point towards the secondary servers only, essentially treating them as the primary nameservers.

What is secondary DNS Override?

Secondary DNS Override builds on the Secondary DNS with a hidden primary model by allowing our customers to not only serve records as they tell us to, but also enable them to proxy any A/AAAA/CNAME records through Cloudflare's network. This is similar to how Cloudflare as a primary DNS provider currently works.

Consider the following example:

example.com Cloudflare IP - 192.0.2.0example.com origin IP - 203.0.113.0

In order to take advantage of Cloudflare's security and performance services, we need to make sure that the origin IP stays hidden from the Internet.

Figure 1: Secondary DNS without a hidden primary nameserver

Figure 1 shows that without a hidden primary nameserver, the resolver can choose to query either one. This opens up two issues:

Violates RFC 1034 and RFC 2182 because the Cloudflare server will be responding differently than the primary nameserver.
The origin IP will be exposed to the Internet.

Figure 2: Secondary DNS with a hidden primary nameserver

Figure 2 shows the resolver always querying the Cloudflare Secondary DNS server.

How Does Secondary DNS Override work

The Secondary DNS Override UI looks similar to the primary UI, the only difference is that records cannot be edited.

Figure 3: Secondary DNS Override Dashboard

In figure 3, all of the records have been transferred from the primary DNS server. test-orange and test-aaaa-orange have been overridden to proxy through the cloudflare network, while test-grey and test-mx are treated as regular DNS records.

Behind the scenes we store override records that pair with transferred records based on the name. For secondary override we don’t care about the type when overriding records, because of two things:

According to RFC 1912 you cannot have a CNAME record with the same name as any other record. (This does not apply to some DNSSEC records, see RFC 2181)
A and AAAA records are both address type records which should be either all proxied or all not proxied under the same name.

This means if you have several A and several AAAA records all with the name “example.com”, if one of them is proxied, all of them will be proxied. The UI helps abstract the idea that we are storing additional override records through the “orange cloud” button, which when clicked, will create an override record which applies to all A/AAAA or CNAME records with that name.

CNAME at the Apex

Normally, putting a CNAME at the apex of a zone is not allowed. For example:

example.com CNAME other-domain.com

Is not allowed because this means that there will be at least one other SOA record and one other NS record with the same name, disobeying RFC 1912 as mentioned above. Cloudflare can overcome this through the use of CNAME Flattening, which is a common technique used within the primary DNS product today. CNAME flattening allows us to return address records instead of the CNAME record when a query comes into our authoritative server.

Contrary to what was said above regarding the prevention of editing records through the Secondary DNS Override UI, the CNAME at the apex is the one exception to this rule. Users are able to create a CNAME at the apex in addition to the regular secondary DNS records, however the same rules defined in RFC 1912 also apply here. What this means is that the CNAME at the apex record can be treated as a regular DNS record or a proxied record, depending on what the user decides. Regardless of the proxy status of the CNAME at the apex record, it will override any other A/AAAA records that have been transferred from the primary DNS server.

Merging Secondary, Override and CNAME at Apex Records

At record edit time we do all of the merging of the secondary, override and CNAME at the apex records. This means that when a DNS request comes in to our authoritative server at the edge, we can still return the records in blazing fast times. The workflow is shown in figure 4.

Figure 4: Record Merging process

Within the zone builder the steps are as follows:

Check if there is any CNAME at the apex, if so, override all other A/AAAA secondary records at the apex.
For each secondary record, check if there is a matching override record, if so, apply the proxy status of the override record to all secondary records with that name.
Leave all other secondary records as is.

Getting Started

Secondary DNS Override is a great option for any users that want to take advantage of the Cloudflare network, without transferring all of their zones to Cloudflare DNS as a primary provider. Security and access control can be managed on the primary side, without worrying about unauthorized edits of information on the Cloudflare side.

Secondary DNS Override is currently available on the Enterprise plan, if you’d like to take advantage of it, please let your account team know. For additional documentation on Secondary DNS Override, please refer to our support article.