Recent DNS Outage: Updates, Root Cause Analysis and Plan of Action

Earlier last week, we faced a Denial of Service attack on our DNS Servers which resulted in disruption of services for several resellers. We sincerely apologize for the inconvenience and disruption caused by the issue and as promised, here is a debrief. Following this unfortunate incident, we conducted a complete root cause analysis. Here’s what went on and what we’re doing to deal with potential future attacks.

Updates, Root cause analysis and Plan of action:

Over the last few days we have spent time analyzing our DNS servers architecture and distributed denial-of-service(DDOS) mitigation process & capacity at our Data center (DC). The post below covers aspects on what is wrong and what we are doing to fix it.

Managed DNS Architecture  

Our DNS servers are spread out across 4 Data centers in the US. They are isolated both physically and at the network level with their own bandwidth capacity, network gear etc. They are all hosted with Softlayer, who have always provided us with the best service in all circumstances.

For each domain registered with us, your domain gets Managed DNS service for free with 4 Name servers configured. Below is an example that illustrates this –

domain.com registered with us gets 4 Name servers(NS) dns1.orderbox-dns.com , dns2.orderbox-dns.com, dns3.orderbox-dns.com and dns4.orderbox-dns.com. dns1 has 4 IP addresses and is hosted at DC1 using 2 physical servers, dns2 has 4 IP addresses and is hosted at DC2 using 2 physical servers and so on
So, in total, we serve our DNS traffic with 4 DC’s, each with 2 physical servers which gives us a capacity of 16 GBps network throughput.

On each of these DNS servers we run a optimized version of PowerDNS with a capacity of 50000 qps. The total theoretical capacity of our DNS cluster is around 400,000 qps.

DDOS Mitigation Capacity 

As mentioned before, our DNS servers are hosted at Softlayer and Softlayer’s network has been battle tested many times before during similar DDOS attacks. Each of the Softlayer’s DC is equipped with multiple 10Gbps or 40 gbps transit links to the internet and uses high-end networking gear. Softlayer also use Arbor Peakflow for DDOS detection and Arbor TMS for DDOS mitigation. Each of the Arbor TMS systems are capable of mitigating 10+ gbps of attack traffic.

You can read more about Softlayers network and architecture at http://www.softlayer.com/network.

What went wrong?

Typically we see one or few DNS Server IP addresses getting attacked and they get either null routed or mitigated on the TMS system. This activity is pretty common and we see two or three such incidents every week. We have always maintained our service levels during all such incidents.

During the recent attack, we received 40+ gbps traffic spread out across all our DNS server IP Addresses. The attack traffic was moving from one IP Address to the other at rapid succession. Softlayer, to prevent instability on their network null routed our IP Addresses. The null route is a rule to drop all traffic destined to our IP address at the Softlayer’s upstream ISP’s network. What this means is that after the null route is in place even Softlayer will not have the visibility onto what the attack traffic is.

Post this as explained in the previous post , we started removing each null route, finding and mitigating the attack on every IP.

What’s wrong with our setup ?

Problem 1 : Relying solely on Softlayer Datacenters and DDOS mitigation capabilities.
Problem 2 : We are bound to /32 static IP addresses provided by the DC’s. We are not utilizing our own /24 subnets to host the DNS servers. By using our own /24 subnets, we could have swung the traffic to our third party DDOS mitigation partner, Neustar.
Problem 3 : All customers NS’s pointing to the same IP addresses. So when attack happens and causes disruption all customers are affected.

To solve these problems, we have planned a new DNS architecture in the last quarter and have made some progress in deploying the same.

More on the new architecture below:

11l2z6f.png

Our new Managed DNS Architecture 

The new Managed DNS infrastructure architecture is explained as below:
In phase1 , we will move current DNS server IP Addresses to our own IP Subnets. This ensures we have the ability to use Neustar for DDOS mitigation when needed. All our data centers are already protected by Neustar. Learn more about Neustar DDOS mitigation service at https://www.neustar….ddos-protection.
In phase2, we will start bucketing customers across different IP addresses such that an attack on a domain in set1 will not disrupt DNS service to customers in other sets.
In phase3, we will start introducing DNS servers in other geographical regions where we have a DC presence and use Anycast. With Anycast, an attack originating from a particular region will only affect that region and other regions will continue to work normally. The affected region will use Neustar DDOS mitigation to mitigate the attack.

Next Steps 

Our immediate goal is to complete Phase1 in the next 2 weeks. This will ensure we can withstand any attack and will also eliminate the Single point of failure with Softlayer. It will also enable us to use Neustar’s Anycast’d DDOS mitigation system to withstand any traffic volumes.

We will communicate the changes required by you and your customers as and when needed to ensure you all utilize the new setup.

We sincerely regret and apologize for all the inconvenience. We understand that you count on us and to that effect, we’ll continue to render our services to the best of our ability to helping you build and grow your businesses to their full potential.

About Amrita

AmritaAmrita is a marketing specialist by profession who loves writing, music and animals.

Amrita

Amrita

Amrita is a marketing specialist by profession who loves writing, music and animals.