HOW CAN WE HELP YOU TODAY?

1
Knowledgebase: Announcements
Issues on Bigrock DNS servers
Posted by Jaison N on 12 September 2016 11:51 AM

RCA - DNS - Downtime

 

[Updated: 28th September 2016 01:00 PM]

OVERVIEW

Intermittent issues with DNS servers starting 2nd September 2016 due to different types of DDOS/SYN attack.  The total cumulative downtime is estimated to be about 6.15 hrs

 

DETAILS

Our System Operations Team started receiving alerts for our Managed DNS service which saw about 20-25k QPS on each node i.e  5X more than our normal traffic. Support Center received multiple reports from people who could not access services as the DNS servers were unreachable. We immediately started mitigation via Neustar and tried to bring down the QPS count by evaluating tcpdump in order to identify any unusual pattern. We also moved all our traffic via all the 16 IPs and put them under mitigation with only Port 53 as allowed as it is used for UDP/TCP. This brought the QPS count under control and the alerts started to clear up.

Connectivity issues were sporadic yet repeatable.  One of the major issues that we faced with this attack was the attack vector kept changing and making the necessary changes with the mitigation filter template took time as we needed to improvise every filter in real time. Since the QPS count didn’t show drastic improvement even after Neustar dropping 960K PPS and 350 Mbps traffic, we decided to  cancel the mitigation and spread the load across all our DC’s and deployed legacy mitigation at each DC. This plan worked initially but had its own pitfalls - We had to quickly move back to Neustar as another attack would have put this temporary setup under jeopardy. We got our internal team to review the problem and it was decided that increasing our nodes at colo will help to load balance traffic against all available nodes (old & new) after mitigation from Neustar.

 

IMPROVEMENTS

We are committed to improving your experience and are making the following improvements:

 

  • Work with Neustar to handle cases where the attack vector keeps changing.

  • Work with Neustar to terminate traffic with multiple GRE tunnels, instead of just one. If this can be done, all DNS traffic need not be pointed on our DNS nodes in one DC and can be spread out to multiple locations.

  • Network stack optimizations on DNS servers, to accept more packets.

  • Cross check current DNS server and verify if any optimizations can be done to increase the DNS throughput.

 

CONCLUSION

We set out to build a highly resilient Anycast Managed DNS service backing on mitigation services provided by Neustar and this attack was the first one which caused intermittent outages in the course of last 1 year. While we do have some lessons learned and some improvements to make, we continue to be confident that this is the right strategy for us.

=======================================================================

Update: 01:00 PM 

 

The load on our DNS servers has normalised and the issue has now subsided.  

DNS services should start to function normally.

Please write to the support team if you feel the issue persists. 

 

 

=======================================================================

 

We are currently facing an issues on the Bigrock DNS servers, on account of which the DNS records (A, MX, PTR, etc...) could not be fetched.

 

Name Server Details:

DNS1.BIGROCK.IN

DNS2.BIGROCK.IN

DNS3.BIGROCK.IN

DNS4.BIGROCK.IN

 

We are working in co-ordination with the data center to mitigate this issue as soon as possible. Please follow this thread for more information.

Feel free to contact the support team, in case of any further queries or concerns.