Connectivity issues in the India GPX datacenter Posted by on 10 September 2016 10:04 AM
|
|
RCA: GPX Downtime HISTORY
We noticed alerts wherein all the links went down. On investigation we found hardware alarms. We tried to swap/reset the MIC but it didn't help. JTAC (Juniper Technical Support) identified the hardware issue and placed a RMA (Return Material Authorization). There was no downtime associated with this hardware failure as all the traffic was automatically switched over to RTR1 via VRRP.
ISSUE The next day, we got alerts confirming that RTR1 has also gone down and has affected all the services hosted out of GPX India data center. On troubleshooting we found that it was a similar to the incident that happened on the 9th September 2016. Reseating the MIC & rebooting the router did not help.
We started to work on a temporary solution as getting the hardware replacement was not going to be possible until Tuesday 13th September 2016. We got few advanced license from Juniper for two of the Ex4200 switches( which act as a aggregation layer), moved the ISP links to those and configured the BGP sessions to bring the DC online. Slowly all the alerts cleared and services from the India GPX data center started working around 05:00 GMT
Failure of 2 routers back to back put doubts in our head regarding this being a hardware failure and hence we got senior JTAC’s to check this, upon investigation they confirmed that the issue was caused due to:
Since the root cause of the issue was identified and the routers didn’t have stability issues we decided to move all the traffic to the routers at around 01:30 GMT on 12th September 2016 (time when we have least traffic) resulting in a 2 minute downtime due to route convergence in internet routers.
We are still waiting on answers from JTAC on some critical business assumptions and actions which could have helped to solve this incident faster.
============================================================================ Update: ============================================================================ Update: We have been able to restore the network connectivity and the servers seem to be working fine now. ============================================================================ Hello All, | |
|