HOW CAN WE HELP YOU TODAY?

1
Knowledgebase: Announcements
Connectivity issues in the India GPX datacenter
Posted by Jaison N on 10 September 2016 10:04 AM

RCA: GPX Downtime

HISTORY
10 AUGUST 2016

Upgraded Junos from 11.4R4.4 to Junos 13.3R9 about 30 days ago as version 11.4 was EOL.


PRECAP
09:00 GMT 09 SEPTEMBER 2016 - RTR2 Failure

We noticed alerts wherein all the links went down. On investigation we found hardware alarms. We tried to swap/reset the MIC but it didn't help. JTAC (Juniper Technical Support) identified the hardware issue and placed a RMA (Return Material Authorization). There was no downtime associated with this hardware failure as all the traffic was automatically switched over to RTR1 via VRRP.

 

ISSUE
03:30 GMT - 10th September - RTR1 Failure

The next day, we got alerts confirming that RTR1 has also gone down and has affected all the services hosted out of GPX India data center. On troubleshooting we found that it was a similar to the incident that happened on the 9th September 2016. Reseating the MIC & rebooting the router did not help.


TEMPORARY SOLUTION
05:30 GMT - 10th September 2016

We started to work on a temporary solution as getting the hardware replacement was not going to be possible until Tuesday 13th September 2016. We got few advanced license from Juniper for two of the Ex4200 switches( which act as a aggregation layer), moved the ISP links to those and configured the BGP sessions to bring the DC online. 

Slowly all the alerts cleared and services from the India GPX data center started working around 05:00 GMT  


HARDWARE FAILURE RCA
08:30 GMT - 10th September 2016

Failure of 2 routers back to back put doubts in our head regarding this being a hardware failure and hence we got senior JTAC’s to check this, upon investigation they confirmed that the issue was caused due to:

  1. When the juniper routers were delivered - they were delivered with the MIC cards on a restricted slot.

  2. Post upgrade, the license for these slots changed from a honour system to a restrictive one in the new JunOS release.

  3. This caused JunOS to disable those interfaces after 30 days.


ROUTERS ONLINE

01:30 GMT - 12th September 2016

Since the root cause of the issue was identified and the routers didn’t have stability issues we decided to move all the traffic to the routers at around 01:30 GMT on 12th September 2016 (time when we have least traffic) resulting in a 2 minute downtime due to route convergence in internet routers.


CONCLUSION

We are still waiting on answers from JTAC on some critical business assumptions and actions which could have helped to solve this incident faster.

 

============================================================================

Update:

Our Networking Team has been able to correct the connectivity issues. Please reach out to our Customer Support Team if you experience any additional interruptions going forward.

============================================================================

Update:

We have been able to restore the network connectivity and the servers seem to be working fine now.

We are monitoring for any further issues and keep you updated.

============================================================================

Hello All,

We are facing network issues with some of our servers resulting in intermittent issues with websites and emails. Our network team is working on getting this fixed as soon as possible.

We will keep this thread updated with the latest developments. We apologize for any inconvenience caused due to this.