HOW CAN WE HELP YOU TODAY?

1
Knowledgebase: Announcements
Please Read: Control Panel Database Outage RCA and Maintenance Update
Posted by Chhaya P on 11 May 2016 12:58 PM

Summary for the Control Panel database Outage :

What happened ?

  • At around 01:00 AM GMT (06.30 AM IST) on May 9th 2016, the Orderbox(Control Panel) platforms latency started increasing and some requests started to timeout.
  • During the same time our firewall, at our corporate head offices, malfunctioned and we couldn’t access the servers or various control panels for a period of time. Due to this, we were initially led to believe that our Data Centre is under DDOS attack and we spent some time investigating that.
  • Our technical staff from other locations were able to identify that the Data Centre is not under attack and we then started isolating the real issue.
  • On debugging, using logs and various metric dashboards, it was found that certain SQL queries were taking too much time to complete and thereby blocking other queries. This caused the database connections to pile up on the application servers and exhaust all database connections to the system.
  • To resolve the issue, we killed the offending queries & optimised the tables. We then put up an emergency maintenance notice and restarted the web apps to clear up the blocked connections.

Next steps :

To prevent such issues from happening again, we will be taking the below mentioned actions :

  • Investigate the offending queries & tables and identify how latency increased suddenly and take relevant preventive measures.
  • Review redundancy measures for accessibility to our Data Centre’s in case of corporate network issues and implement the necessary measures.
  • Failover to a new database server to rule out hardware issues.

.
Maintenance window :

  • As mentioned above, to rule out any hardware issues we will be performing a failover of the current primary database server to a standby. The schedule is mentioned below :

           Date : May 22nd 2016.
           Start Time : 02.30 AM GMT
                             08.00 AM IST
           Duration : 1 hour

  • Post failover we will also be running some stress tests / burn-ins on the current primary to identify any hardware issues.