Austin Outage issue Posted by on 25 April 2014 12:33 PM
|
|
[Updates 12:16 I.S.T. July 31, 2014] I want to take a few minutes to highlight some of the recent updates & milestones in what has been a long and complex restoration effort over the past few months. Many of these efforts have been described in our earlier posts, and I am recapping them here briefly.
Next Steps
Dushyanth Harinath ================================================================== Updates on the email outage at our Austin data center [Updates 18:20 I.S.T. June 16, 2014] The email recovery process is proceeding in the manner that I had highlighted earlier and as of writing this we have completed the restoration for a large number of email accounts as part of our Phase 1 efforts. The recovery process works by incrementally syncing any additional emails that we recover to your 'restored_emails' folder and this process will continue each day until we have determined that no more data can be recovered. This recovery effort continues to be a significant effort and priority for my teams and myself personally. Please stay tuned for further updates. In the meantime, please feel free to connect with our support teams if you have any additional questions. Thank you. [Updates 12:00 I.S.T. June 2, 2014] As detailed in my last post, our team has been in the process of restoring data in two distinct phases which use different technology/processes. As of this update, I am happy to report that the first phase of data recovery is nearing completion. Large amounts of data (multiple TBs) has now been processed as part of the Phase 1 data restoration project. At this time, our team is working on running a secondary process to sync this recovered data with each affected user's email account. Given the large volume of data, we expect this process to continue for the several days. Emails that are being restored are being placed in a distinct folder called 'restored_emails' in your account. We continue to correspond with individual users via email during this time to keep them abreast of progress for their accounts. While we work on completing this first phase, the recovery team is simultaneously working on an additional advanced recovery process ('Phase 2') which is a much more intensive and low-level bit-by-bit restoration that may allow us to recover additional data that was not recovered during the first phase. Please stay tuned for further updates on how this second phase is progressing. In the meantime, please feel free to connect with our support teams if you have any additional questions. Thank you.
[Updates 12:00 I.S.T. May 22, 2014]
As mentioned in our last post, we are working on recovering data in phases. Phase 1 of the recovery process is currently in progress and we have started to recover emails across affected accounts. Phase 1 will continue for at least a couple of additional weeks. Our customer facing teams are reaching out to customers via emails to inform them of the status of recovery for each account and we have done this across a large number of affected users already. This communication is going out every day now as we continue to restore additional data.
While the Phase 1 recovery process is running, the data recovery team is working on building a Phase 2 recovery program (which is a variation of the Phase 1 program but utilizes additional techniques to support even lower level bit-by-bit data recovery) simultaneously. With the Phase 2 process we expect to recover additional data that was not recovered during the Phase 1 process.
OUr engineering teams continue to be fully engaged in this activity and are leaving no stone un-turned. We am fully aware that we are now into the 4th week of this recovery process. However, given the complexity of the effort we are expecting this to take several weeks more. We thank you again for your continued patience. This remains a very critical issue for us and will continue to be until we exhaust each and every avenue available to us to recover your data. Thank you.
[Updates 18:30 I.S.T. May 21, 2014] In the aftermath of the email storage outage on 24th April, we had setup interim storage devices so that mail services can be resumed for affected accounts. In the meanwhile, we have taken efforts to create a more hardened storage infrastructure. We will now be migrating the affected accounts from the temporary servers to this infrastructure. The maintenance is planned to be conducted as below :-
Date :- 22nd May 2014
We regret the inconvenience this may cause; thank you for your continued patience. If you have any queries regarding this, feel free to reach out to our support teams.
[Updates 10:50 I.S.T. May 17, 2014] Summary Email restoration process updates
Email restoration process The effort our engineering teams have put in over the last few weeks is starting to yield some concrete results. Over the past few weeks, the engineering team has built a multi-phased automated recovery and restore process for the storage cluster. The first phase of the recovery process is to recover files that are undamaged or are partially damaged. At this time, we are recovering data in excess of 500 GB of data a day and we expect that this phase will complete in about 2 weeks under current conditions and estimates. Once we conclude this first phase - we will be launching future phases to recover additional types of data using different means. Over the last few days we have managed to restore selected emails for a small number of user accounts as part of this first phase of operations. It is important to note that the restored emails in this phase may not be the complete set of emails in these email accounts. Once we complete the first phase of operations highlighted above, my team will continue the process in subsequent phases to restore additional data, if possible. As explained in our previous posts, we are unable to predict or place any guarantees on the quantum or percentage of recovery for specific users. However, we remain committed to recovering all the email data that we possible can through this multi-phased processes. We have started to reach out to users individually where required to keep them apprised of this progress. Looking ahead I spoke about revisiting some of our systems and processes in light of this outage in my previous post. We take a lot of pride in the faith you put on our services and we engineer all our products to meet the highest availability & safety standards - right from selecting the best in class data centers, hardware, ISPs, DDOS mitigation, storage systems - to building processes that allow our teams to efficiently operate and manage our systems. Post this outage, my team and I have conducted a detailed reviewed of each of these systems we have in place and also conducted meticulous planning for additional scenarios to boost our preparedness for a wide range of possible issues. Our primary objectives as part of these sessions have been to review and analyze processes and systems:
Our systems have always been designed with these objectives in mind. We have always built our systems to very high standards and provided you with strong services and SLAs across a range of products for well over a decade. However, this detailed review has highlighted a few specific areas where we can improve and my team and I are working rapidly to resolve these gaps. FAQ I am posting here some some additional FAQs for users who emails may have been recovered partially / or not at all and have received an indication to this effect.
Next steps In conclusion, I want to reiterate the following.
Updates on the email outage at our Austin data center [Updates 12:26 I.S.T. May 14, 2014] Summary
- Post-mortem findings - Details on the email restoration effort - Frequently asked questions - Next steps Post-mortem for the outageOur senior tech and management team has concluded a detailed root-cause-analysis for this outage and here are the findings. At Around 2 PM IST on Thursday April 24th, we were in the process of provisioning a new storage cluster at one of our global data centers. As part of this process, an aggregate (which is a collection of multiple disks spanning RAID groups) holding the production data and backup snapshot volumes for some of our email users was rendered inoperable while attempting to build a new aggregate. We noticed the services on this storage cluster failing almost immediately and this was highlighted by our network and systems operations team which operates 24x7 across all our facilities and services. We immediately halted the offending deletion/new aggregate creation processes as soon as we detected the issue. Our first goal at this point was to restore email services, which we did as soon as possible, by migrating all users on this storage cluster to a different cluster. We simultaneously started work with a team of experts to re-build the storage cluster and to bring it back online - and those efforts continue as of date with more updates in the following sections. In response to this incident, our senior technical team immediately put into place a series of stringent change control measures and oversight, which went over and above the systems already in place, to ensure that no further opportunity exists for an outage caused by a similar event in the future. The email restoration processAt around 10 PM IST on Thursday April 24th, we initiated a process of bringing back the storage cluster online by constituting an advanced team dedicated to this purpose. This process involved reconstruction of individual files across the storage cluster ground up. This restoration process as we have highlighted earlier is very time consuming since it requires us to reconstruct the storage cluster and all of its components from the ground up. A team of engineers has been working on this effort non-stop from the day of the outage. The restoration effort involves a multitude of software, scripts, hardware systems, and manual inspection processes to carefully rebuild the cluster. We were advised at the very onset that this process will take several weeks and our communication to you has been consistent with this. A secondary task force composed of our senior most engineers and managers has been reviewing progress every day and continues to do so. As our engineers have progressed further on this restoration effort, they have also advised us that certain files on the aggregate are not-recoverable. They estimate that at least 25% of the underlying data (not be confused with users/mailboxes) on the storage cluster is not recoverable. We do not have the ability at this time to tell you what this means for your individual mailbox or your account. However, this restoration effort continues and will continue until we are certain that we have done everything in our power to restore emails for each and every affected user - though we cannot guarantee the end results – and we apologize for presenting you with this uncertainty. Our goal has and will continue to be to share meaningful updates as soon as possible. Frequently asked questionsOver the past few weeks, we have had conversations with a number of you about this outage. We have created the following FAQ based on the questions we have heard most often. We look forward to addressing any additional questions over email / phone. What about backups? Don't you have any?We spoke about this earlier in our post from April 26th. Our backup strategy was to create periodic snapshots of the email data. Given that these snapshots were stored on the same storage cluster which is now offline - we have no access to them at this time. Why is this taking so much time?The reason the team is taking the time that they are is as follows. They are working on building the meta-data of mailbox files for each individual account on this storage cluster. They are writing scripts/tools to assist with both the manual and automated restoration of files on the cluster. They are running tests to validate that the restored data is valid and usable. All of these processes require building a series of complex software systems and specialized hardware to operate on those systems. The team also has to double back often and try out new approaches when certain efforts do not yield the desired results. We remain committed to allowing this team, which we believe is composed of some of the best engineers in the industry, time to complete this effort. Is my data lost? Will you ever be able to recover it?We cannot and don't want to promise specific outcomes given that the restoration process has not progressed to a stage that will allow us to do this. At some point over the next few weeks (and we don’t know the exact date since this is a complex effort akin to a reasonably large software/hardware engineering project) we will know definitively the status of each and every affected email account. At that point we will communicate specifically with you to talk about the results that our engineering team has been able to generate. At this time, the goal for this team is to re-build and re-construct the cluster and we request your patience while they continue to do that. These answers are not sufficient - you need to tell me moreWe are sorry if these answers appear to be insufficient. This forum post reflects the most recent update we have at this time. Please realize that our front line support team and our senior managers are all working hard to create the best possible outcome they can under the circumstances. We would be glad to answer a different question if you call or raise a ticket with us. What are you doing to prevent this or something like this from happening again? We realize that this incident impacts your trust in us and our services. When we learned of this outage, our senior most technical engineers immediately put in place a series of measures to ensure that a similar outage or issue would not happen again on our systems. We also commenced a detailed deep dive into our systems and processes which goes far beyond this particular incident with the goal of demonstrating the rigor and confidence with which we build and deliver our services to you. We do not take this responsibility lightly. We know you deserve more detailed understanding of the work we are doing in this area and we will share this with you over the coming weeks. Next steps In conclusion, we want to reiterate the following.
Thank You.
[Updates 13:35 I.S.T. May 9, 2014] Our engineering teams are continuing the restoration effort. As highlighted earlier, this process will take several weeks since it a complex engineering effort involving multiple engineers, locations, hardware and software resources. Given our goal of providing meaningful information to you as it becomes available, we will post on this forum as soon as we have significant information to share. In the meantime, we remain available to discuss this with you at all times through our regular support channels i.e. via calls on 91-22-30797979 and on tickets at https://support.bigrock.com. ================================ Updates on the email outage at our Austin data center
[Updates 11:20 I.S.T. May 8, 2014] At this time we have no additional information to provide over and above our update from yesterday. Efforts continue unabated and the primary purpose of sending this update is to continue our engagement with you every 24h as promised and to reassure you that we continue to drive this issue with the highest priority. ================================ [Updates 9:00 I.S.T. May 7, 2014] Our engineering teams are continuing to attempt email restoration through a number of approaches at this point - both manual as well as automated. As highlighted earlier, a large part of the effort to date has been towards building processes, deploying hardware, and developing software and scripts to drive this project further. Our engineering team continues to emphasize that while they are making progress, the expected timelines for completing the recovery effort is several weeks, and even at the conclusion of those efforts we have no guarantees on the recoverability of emails for specific accounts. Over the next few days, our engineers have indicated that they may be able to partially recover some emails for certain users as they make their initial pass through their system and we will continue to communicate with those users as and when that happens. In the meantime, we appreciate very much your patience and understanding while our teams continue their efforts. =========================
[Updates 09:10 I.S.T. May 6, 2014]
We have no significant updates to present today related to the restoration process. Our tech teams are attempting a number of restoration methods that are intensive in terms of people, hardware, and time. These efforts are currently underway 24x7 across our global teams as indicated earlier. We will continue to update this thread through this process. =========================
[Updates 11:30 I.S.T. May 5, 2014] - We have restored IMAP services for all customers that are affected by this outage. You will now be able to access your emails from multple devices =========================
[Updates 17:00 I.S.T. May 4, 2014] - At this time we have no additional information to provide over and above our update from yesterday. Efforts continue unabated and the primary purpose of sending this update is to continue our engagement with you every 24h as promised and to reassure you that we continue to drive this issue with the highest priority. - This forum will continue to be our primary mechanism for providing updates as soon as available. In the meantime we truly appreciate your patience. Thank you. =========================
[Updates 17:00 I.S.T. May 3, 2014] - As highlighted earlier our engineering team continues their effort of creating a framework to allow the restoration activities to continue in their second week. They are making fair progress towards this goal. =========================
[Updates 20:00 I.S.T. May 2, 2014]
Thank you as always for your patience. We realize that this has been a challenging week. We are doing our absolute best and our senior technology & management teams are monitoring this process very closely. Our next update will be in 24H or earlier in case of any additional developments =========================
[Updates 20:11 I.S.T. May 1, 2014] Restoration efforts continue around the clock as indicated earlier and unfortunately we have no further updates to share at this time other than that which was shared previously. We will post any additional information on timelines or progress on this forum as it becomes available. Thank you for your continued patience and understanding.
=========================
[Update 13:05 I.S.T. 01/05/2014] Our Engineering team continues work on email recovery and to bring the storage cluster online; we are starting to see some limited progress. This process is very time-intensive, and we continue to work 24x7 on it. As we communicated earlier, it is likely to take several weeks to complete this process.
Thank you for your continued patience and understanding.
=========================
[Update 21:00 I.S.T. 29/04/2014] Summary
Updates on recovery processThe process to bring back the storage cluster continues at this time. We have started to see some progress from the engineering team around recovering critical pieces of meta-data that will aid in rebuilding the storage cluster and identifying the contents of and starting the process of recovering the emails contained therein. Please note that this process is highly time-intensive since it requires rebuilding individual files distributed across multiple and redundant disks & re-constructing inbox structure for all affected users piece-by-piece. As highlighted in yesterday's update - this will take several weeks to complete. While our engineering team remains hopeful that we will be able to restore a portion of the data, we are unable to share specifics on exactly how much or what is recoverable at this time. It is our commitment to continue updating you throughout this process.
Activities in the next 24HHere are the broad set of activities our various teams are working on over the next 24H
Timeline for next updateWe will send out the next update in the next 24H (or earlier in case of any other important developments).
=========================
[Update : 09:00 I.S.T. 29/04/2014] How did this happen? What are you doing to get my mail back? Are my messages gone forever? It is possible that some messages of your messages may be unrecoverable, however at this time we have not completed our work on the restoration. We will continue working on our recovery efforts, with an eye towards restoring as many messages for as many customers as quickly as possible.
Should I be worried that this is going to happen again?
=========================
[Update : 23:14 I.S.T. 27/04/2014] Apologies for the delay in updating the forum post.
=========================
Updates on the email outage at our Austin data center [Update : 22:00 I.S.T. 26/04/2014] Why is bringing this storage cluster online taking so long?
=========================
Updates on the email outage at our Austin data center [Update : 13:25 I.S.T. 26/04/2014] As previously mentioned, our Engineers are still working round the clock to try and restore all mails from the storage units. We still do not have any ETA on this but we will make sure we update this forum with relevant information as soon as we get any update from them. We appreciate your patience and co-operation for the same.
=========================
Updates on the email outage at our Austin data center [Update : 09:15 I.S.T. 26/04/2014] While we are working to restore access to all services, we have taken some additional proactive steps to ensure that users who are currently using IMAP and have older emails stored (downloaded through their local clients such as Outlook or Thunderbird) continue to have them saved through this outage. Detailed information pertaining to this change is given below: What are we changing?
We are disabling the IMAP functionality for a subset of domains that were affected by the email issue.
Why are we disabling IMAP?
While the restoration of the historic data is still in progress, we are trying to ensure that the users who have a local copy of their emails do not get affected when they try to sync their email accounts via an email client like Thunderbird or Outlook. Once IMAP is disabled, there won't be any changes to the emails that are currently synced.
I don't want to use webmail. How can I continue to use my mail client?
We strongly recommend you use webmail to access your emails, However if you wish to use your email client, you may have to use POP for your email account keeping your previous IMAP settings intact.Here are the detailed steps explaining how to use POP for your email account.
Important Note:
Please note that you should NOT remove the existing IMAP account. You have to create a new POP account in the same email application (outlook/thunderbird etc).
Thank you for continued patience and support.
=========================
[Update : 20:00 I.S.T. 25/04/2014]
Here is a broad time line and summary for this outage:
- At around 14:50 IST on April 24th we detected that one of the storage units which serves part of our email infrastructure became non-operational causing email services for a subset of our customers to not function.
- We immediately started a process of restoring access to this storage unit and our engineers worked dedicatedly over the next few hours to restore functionality to affected email accounts.
- As of 22:00 IST on April 24th access to all email accounts affected by this issue was restored. All affected accounts were at that time able send and receive email via webmail and through external clients.
- Email that was queued up during this outage was relayed to individual email accounts after email services were restored and should be working without pause at this time.
- As of this update, we are still working to restore this storage unit to a fully operational state. This process is taking longer than we originally anticipated.
- We currently have some of the best storage unit experts in the industry working round the clock to bring the unit back online. We are currently being advised that this process can take several days to complete.
What is the current impact? Is there still something which is still non-functional? Many customers that were affected by this downtime are currently unable to view emails received prior to this outage, specifically, those emails that were stored on the storage unit that suffered the outage and were being accessed either via webmail or IMAP. If you were however downloading your emails to a local email client such as Outlook or Thunderbird using the POP protocol – then there is no further impact to your service due to this issue. When will you fix this issue? What can I expect in the interim? Our goal is restore access to your stored emails at the earliest possible and we are sparing no effort towards this. We care deeply about ensuring that you have best-in-class services and we apologize deeply for the inconvenience caused to you by this outage. We are now being advised that bringing up the affected storage unit is, unfortunately, not a matter of hours but of days. In the interim we will endeavor to keep you updated as best we can with any and all meaningful updates because we do care about getting you back up and running fully. As much as it pains us, we cannot offer a concrete ETA at this time but will continue to update you regularly. What happened? Can you explain in detail what really went wrong? One of the storage units that stores email for a subset of our customers went offline as a result of a set of pre-planned activities we were working on, to bring a new storage cluster online. We build our systems with multiple layers of fail-safes and safeguards, and we are still in the process of doing a post-mortem on what we could have done better to prevent this from happening. We promise you a full account of this when we complete this investigation along with a detailed summary of steps we will take to prevent a similar event in the future as soon as we complete this analysis. What can I do if you have additional questions? While we realize you will still have questions, we hope this post adds to your understanding of this outage. We promise to be here to work with you and provide transparent & meaningful updates as soon as we are able. This post will be the primary medium of communication with you and as such we are directing our contact center agents to refer you to here so we can have clear and consistent communication with you. However, we will continue to be available on 91-22-30797979 and on tickets at https://support.bigrock.com should we be able to offer any additional assistance during this time. Thank you for continued patience and support.
=========================
Apologies for the delay in updating the forum post. We have been working round the clock to restore access to mails received prior to this outage. Newer emails should be sent/received without any issues.
=========================
[Update : 2130 I.S.T.] Currently most of the mail accounts will be accessible and you should be able to send and receive mails. The emails that were queued up on our inbounds since the start of the outage is being released slowly and all such mails should be delivered in another hour. =========================
[Update : 1800 I.S.T.] At around 14:00 IST/ 08:30 GMT today, one of our storage units which hosts email accounts went down. We are currently working with our storage vendors to bring it back up. The impact of this is that accounts hosted on this unit are not able to access their emails In the meanwhile, we are setting up an additional channel so that new emails (received after 14:00 IST) can be received. Once the storage unit is fully functional and we complete analysis, we will restore access to old emails.
=====================================
[Update : 1530 I.S.T.] The problems have been identified as issues with storage unit. Our System Administration Team are hard at work in getting this fixed. Stay tuned to this post for more updates.
=====================================
Reported issue : 1450 I.S.T. We are currently facing intermittent issues with mail service hosted on our Austin mail servers and Enterprise Email. If your domain’s MX records are pointing to the below MX records, your services might be affected :-
or
You can check your domain's MX records using the link - http://www.mxtoolbox.com. You might face errors while login on to your webmail account Our System admin team is already working on this and are trying to fix this asap. We sincerely regret the inconvenience caused.
| |
|