On Saturday 27th May one of the busiest travel days of the year BA’s IT systems went offline effectively stopping the UK side of the business in its tracks causing almost all flights from Heathrow and Gatwick to be cancelled.
Below is a summary of what happened and the current ongoing investigations.
What is clear in all of this is that BA believed they had everything covered for Continuity, DR and Backup but until an event happened which seemed simple enough and probably planned for (instant power failure but then unplanned return of power) the behaviour of their systems to this was untested. By planned but untested we mean: Have you ever simply pulled the plug on one side of a whole live IT system, at one of the busiest times for the business, to see if the redundant side works and business continues? Your upper management probably believe they are covered on this but have they ever given approval to do this? Controlled planned power downs and controlled planned system failovers probably, but due to the potential damage to hardware and corruption to software probably not simply pulling the plug.
So what happened?
There is a continuing internal investigation into the issue and also an independent review has been commissioned. It is understood that BA has over 200 critical systems that need to interact to ensure each passenger and their baggage get on the right flight and when they arrive at their destination, the country they are arriving at know about it. Each of these critical 200 systems will be made up of multiple servers and software which all need to interact from ticketing systems, to baggage handling, to on board meals. The entire BA UK IT infrastructure is said to span more than 500 cabinets in six halls across three different data centres, two of which are believed to act in an Active:Active configuration and the third for offsite backup. It is also known that BA has been moving jobs from the UK to be outsourced offshore with India’s Tata Consulting Services (TCS).
Just to put this into perspective to your business. Do you use AWS, Azure or similar? The size and redundant design of the BA systems will most likely be very similar, there are only so many ways to design something like this. So, you may be reading this thinking this issue could only effect companies the size of BA but if you use a Cloud for hosting systems then you are probably vulnerable to the same event however unlikely it could be. The key to this is ensuring you have hold of your data somewhere separate (BA had their third Data Centre which is Air-Gapped) so if the worst did happen you knew you had a copy of your data somewhere you can get hold of it. Or your DR plan is to a separate Cloud/Infrastructure to your Primary.
So how did a giant such as BA get effected so badly. Bad investment, no DR plan, poor management, sending jobs off shore, the cleaner unplugging the power to plug the vac in? The current explanation of the cause of the outage was due to a power surge in one of the UK datacentres taking that site offline, why this would affect another site is unclear. It is believed that this surge was due to a contractor at the datacentre inadvertently switching off a UPS. This resulted in the total immediate loss of power to the facility, bypassing the backup generators and batteries... After a few minutes of this shutdown, it was turned back on in an unplanned and uncontrolled fashion, which created physical damage to the systems and significantly exacerbated the problem.
So what about the second data centre in the Active:Active setup, how does a power surge effect a completely different site? This will be the interesting part of the investigation, did the unexpected behavior of the first site which lost the power cause corruption of the data in the second site, did the return of power cause confusion in the failover automation? Ultimately was the only reason the systems came back online was due to the offsite backups and lastly did the outsourcing of IT off shore effect the speed of recovery for better or worse? All of this is currently unknown but will no doubt come out in the internal or independent enquiry, or both.
Do you need to do anything?
At the very minimum ensure you can always have access to your data by following the 321 backup rule, including if you use a hosted cloud such as Azure or AWS, do you have a seperate source access to your data if they went offline. If you want to go a bit further: ensure you can fail your IT systems over as per your DR plan and test this. As with BA, the unexpected may happen with which you thought you were covered but there may always be something. The more you test and plan, the better you can react if the unexpected does strike.
How can FCS help?
As well as the above you could look at Email Security software to help filter emails before reaching the end user and your systems. Our Backup service is Air Gapped from your systems so if the worse does happen you can be sure your data is safe and recoverable offsite with us. Lastly most organisations cannot afford to be without their key systems for very long. In the recent Wannacry Ransomware attack the NHS had to revert to pen and paper in some areas for a while. We can provide a Disaster Recovery solution for some or all your IT systems. A simple dashboard gives you the ability to recover your systems offsite in our data centers, quickly and simply.
FCS. Make us your first resource, not your last resort…