The importance of Disaster Recovery and High Availability
The recent British Airways IT failure has brought to light one of the key questions IT Managers and CTOs are likely to be very used to asking: with so much at stake, why don’t businesses take their data centre backups more seriously?
It’s not that anyone believes their bosses don’t think data is important (or, at least, we hope not), it’s more that focus today is on malicious data hacks, and Disaster Recovery (DR) and High Availability (HA) aren’t given the budget they need. While the BA failure was a terrible event for all involved, the silver lining is that we can expect businesses to think about their own data security and backup systems with greater consideration.
On paper, High Availability seems like everything budgeteers dislike – lots and lots of redundant elements. But that’s the whole point of HA; all the hardware and software is deployed in a multi-node environment (for the sake of simplicity, think of a node as a big computer or server). This often means having two or more of everything, and each duplicate to be geographically dispersed (to protect against a site problem, like a flood or fire).
It’s all about making sure that, through any event, your servers are still operational with as little downtime as possible. The global standard of availability for a system is known as the “five 9s” – 99.999% availability – where the servers are almost completely available no matter what eventuality (assuming a global blackout doesn’t happen, or the Earth doesn’t trip and fall into the Sun).
The most common HA environments feature two redundant front ends (‘front ends’ are the servers that run the applications, databases, or programs) with a highly redundant storage backend (storage has less chance of failing, but still needs multiple power supplies, sources, drives etc.). The connections between the front and backend are all redundant too, as are the power supplies and power sources supplying everything. The more thorough your layers of redundancy, the safer your data will be.
With all the HA and redundancy going on, it’s less likely that a recovery will be required. However, no matter how well you protect yourself from data loss and downtime, there’s still a chance – even if it’s that remote 0.001% – that something can go wrong. Sometimes, it’s out of your control. But that makes having a DR system in place even more essential. Sure, 0.001% might not seem particularly hazardous, but bear in mind that the aforementioned British Airways disaster is expected to cost around half a billion pounds if you tally the compensation, damage to reputation, and the 4% drop in share price of their parent company, IAG. We all sleep better when there isn’t a 0.001% chance we’ll lose £500 million before morning.
Disaster Recovery is what happens when the HA environment fails. Assuming Earth hasn’t fallen into the sun, the world needs to keep moving after a data loss disaster, and your DR site is what picks up the pieces and carries on. All the programs, databases, applications, and hardware at the DR site can be recovered in order to resume business as usual. The speed of your data recovery and the point of time from which you want to recover your data will depend on your budget.
Budgets are decided, at the most basic level, on income/available capital minus expenses. Most businesses would include insurance as a direct expense. When it comes to IT, budgets are often tight due to business leaders seeing IT as an indirect expense rather than a driving force for their business. An insufficient budget can lead to the inability to provide effective HA or DR – essentially meaning no insurance for your IT functions. Or, the HA or DR environments might exist, but not at the sophisticated level your business needs.
If HA fails, the DR should kick in. You can either automate this failover (the more expensive option), do a manual failover (less expensive), or implement a recovery from backup (the cheapest, but slowest, option). What you need to do is look at your business needs; ask yourself if you can handle having no sales database for 10 minutes, let alone 10 hours. How much money are you likely to lose if systems are not available for an hour, day, or week? If there is a disaster in your system, who is going to take the rap for it? Factor these things into your decisions around how robust your environment should be. To make these calculations easier, you can use the DSP Disaster Recovery Calculator here.
One example of an automated failover is Database Recovery As A Service (DBRaaS). This is a cost-effective solution for disaster recovery that automatically copies your data to a secure cloud server. With data centres located in the UK and globally, you’ll never need to wait long to recover your data. It’s one of the most affordable ways available to have peace of mind.
Finally: test, test, and test again. The failures at British Airways should never happen to a modern business. If BA had HA and a DR in place (if they didn’t, it would have been bordering on reckless criminality), then we can assume it wasn’t properly tested. While they are claiming a power surge caused the data crash, it’s more likely that the cascade effect of one critical system failing (due, perhaps, to a power surge, a violent knock, a barrage of arrows, or someone reversing the polarity) led to a situation where they could not re-sync the systems easily. With proper testing, these sorts of error could have been highlighted and prevented beforehand.
A business’s data is extremely valuable and the idea of losing it might keep those in the IT department on edge if they don’t have the proper budget to implement High Availability and Disaster Recovery. HA is nothing to be laughed at, and you need a reliable DR in your house, rather than, say, Dr Evil or the Dr from the game, Operation.