Resilience may be called all kinds of things – fault tolerance, redundancy, data preparedness. However, before you start making your IT system blast-proof, it's a good idea to be sure of what it means to your organisation, as this post outlines.
Approaching resilience in a formal, quantitative way will reveal the limits of your systems other methods wouldn't – limits you don't want to find when live data or the bottom line might be affected. Basically it's the practice of keeping applications running in the presence of system degradations and failures. It's also something of an arms race – the more systems we use, the more complicated the operation becomes and the more resilience it needs.
It's certainly true in the cloud where data and applications overlap across your organisation and the internet as a whole, but it applies equally right down in the cores of processors. America's Lawrence Livermore Laboratory notes that the next generation of super computing will be so complex – with so many processes running concurrently – that resilience will be a staple in applications rather than just hardware.
And as data scientists Al Geist of Oak Ridge National Laboratory and Franck Cappello of Argonne National Laboratory wrote recently, faults will be the norm; 'The major challenge in resilience is that faults in extreme scale systems will be continuous rather than an exception event'.
The trick to systems resilience is for it to be proactive and predictive instead of reactive – trying to pinpoint and patch up a fault after it happens will cost more than the pre-planning will, and not just in financial terms. If you run an ecommerce site, for example, it will cost you a lot of trust and goodwill among customers.
Treat systems resilience like you do the acquisition and/or deployment of systems to begin with. Formalising the process will teach you a lot about how far you can stretch your resilience, and the point at which you need to establish protection. Maybe, after doing your sums, you realise you'd be willing to sacrifice 20% of a particular system's performance because you know the cost to maintain it to 100% doesn't justify the financial performance of the attached business units.
Whether it's buying two identical PCs or running two exact copies of the same dataset in the cloud (where you swap the live one out if there's a corruption and have a new, clean copy immediately), replicating systems or mirroring is a longstanding favourite defence against faults. One financial provider, Seattle Bank, talked about their own experience of using a hybrid cloud approach to enable redundant data recently.
In the cloud era the problems that can arise because of data or system faults are almost limitless. Applications are often multi-tenanted, use shared platforms, compete for processor resources and bandwidth, communicate over a very patchy internet and might run on ageing hardware, so it takes a far more engineered approach. As outlined in this study by Microsoft, processes can be retried, rerouted, leader transactions applied and much more.
Aside from the obvious advantages of running processes in the cloud, research has shown that next gen computer hardware simply might not be fault tolerant enough for tomorrow's needs. A traditional approach like checkpoint/restart won't work at the exascale because the time it needs may exceed the mean time to failure for even the fastest supercomputers.
Application-level resilience protects against faults at a lower cost, so if you're looking only at system continuity and protection, it makes sense above a certain level to throw all your hardware out as scrap and host/process everything in the cloud.
If you work in IT there's nothing to be said about backup you don't already know, but disaster recovery (DR) is an area where several organisations still fall down. Just like we talked about in a previous blogs, do your contingency planning. When (not if) disaster strikes, you'll have a formal policy with all staff knowing their part and it will result in minimal disruption to the business.
Just like in every other area of IT, all of the above can be handed off to an expert in what's called ‘resilience engineering’. Dedicated providers will put your systems to the test under extreme conditions to see how far they stretch before breaking. They can advise you about cost effective ways of galvanising your infrastructure and data to handle anything that comes up. And because of the exponential rise in complexity, cost, and the need for protection as systems get more complicated, bringing in a third party expert becomes an even more valuable proposition the more you grow.