“Most people don’t plan to fail. They fail to plan.”
John L. Beckley, founder of The Economics Press, Inc.
It is inevitable – the systems will fail. If you bet against that tautology when designing your systems and business practices, you will lose. Badly. Your systems will fail. Count on it.
Hardware will fail.
Software will fail.
People will fail.
Communications will fail.
Now that we’ve accepted that sad truth, the question arises: what to do when it happens? And how we can minimize the impact of the fall.
Enter downtime management.
Downtime management is defined as a set of activities, procedures and plans that minimize service unavailability. These activities, procedures and plans are obviously critical when it comes to continuous performance. They include conducting a detailed risk analysis and setting a regular schedule of maintenance (patching, updating and upgrading).
Simply put, with downtime management we plan to fail, but fail gracefully.
How do we do it? There are several aspects – we may define as strategic – included in ensuring the optimal workflow level: being reasonable, being preventive, being prepared and being continual. Let’s see what lies beneath each of them.
Being reasonable in downtime management
Downtime management is a complex business process and it starts with defining service level objectives (SLO).
We always advise our clients that they should be reasonable and objective.
When defining SLO one should estimate what is the percentage of time a service can be unavailable. Try to be realistic and do not expect a 100% uptime. That is simply not feasible. Also, we advise our clients they should try not to overestimate – if their business can sustain an uptime of 99%, they should not plan to have 99.99% uptime. Getting closer to the holy grail of 100% uptime increases costs exponentially.
Being preventive in downtime management
The fact that the systems will fail at some time does not mean that we should not actively try to prevent failure.
Of course, it all starts with the design.
We always plan to prevent hardware failures as much as possible by designing a fault-tolerant, no single-point-of-failure (SPOF), geographically diverse solution (multiple data centres, multiple availability zones, multi-region, multi-cloud). We take into consideration correlated and cascade failures, large-scale outages and positive feedback failures. Also, we make capacity plans based on systems running in a degraded state. If this is not feasible, we plan for graceful degradation of services and swift recovery.
Downtime management is an essential part of proper cloud management. Learn more about what is necessary to maximize your benefits on the cloud >>
We advise our clients to develop software solutions that are modular so that we can design separately for each module. Every element of the system must be monitored since that will provide us with early warnings of possible failure and log and audit trails which will help in finding the problem and devising a solution.
Testing each separate module and using canaries to anticipate capacity problems are also a thing that we do during the implementation phase to prevent possible software errors and capacity-related failures.
Being prepared in downtime management
The best time to prepare is before the failure happens. Preparedness starts at the design phase where we create procedures for dealing with outages and failures. We also define procedures for planned downtime. Planned downtime is a process of shutting down parts of systems that do not infringe on the desired SLO. We use planned downtime for upgrades and testing.
These procedures should be reviewed and improved constantly, and thoroughly tested by scheduling simulated outages in order to train people and test emergency procedures. We take special care in defining correct communication procedures where we try to prohibit assumptions, as these can be a big problem when dealing with unplanned outages.
Being continual in downtime management
Once the system enters the production phase, downtime management does not end. It is a continual process where we constantly analyze logs to catch signals of possible failures and then act on it. However, failures will happen. In such cases, our job is to recover the system as fast as we can, but also to perform post-mortem analysis of said failure and deduce the next steps to prevent it in the future. This also means that we will have a ‘lessons-learned’ meetings and improve on our procedures, and, if necessary, to rethink and redesign the troubled part of the system.
Making sure there are no hiccups or reducing the risk to a minimum, has a major impact on your business’ credibility and reputation. This is why downtime management is an integral part of our cloud management services, that serve to provide the best outcome for our customers. And theirs.