In 2016, IHS Markit surveyed 400 companies and found downtime was costing them a collective $700 billion per year. How do you estimate your own cost?
A number like that makes for sensational headlines, but itâs hard to wrap your head around. Per-company figures are more down-to-earth: Gartner cites $5,600 per minuteâabout $300,000 per hourâin its estimates. We ran the numbers of the top US ecommerce sites to see what an hour of downtime costs them. For example, Amazon.com would lose $13.22 million dollars for a single hour of downtime.
But whatâs your cost? This post will help you get an idea.
During the Downtime
As soon as your service degrades or crashes, you start losing money, so first tally the most obvious cost: the revenue you forfeit every moment youâre down. Call it R.
If you make money from ads, R is lost ad revenue. (By one 2015 estimate, Facebook was then apt to lose $1.7 million per hour.) If you run an e-commerce store, itâs the number of lost sales times the average sale amount. If youâre a ride-hailing service, itâs the number of failed hails times the expected average fare.
Then thereâs E, the cost of lost employee productivity. In the IHS Markit survey, an incredible 78% of that $700 billion was from E; just 17% was from R.
From the first moment of downtime, itâs all hands on deck: the engineers drop everything and hole up in a room together; the support team struggles to tamp down swelling ticket and phone queues; and the executives, if itâs bad enough, work with PR to start apologizing to stakeholders. Add to E the number of hours (times pay-plus-benefits) spent by all affected employees.
The Aftermath
Unfortunately the costs keep accruing after your service is back up. At a minimum, engineers need to find the root cause and design safeguards against future outages (which should be easier if you have an SEV Management Program). So keep adding to E for any employees dealing with the aftermath.
Next, if youâre a B2B company, your customers probably lost some revenue, too. (Amazonâs S3 outage last year cost its customers around $150 million.) Add anything you owe customers to another figure, C. If you have a service-level agreement (SLA) and you breached it, prepare to pay up. Also add to C any money that, while not contractually due, you pay as penance. If you run an airline and strand tens of thousands of customers, you may be buying hotel rooms or comping flights.
Whatâs the Damage?
The total cost of downtime (COD), then, is easy to calculate:
COD = R + E + C
If you incurred other (significant) miscellaneous costsâfor outside consultants, for the recovery of lost data, etcâcompile those into one more figure, M, and append that to the equation.
Now that youâve got a ballpark number for one outage, how do you estimate yearly downtime cost? Thatâs not so straightforward; not every outage is equal. But if you divide COD by the number of hours in that outage, and you know roughly how many hours you were down in the past year, just multiply those two figures to approximate yearly cost. Make sure the example outage you picked to calculate COD was not especially impactful (e.g., occurring on Black Friday) or insignificantâuse an outage of average severity.
Hidden Damage of Downtime
For the yearly cost of downtime, how many SREs could you set to the task of preventing it? Youâll never eradicate downtimeâeven if you put every penny of those costs towards preventive effortsâyet minimizing it is still worthwhile. Why? Because downtime has other, incalculable costs.
How many would-be customers read about your last outage and decided not to sign up? You cannot know, but it isnât zero. How many existing customers churned out? A drop in your NPS score may give you an idea (here's how a raft of outages dropped Telstraâs score) but suffice it to say, customers wonât put up with one outage after another.
Maybe the most pernicious cost of downtimeâespecially if itâs chronicâis its drain on employee morale. And employees may not keep it to themselves. Word gets out.
SRE Bob: Hey Alice, howâs $NEW_JOB
going?
SRE Alice: Meh. The on-call is miserable. Engineers deploy code without testing it.
Bob: Really?
Alice: Yeah. But weâre hiring a lot. You should come have lunch and see the office!
Bob: Ok, maybe!
Bob, of course, is only being polite.
Reducing Downtime with Chaos Engineering
Hopefully, unlike Aliceâs company, 1) you test your code thoroughly, and 2) your software engineers and SREs work in harmony. But these practices alone wonât minimize your downtime.
Modern software architectures are more distributed than ever. Long gone are the days of applications running in one server rackâor even one datacenter. At first blush, distributed applications would seem to be more reliable, and in many ways, they are. But theyâre also incredibly complex, often glued together by a mixed bag of third-party services running halfway around the world. (In the IHS Markit survey, network interruptions were the number one cause of downtime.)
Beyond code test coverage and DevOps culture, mature teams practice Chaos Engineering. They proactively run Gamedays to unearth weakness in their architecture before it causes downtime. But they know chaos doesnât mean all madness and no methodâthey thoughtfully plan chaos experiments rather than kill services with reckless abandon. Whether youâre a veteran or a newbie in the Chaos Engineering community, come say hello on Slack!