When Facebook, Instagram, and WhatsApp all went dark on Oct. 4, Courtney Nash recognized the outage for what it was: a potential disaster. The hours-long outage didn’t just stop the global flow of dog videos and vacation updates from your uncle in Wyoming, it effectively shut down the Internet for a large portion of the world. For billions of people, those services are the Internet, and with all of them offline, communications in many countries just disappeared.
“When these systems like Facebook or WhatsApp go down, it’s almost life-threatening for a lot of people,” said Nash, a researcher who studies the way that large systems fail and what the consequences of those failures are.
Even though it only lasted about six hours, failures like the one at Facebook--which was caused by an erroneous command being pushed to the company’s global backbone network--can have massive downstream effects and cause cascading failures across the globe. The effects can last for days in some cases, but once the incident is resolved most of the world moves on to the next shiny new failure. Many large platform providers and site owners that experience outages, security incidents, and other issues publish after-action reports or post mortems laying out the root cause and how the issue was resolved. Some of those are plain English, while others are highly technical. But regardless of format, those reports usually live on the affected company’s website and anyone who’s interested has to hunt them down, and doing any kind of trend or common cause analysis is difficult.
Nash wanted to fix that. So for the last year she has been collecting reports of all kinds of outages, security incidents, data loss incidents, connection problems, and many other types of problems and building out a comprehensive, searchable database. The result is the Verica Open Incident Database (VOID), which comprises reports on nearly 2,000 software-related incidents culled from around the web.
“I was starting to study failure modes for Kubernetes and other systems. No one had ever collected all of these reports together in one place before. There are security related ones, but it doesn’t exist for availability. They’re all scattered and some are intentionally obfuscated,” Nash said.
“I wanted to pull it all together for other researchers and press and analysts. The whole goal is to share this and make things safer and more reliable for everyone.”
One of the accepted methods for figuring out what happened in an incident is a root cause analysis (RCA). Digging into the details and teasing out exactly what caused an outage or security incident is standard practice, but Nash said RCAs can lead to organizations focusing too much on the root cause at the expense of discovering how the system actually failed. Only about 25 percent of the reports had a public RCA included, but Nash would like to see that number drop even further.
“When you pin it on humans, you create a situation where humans are reluctant to speak about incidents."
“If you’re doing the root cause analysis, that can lead to some unhealthy behaviors in companies. We don’t have a lot of research on that,” she said.
Another way that organizations often look at the impact of outages or incidents is the mean time to resolve (MTTR) metric. Of the more than 1,800 reports in the VOID database, more than half of them were resolved in less than two hours. Is that good? Does that measurement mean anything? What does MTTR even tell you? Nash isn’t sure.
“On a purely statistical basis, if you don’t have a good distribution, the mean and median don’t tell you anything. Mean isn’t a good indicator,” she said.
“Also, how do you know if your time to resolve is good, and what to do about it if it isn’t? Or if it is good, what are the processes that made it that way?”
Along with the confirmed outages and other incidents, Nash also was interested in seeing how well the industry was doing at documenting and analyzing incidents that almost happened, but didn’t. Many industries, notably the airlines, routinely study near misses and publish analyses of them. But this isn’t yet common practice in the software or security industries. Nash was able to find just seven near-miss reports.
“The airline industry is adamant about this. When nothing bad happens, you have less finger pointing and blame, and you find more of the business and social pressures that can cause incidents,” Nash said. “When you pin it on humans, you create a situation where humans are reluctant to speak about incidents. As an industry, we’re actually pretty good at a lot of this stuff. And that’s not the message that gets out there.”
Nash plans to publish two “chunky” reports per year from the VOID data, and is hoping to attract more data partners. Right now the data collection is done manually, something she wants to automate at some point in the future. For now, Nash hopes organizations will take advantage of the VOID database and use it to inform their own practices.
“I harbor this deep-seated hypothesis that companies that analyze this kind of data tend to be better at these activities in the long run,” she said.