Building Better Reports With a Data Pipeline
Recently, we showed you how we enhanced the Duo Dashboard and introduced several new and exciting reports. We've also presented some details about how we've built some of the new visualization tools that power these reports.
Let's dive a bit deeper into the data-driven systems that help make these new reports possible.
Where We Started
Historically, Duo has kept all of the data you see in the Admin Panel reports in a MySQL database. As a relational database, MySQL does a wonderful job of handling our customers' entity data, and helps us answer questions about the current state, like: What is the name of the customer with ID 12345? Or, what are all the devices associated with the user with ID 23456?
However, we were also using MySQL to store our event data, which is used to help answer questions about what happened, like: How many distinct users authenticated on February 16th? Or, how many users failed to authenticate because their devices were out of date in the month of March?
The Authentication Log
Our largest customer data set is called the Authentication Log ("Authlog" for short). At the time of this writing, there are over six billion records in the Authlog across all of Duo! While the Authlog is spread out among many MySQL databases, this is still no small amount of data.
It's important for both Duo and our customers that we have a complete history of every time a user authenticates, so we do not allow an authentication to proceed until we know that we have saved a record of it.
It's also important to Duo and our customers that users can authenticate as quickly as possible. These two factors pose a big challenge when it comes to both capturing new Authlog data as well as indexing existing Authlog data to provide compelling reports.
These challenges continued to compound as the number of records in the Authlog continued to grow exponentially.
First Pass With Elasticsearch
Over time, there was a growing need for Duo to provide better reporting facilities in the Admin Panel, as we were failing to answer even basic questions about our customers' event data due to the lack of strong indexing of the Authlog.
We decided that MySQL was no longer the right tool to store event data, and so we set out to build a new data storage system that would better suit the use cases for us and for our customers.
We chose a database called Elasticsearch, which is designed with searching and data aggregation in mind, and can scale to the amount of data we need to index, with room to spare.
Our first pass at this system introduced Elasticsearch into our production systems, with a process that every so often would pull the newest Authlog records out of MySQL and write them to Elasticsearch.
This preserved the write performance we had with MySQL, but still gave us a way to highly index our Authlog data and give our Admin Panel a flexible system for searching and aggregating Authlog data.
The tradeoff was reduced consistency, as there would be a short delay (usually a few minutes) before a new Authlog record would appear in the Admin Panel, which we felt was acceptable.
Second Pass with Apache Kafka
With Elasticsearch, we were able to generate new and informative reports in the Admin Panel. But the process we built for getting data from MySQL into Elasticsearch was very specific to these systems and specific to authlog data. Ultimately, we need to connect customer event data to many places in near real-time.
We needed a system that stores event data and then, in turn, streams it to many subscribers. We chose Apache Kafka to fill this need.
Kafka has several properties that made it an attractive choice for us. Kafka is a distributed application, meaning it runs on more than one machine, allowing it to replicate data to several different places at once and protect itself against hardware or software faults. Being a distributed application also allows us to add more machines into a cluster as our needs continue to grow. Finally, plenty of other companies use Kafka, giving us more confidence in it as a stable technology.
With Elasticsearch and Kafka in place, we have the foundation of a new and exciting Data Pipeline service. In the future, we plan to connect new systems and databases, like machine learning pipelines or webhook mechanism, to the firehose of customer event data. This will enable us to provide new insights to our customers and identify new ways to protect them from breaches.