Amazon Web Services (AWS) stores a history of API calls to the data storage service S3 via a service named CloudTrail. These logs are important for auditing what has happened in an AWS account. They can be used to understand errors that have occurred, review historical usage so that tighter IAM policies can be implemented (such as with CloudTracker), test ideas for new detection rules, investigate incidents and more. To search these logs, you can download them and use grep or jq to search through them, but that can be slow. You could ingest them into a log analytics platform—but that can be expensive, difficult to maintain and require consideration of resource consumption.
How it works
The cloudtrail-partitioner is based on work by Alex Smolen in his blog post Partitioning CloudTrail Logs in Athena. Our contribution is to make that work easier to run by incorporating it into a CDK (Cloud Development Kit) app and adding functionality to incorporate new regions and logs from new accounts automatically. Athena is a serverless AWS service that allows you to use a SQL interface to query data stored in S3 buckets.
When using Athena one needs to define a table to describe where your data is located and its format. You can additionally define “partitions”, which are based on the folder path structure to limit the amount of data read. This is useful because the Athena pricing model is based on the amount of data read, so by defining which files should be looked at you can reduce your costs. In my experience, querying less data also results in the queries running faster.
The file path used by CloudTrail logs includes the year, month, and day. As these values change every day, you’ll need to regularly create new partitions daily. AWS also periodically adds new regions, which are also part of the file path, so again, you’ll need to ensure you create new partitions to account for the new regions. Finally, your company may add new AWS accounts, which you’ll have to create new Athena tables for. It is due to all this work that we built the cloudtrail-partitioner to perform all those tasks automatically.
To use the cloudtrail-partitioner, you’ll need to first edit a configuration file to define the S3 bucket that contains the logs and an SNS to send any errors to. Then we recommend you run the cloudtrail-partitioner manually, which not only helps ensure things are setup correctly and allows you to use Athena tables immediately, but also creates partitions for the past 90 days by default. After you then deploy the CDK app, a Lambda will be created that runs on a nightly schedule to create the new partitions. This will figure out what CloudTrail logs you have, whether they are configured by the account or via AWS Organizations.
Using the Athena tables
Tables are created for each AWS account, which will look like cloudtrail_000000000000. You can query those directly, or if you want to run a query across all account logs, a view is created named “cloudtrail”. An example query that makes use of the different partitions is:
A more advanced query can be used to find counts errors by user across all accounts. This can be useful for finding applications that aren’t working correctly, or could identify compromised applications that are attempting API calls they aren’t allowed to make:
Using Athena can be a cost effective and low maintenance solution to provide your teams with an easy way to query their CloudTrail logs using SQL. This solution makes setting up the required tables and maintaining the partitions easy and with best practices of infrastructure as code, least privilege, and monitoring for errors.
Try it out for yourself at https://github.com/duo-labs/cloudtrail-partitioner