Prepping for a Tsunami: Scaling with Amazon CloudFront
On June 25th, Duo Security’s website handled one of the largest spikes we've ever seen, a 1400% increase in traffic. A well-received blog post was the culprit. Thanks to a great DevOps team and a little advanced warning, we now have a simple solution that will scale our web presence for years to come.
Our initial implementation was built for our day-to-day load, but we weren't sure it could handle the influx we were expecting. With the help of beeswithmachineguns, we discovered that the EC2 instances, already tuned with built-in caching, could handle an extremely large amount of traffic. Adding more instances would happily cover what ended up being a 14x increase. But, there was a possible weak point: we previously offloaded SSL to Amazon’s Elastic Load Balancer (ELB).
The initial tests were far from great. We were getting less than 1/10th the throughput while our instances were ready and willing to serve up more. We assumed the SSL offloading was causing the huge dent until we noticed anomalies in testing. Throughput seemed to slowly increase on its own. While ELB is 'elastic,' the elasticity comes gradually in the range of 50% / 5 minutes.
Good to know, but we were expecting an all-out invasion.
Not to worry, Amazon allows you to pre-warm your balancers. After filing a request with pie-in-the-sky estimates and getting assurance it was in place, we began what we thought would be our backup plan.
Onward to CloudFront
The duosecurity.com website is only available via HTTPS. In fact, the only non-HTTP traffic we serve is a redirect back to HTTPS. For the sake of our public facing site, we don’t want to use SNI (Server Name Indication) and frighten potential customers with possible certificate issues.
Thankfully, Amazon now provides the ability to use custom SSL certificates in CloudFront, their CDN (Content Delivery Network) solution. With this, we had the potential to scale our site enormously with no management, globally consistent page load time, and while maintaining our current certificate.
The initial configuration seemed to be simple. Point CloudFront at our balancer, update DNS to direct clients to CloudFront, and we would be done.
Well... It Wasn’t Quite That Simple.
It wasn't immediately obvious, but you need to request specific access from Amazon just to enable the functionality. Even then, it still isn’t enabled until a certificate is actually added.
We set up a test hostname and fired CloudFront up. We tried it and, sure enough, we were now propagating our website through Amazon's CDN. Awesome!
We decide to flip the site over midday before the launch, as we were getting close to the wire. What's the worst that could happen?
502's. 502's Everywhere.
Alarms going off. Every query threw a 502 error. We immediately flipped DNS back to the balancer and went back to the drawing board. Note to customers: Our service was unaffected by this!
The issue implied that it was a certificate problem, but everything looked correct. As we were debating about pulling the plug on the project, we decided to file a last-ditch AWS support ticket.
T-Minus Eight Hours.
AWS called with a solution. Thanks Jarrod at AWS Support in Australia!
Now, it’s a bit confusing to describe what happened, so here are some illustrations we hope will help explain things.
Initially, we used a CNAME (Canonical Name) record pointed to our ELBs.
To test, we wrapped the initial configuration in CloudFront and a secondary hostname.
We couldn’t simply point the www record at CloudFront, as we would have to remove the content and create a loop.
So, we pointed CloudFront at the ELBs directly. This configuration was the source of the 502s.
CloudFront is strict about SSL internally. It failed, as the certificate the ELBs served up was for www.duosecurity.com, not amazonaws.com. This would not have been an issue if we weren't forcing SSL.
The workaround is relatively simple. We added a secondary hostname with a valid certificate.
The following day, we moved 1400% more data than we ever have before. Thanks to CloudFront, everything was perfectly stable with no external complaints!
But Wait, There's More!
While it was stable, changes didn’t seem to be propagating as fast as they should have.
As we were heavily focused on getting the solution in place, we neglected to set a lower default TTL (time-to-live). Twenty-four hours is way too long to wait for a minor website layout tweak.
When we attempted to change it, we saw a bizarre behavior. The UI would remove the value and change the default behavior. Per AWS support:
The value of the customized minimum TTL will take the value you set up, but when you try to update the behavior of the distribution you will always see the option "Use Origin Cache Headers" selected as it will always be honored.
It appears Amazon has since addressed this issue.
Even though we managed to verify a default TTL and invalidate the popular queries, the Age header was still growing exceptionally large. To solve this, we enabled Cache-Control headers on our instances. After invalidating the most popular queries again, everything was updating as expected!
Well, almost everything...
Caches, Caches All the Way Down. Our original infrastructure, utilized ELBs backed by a number of instances, and cached in both the web server and the CMS itself. Once CloudFront was enabled, the frequency and number of queries dropped dramatically.
The lack of traffic amplified an underlying issue caused by our CMS located behind a balancer. Dynamic content on our CMS was generated and cached on page load, yet the ELB would direct queries round-robin.
This previously went unnoticed due to the amount of traffic that would hit an ELB directly.
The CloudFront edge node would query an ELB for a page, and the instance it queried, it would reply with the page and generate dynamic content. The query for the dynamic content would hit the ELB, round-robin to the another instance that didn’t have the content, and return a 404.
To keep things as simple as possible, we pre-generated the dynamic content by crawling our staging host. Our release mechanisms now propagates the content out as well.
After a reasonably small amount of work, we managed to go from a simple infrastructure to a globally stable solution via CloudFront with SSL in just a few days. What was initially planned as a trial and a fallback has now become the core solution!