A Thrilling Adventure: What a Site Reliability Engineer (SRE) Intern Has to Say
“Hello, I wanted to inform you that you’ve built a great product. It looks amazing but is of no use. Half of the time, I can’t even access it.”
“Your product loads as fast as a sloth.”
This is definitely not what you want to hear from your customers after developing a great product. If you want to batten down for this situation, hire rockstar Site Reliability Engineers (SRE) – the team I’m an intern on here at Duo Security.
Before you lookup what SRE is and what they do, let me make an attempt to explain quite abstract SRE basics in a seemingly easier quite non abstract way:
Site Reliability Engineer (SRE) in Layman’s terms:
A Site Reliability Engineering team, which originated at Google, is responsible for making sure that the product is unfailingly available to customers for use at any given time.
You might now wonder how a product built by smart engineers could ever fail to be available?
A bunch of reasons. Sometimes nature has a hand; for example, the data centers that house your application code via servers might fall victim to natural calamities. OR sometimes an event could create a traffic spike, limiting the availability.
Imagine Beyoncé going live on Facebook with her latest song and millions of fans are accessing their Facebook accounts all around the world at the same time hoping to not miss it. The web servers at Facebook that serve you your personalized Facebook content when you log in to ‘facebook.com’ might crash due to the sheer volume of users utilizing it. Thankfully, in reality, Facebook SREs hold these types of problems at bay.
During such unprecedented times, Site Reliability Engineers bring out their lasersharp troubleshooting skills from their ultra awesome bag of technical tricks to effectively rebound these services and keep them running.
A Quick Look Into the Work of a Site Reliability Engineer:
Previously, system administrators were the only superheroes tasked with keeping the IT infrastructure – hardware, software and network components – robustly functional. Times have changed and the capabilities of computers quadrupled and the organizations are embracing cloud computing services more than ever before at an increasingly whopping pace!
Manually performed IT operational tasks can now often be replaced by novel software technologies. Site Reliability Engineers, (aka the software engineers who marry software and computer systems knowledge), now handle these operational tasks.
What Do Site Reliability Engineers Do?
To achieve highly reliable working systems, Site Reliability Engineers:
Focus on automating (e.g: automating provisioning new servers)
Scale the systems
Demand forecast for capacity planning
Monitor the systems (Logs are to SRE as medical reports to doctors – but in this case, it’s a computer system’s history)
Remain on call: Come any problem, I am alert, 24 hours, armed with energy, ready to tackle any outage problems
Don’t let their coding skills depart: Yes! They work on innovative software projects too
These are just the highlights of the higher-level work a Site Reliability Engineer carries out. One more thing worth mentioning about SREs is that they are revered for their expertise in a large scale production systems. Isn’t that doubly awesome?
My Internship Experience at Duo
My decision to join the Site Reliability Engineering team mainly stemmed from my innate desire to learn how large scale production systems work, how the infrastructure of a Software-as-a-service (SaaS) product is designed and the hidden underlying complex mechanics that let Duo customers access the product seamlessly. I believe knowledge about a product, product’s infrastructure and a product’s reliability is instrumental for building a great product– divine knowledge not easily amassed at universities.
Why was acquiring basic knowledge about a product’s infrastructure a great investment for me?
Every piece of code requires computing resources for its execution, and large software products made up of thousands of lines of code require even more computing resources, which means complex infrastructure — more servers, etc.
Duo is a SaaS multi-factor authentication (MFA) provider, protects more than 20,000 of the world’s most top titans like Facebook, Lyft, University of Michigan, Zillow and more, receives 700+ million authentications every month and integrates with customers’ growing IT infrastructure, which makes establishing scalable and robust infrastructure its No. 1 priority.
Hence, learning about Duo’s infrastructure exposed me to the beautiful landscape of large scale production systems. I now know how code works in a production environment consisting of many systems, which is very different from the code running locally in one computer system!
What’s more? The architectural skills you learn are transferable!
For instance, later down the road, if you make a switch to machine learning and the organization positions you to build machine learning infrastructure, you could then easily chant “Cotton candy, my infrastructural skills are super handy!”
What do I think now that I did not think before?
In the world where adding more features and building new products are reckoned as the only “business driving factors”, with the majority of efforts expended in that direction, one must always always wear a reliability-oriented mindset cap because reliability of a product matters. Customers will only be able to trust your services if they work, not momentarily, but consistently, not slowly, but rapidly!
While programming, it is critical to keep virtues like reliability, scalability and efficiency in the back of the mind. A great software engineer must indeed not only focus on writing efficient code but also on harnessing hardware resources efficiently such as CPU, memory, storage – especially, when it comes to a large scale product.
We're hiring and looking for interns! If your passion is collaborating with inspiring teammates, and creating and supporting products that make a difference, we want to hear from you. Check out our open positions!