In the beginning there were software engineers and operations engineers. The software people had a vision of what the customer wanted. And their job was to realise that vision with eloquent code and dazzling new product features.
This job has always been crucial to the success and even survival of any software company: its brand needs to be out there in front of the competition and the only way to do this is to wow the market with new app features created by first rate coders.
But while the software engineers were developing their bright new vision, the ops team were working under greyer skies.
They knew that regardless of how luminescent the commercial vision, sometimes new features just don’t work at launch and sometimes never worked at all.
So there’s always a threat that feature launches could disrupt the operation and take away from the site’s reliability; far from wowing customers, innovations like this can result in generating huge waves of customer dissat.
Working in Google, Ben Treynor, (VP, Site Reliability Engineering) came up with the Site Reliability Engineer role as a solution to the perennial conflict between development and operations teams.
The profile of the Site Reliability Engineer is hard to define. But it’s a role that’s none-the-less necessary for global on-line companies like Google or Facebook, that need to be switched on all the time.
An SRE can be a software engineer with a good understanding of IT systems engineering or they could be someone with a professional systems engineering background but who can code as well, even if they haven’t done it as part of any previous role.
What An SRE Does
In an interview with The Site Reliability Blog, published last year, Treynor attempted the following definition of the job, he said: “Fundamentally, it’s what happens when you ask a software engineer to design an operations function.”
The SRE’s remit is similar to the ops engineer’s. It includes “availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning” but along with these comes the responsibility for creating new software and technology to make things better for their environments.
At Google the SRE team owns the site’s speed and reliability. Whatever happens these two features can’t be compromised. The website has to stay up all day, every day and maintain top speed regardless of user traffic or the load on resources.
Facebook’s Mark Schonbach said in a social media post that his SRE team always has an eye on the future. The team develops automated tools that manage server resources, as well as tracking site issues and long-term trends to stop glitches and minor problems developing into threats to performance. And they develop new features that enhance the site’s reliability.
The SRE Environment
What sort of environment does a Site Reliability Engineer work in? When he set up his SRE team, Ben Treynor created incentive mechanisms working between the software engineering and SRE teams that are designed to get everyone lined up in the same direction.
The error budget is an example of an incentive mechanism. It works like this: for every operation in the real world there’s a margin of error; this margin of error is re-conceived as a budget giving an error budget. The software developers can spend it how they like but of course they can’t go over budget.
As software people they want to spend it on the risks inherent in rapid product feature launches. This means they’ve an incentive to tighten up on other areas to save as much of their budget for things that want to do. So they focus on coding much more reliable software, compelling them to police their own code.
The site reliability engineering team monitors the error budget threshold and if it isn’t breached then they’re content with the developers getting on with things.
If the budget gets blown then the SREs stop all feature launches until the error margin is corrected. One result is to reduce the adversity between the developers and the SRE team, this only happens if the developers over spend on their error budget.
There’s another incentivising system that gets developers focused on the quality of their code.
Engineers in the SRE team are free to transfer to other teams. So if they’re working on operations that are made difficult by unstable code then the Site Reliability Engineers will begin to leave and once the team headcount falls below a given threshold then responsibility for the ops is handed to the development team.
In most organisations, accountability between business functions is necessary but this means there’s a danger of them becoming adversarial. The Site Reliability Engineer role is a way of restoring the balance and ensuring that while accountability continues; relations are ultimately productive.
Locally, Dublin is becoming a hub for SREs. Its concentration of engineering talent means there’s a rich pool to draw from and many of the global scaling, on-line companies based in Dublin like Google,Yelp, TripAdvisor, Udemy, Salesforce.com, Dropbox and Microsoft have SRE teams.