Imagine in a work environment where there is no one to manage the IT infrastructure and operations, things can be very messy and difficult to handle but luckily enough, we have teams whose job is to make sure that IT infrastructure and operations management is in place, helping software engineers to focus much on coding and not automated deployment processes.
That is what SRE and DevOps do but what differentiates SRE from DevOps and vice versa in a software development environment? This post will help you understand the four most important differences between the two.
If you happen to know what Site Reliability Engineer (SRE) is, you might be wondering how it relates to DevOps. Well, let’s not beat around the bush. There’s no “versus”—there’s only a different approach for how to deliver better software faster.
In this post, I’ll break down each approach and show where DevOps and SRE differ. You’ll notice that SRE has an opinionated approach for how to run production systems, whereas DevOps focuses more broadly on people, processes, and tools—in that order of importance.
Let’s start by setting the foundation for what DevOps and SRE are.
I’m not going to spend too much time on definitions, but I’ll use them throughout the post to remark on the differences between DevOps and SRE. Of the many definitions of DevOps, I prefer this one from Gene Kim:
DevOps is [the] set of cultural norms and technology practices that [enables] the fast flow of planned work from, among others, development, through tests into operations while preserving world-class reliability, operation and security. DevOps isn’t about what you do, but what your outcomes are.
So DevOps is mainly a cultural shift inside the organization, not a group, person, or position. What’s essential in DevOps are the outcomes at the finish line. The “what” and “how” of it all isn’t important. That’s why the DevOps model CALMS outlines a set of principles that every DevOps initiative should consider using, and it starts with culture.
Now, let’s continue with SRE.
SRE stands for “site reliability engineering,” a term coined by Google. A few years ago, Google engineers wrote a book to explain how they run and operate their systems in production. Then, they wrote a second book on practical ways to implement SRE. Both books are now available for free.
Google’s definition of SRE is quite simple:
SRE is what happens when you ask a software engineer to design an operations team.
Therefore, SREs are operations folks with strong development backgrounds, and they apply engineering practices to solve common problems when running systems in production. SREs are responsible for making systems available, resilient, efficient, and compliant with the organization’s policies (like change management).
Now, let’s get into the details of where SRE differs from DevOps.
Removing Silos in the Organization
The DevOps movement was initiated to eliminate the silo between developers and operators. Developers want to deploy the features they just coded as soon as possible. Operations folks would like to slow down doing deployments to maintain available systems.
How does DevOps solve this problem? Besides the CALMS framework, there are also principles from the three ways of DevOps, which aim to break down the silos between developers and operators.
The DevOps Handbook best explains these three ways of DevOps and gives you a few ideas to apply. For instance, including practices like infrastructure as code, configuration management, or working in small batches.
SRE also removes silos. The difference is that instead of only finding ways to optimize flow between teams, SREs get their hands dirty. Being where the action is, gives SREs a better context for supporting systems in production. SREs integrate into the team as consultants, helping developers create more reliable systems.
What’s most important here is that SREs share ownership of running systems in production with developers. For instance, SREs and developers use almost the same set of tools. Everyone has the same perspective when working with production.
Measuring a Successful Implementation
DevOps metrics focus mainly on how quickly and frequently deployments are happening and how often they go wrong. In other words, according to the 2017 report from Puppet and DORA (DevOps Research and Assessment), the key metrics in DevOps are the number of deployments, the lead time from code commit to releasing, the number of deployments failed, and how much time it takes to recover from failure.
Feedback loops help DevOps continuously improve the quality of systems, and they open the door to experimentation. A DevOps culture fosters the team to deliver software more quickly, but with better quality after each release.
SRE also depends on metrics to improve systems, but from the reliability perspective. The foundations for SREs are the service-level objective (SLO), service-level indicator (SLI), and service-level agreement (SLA). Each of these metrics will show how reliable the system is. SREs use these metrics to determine if a release for a change in the system will go to production or not.
In SRE, speed and quality are products of reliable systems, and SREs focus on those types of metrics. These metrics are the foundation to build an error budget which allows the team to make better decisions. For instance, focus on fixing a problem than affects reliability rather than shipping new features.
Pursuing CI/CD Practices
DevOps is a huge advocate for automation; I’d say that after culture, automation is the second most crucial aspect. In DevOps, the message is to automate as much as possible and make the releases boring. Many activities happen after a developer commits the code, and most of these activities can—and should—be automated.
For example, you can automate leveraging the application’s build process after integrating everybody’s work in code to a machine. Or you can automate the process of deploying application changes, which is—or should be—the same every time. DevOps pursues CI/CD to increase the velocity and quality of the systems.
SRE pursues CI/CD for a different reason: to reduce the cost of failure. For SREs, all the tedious and repetitive tasks that are common in operations—like deployments, application restarts, or backups—aren’t appealing. For that reason, SREs reserve a certain amount of time (for example, Google reserves 50%) for reducing the operational work or toil.
SRE uses the same practices from DevOps, such as canary releases, blue/green deployments, and infrastructure as code. But SRE does so with the purpose of doing other more appealing things, like evolving the architecture or implementing new technologies.
DevOps fosters a blameless culture because every time something goes wrong, it’s a learning opportunity. Accepting that failures will continue to happen is the first step. Instead of putting too much effort into making systems completely fault-tolerant, a DevOps culture finds ways to tolerate fault. Netflix is the most prominent advocate of this culture, with its Simian Army.
Netflix is continuously bringing part of its system down so that it’s just regular business when a real fault comes. If a set of servers goes down in a zone, Netflix automates the process of recreating servers in a different zone. And they practice it in a production environment all the time.
Although, this doesn’t mean that you’ll stop testing in non-production environments. Teams will try to automate testing as much as possible to spot bugs much quicker before going live.
SRE practices blameless postmortem every time a failure in the system happens. The idea of blameless postmortems is to identify what caused the fault, then find ways to avoid having the same failure happen again in the same way.
SRE also accepts failures, but they put numbers to it—they call it the error budget. After defining the SLI, SLO, and SLA, SRE determines how much failure is acceptable (the budget), because it’s expensive to be 100% available. And in some cases, it’s not possible.
Therefore, SREs determine how long it would be acceptable for the system to be down. For example, say the site can be down for about 43 minutes every month, which means the uptime is 99.9%. If the system has been down more than the allowed budget that month, releases are paused until the next month.
DevOps and SRE Don’t Compete With Each Other
I very much like the way Google relates SRE with DevOps by using the following phrase:
class SRE implements interface DevOps
SRE and DevOps don’t compete with each other. SRE is the name Google chose to define the way they do DevOps before the term DevOps was coined. There are slight differences, but as it happens when a class implements an abstract class, the implementation might choose to overwrite or extend the base functionality.
I’d say the main difference is that DevOps is a culture that broadly defines a way of doing things. Maybe that’s why there are too many definitions of DevOps and many case studies from companies of different sizes and industries. By contrast, SRE has an opinionated way of doing things, but that’s because it was born when Google published their explanation of how they run systems in production.
What Are Some of the Similarities Between SRE and DevOps?
Without a doubt, both SRE and DevOps something in common that is very important and that is to make sure that to help the organization’s workflow better. How is that done? Well the SRE and DevOps are just methodologies in place to monitor production operation management work as expected.
While SRE focuses on how something can be done and DevOps on what can be done, they both have a similar goal which is a better result of complex distributed systems.
Both DevOps engineers and Site Reliability Engineers actually code but in big companies, their focus is on writing code for automation and just to make sure that code is well improved before deployment in production. Taking it from this, I can say they both have a better understanding of coding at some level and they use this experience to solve IT infrastructure and operations problems in an organization.
My two cents? Study each movement, and pick the practices and principles that work for your organization today. Tomorrow? Well, tomorrow things will change, and you might need to adopt new principles from DevOps and SRE. You are looking for software that will help your team with log management, Scalyr is the best option for you. You can quickly signup from this trial link and enjoy log analytics at lightning speed.
This post was written by Christian Meléndez. Christian is a technologist that started as a software developer and has more recently become a cloud architect focused on implementing continuous delivery pipelines with applications in several flavors, including .NET, Node.js, and Java, often using Docker containers.