Think SRE: look at projects through the eyes of an SRE engineer

In the reviews of Slurm, Kubernetes sounded the phrase: “Kubernetes turned out to be easier than I thought.” Now it no longer sounds, the myth of the complexity of k8s is no more. He moved into the category of tools easy to learn, hard to master.


We want to repeat the same with SRE. Show that SRE is easier and more understandable than it sounds. Shift the paradigm: let people see the project through the eyes of an SRE engineer.


As always at the start, there are many unknowns in the equation. And as always at the start, the most interesting will go first.



On February 3-5, we will host Slurm SRE in Moscow. A three-day intensive ticket costs 60 thousand. What will the participant get for his money?


When I tell friends and colleagues about SRE, I come across healthy skepticism:


  • For the first time I hear about SRE, it's some kind of alchemy.
  • Implementing SRE is difficult, for giants like Google.
  • It is expensive and long, they will not give time, they will not allocate a budget.
  • What you describe is too good to be true.

I want to make out these questions.


It's time to find out what SRE is.


At the slogan level: SRE is one of the implementations of DevOps. It appeared 10+ years ago on Google, but only recently began to penetrate the “regular” market, primarily thanks to the book Site Reliability Engeneering, released by Google in 2016.


The connection between SRE and DevOps is well described in this video:



The bad thing is that slogans are about nothing. Well DevOps, well, implementation, the next "for all good versus all bad".


You can read the book (and it's worth it). But the reader will find himself in the position of a person studying karate from drawings. The book describes the concept without application to reality. The teacher leads the hand along a specific path and points out errors in the process.


The price includes a quick and in-depth review of the SRE approach and tools.


Implementing SRE is easier than it sounds


At Slurma we will touch SRE with our hands: we will choose metrics, configure their measurement, alerts, run into incidents, solve and analyze them, rebuild the project according to all SRE canons.


That is, we will give step-by-step instructions that you can implement at your own upon returning from intensive.


I'm lying. In fact, we will not give instructions, but a sample from which you can draw a bunch of ideas and solutions.


The price includes a sample for implementation.


The main problem is that you have to convince those who have not been to Slurm. Therefore, ideally, it is worth coming to intensively as a whole team. Therefore, we give big discounts for groups.


It would be nice to come to Slerm led by the service station. And CEO is also useful, and about this section ...


... how to convince top management that SRE is useful and necessary.


Usually there is a conflict of tasks between CEO (top management), STO (IT management), developers and operation.


I intentionally do not say “conflict of interest”, it is precisely a conflict of tasks.


CEO needs financial performance. STO - an understandable, manageable and as comfortable as possible situation. That is, understandable tasks with understandable business value, meeting deadlines, a normal stack, more features and fewer fakaps. Developers need to roll out more features, and exploitation - to ensure accessibility (which clearly conflicts with "more features").


SRE says that all participants in the process have a single task: the user's happiness. The user is happy with a healthy balance between new features and the reliability of the service. Happy user pays more money. To manage user happiness, you need specialized tools.


Moreover, SRE, being based on metrics, allows you to translate financial indicators into target indicators of various metrics, and they, in turn, into tasks of DevOps teams.


Allows you to translate - I exaggerated. The presence of these metrics allows you to find the relationship between the state of the metrics and financial indicators. This is a separate big but understandable task.


There is a project DORA, DevOps Research & Assessments , it releases annual studies on the value for business and ROI DevOps and its subclass SRE. We are now translating the current report into Russian. There are evaluation formulas that can be applied to your company with a certain degree of accuracy.


Summary: SRE gives businesses the ability to manage financial performance by setting metric targets, and the DevOps team, looking at the current metrics, clearly understands what needs to be done to the maximum benefit for financial performance. Which CEO will refuse such a tool?


Obtaining resources for SRE implementation is quite realistic.


The course price includes a set of arguments in favor of switching to SRE and DevOps.


And even in small companies there is a place for SRE.


SRE is divided into tools, culture and organizational structure.


Some tools, for example, Service Mesh, are needed for large and complex projects. But the same retry, backoff, failure injection, graceful degradation can be implemented in small projects, and they give a huge return.


Culture is also useful in any company. The classic administrator, setting up Prometheus, will act according to the standard: it will include monitoring of memory and disk consumption, and other familiar monitoring. The SRE engineer will first go to discuss key indicators of business processes with the business, and then set up their monitoring. It’s immediately obvious that the SRE-engineering culture is useful even in micro-startups.


But the organizational structure in small companies is probably not needed and even harmful. When all employees are generalists, there is no need to forcibly allocate SRE commands.


Everything we describe is already working


The course was created by those who have long implemented SRE in their teams and have long lived in this paradigm. Ivan Kruglov and Ben Tyler, both are Principal Developer at Booking.com. Eugene Varavva, a broad-profile developer at Google. Eduard Medvedev, CTO at Tungsten Labs, who grew up from an SRE engineer.


Edward holds a webinar “SRE - HYIP or the future?” December 12 at 11:00.


About the program


As for the program. I’m already getting expert feedback that the program isn’t fighting: it’s too wide and sometimes illogical. It really is.


In fact, we have a framework for the program, a set of ideas that we want to reveal. We have two months of hard work ahead of us, as we prepare, the program will be clarified: we remove the unnecessary and specify the remaining.


But already in its current form, the program clearly shows the direction in which we are working.


Slurm SRE program

Theme # 1: Basic principles and methods of SRE


  • What does it take to become an SRE?
  • DevOps vs SRE
  • Why do developers appreciate SRE and are very sad when they are not in the project
  • SLI, SLO and SLA
  • Error budget and its role in SRE

Theme number 2: Design of distributed systems


  • Application Architecture and Functionality
  • Non-Abstract Large System Design
  • Operability / Design for failure
  • gRPC or REST
  • Versioning and Backward Compatibility

Theme №3: How to accept the SRE project


  • Best Practices from SRE
  • Project Admission Checklist
  • Logging, metrics, tracing
  • Take CI / CD into our own hands

Theme №4: Design and launch of a distributed system


  • Reverse engineering - how does the system work?
  • We coordinate SLI and SLO
  • Capacity planning practice
  • Launching traffic to the application, our users begin to "use it"
  • Launch Prometheus, Grafana, Elastic

Topic # 5: Monitoring, Observability and Alerting


  • Monitoring vs. Observability
  • Set up monitoring and alerts with Prometheus
  • Practical monitoring of SLI and SLO
  • Symptoms vs. Causes
  • Black-Box vs. White box monitoring
  • Distributed application and server availability monitoring
  • 4 gold signals (anomaly detection)

Theme №6: The practice of testing the reliability of systems


  • Work under pressure
  • Failure injection
  • Chaos monkey

Theme # 7: Practice incident response


  • Stress management algorithm
  • Interaction between incident participants
  • Post mortem
  • Knowledge sharing
  • Culture formation
  • Fault monitoring
  • Conducting blameless debriefing

Topic # 8: Load Management Practice


  • Load balancing
  • Application Fault Tolerance: retry, timeout, failure injection, circuit breaker
  • DDoS (create load) + Cascading Failures

Topic # 9: Incident Response


  • Debriefing
  • On-Call Practice
  • Different types of failures (testing, configuration changes, hardware failures)
  • Incident Management Protocols

Theme №10: Diagnosis and problem solving


  • Logging
  • Debugging
  • Analysis and debugging practice on our application

Theme №11: Testing the reliability of systems


  • Stress Testing
  • Configuration testing
  • Performance testing
  • Canary release

Theme №12: Independent work and review


Is all of the above worth the money?


PS. What does the Kubernetes hub have to do with it


All practice is done at Kubernetes. Those who own Kubernetes have a direct road to SRE engineers. For those who don’t own, go to our Kubernetes courses .


Registration for Slurm SRE

Source: https://habr.com/ru/post/479378/


All Articles