Chapter 1. Identifying Your SRE Training Needs
Providing training and education for site reliability engineers is universally important to set them up for success in your organization. However, the specific training needs of each engineer varies depending on several factors:
-
The maturity of your organization in adopting SRE principles, practices, and culture
-
The knowledge those individuals have about your organization and infrastructure
-
The experience of the individuals being trained, both in terms of technical skill and familiarity with the SRE model
These dimensions make up a matrix (see Figure 1-1) that describes different use cases for SRE education. The optimum training solution for your SREs varies, depending on the specific use case. In the sections that follow, we define each of the key dimensions.
Organizational Maturity
Organizational maturity is considered low if your organization has not yet adopted SRE principles, practices, and culture.1 Organizational maturity is considered high if you have a well-established SRE team, or if SRE principles, practices, and culture are widely understood, implemented, and embraced. An organization with high SRE maturity is expected to have the following:
-
Well-documented and user-centric service-level objectives (SLOs): a target level of reliability that should ideally be correlated with customer happiness.
-
Error budgets: a budget for failure. The error budget is the difference between perfection and your SLO, allowing teams to move as fast as possible, as long as the budget is not exhausted, but with defined actions that will be taken to improve reliability if the production service falls short.
-
A blameless postmortem culture: recognition that things will go wrong and human errors are really systems problems.
-
A low tolerance for toil. According to Site Reliability Engineering (O’Reilly), “Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.”
Organizational Familiarity
High organizational familiarity means that an engineer has worked for your company for a considerable length of time (at least a year or more). Low organizational familiarity means that an engineer is new to the company. Organizational familiarity determines how open or resistant an individual engineer might be to training content and what types of content are most important to consider (Figures 1-1 and 2-3).
SRE Experience
High SRE experience means that an engineer has worked as an SRE at your company, or elsewhere, for a number of years and understands the core SRE principles, practices, and culture outlined in Site Reliability Engineering. New university graduates are an example of engineers with generally low SRE experience because SRE concepts are not often taught in school. Experienced software engineers in product development, systems administration, and others making a career change into SRE are also considered to have low SRE experience, when evaluating SRE training needs.
Note that the training solution you choose also depends on the following:
-
The size of your organization
-
The speed at which your organization is growing
-
The resources your team has to spend on training
Now that we’ve defined a framework that describes the types of students who you might encounter, let’s describe the skills these students might need or want to develop.
Types of Skills to Develop
Apart from “obvious” related subjects that SREs need to learn (such as the technical infrastructure and practical troubleshooting skills when on-call), there are other peripheral subjects related to SRE that help employees become better SREs. In the sections that follow, we use some Google-related examples of training that drive the development of specific skills.
Skills That Support a Career Shift Toward SRE
According to Ben Treynor Sloss, vice president of 24x7 at Google, “an SRE’s job is to apply software engineering skills to operations problems. This means that we expect SREs to spend a lot of time on software engineering. However, Google also hires highly qualified system administrators who’ve been in a more Ops-oriented function, with a bit of scripting experience, but no true software engineering experience. For these people, offer (software) engineering training, or pay these folks to take an external course at a university or online.
Sometimes, an organization makes a decision to change the use of a certain technology or move to a different technology, which then requires extra education. Here’s an example. At Google, as in the broader world of software engineering, we’re seeing more new projects being created in the Go programming language, particularly in SRE. Therefore, many SREs need to learn Go, and an education effort is needed to quickly ramp people up in it. In general, requirements change and SREs are expected to change with them, so it’s important to provide education and learning resources for SREs.
Finally, there’s a skill that’s difficult to acquire outside of companies with large complex systems: Non-Abstract Large System Design (NALSD). This a critical skill for SREs, as described in the Site Reliability Workbook. In NALSD, we consider how to design large systems for reliability, resilience, and efficiency. NALSD is not only used when building a completely new system, but also when systems need to be changed due to changing requirements or growth. For example, a global service was not sharded initially, because when it was first designed, the number of users was small. However, as the user base grew, the new growth forced a redesign of the global service, such that it was sharded. It is important for SREs to demonstrate an appreciation and awareness of future scalability traps and why simplicity is critical for smooth operability and disaster recovery. Focus on building experience and judgment, not simply more algorithms.
The SRE approach and skillset is something that is useful for non-SRE developers in a company, as well. It’s useful for the developer community to learn about SRE principles and practices, including large system design. This helps the developers build more resilient software. The training material for the developers is probably less extensive than a complete onboarding curriculum for SRE. The people involved with SRE education are the ones best suited to supply this material.
Troubleshooting Skills
It’s important for an SRE to keep their troubleshooting skills sharp. Therefore, SREs should regularly be on-call for the service they support. Too much on-call, however, can burn them out. Too little, on the other hand, can cause them to lose familiarity with the service and troubleshooting processes. While on-call, they’ll encounter outages that need to be resolved. SREs tend to like solving puzzles, have an inquisitive nature, wonder why things are the way they are, and follow an analytical approach to solving problems. During troubleshooting, it’s important that SREs follow a scientific method: formulate one or more hypotheses and then rule some out. We at Google teach our SREs that troubleshooting is a series of failures, and it’s OK to go through the process of not figuring out the problem, especially if it helps them rule things out. It’s very important that SREs who are on-call know that they are not alone—they might be the first responder, but they are backed up by a large group of engineers whom they can ask for help.
Good SREs try to see the bigger picture—they try to find correlations with other outages. This (potentially) helps find the root causes of multiple incidents. This contrasts with solving only the immediate problem at hand. Running regular “Wheel of Misfortune” sessions also sharpens SREs’ troubleshooting skills.
Training That Supports a Culture Shift and Promotes Trust
For SRE training at Google, we pay a lot of attention to the culture of trust between developers and SREs, and between different SRE teams. SREs and developers both share ownership of the service and user experience. Users receive the best service when we balance launching new features with reliability. To achieve this, we need a healthy relationship between SREs and the developers they work with. Through interviews with different teams, we’ve found that communication, agreement, and trust are paramount to healthy SRE–developer relationships, and the best functioning teams are those in which it’s barely known who is an SRE and who is a developer—who you are is defined by what you do, not your job title.
Not only is a good relationship with a service’s development team important, other teams are often relevant, as well: security, for example, because some major leak has been found and a change needs to be rolled out on short notice (in a reliability-safe way). Or when the privacy team has found out about a product for which Personally Identifiable Information (PII) is not erased or anonymized in a timely fashion and a Spanner database needs to be cleaned up. Here, too, it’s important for students to learn how to work with other teams and respect that their requirements might sometimes be at odds with the goals of SRE.
Because SREs often must communicate with many different teams, it’s important that SREs communicate effectively. When we train our SREs, we create cohorts in which the student encounters many other students who will work in other teams. This way, after training they will already know people in other offices and teams who can help them as the need arises. We encourage students to build a network and often see that the mailing lists we create for these cohorts are used by the students for a long time afterward. The students feel comfortable using these mailing lists to ask questions because they already know one another.
Incident Management Training and the Corresponding Soft Skills
In major incidents, the on-call person who was initially paged ropes in more people to help. This requires careful coordination and communication. At Google, we follow the Incident Management at Google (IMAG) protocol, which is a flexible framework based on the Incident Command System (ICS) used by firefighters and medics. IMAG defines how to organize an emergency response by establishing a hierarchical structure with clear roles, tasks, and communication channels. It establishes a standard, consistent way to handle emergencies, and organizes an effective response. Implementing incident management training is a good idea so that new SREs understand not just the technical troubleshooting elements of responding when something goes wrong but also the command and communication framework that is in use in the organization.
Soft skills are also important during an incident. Usually, when people go on-call for the first time, soft skills are less high on the agenda. Soft skills include things like explicit and clear communication, time and task management, and record keeping. In practice, mastering these skills is as important for timely incident resolution as mastering the technical knowledge. Therefore, consider developing training that teaches students how to spot hidden assumptions in incident communication that commonly cause misunderstandings with other people on-call; delegate tasks effectively with explicit communications; and think one step ahead by considering what would happen if they carried out a certain action.
Finally, no matter how skilled and knowledgeable the on-caller is, there comes a time when they feel overwhelmed by the problem and don’t know what to do. It might seem unorthodox, but including training on human factors in incident management helps students understand how their body works when under stress—cold sweat, trembling, difficulty concentrating, loss of motivation, feeling tired, fatigued, exhausted, and ultimately perhaps, when stress levels get high enough, freezing and not being able to do anything. Such training helps students understand how to monitor themselves for these symptoms, recognize the danger of the last phase, escalate in time, and hand off the incident to someone else before the last phase actually occurs.
An Introduction to SRE Training Techniques
We’ve discussed a variety of topics that you might want to cover in your SRE training. We now discuss ways to deliver that training. There are many techniques for equipping SREs with critical skills, especially when they are new to an organization and ramping up to become productive in supporting specific systems. These techniques vary widely in sophistication and level of effort required on the part of those delivering the training. Figure 1-2 shows training techniques with regard to the level of effort to apply that technique.
Sink or Swim
On the “low effort” end of the spectrum, there is the “sink or swim” model in which onboarding consists of telling a student to figure things out on their own. Throw your new person into the job on Day 1 with the expectation that they will learn by doing, without a specific framework for ramp-up. Because there are no guiding principles or guardrails showcasing what an SRE new to the team needs to know, “sink or swim” could also be described as “grokking SRE the hard way.” Although “sink or swim” is a low investment approach, it’s not a very inclusive approach, and it does not aim to set every new member of the organization up for success.
Why isn’t “sink or swim” inclusive? As we discuss more in the section on theories of instructional design and adult learning, different people learn best using different learning modalities. Self-directed learning is just one modality. Others include lectures and hands-on exercises. “Sink or swim” leaves students guessing about what they should be focusing on, provides no guidance on what the learning objectives are, and generally leads to a higher level of stress and imposter syndrome.2
Self-Study
One step up from “sink or swim” on the spectrum of techniques for training SREs is to provide self-study materials. These materials can be documents, videos, or exercises. Typically, the SRE receives a checklist of things that are useful to know, with associated resources linked to the checklist. The latter items on the checklist might build on the knowledge learned from previous items. Even though self-study is better than “sink or swim” because SREs are at least given some guidance on materials and/or curriculum, there are some downsides to self-study material that can be frustrating or overwhelming. The SRE consuming the material may feel like they are on their own because they are left to learn on their own (albeit in a guided way), without an explicit channel for asking questions or getting support when they become stuck.
There is also a risk that an SRE encounters out-of-date or deprecated material. This is particularly problematic for students, and it can occur if no one is actively curating the self-study checklist. The student does not realize that some material is deprecated or out of date. We have seen examples where an experienced SRE walks by a student’s desk, notices that they are watching a video recommended in the student checklist and says, “Oh, that thing has been deprecated for years. I wouldn’t bother watching that.” The student then feels like they have wasted their time, which leads to high degrees of frustration. It also contributes to a general lack of trust in the self-study materials.
Another downside of self-study training materials is that they can be more difficult to maintain, especially video formats. In this case, experience with video editing software is required or completely new recordings need to be made at some frequency to ensure that self-study training materials are kept fresh and up to date.
Buddy System
Training SREs, especially new SREs, can be enhanced by providing one-on-one mentoring and shadowing opportunities. Well-maintained self-study materials combined with a mentor who is an explicit point of contact for answering questions helps the new SRE have confidence in the training materials and not feel like they have no guidance and support. Shadowing an experienced team member and then having the experienced person reverse-shadow the student when the time approaches for the student to go on-call is a useful training technique and a variation of the buddy system. The buddy system also fosters experienced team members’ confidence in the skills and abilities of the new person on the team.
Ad Hoc Classes
Ad hoc, in-person classes or whiteboard sessions are another approach to training SREs. Because this approach entails a live person giving a class, it requires more ongoing effort than self-study options. This can be particularly burdensome for small teams with few potential ad hoc instructors. However, this approach provides a useful structure for students, and an opportunity to have questions answered. Members of the team might maintain ad hoc slide decks on different aspects of the organization’s infrastructure that they deliver as needed. Less formally, whiteboard sessions in which an experienced team member draws a system diagram that outlines key elements of the infrastructure and key dependencies and how they work requires less overhead.
As an added bonus, have someone new on the team teach back what they’ve learned about the system from their own exploration, combined with self-study and whiteboard sessions from experienced team members. The team as a whole often learns from this approach. Oftentimes, the new member of the team learns something about the system or some recent change that even experienced team members didn’t know. The “teach back” approach ensures that the entire team has the most up-to-date mental map of how the systems they support work in practice. Teaching is the best way to learn (see “Teaching to Learn”) and is an important feedback mechanism to ensure that the student understood the material.
Systematic Training Program
If your organization is large enough or growing fast enough, it makes sense to invest in a systematic training program to ramp-up SREs on different topics. Creating an SRE training program ensures reliability and consistency in the ramp-up experience throughout your organization. Investing in a systematic training program that brings people together in person is also important for organizations driving a culture shift to SRE. Culture must be modeled in person—this is difficult to do with self-study formats. An organization trying to adopt SRE using a lower-touch training approach such as self-study might find this to be counterproductive. If possible, it’s better that the training is done in person because that sends a signal that the organization really cares about the change and the development of its employees, leading to a higher probability of success.
For large organizations, program operations become more important. Program operations are the “how” of the training program. Let’s draw an analogy to software development for which the “what” is the product features and the “how” is deploying to production in a reliable way to meet the needs of users. In the case of training, the “what” is the training content itself and the “how” is deploying it in a consistent and reliable way that meets the needs of students. Just like SRE focuses on the “how” of software development, we discuss how to apply SRE principles to training in Chapter 5.
A systematic training program allows an organization to build cohorts of new SREs. By putting people through the program together, people feel that “I’m not in this alone.” This helps fight imposter syndrome and builds the confidence of new SREs.
A formal training program for SREs should be systematic, not just in operations but also in class materials. Consider building a centrally curated curriculum. We discuss more about how to build and curate an SRE training curriculum in Chapter 4 and Chapter 5.
SRE training can be in-person (at least to start with) and then move to video or video conference. Each approach involves trade-offs between effort and effectiveness. It’s easier to obtain cycles for learning from engineers when they are new. The longer an SRE is with the organization, the more demands there are on their time, so prioritize in-person training as much as possible while people are new to the team. Impatient managers are also a concern. If you run an in-person training program, you might get push-back from managers who want their new team members to get started on the team as soon as possible. The risk of manager impatience and push-back increases with time, as evidenced by lower completion rates and higher cancellation rates the longer an engineer has been in the organization. For example, at Google, we achieve 99+% coverage of our target audience in an orientation program delivered in the second week on the job, whereas completion rates for classes related to incident management and getting ready to go on-call, which are delivered a few months after the new engineers start, drop to 50%.
With a formal training program, it’s important to keep in mind inclusivity, especially if travel is involved. For example, limiting training to one week, with the option for people to travel on Monday and go home on Friday, shows consideration for SREs with family or other personal obligations. In fact, in some countries, business travel must be limited to working hours.
Although distributed training (e.g., by video conference) can be appealing because it requires less time from both the students and developers of the training, it’s important to be aware that attendance and engagement decline for distributed training, compared to an in-person training model. Distributed training is not zero cost: there is the cost of logistics (meeting room bookings, getting the training on people’s calendars, recruiting instructors). The main savings are on student travel time and in organizing travel, if that is centrally managed. However, doing training in a distributed way means, in our experience, that students are more likely to become distracted and not pay as close attention, or not show up at all.
Teaching to Learn
Teaching is, in fact, the best way to learn.3,4 Take advantage of volunteer instructors and draw on former students to teach new students. This approach helps build a strong team and community and keeps people involved in education across the life cycle of an SRE.
It’s very costly to hire full-time instructors, especially when the topics being taught are very technical and require in-depth knowledge. Hiring full-time instructors basically cannibalizes engineers who could be working to run your infrastructure. Instead, consider crowd-sourcing instructors. Volunteer instructors spend at most a few hours a week (in the case of an extremely large and rapidly growing organization) paying it forward to help others ramp-up on selected topics. For the volunteer instructor approach to work, incentives are important. For example, consider recognizing volunteers at a company or department all-hands meeting or distribute limited edition corporate swag.5 Of course, there is also the innate incentive that if an experienced team member helps a new person ramp-up, that person would be ready to share the on-call and project load required to support their services faster.
Even better, teaching and knowledge sharing should be explicitly called out in the SRE role description. These community contributions should be taken into consideration when awarding raises and promotions. Being explicit about the importance of teaching shows that the company is serious about making the training program a success.
In a nutshell, if you are part of a small organization with limited resources and are growing slowly, focus on supported self-study techniques. If you are part of a larger organization, in-person classes using volunteer instructors are more effective. If you are large and growing rapidly, invest in a full-fledged training program with consideration for how the training is delivered in addition to what is taught. Sink or swim is never a good option and doesn’t set new members of the team up for success.
Conclusion
In this chapter, we talked about identifying your SRE training needs. We introduced the Organizational Maturity Matrix and discussed what type of skills to develop. We also introduced some SRE training techniques and which approach might work best for your organization.
1 For purposes of this discussion, SRE principles, practices, and culture are taken as the key elements laid out in Site Reliability Engineering: How Google Runs Production Systems (O’Reilly).
2 Imposter syndrome is a psychological pattern in which an individual doubts their accomplishments and has a persistent fear of being exposed as a “fraud.”
3 Koh, A. W. L., Lee, S. C., & Lim, S. W. H. (2018). The learning benefits of teaching: A retrieval practice hypothesis. Applied Cognitive Psychology, 32(3), 401–410, https://oreil.ly/Qlb1h.
4 Duran, D. (2017). Learning-by-teaching. Evidence and implications as a pedagogical mechanism. Innovations in Education & Teaching International, 54(5), 476–484, https://oreil.ly/YO_W5.
5 Swag is a common Silicon Valley term for promotional merchandise branded with a corporate or team logo.
Get Training Site Reliability Engineers now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.