Chapter 4. Instructional Design Principles

Creating training that sticks and is engaging is challenging in any business environment. That is because business trainings tend to suffer in three ways:

Lack of immediate application
Information overload, and
No clear, actionable, learning objectives

The consequence is that employees feel that they have learned something, but can’t quite identify what that was, nor could they show you anything substantial to prove that the training was worth their time. One way to combat a few of these challenges is to apply instructional design principles to the design of your training program. Here, we highlight approaches that you can use, no matter how large or small your organization is.

Instructional design is the practice of creating consistent and reliable materials, experiences, and activities that result in the acquisition and application of skills. Another way to think of instructional design is as a framework. Although there are many techniques that instructional designers use to develop training, here we highlight some of the instructional design principles used by Google’s SRE EDU team. These principles have helped build our training materials, experiences, and activities, so that our students gain the ability to hit the ground, confidently running, during their first few weeks at Google.

Identifying Training Needs

Before you design any training program, you need to establish what the problems are that you want to solve with training. In the case of SRE EDU, we were faced with two problems. The first was the steep learning curve in SRE. This was thought of as just a part of the job, which resulted in lots of smaller initiatives to help lower that learning curve. This led to onboarding that was inconsistent, non-existent, or very targeted toward certain services. The ad hoc onboarding programs that did exist varied widely by team and by region. The second problem was to standardize on what and how SREs were trained, in a global, scalable, and dedicated effort.

Here are some questions to ask yourself and/or your training-minded team:

What problem are you trying to solve?
What are the goals of the training?
Who are the people who can make this happen? Who are the key influencers in your organization? Who are the stakeholders who can back you?
What are the root causes of the problems you are trying to solve?
How will training help mitigate identified problems?

If you answer these questions fairly thoroughly, you will have a foundation on which to build. Spend a significant amount of time here. Understanding and enumerating the problems help guide you in building out the rest of the training content, and to a larger extent, the training program.

Build Your Learner Profile

Ideally, you already know your audience. However, we’ve found that our audience is often bigger than we think. Therefore, it helps to identify who you are training and to describe what they do.

Here are some questions to ask yourself and/or your training-minded team:

What are the skills/knowledge/activity/behaviors you want people to do?
What are the skills/knowledge/activity/behaviors you want people to stop?
What is the current working environment of the audience you are training?
What resources/training is the target audience currently using?
Who are the people you are planning to train? Are they new to the company? Industry experts? Recent college graduates?

During this activity, you’ll discover a long list of items that you should curate based on who you are training. For example, if your primary target is new employees, you’ll want to decide on the most important skills that are appropriate for new employees to know, to hit the ground running. A major pitfall in business training, as mentioned earlier, is information overload. When everything is important, nothing is. For SRE EDU, we negotiated, discussed, and decided on what key things we wanted our students to be able to do at the end of the training, that had the biggest payout to the teams they were joining.

Finding out these answers should not be done in a silo. Get out into your organization and start interviewing. Talk to leaders, managers, individual contributors, and others. SRE EDU met with staff to identify these things. What came out of those meetings became very valuable to the development of the training. Following is a sample of resources we discovered:

Ramp-up strategies in individual teams
Best practices that teams had developed
Resources and materials that would be useful to a larger group
Graphs and data analytics dashboards that have broad use cases for training
Tools and system use cases that were common across multiple areas

Consider Your Culture

The intersection between culture and training is reflected in attitudes, practices, environments, and beliefs about how one acquires and uses knowledge in the workplace. What we mean by this is that how you train is just as important as what you train. With that in mind, having a functional knowledge of the culture of the workplace becomes paramount to effective training. There are two main principles to culture that SRE EDU focuses on:

The SRE culture (the way we do things and why) must be explained, demonstrated, and reinforced throughout all of the training.
SRE EDU is a vehicle for early introduction of these cultural norms, by those who are living the culture.

Throughout the entire curriculum, we weave cultural classes that introduce cultural topics to students. As an example, we have a class called “The Role of the SRE”. In this class, we explain Google’s mission, and how we as SREs fit into that mission. We use real examples to demonstrate who we are, what we do, and how we do it. This class sets a foundation from which we build a knowledge base to work from, a vocabulary to build off of, and an immediate identification with something tangible in a globally distributed organization. It answers a common question from our students—who am I and why am I here?

Culture is the most fragile part of training. Authenticity in what you say and how you train culture matter. If what you train is not reflective of the real organizational culture and practice, you have effectively lied to your students and set them up to fail. When that trust is broken, it’s difficult to get it back. It is entirely reasonable to present the culture as it is, while communicating where the culture should be. For example, we noticed that developer and SRE relationships in some teams were not as strong as we would have liked. After following our process of identifying the training needs, root-causing problems, and interviewing other teams, we developed a culture class centered on what healthy developer and SRE relationships look like. We acknowledged that not all teams are working with each other as effectively as they should, while stating that we need to change this within the culture, and here is how you can help.

Where did we get this information from? From within the culture, through teams that measurably worked well together. Interviews with these teams highlighted a common theme to work from—and teach. Each of these teams had excellent strategies for how they communicate, align, and agree on tasks, goals, and work. These teams built and maintained trust with one another. The authenticity of this class is enhanced by students who might have come from other industries and can attest to and/or provide additional insight into what we are teaching.

Storytelling

Another aspect of how we teach culture is through the use of storytelling. Storytelling is a cultural or social activity of sharing narratives that can be used for education, entertainment, or preservation. What makes storytelling so compelling as an instructional tool is the ability for recalling (or retelling) the story in ways that keep the lessons learned alive. When constructing the training program, we first started asking ourselves what is the story we want to tell? We went through a simple exercise of looking at what makes stories such an important part of culture. We identified what key things SREs learned in the past, that we wanted to bring to the present, to positively influence our future.

We framed training around telling a story. One story that framed our training is that things break, and they break in interesting and novel ways. You learn how these things happen, but more important, you learn strategies to combat that. What makes a story worthwhile is that it can be retold by the listeners; there are visuals, word queues, vocabulary, and actions, that make the stories memorable while imparting lessons.

Let’s see how well using word queues work to get you to guess the story or event:

Glass slippers, midnight, fairy godmother, prince, stepsisters (“Cinderella”)
King Triton, Sebastian, Flounder, Ursula, Ariel (The Little Mermaid)
“I have a dream...”, Lincoln Memorial, Civil Rights March (Martin Luther King, Jr. speech)

So how does this work with SRE training? During the training, we build up a story around the many ways things break in interesting and novel ways. These are reiterated through storytelling by those who were there to experience it or were passed down this knowledge well enough to tell it. The important part of this is that we impart the lessons we learned to our students.

Here’s an example of word queues that happen in a class we teach around causes of outages: nature, humans, power, cooling, bad config pushes.

Each of these words are tied to stories we tell, real-world examples of these things happening, and in some cases, humorous ways we’ve experienced outages we would have never expected. Students remember these things when they move into their teams and have a model around reliability to plan for. We further this experience through our hands-on breakage activities, which gives students the ability to use real tools and real systems, to solve real breakages and learn through the process.

In our orientation program, the students’ first breakage activity happens on their second day of training. Throughout the experience, we stop and see what students have discovered. They tell us the beginnings of the narrative or story behind what’s happening. We do this a few times during the experience, which continues to build the story. By the end of that first breakage, after they’ve solved the problem, we ask two questions:

Can you succinctly tell me what the problem was and how you triaged, mitigated, and resolved the problem?
What went well and what could have gone better?

In every instance, students are able to tell the story back to the instructor, identifying the tools and systems they used to triage the problem, explaining why the problem was happening, and detailing what they did to mitigate the problem and resolve the breakage. They further help to tell the story by adding their own personal ideas around how things went.

Why is this important? A primary principle of SRE is that we learn from failures. Most of our failures have valuable lessons that can be applied to system design, monitoring philosophies, code health, and/or engineering practices. By turning these lessons learned into stories, the memory of actions, words, and visuals become a method for recall in future work. These storytelling activities are also a part of the SRE culture at Google. They happen in various events throughout the SRE organization, whether it’s through Wheel of Misfortunes (see Chapter 15 of Site Reliability Engineering), Ops Reviews,¹ lunch conversations, and so on. Retelling these stories and lessons learned, much like world cultures, keeps our collective memory of history, practice, and culture alive.

Build the Vocabulary

Think back to your very first meeting at your organization. What do you remember? How many words, three-letter acronyms, and jargon did you understand? All of these are the vocabulary of both the company and the culture. They play an important role in identity and establishing a common understanding of meaning within a company. Google is no different. Throughout the training, we give students a common vocabulary from which they interact with the rest of the company.

Consider Your Learners

Closely related to building your learner profile is considering the learner attending your training. Who are the people you are planning to train? Building a profile of your target audience gives you an idea on how to design your training. Things to consider about your learners include the following:

Roles: How many different roles is your training going to target, and how widely varying are those roles?
Geography: What locations are your learners located in?
Population: How many people will you train?
Prerequisite knowledge: What should students already know?
Travel: Do people need to travel for training?

We identify this early because all of this influences the design choices we make when developing training content. Of course, these are all logistic questions. There are other aspects of the learner that we have to consider, as well.

Adult Learners

Training programs often overlook that adult learning is different from how children or teenagers learn.² How adults learn is a question of motivation. Adults choose to learn things for different reasons, mostly because they see value in learning something that is of interest to them. Adults have a readiness to learn because of that value. Adult learners in corporations also come with some level of industry or educational experience to build from. There is a entire field of study related to this called andragogy, as opposed to pedagogy. SRE EDU has worked within three main premises of adult learning:

Learning is experiential—adults learn what they do.
Adult learners transfer what they learn from one context to another.
Adult learners must identify why the content they are learning is important.

Adult learners also deal with fears, uncertainties, and doubts about their abilities. Fear shows up in the form of questioning whether they made the right choice to be an SRE. Uncertainty shows up in questioning whether they have the skills to do the job, or how they will be viewed by colleagues. Finally, doubt shows in the form of imposter syndrome. We’ve seen this in our daily work lives, and some of us have personally experienced these feelings. In our training, we considered each of these cases in the choices we made when designing our curriculum—from the content depth to the activities.

Learning Modalities

SRE EDU has various learning modalities, depending on the content. When choosing a training mode, think of what your students will be doing and learning. Different people learn best through different modes of learning. Different training topics also lend themselves to different modalities.

Instructor-Led Training

SRE EDU uses instructor-led training for our orientation program. It’s the first training students go through. We’ve seen that having classes in person is the most effective way to introduce our students to the SRE world. Culture classes work well here because of how we teach culture through storytelling and vocabulary building. We also have a lot of hands-on activities that are better executed through instructor/facilitator observation. Observation by experts in their field becomes important for in-the-moment evaluation of how well students are progressing, especially new hires.

Reasons for choosing instructor-led training:

Your students benefit from expert advice/Q&A from a diverse group of people in your organization.
Reinforcement of new skills and knowledge.
Peer-learning and networking are an important part of your company culture.
Real-time feedback/observation is crucial to your training.
You have exercises that would benefit from facilitation by an expert.

Instructor-led training does require a higher upfront investment in preparing your students and prospective trainers. We’ve found it vitally important to set expectations with our students early and be upfront about the experience. They should know what is required of them, how long the training will be, where they should be and at what time, and what to expect throughout the experience. This led us to create a short introduction to the entire program that starts off their training. You’ll also want to prepare your students to know where things like restrooms and break areas are. For instructors, we prepared items such as lesson plans, slides, and help guides. We also do a Train the Trainer session for our new trainers. We discussed Train the Trainer in context of the hands-on activities, earlier, in Chapter 3.

Self-Paced Training

Self-paced training shows up in various forms. It could be a video, document, or some form of elearning module. The point is that self-paced training is contingent on the student guiding the pace of learning. SRE EDU’s primary use of self-paced training is in videos. SREs have lots of knowledge about many things. In SRE at Google, we have a culture of sharing knowledge. Once each year, SRE EDU hosts an education program in which any SRE can build and teach a class for their fellow SREs and other interested parties. We record all of these talks and post them to a page. The content varies widely, but what this allows us to do is broaden our training topics by using our own team members. This also allows us to quickly create a library of courses for varying skill levels. These videos are now available for anyone to watch, whenever they want to. Students control their own pace of learning.

Another aspect of self-paced training is scale. Our SREs are globally distributed. Getting SREs into one place for training is not always efficient, and generally difficult. Think about visas, travel, hotels, and transportation, and then multiply that by the number of SREs you have. The costs alone are prohibitive for a large organization. SRE EDU has made self-paced study a primary mode of training for our continuing education efforts.

Here are some reasons for choosing self-paced training:

A globally distributed team
Remote employees
Content with high reusability and little change
Activities that are best done solo
Target audience is experienced staff
Less prescriptive training
Don’t have a large enough population to train, to justify an in-person class

Some examples of the types of courses that tend to work well with self-paced training include account setup courses, tutorial classes, programming classes, and product overview classes. This is by no means exhaustive, just examples that we’ve seen work well.

Mentoring and Shadowing

Mentoring and shadowing is a training mode we encourage individual teams to do. It’s a good way to get our employees accustomed to how their team operates. For teams that are highly specialized, this type of training mode often gets new hires up to speed faster. The interactions are more personal and can be tailored to individual needs. We use this technique for the SRE EDU program, when bringing in new instructors. When we have new people who want to teach one of our existing instructor-led classes, we ask that they shadow a class with an experienced instructor first.

Shadowing gives our new instructors an example of how class is run. They see how topics are taught and what examples or stories are used, and can ask the instructor questions, learn what types of interaction happens between students and instructors, and watch how classroom management is done. By having a shadow/mentor process, we generally ensure that the content is taught consistently, and share responsibility for that consistency.

Here are some reasons for choosing mentoring and shadowing:

Onboarding new hires into the practices and norms of a team
A team has use cases or practices for a highly specialized tool
You have just a few people with specific knowledge you want to pass on

Create Learning Objectives

Learning objectives are generally defined as outcomes or actions that you want your learners to be able to do when they complete the training. To help with this, we use a model or framework for building learning objectives. We frame all of our classes and activities with “Students-Should-Be-Able-To’s” in mind. Learning objectives in SRE EDU use the following formula:

At the end of this training, a student should be able to do the following:

Use <job tool> to identify how much memory a job is using
Interpret a graph in <monitoring tool> to identify the health of a service
Move traffic away from (i.e., drain) a cluster, using <drain tool> in five minutes

In these short examples, notice a few things. We use action verbs to establish the behaviors that we want to observe or measure. Action verbs imply an activity, something that someone is doing. We avoid words like “understand,” “know,” and “be aware of,” because we can’t measure those effectively. The objectives also contain the conditions under which the learner does the tasks. In some cases, it is a system or a tool. Finally, if the tasks the students are doing is time bound or accuracy dependent, we assign a value for the level a student needs to hit in order to perform the task successfully.

Concrete learning objectives (behaviors) lead to better measurability and observability of those behaviors. At this point, the data that you gathered from identifying the problem and training needs becomes important. During that phase, we asked what are the skills/knowledge/activity/behaviors that you want people to do? Notice we asked, to do. This is important, because now you have a framework for developing learning objectives.

Learning objectives that are clearly defined and measurable help designers develop training content and materials that meet those objectives. It’s also a clear way to present the intentions of the training to stakeholders, and how you’ll measure the effectiveness of training. We know training is working as intended when students demonstrate what the learning objectives detail. Because we frame them as actionable, we observe in real time students meeting or not meeting those objectives.

Designing Training Content

Training materials take many forms. The common ones include slides, workbooks, handouts, lesson plans, etc. Instructional designers tend to design with at least one of these in mind. However, the word design is more than just the tangible materials that students or instructors use. In this section, we highlight a few of those less-tangible items and how we designed training content with them in mind.

ADDIE Model

One of the most common models for creating training content is the ADDIE model. Instructional designers tend to be well versed in this model for development. ADDIE stands for Analyze, Design, Develop, Implement, and Evaluate. It is a staged process for how to design training as well as things to consider while designing.

Even if you don’t have an instructional designer, you can employ the ADDIE model on your own. This is not an exhaustive study of the ADDIE model, but we want to keep the material in the realm of practical use, so we cover some key concepts here. Note that each stage of the ADDIE model is generally a linear process, but you can revise the model according to your training needs. Let’s talk about the ADDIE model in detail next.

Analyze

In the analysis stage of the ADDIE model, instructional designers do a fairly exhaustive inventory of training needs. This includes identifying problems training intends to solve, and defining learning objectives, behavioral outcomes, delivery options (learning modalities), and timelines. Analysis takes a large amount of development time because getting the analysis right saves time and resources as you progress further.

Design

The design phase is primarily focused on the learning objectives. Through the learning objectives, you are determining evaluation strategies and exercises, and identifying the content needed, such as instructional guides, media, and the student experience. During the design phase of our more hands-on curriculum, SRE EDU created a design document similar to software design documents. We did this because we wanted a single source of truth for what the training would look like and how the training would progress.

Design documents get very detailed and definitely do not make for good leadership discussions. However, they are a useful tool for instructional designers and teams to comment, revise, and collaborate on the training content plans. In the design document, we also created our timeline and launch strategy. Start with a basic idea of what you are planning to do. You can see a more detailed design document in Appendix A.

Begin with an overview of your training. Think of this as your elevator pitch, a succinct description of the problem that you are trying to solve, why this training solves the problem, and how this training solves the problem. Create a learner profile. Identify who your intended audience is and why. Things to consider here are the roles or job positions your training is targeted for, where they are located, and how many people you plan to train.

Review the content that you are planning to train and give detailed information about each course. Include the timing of each class (60 minutes, 30 minutes), your learning objectives, and an outline of the content. If you have source material that you are referencing, document it so that you have a repository for all your source materials. If you are using subject matter experts, document who they are. Just as you identified who your learners are, you also want to identify your possible trainers. List who will teach the courses. You do not need to identify specific individuals at this stage; instead, identify the skills you are looking for, what knowledge and skills the prospective trainer should have, and what teams these individuals should come from.

Another thing that you want to identify early are the risks. Risks are things you are aware of that could derail or delay your training project. When identifying risks, talk to others who have gone through a similar project. The insight that others have is invaluable when designing training programs. One risk that Google SRE EDU found early was that some designs we had for our hands-on activities were not possible because the systems would not allow it. That meant we had to work with our partner teams to figure out hands-on activities that were possible.

Success metrics are closely tied to risks. Success metrics are measurable data points that you use to determine how well your training is performing. One way to do this is to look at how you evaluate your training (see “Evaluating Training Outcomes” later in this chapter). Split success metrics into the overall program and another set of metrics to determine go/no-go decisions for launch. Next, document a launch plan.

Launch plans should include criteria for launch, and the decision to launch should meet those criteria. The launch plan should also have a general plan for training your trainers, piloting your classes, and a plan for keeping your training fresh (updated, accurate, and reflective of reality). It should also include any administrative tasks, and a training schedule for students. Closely related to launch plans are communications plans.

Communications plans are often overlooked, causing a last-minute panic to get the information out to the intended audience. By planning for communications early, you can draft, review, and test your communications strategy early and get feedback to improve before launch. Think through how you’ll communicate to your audience. If you plan on using emails, draft them beforehand and get feedback.

Finally, documenting sign-off partners and timelines are helpful to keep track of how the development phase gets done. Sign-off partners are individuals who are key decision makers and influencers in your organization that help implement your training. This list of people also identifies key stakeholders with whom you want to keep in constant communication during your design phase.

Timelines at this stage can be a little difficult to determine with complete accuracy. Instead of focusing on dates, pay attention to the amount of time tasks might take. Refine and update those tasks as you progress. When developing timelines, give yourself some padding because unexpected things always come up. Having that extra padding is a way to mitigate risks to your schedule or development cycles.

Storyboarding during the design phase is also useful. Storyboarding is walking through the training ideas visually in a systematic or linear approach. Think in terms of a movie. You sketch plot points, interactions, action scenes, and so on. Various elements in the training get sketched out. After that’s done, you step back and look at the flow. You find missing points and gaps in the training that require a link. More important, you play around visually with the flow. What would happen if I moved one particular aspect of training somewhere else? How does that work? What content support would this change need? One word of caution is that storyboarding works well if you are planning a training that has a linear or step-based process, but nonlinear or branching processes can inadvertently become linear by following this method. In some cases, using a storyboard in nonlinear training unintentionally makes your training linear—or seem linear—to your students. For example, take monitoring training. We introduce monitoring via a philosophy course, but students (later in their career) might choose to go further into monitoring, via other training topics and tools related to monitoring. There is no sequential path to follow.

Develop

The next stage, development, is where instructional designers begin to assemble the training. We don’t recommend making a large amount of changes to your design plans during this stage. The development stage includes material creation, integration of any technology pieces, the building of activities, surveys or evaluative instruments, and graphics. We take the things learned in the analysis phase, learning objectives, the items decided on from design ideas in the design phase, and turn them into learning activities or experiences. This varies depending on the modality of training you chose.

Implement

Implementation deals with timelines and launching. It’s all about getting the content, materials, rooms, trainers, and things you’ve planned to be ready for use. Implementation has a feeling of finality. In SRE EDU, we used implementation to start our pilot plan. One of the criticisms of the ADDIE model is the inflexibility of iteration and changing midstream. To combat this, we planned for multiple pilots of the content, gathering feedback, and incorporating it back into the curriculum before a final launch. Pilot plans vary depending on the size of the training you are building and the type of training. Hands-on training requires a pilot that tests not just your content and learners, but also the systems/tools you are using. Think of what you are building, and plan for pilots ahead of time. Launching and discovering major problems afterward, which you don’t have time or resources to fix, is frustrating.

Evaluate

Evaluate is the last step in ADDIE. Often, evaluation is considered as the final step in training. Ideally, you want evaluations to happen throughout your training, which is when your learning objectives come into play. SRE EDU creates points of evaluation through the activities we build, to give students and instructors an idea about how they are progressing. This can be through storytelling, activities with a definite outcome we expect all students to experience, or hands-on practice.

Final evaluation in the form of surveys are another way SRE EDU has gathered feedback. In the beginning, we surveyed our students just once, post training. Now we gather feedback at various points, pre- and post-training (see “Evaluating Training Outcomes” later in this chapter). The idea behind this evaluation is to use the feedback to improve content and stay in tune with the experiences of the SREs, as time progresses.

We’ve covered some basic ideas around the ADDIE model. There are, of course, other things we’ve learned from developing SRE EDU that we feel would be helpful, outside of the instructional design models that we use.

Modular Design

When designing a curriculum, we’ve found that designing for modularity is important for flexibility. Much like designing systems, we want to reduce single points of failure. Tools, technologies, priorities, goals, and even leadership, changes meaning, so tying yourself to any of these items gives you points of failures when they change, and can cause unnecessary toil as the years pass. Therefore, what SRE EDU has done is create a modular style of development, based on reducing single points of failure. We modularize around those principles, to incorporate the best practices and tools at the time.

Let’s take our monitoring course as an example. Our class focuses on monitoring philosophies rather than teaching the entirety of a specific monitoring tool. What this enables us to do is teach the principles of monitoring first, and what we mean by monitoring. Then, we apply those principles to the current tool of choice. The principles (generally) should not change for quite a bit a time, and the strategies for what and how we monitor should be stable. We generally leave those in place and keep those principles as a core part of the class.

The hands-on, practical application of those principles can then be placed in a module within the class, which can be replaced as tools, best practices, and so on change. This happened a few years ago, when we transitioned from our legacy monitoring tool to a new tool. The principles still applied. What changed was how we incorporated the new features, layouts, dashboards, and so on. The ability to pull and replace was much easier as we kept the training experience modular, by design.

No single points of failure

A single point of failure is a part of an overall system that, should it fail, the entire system stops working. For each of our hands-on activities, there are at least three points or clues that can be used to help solve the problem. We do this because we don’t want students to just discover a single point of failure and then find the solution from that. If there is only one clue or one way to discover the problem and students miss it, they effectively cannot get to a resolution. When designing hands-on training, plan for how students discover problems, and eliminate single points of failure.

Designing for modularity comes back to your learning objectives. Modular design is about designing around your learning objectives and not about content coverage. If you have content that is not covered, don’t add it to a course because it fits with the topic. Verify that the content meets your learning objectives. Keep your learning objectives at the forefront while you arrange all the content and classify them. Identify what are principles, what are tool specific, what are best practices, what are patterns, antipatterns, and so on. Identify what content changes frequently and distill the general principles in that content. Based on what you identify, relate them back to your learning objectives. After you have that categorization complete, you begin to see how this topic fits with your overall training program.

Train the Trainers

For instructor-led courses, preparing your trainers is important. It gives them two things:

Confidence to teach the class, especially hands-on activities
Ability to set the larger context of the class(es)

SRE EDU orientation is a collection of classes developed and cadenced in a way that builds skills with our learners. Because our instructors are SREs, they might teach as few as one of our classes per session. New trainers might look at their class in a vacuum, meaning that the class stands alone, and trainers might teach it that way. SRE EDU developed Train the Trainer to address this. Each class builds and connects to each other, to support our hands-on activities. Due to this, we wanted our trainers to develop teaching strategies using the overall curriculum story (things break in interesting ways / we teach you how this happens and methods to combat that). The overall curriculum story acts as a guide for the trainers.

When creating courses, think about your trainers. During Train the Trainers, we cover the following:

Overview of the curriculum (purpose and overview of the classes)
Schedule that students follow, to go through the classes
Rundown of each class—its pitfalls, gotchas, lessons learned from pilots, and so on
Learning/teaching strategies
Practice time for hands-on activities

Some of our learning objectives for the Train the Trainer course are as follows:

Demonstrate facilitation skills used in a hands-on scenario, using the teach-back method in class
Identify how the overall curriculum supports the hands-on activities, using the Triage, Mitigate, and Resolve model
Use the lesson plans, one-pagers, and/or guides to plan your teaching strategies

When trainers identify the relationship between the courses they teach, the objectives and purpose of the curriculum, and the activities, trainers have an easier time teaching. To help with some of this, we also create lesson plans that highlight talking points. It’s important to note that these are not scripts. We don’t want our instructors to read verbatim from the lesson plan; that defeats the purpose of classes. What we really want is for the SREs who are teaching the class to share their domain knowledge and experiences, within the constraints of the program. The lesson plan helps our trainers identify the purpose of the slide, video, and other training materials, and gives them some ideas on what could be said, not what they should say. Trainers should bring their own stories and experience to the class.

Pilot, Pilot, Pilot

Much like launching a service or software program, we want to canary test our training. When we launch new training, we do multiple pilots and staged rollouts. Piloting is a process in which you are testing everything in your training. You are identifying how well your training materials work, and how well the activities work. The benefit of pilots are finding those unanticipated bugs in your training. This is particularly useful for observing students’ reactions, getting a pulse on what the class flow is like, and seeing whether your timing is correct. We let our students know up front that they are participating in a pilot.

We also give students time at the end to give feedback. We ask questions such as: What worked? What could be better? What was not clear? What would you do differently? All of these questions help guide our iterations and design, before a final launch. Because SREs at Google are also globally distributed, we pilot in different locations. We learn a lot about what gets lost in translation. An example of this is when an American trainer made an analogy to a rodeo for one of our team members in Zürich. A look of bewilderment washed over the student’s face because he did not know what a rodeo was. Had this training gone out to others with that analogy included, a large portion of our teams would have spent their time using Google Search to figure out what a rodeo was instead of focusing on the class.

One pilot for medium and large training programs is not enough. Doing multiple pilots saves you time by avoiding the toil of launching and iterating at the same time. SRE EDU always has a pilot plan for each training. We pilot in multiple locations, with different audiences, to get as much feedback as we can. We plan a few days to a week for update time, post the updated content, based on feedback, and then run another pilot. Generally, for large programs, three pilots give you a lot of information to work with and will address most, if not all, major concerns.

Making Training Hands-On

SRE EDU Orientation started as a collection of classes that were deemed important to all SREs at Google, during their first two weeks on the job. The topics ranged from culture classes, to general courses on tools used by a majority of SREs. These classes were put together into a collection of courses that became the first iteration of SRE EDU Orientation. The classes were lecture heavy, with little to no hands-on experiences for students to apply what they were learning.

The most frequent feedback we received early on was to make the training more hands-on. What the students were telling us is that they wanted a way to immediately apply learned skills. Complex and oftentimes abstract principles stick better through hands-on activities. It’s one thing to talk to students about draining a cluster, it’s quite another when a student can drain a cluster themselves, view the monitoring data to validate that the drain worked, and analyze how they know that it worked. We investigated more about what problems we were trying to solve. It wasn’t enough to create something that was a gimmick or fake. We wanted to create something that was real, used real production services, and the exact same tools that experienced SREs use at Google. We wanted to give an experience that was as real as possible. For SRE EDU to move from lecture-based training to hands-on training, we started—again—with defining learning objectives, and the idea that we should break real things and fix them, using real tools.

We defined a few learning objectives for the hands-on activities:

Use the page-receiving device to acknowledge a page within two minutes of paging
Use the monitoring tool to investigate what “things” are affected by the outage
Use learned skills to identify the problem from each breakage
Mitigate the identified problem using taught tools and processes
Resolve the issue and validate the resolution using monitoring graphs and dashboards

From these learning objectives, we created a service in production for the specific purpose of breaking. It turned out that this was very difficult to do at Google because our systems tend to prevent breakages/outages from happening in the first place. We were, however, able to create four breakages that we use in the class.

The Breakage Service

As mentioned earlier in Chapter 3, “Current state of SRE orientation at Google”, we created a very simple but functional photo upload service called the Breakage Service, which contains many parts of a real, scalable service. We wanted to mimic a real production service as much as possible. The frontends and backends of the service are distributed over multiple datacenters and we also have some load generators running, to make sure the service looks “alive.” The breakage service has a storage component, complete with quota and redundancy, and is housed in multiple data centers, in multiple regions. It also has a user-facing front end, monitoring, code, and network components, pager queue, on-call rotation, and so on. All of this runs on real Google production systems, and just like SRE, the SRE EDU team gets paged if dependent systems go down.

To give students an experience similar to a real SRE team, we don’t have just one installation of the service (which we call a stack)—we actually have multiple complete instances. These stacks were built using Google’s frameworks, which enforce SRE best practices; for example, for continuous integration (CI)/continuous deployment (CD), and monitoring and alerting. When students work with the breakage service, they have full ownership over their stack. They work in groups of three, and can bring down jobs, erase storage if they want to, and manipulate the service just like any other production service. They do this without causing harm to other students or “real” Google production jobs.

After the students get used to the service, we hand them a mobile phone that acts as a pager, and instruct the system to break. The students then receive a page. We designed the curriculum such that classes are followed by a breakage exercise in relation to that class. That way, the students are able to practice working with the tools used when handling an outage, and they follow Google’s incident management procedures to Triage, Mitigate, and Resolve (TMR) the problem.

Part of an SRE’s job is troubleshooting and finding root causes. Therefore, the breakage service we built has some deliberate weaknesses engineered into it, to simulate outages and have the new SREs exercise troubleshooting. For example, we have a deliberate bug in the “search by hashtag” code, in the backend that triggers on a specific hashtag; the binary crashes. When we “turn on” the breakage, we instruct the load generators to change the mix of hashtags it requests the service to look for. That way, more random crashes occur, and eat away at the service’s Service-Level Objectives (SLOs, see Chapter 4 of the Site Reliability Engineering).

The breakage service is one important aspect of our training. The service, when paired with the curriculum, became a powerful teaching tool. To take this service and design around it, the SRE EDU team went back to instructional design principles to learn how best to implement it. We came away with some two key principles for how hands-on training becomes effective in practice: play to learn and collaboration among learners.

Play to learn

One of the ways hands-on learning works to reinforce and expand on skills is through play. When we developed the breakages and training, we wanted learners to feel that they were playing with the service. Students are given the space and time to explore, learn through failure, apply the scientific method to a problem space, and work through TMR. As mentioned earlier, adult learners in business environments are often stressed with being new to a company, pressure to “pass” a class, and confidence concerns. We wanted to limit this as best as we could. Hence the breakage system feels more like play, especially in the first encounter.

Collaboration among learners

Playing is generally a social activity. Think of sports, board games, parties, and so on. Therefore, we wanted to make sure that learning in this environment maintained a social aspect to it. Our hands-on activities were done in groups of three. We believe that this collaboration is both reflective of the work environment, and also a healthier way to learn new skills. We break students into groups, based on their self-reported industry experience. We make sure to mix the industry experience level within teams. No one team should have all the industry veterans, nor all the newbies.

During our pilots, we tested with various group sizes and settled on three being the ideal. The reason for this is that with too large of a team, you have individuals who won’t participate as much. If you have too few, getting stuck becomes easier due to less ideas shared. This breakdown also helps with imposter syndrome. When teams comprise individuals from different levels and types of experience—newbies, industry veterans, and internal transfers (refer back to Figure 2-3), teaching is happening in all sorts of ways between the team members. When a newbie hears an industry veteran say, “I don’t know,” they realize that it’s okay to not know everything, even after 10 or more years of experience.

Scaffolding: Build, Stretch, and Reach

The hands-on activities are built to scaffold the learning experience. In other words, we start with very strong hand-holding to build skills and confidence. Then, as we progress through the breakages, we do less and less hand holding, until the final breakage, when the vast majority of students are progressing on their own.

We use a model for each of the breakage scenarios that students might not consciously feel (but our trainers do). We do this with the understanding that in each progressive scenario, we want our students to build, stretch, and reach their skills.

In Table 4-1, we show four of our breakage scenarios and what teaching style we use. In Breakage 0, the instructor is driving the entire time (build). They run through all the steps from start to finish, from pager to resolution. This demonstrated behavior is a model we want students to build from. In Breakage 1, the instructor allows students to drive, from pager to resolution, but there is a lot of guidance provided here. We remind students of the tools they’ve learned and ways to navigate it, suggest looking at certain monitoring views, and level-set frequently.

Table 4-1. Hands-on activity teaching style
Hands-on activity	Teaching style^a	Instructor tasks
Breakage 0	Sage on the Stage	Drives entirely Follow along with me Build skills
Breakage 1	Coach	Get you started Collective level setting Remind students of tools/systems More interaction with teams Stretch skills
Breakage 2	Guide on the Side	Less interaction with teams Answer more tool/service use questions Collective level setting Stretch skills
Breakage 3	Consultant	Answers questions from individual teams Less collective level-setting Watching teams, more than interacting Reach skills
^a Teaching styles obtained from this article: “From ‘Sage on the Stage’ to ‘Guide on the Side’: A Good Start”

Level-setting is when we take a pause in the breakage. Level-setting happens throughout each breakage. During that time, we ask students a few questions, as a group, such as these:

What tools have you used so far?
How did you discover what you found?

We ask questions in this way so that we don’t have students spoiling the activity, while allowing students who might not be quite as far along to try out some of the suggestions other students have provided. Level-setting allows instructors to get a pulse on the class, and encourages students to teach one another as a larger group.

In Breakage 2, we move to less hand holding, but still answer questions as students have them (stretch). We continue to do level-setting. Finally, by the time students get to Breakage 3, they are operating independently (reach). Often, some students finish early and can help other students, if needed. The cadence of training was built this way to help students confidently apply the learned skills that they will use in their day-to-day work.

We let students know that the hands-on activities are there to help build their confidence in what they’ve learned and apply it in a safe environment (see “Build confidence” in Chapter 2). We want students to make all their mistakes here with the activities and learn from those mistakes. We preface every hands-on activity, every time, with a brief conversation highlighting that troubleshooting is a series of failures, and it’s ok to go through a process of not solving the problem, especially if it helps you rule things out—just like the scientific method. We also create moments of discovery—called eureka moments—throughout the triage and mitigation steps, and have our SREs simulate working very closely with their developer teams by role-playing SRE to Developer communication, directly in the training. Our hands-on trainers act out the role of SRE or Developer, to give students practice on escalating issues to a developer or SRE.

Evaluating Training Outcomes

As mentioned earlier, evaluating training is a process that happens throughout the experience—not just at the end. During training, structure your learning objectives to help assess how students are progressing through the courses. Because the learning objectives are based on things students are doing (action verbs), you observe those in real time. That is one data point you use to evaluate your students’ progress and training outcomes.

When We Evaluate

We also want to know how our students are perceiving their training outcomes. Therefore, we ask them, multiple times. We like self-reported data because it gives us an idea of how the training affects the growth of students, as they perceive it. We survey our students at different times: before their training, after their training, 30 days after, and six months after. We also survey the hiring managers who receive our students. We ask our managers the same questions and get their take on their employees’ skills, post training. As with the students, we survey our managers 30 days after, and six months after receiving their employee. Finally, we also survey our instructors around their training experience. We use all this data for the explicit purpose of improving our program. We are not evaluating the students here—it’s all about how the program is working or not working, and gives us data points to justify making changes as needed. We explicitly tell our students that.

We evaluate at different points because as students ramp into their job roles, their ability to recall and apply concepts becomes real. Sometimes, what they thought they could do was not as solid as they thought, which we capture through feedback. Often, they realize they can apply and recall things very well, signaling that we are teaching the right things. In some cases, we discover that we need to improve or add something to the training that was not there before.

How We Evaluate

Surveys often take the form of a “rate yourself on a scale” model. In our early iteration of SRE EDU Orientation, we used a point scale model for our questions. We discovered that the signals developed were too noisy. If you have a seven-point scale, you might ask someone a question like, “How well can you identify the health of your service by using monitoring graphs?” They might rate themselves a five. However, what does that mean? Does that mean yes? No? Maybe? The result is that we make presumptions about the student’s responses and find no actionable results.

So, instead, we adopted a more measurable “scale” for our surveys. We partnered with Google’s broader engineering training team to adopt the same verbiage around survey data. Each question we ask is tied back to skills that we know are an important part of the job and the learning objectives. The scale is reflective of actions that students think about and answer. Our scale is as follows:

No idea where to start
Could do it with someone’s help
Could read documentation and figure it out myself
Could do it mostly from memory
Could teach others mostly from memory

For example, we ask students to rate their ability to use monitoring (and alerting) to make rational decisions about the production service that they support. Students choose from the five choices. Notice also that the five student response options are action-based, much like the learning objectives. Having the responses worded this way helps identify a level of proficiency for each surveyed task. One thing to note: SRE EDU believes that teaching is a function of learning (see Chapter 1, “Teaching to Learn”). We know how well someone understands something if they can pass that information on to others. Because of this belief, our highest rating is that a person could teach others from memory. We built that principle into our survey response scale. Also, note that none of the options are necessarily bad. Having a student read documents to solve a problem is not a bad thing. It implies they know where to look and how to use a company’s knowledge base of resources to find answers. Having a student get help from another, is also not bad. It means they have an idea who to talk to, whether that other person is on their team or not.

Now, if a majority of new SREs choose “no idea where to start” in the post–SRE EDU Orientation survey, we know we have a problem. There is something in our monitoring class that we need to change, so we start looking at the content, the trainer involved, and the hands-on activity session’s feedback. In other words, based on both individual and collective responses, we have an idea of what to do next to improve the training.

Of course, there are many ways to evaluate training, and we’ve only shared one way which the Google SRE EDU team has chosen. Evaluating training outcomes is something you want to think about early in your training development, and tie them back to your learning objectives. Also, think about how you present this data to your colleagues. In our case, we have regular meetings with our stakeholders, to show the data and guide decisions for our entire program. It’s much easier to make decisions and changes based on data and evidence than on presumption.

Instructional Design Principles at Scale

Although the preceding discussions relate to large, hands-on training systems like Google’s SRE EDU program, and are appropriate for large company trainings, small and medium-sized companies might not have the ability to do all this. Google is a large corporation, with resources that many companies do not have. For smaller companies and SRE teams, our recommendation is to focus on four main things:

Create learning objectives
Follow a learning design model (ADDIE or another)
Create a few simple activities that allow students to “play” and apply learned skills
Create an evaluation method that measures how well your training is meeting objectives

Learning objectives are critical to a training program and cannot be ignored. Defining your learning objectives is the first step to designing training. You always refer back to your learning objectives, when designing training. When cornered with decisions or problems in your design, ask yourself, “Does this meet my learning objective?” If it doesn’t, either the learning objective needs to change or you discard or change your design idea.

We also recommend following a learning design model like ADDIE because it gives you a framework and basic instructional design tools to develop your training, in a generally linear fashion. Following a model helps you define your workflow in an organized way and keeps you moving forward. The cost for following a model is low, and how you organize your tasks falls in-line with the different stages of the model.

Hands-on activities should be relevant to the job your trainee will do. It can be something simple but impactful. For example, if you are trying to teach your students mitigation techniques for a problem, use the existing outage information you have and turn it into a case study or a game. Have students identify where things went wrong and why. Have students describe different ways in which they could mitigate the problem and the pros/cons of each. Use your learning objectives to align your activities with expected behavioral outcomes.

Creating an evaluation method tells you how your learning program and students are doing. Use your learning objectives to build the evaluation model, and implement a rating system to give you actionable results. In other words, based on what your students report, you should be able to identify areas to improve your training.

These tips for instructional design principles are meant to guide you. It’s not the only way to develop or design your training, but it’s one way we found that worked for SRE EDU.

Conclusion

In this chapter, we explored instructional design principles and how to create training that sticks with SREs. We talked about identifying training needs, building your learner profile, and specifics on how to design training content. We discovered the advantages of hands-on training, and the importance of piloting and evaluating your training. In the next chapter, we take evaluation a step further by learning how to “SRE” your SRE training program.

¹ Ops Review is an activity where participants meet weekly to discuss all the stuff that went wrong in production. The emphasis is on learning by talking through failures openly.

² Stephen Paul Forrest III, & Tim O. Peterson. (2006). It’s Called Andragogy. Academy of Management Learning & Education, 5(1), 113. Retrieved from https://oreil.ly/Ami2h
Roumell, E. A. (2019). Priming Adult Learners for Learning Transfer: Beyond Content and Delivery. Adult Learning, 30(1), 15. https://oreil.ly/ANNeJ

Get Training Site Reliability Engineers now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Training Site Reliability Engineers by Jennifer Petoff, JC van Winkel, Preston Yoshioka