Chapter 3. Case Studies

Let’s take a look now at how SRE training has been done in practice. We discuss training activities in place for organizations along a spectrum, from very large to very small. We use Google as an example of a large SRE organization; for the medium and smaller organizations, we look at other companies.

Training in a Large Organization

Google’s SRE training program provides a case study of one possible way to implement such a program at a large organization.

When Google renamed its “production team” to Site Reliability Engineering in 2003, the team members were experienced software engineers tasked with “keeping Google running.” These software engineers had deep knowledge of the systems Google was using. The number of different systems Google was running was limited; it was more or less possible to know most of the internals.

As Google grew, and systems grew increasingly specialized, we needed more Site Reliability Engineers (SREs). Instead of transferring experienced Google software engineers into SRE, Google began directly hiring SREs. Although Google had a handful of classes to train new software engineering hires, we didn’t have any SRE-specific training. The newly hired engineers joining SRE had to “grok SRE the hard way.”

In 2014, a couple of SREs began discussing the great difficulty of onboarding new SREs. Google SRE founded a team specifically geared toward education for SREs. Initially, this team concentrated mostly on new hires. Over time, the ...

Get Training Site Reliability Engineers now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.