Chapter 1. What Is Infrastructure as Code?
If you work in a team that builds and runs IT infrastructure, then cloud and infrastructure automation technology should help you deliver more value in less time, and to do it more reliably. But in practice, it drives ever-increasing size, complexity, and diversity of things to manage.
These technologies are especially relevant as organizations become digital. “Digital” is how people in business attire say that software systems are essential to what the organization does.1 The move to digital increases the pressure on you to do more and to do it faster. You need to add and support more services. More business activities. More employees. More customers, suppliers, and other stakeholders.
Cloud and automation tools help by making it far easier to add and change infrastructure. But many teams struggle to find enough time to keep up with the infrastructure they already have. Making it easier to create even more stuff to manage is unhelpful. As one of my clients told me, “Using cloud knocked down the walls that kept our tire fire contained."2
Many people respond to the threat of unbounded chaos by tightening their change management processes. They hope that they can prevent chaos by limiting and controlling changes. So they wrap the cloud in chains.
There are two problems with this. One is that it removes the benefits of using cloud technology; the other is that users want the benefits of cloud technology. So users bypass the people who are trying to limit the chaos. In the worst cases, people completely ignore risk management, deciding it’s not relevant in the brave new world of cloud. They embrace cowboy IT, which adds different problems.3
The premise of this book is that you can exploit cloud and automation technology to make changes easily, safely, quickly, and responsibly. These benefits don’t come out of the box with automation tools or cloud platforms. They depend on the way you use this technology.
DevOps and Infrastructure as Code
DevOps is a movement to reduce barriers and friction between organizational silos—development, operations, and other stakeholders involved in planning, building, and running software. Although technology is the most visible, and in some ways simplest face of DevOps, it’s culture, people, and processes that have the most impact on flow and effectiveness. Technology and engineering practices like Infrastructure as Code should be used to support efforts to bridge gaps and improve collaboration.
In this chapter, I explain that modern, dynamic infrastructure requires a “Cloud Age” mindset. This mindset is fundamentally different from the traditional “Iron Age” approach we used with static pre-cloud systems. I define three core practices for implementing Infrastructure as Code: define everything as code, continuously test and deliver everything as you work, and build your system from small, loosely coupled pieces.
Also in this chapter, I describe the reasoning behind the Cloud Age approach to infrastructure. This approach discards the false dichotomy of trading speed for quality. Instead, we use speed as a way to improve quality, and we use quality to enable delivery at speed.
From the Iron Age to the Cloud Age
Cloud Age technologies make it faster to provision and change infrastructure than traditional, Iron Age technologies (Table 1-1).
Iron Age | Cloud Age |
---|---|
Physical hardware |
Virtualized resources |
Provisioning takes weeks |
Provisioning takes minutes |
Manual processes |
Automated processes |
However, these technologies don’t necessarily make it easier to manage and grow your systems. Moving a system with technical debt onto unbounded cloud infrastructure accelerates the chaos.
Maybe you could use well-proven, traditional governance models to control the speed and chaos that newer technologies unleash. Thorough, up-front design, rigorous change review, and strictly segregated responsibilities will impose order!
Unfortunately, these models optimize for the Iron Age, where changes are slow and expensive. They add extra work up front, hoping to reduce the time spent making changes later. This arguably makes sense when making changes later is slow and expensive. But cloud makes changes cheap and fast. You should exploit this speed to learn and improve your system continuously. Iron Age ways of working are a massive tax on learning and improvement.
Rather than using slow-moving Iron Age processes with fast-moving Cloud Age technology, adopt a new mindset. Exploit faster-paced technology to reduce risk and improve quality. Doing this requires a fundamental change of approach and new ways of thinking about change and risk (Table 1-2).
Iron Age | Cloud Age |
---|---|
Cost of change is high |
Cost of change is low |
Changes represent failure (changes must be “managed,” “controlled”) |
Changes represent learning and improvement |
Reduce opportunities to fail |
Maximize speed of improvement |
Deliver in large batches, test at the end |
Deliver small changes, test continuously |
Long release cycles |
Short release cycles |
Monolithic architectures (fewer, larger moving parts) |
Microservices architectures (more, smaller parts) |
GUI-driven or physical configuration |
Configuration as Code |
Infrastructure as Code is a Cloud Age approach to managing systems that embraces continuous change for high reliability and quality.
Infrastructure as Code
Infrastructure as Code is an approach to infrastructure automation based on practices from software development. It emphasizes consistent, repeatable routines for provisioning and changing systems and their configuration. You make changes to code, then use automation to test and apply those changes to your systems.
Throughout this book, I explain how to use Agile engineering practices such as Test Driven Development (TDD), Continuous Integration (CI), and Continuous Delivery (CD) to make changing infrastructure fast and safe. I also describe how modern software design can create resilient, well-maintained infrastructure. These practices and design approaches reinforce each other. Well-designed infrastructure is easier to test and deliver. Automated testing and delivery drive simpler and cleaner design.
Benefits of Infrastructure as Code
To summarize, organizations adopting Infrastructure as Code to manage dynamic infrastructure hope to achieve benefits, including:
-
Using IT infrastructure as an enabler for rapid delivery of value
-
Reducing the effort and risk of making changes to infrastructure
-
Enabling users of infrastructure to get the resources they need, when they need it
-
Providing common tooling across development, operations, and other stakeholders
-
Creating systems that are reliable, secure, and cost-effective
-
Make governance, security, and compliance controls visible
-
Improving the speed to troubleshoot and resolve failures
Use Infrastructure as Code to Optimize for Change
Given that changes are the biggest risk to a production system, continuous change is inevitable, and making changes is the only way to improve a system, it makes sense to optimize your capability to make changes both rapidly and reliably.4 Research from the Accelerate State of DevOps Report backs this up. Making changes frequently and reliably is correlated to organizational success.5
There are several objections I hear when I recommend a team implement automation to optimize for change. I believe these come from misunderstandings of how you can and should use automation.
Objection: “We don’t make changes often enough to justify automating them”
We want to think that we build a system, and then it’s “done.” In this view, we don’t make many changes, so automating changes is a waste of time.
In reality, very few systems stop changing, at least not before they are retired. Some people assume that their current level of change is temporary. Others create heavyweight change request processes to discourage people from asking for changes. These people are in denial. Most teams that are supporting actively used systems handle a continuous stream of changes.
Consider these common examples of infrastructure changes:
-
An essential new application feature requires you to add a new database.
-
A new application feature needs you to upgrade the application server.
-
Usage levels grow faster than expected. You need more servers, new clusters, and expanded network and storage capacity.
-
Performance profiling shows that the current application deployment architecture is limiting performance. You need to redeploy the applications across different application servers. Doing this requires changes to the clustering and network architecture.
-
There is a newly announced security vulnerability in system packages for your OS. You need to patch dozens of production servers.
-
You need to update servers running a deprecated version of the OS and critical packages.
-
Your web servers experience intermittent failures. You need to make a series of configuration changes to diagnose the problem. Then you need to update a module to resolve the issue.
-
You find a configuration change that improves the performance of your database.
A fundamental truth of the Cloud Age is: Stablity comes from making changes.
Unpatched systems are not stable; they are vulnerable. If you can’t fix issues as soon as you discover them, your system is not stable. If you can’t recover from failure quickly, your system is not stable. If the changes you do make involve considerable downtime, your system is not stable. If changes frequently fail, your system is not stable.
Objection: “We should build first and automate later”
Getting started with Infrastructure as Code is a steep curve. Setting up the tools, services, and working practices to automate infrastructure delivery is loads of work, especially if you’re also adopting a new infrastructure platform. The value of this work is hard to demonstrate before you start building and deploying services with it. Even then, the value may not be apparent to people who don’t work directly with the infrastructure.
Stakeholders often pressure infrastructure teams to build new cloud-hosted systems quickly, by hand, and worry about automating it later.
There are three reasons why automating afterward is a bad idea:
-
Automation should enable faster delivery, even for new things. Implementing automation after most of the work has been done sacrifices many of the benefits.
-
Automation makes it easier to write automated tests for what you build. And it makes it easier to quickly fix and rebuild when you find problems. Doing this as a part of the build process helps you to build better infrastructure.
-
Automating an existing system is very hard. Automation is a part of a system’s design and implementation. To add automation to a system built without it, you need to change the design and implementation of that system significantly. This is also true for automated testing and deployment.
Cloud infrastructure built without automation becomes a write-off sooner than you expect. The cost of manually maintaining and fixing the system can escalate quickly. If the service it runs is successful, stakeholders will pressure you to expand and add features rather than stopping to rebuild.
The same is true when you build a system as an experiment. Once you have a proof of concept up and running, there is pressure to move on to the next thing, rather than to go back and build it right. And in truth, automation should be a part of the experiment. If you intend to use automation to manage your infrastructure, you need to understand how this will work, so it should be part of your proof of concept.
The solution is to build your system incrementally, automating as you go. Ensure you deliver a steady stream of value, while also building the capability to do so continuously.
Objection: “We must choose between speed and quality”
It’s natural to think that you can only move fast by skimping on quality, and that you can only get quality by moving slowly. You might see this as a continuum, as shown in Figure 1-1.
However, the Accelerate research I mentioned earlier (see “Use Infrastructure as Code to Optimize for Change”) shows otherwise:
These results demonstrate that there is no tradeoff between improving performance and achieving higher levels of stability and quality. Rather, high performers do better at all of these measures. This is precisely what the Agile and Lean movements predict, but much dogma in our industry still rests on the false assumption that moving faster means trading off against other performance goals, rather than enabling and reinforcing them.
Nicole Forsgren, PhD, Accelerate
In short, organizations can’t choose between being good at change or being good at stability. They tend to either be good at both or bad at both.
I prefer to see quality and speed as a quadrant rather than a continuum,6 as shown in Figure 1-2.
This quadrant model shows why trying to choose between speed and quality leads to being mediocre at both:
- Lower-right quadrant: Prioritize speed over quality
-
This is the “move fast and break things” philosophy. Teams that optimize for speed and sacrifice quality build messy, fragile systems. They slide into the lower-left quadrant because their shoddy systems slow them down. Many startups that have been working this way for a while complain about losing their “mojo.” Simple changes that they would have whipped out quickly in the old days now take days or weeks because the system is a tangled mess.
- Upper-left quadrant: Prioritize quality over speed
-
Also known as, “We’re doing serious and important things, so we have to do things properly.” Then deadline pressures drive “workarounds.” Heavyweight processes create barriers to improvement, so technical debt grows along with lists of “known issues.” These teams slump into the lower-left quadrant. They end up with low-quality systems because it’s too hard to improve them. They add more processes in response to failures. These processes make it even harder to make improvements and increases fragility and risk. This leads to more failures and more process. Many people working in organizations that work this way assume this is normal,7 especially those who work in risk-sensitive industries.8
The upper-right quadrant is the goal of modern approaches like Lean, Agile, and DevOps. Being able to move quickly and maintain a high level of quality may seem like a fantasy. However, the Accelerate research proves that many teams do achieve this. So this quadrant is where you find “high performers.”
The Four Key Metrics
DORA’s Accelerate research team identifies four key metrics for software delivery and operational performance.9 Its research surveys various measures, and has found that these four have the strongest correlation to how well an organization meets its goals:
- Delivery lead time
-
The elapsed time it takes to implement, test, and deliver changes to the production system
- Deployment frequency
-
How often you deploy changes to production systems
- Change fail percentage
-
What percentage of changes either cause an impaired service or need immediate correction, such as a rollback or emergency fix
- Mean Time to Restore (MTTR)
-
How long it takes to restore service when there is an unplanned outage or impairment
Organizations that perform well against their goals—whether that’s revenue, share price, or other criteria—also perform well against these four metrics, and vice versa. The ideas in this book aim to help your team, and your organization, perform well on these metrics. Three core practices for Infrastructure as Code can help you to achieve this.
Three Core Practices for Infrastructure as Code
The Cloud Age concept exploits the dynamic nature of modern infrastructure and application platforms to make changes frequently and reliably. Infrastructure as Code is an approach to building infrastructure that embraces continuous change for high reliability and quality. So how can your team do this?
There are three core practices for implementing Infrastructure as Code:
-
Define everything as code
-
Continuously test and deliver all work in progress
-
Build small, simple pieces that you can change independently
I’ll summarize each of these now, to set the context for further discussion. Later, I’ll devote a chapter to the principles for implementing each of these practices.
Core Practice: Define Everything as Code
Defining all your stuff “as code” is a core practice for making changes rapidly and reliably. There are a few reasons why this helps:
- Reusability
-
If you define a thing as code, you can create many instances of it. You can repair and rebuild your things quickly, and other people can build identical instances of the thing.
- Consistency
-
Things built from code are built the same way every time. This makes system behavior predictable, makes testing more reliable, and enables continuous testing and delivery.
- Transparency
-
Everyone can see how the thing is built by looking at the code. People can review the code and suggest improvements. They can learn things to use in other code, gain insight to use when troubleshooting, and review and audit for compliance.
I’ll expand on concepts and implementation principles for defining things as code in Chapter 4.
Core Practice: Continuously Test and Deliver All Work in Progress
Effective infrastructure teams are rigorous about testing. They use automation to deploy and test each component of their system, and integrate all the work everyone has in progress. They test as they work, rather than waiting until they’ve finished.
The idea is to build quality in rather than trying to test quality in.
One part of this that people often overlook is that it involves integrating and testing all work in progress. On many teams, people work on code in separate branches and only integrate when they finish. According to the Accelerate research, however, teams get better results when everyone integrates their work at least daily. CI involves merging and testing everyone’s code throughout development. CD takes this further, keeping the merged code always production-ready.
I’ll go into more detail on how to continuously test and deliver infrastructure code in Chapter 8.
Core Practice: Build Small, Simple Pieces That You Can Change Independently
Teams struggle when their systems are large and tightly coupled. The larger a system is, the harder it is to change, and the easier it is to break.
When you look at the codebase of a high-performing team, you see the difference. The system is composed of small, simple pieces. Each piece is easy to understand and has clearly defined interfaces. The team can easily change each component on its own, and can deploy and test each component in isolation.
I dig more deeply into implementation principles for this core practice in Chapter 15.
Conclusion
To get the value of cloud and infrastructure automation, you need a Cloud Age mindset. This means exploiting speed to improve quality, and building quality in to gain speed. Automating your infrastructure takes work, especially when you’re learning how to do it. But doing it helps you to make changes, including building the system in the first place.
I’ve described the parts of a typical infrastructure system, as these provide the foundations for chapters explaining how to implement Infrastructure as Code.
Finally, I defined three core practices for Infrastructure as Code: defining everything as code, continuously testing and delivering, and building small pieces.
1 This is as opposed to what many of the same people said a few years ago, which was that software was “not part of our core business.” After following this advice and outsourcing IT, organizations realized they were being overtaken by those run by people who see better software as a way to compete, rather than as a cost to cut.
2 According to Wikipedia, a tire fire has two forms: “Fast-burning events, leading to almost immediate loss of control, and slow-burning pyrolysis which can continue for over a decade.”
3 By “cowboy IT,” I mean people building IT systems without any particular method or consideration for future consequences. Often, people who have never supported production systems take the quickest path to get things working without considering security, maintenance, performance, and other operability concerns.
4 According to Gene Kim, George Spafford, and Kevin Behr in The Visible Ops Handbook (IT Process Institute), changes cause 80% of unplanned outages.
5 Reports from the Accelerate research are available in the annual State of DevOps Report, and in the book Accelerate by Dr. Nicole Forsgren, Jez Humble, Gene Kim (IT Revolution Press).
6 Yes, I do work at a consultancy, why do you ask?
7 This is an example of “Normalization of Deviance,” which means people get used to working in ways that increase risk. Diane Vaughan defined this term in The Challenger Launch Decision (University Of Chicago Press).
8 It’s ironic (and scary) that so many people in industries like finance, government, and health care consider fragile IT systems—and processes that obstruct improving them—to be normal, and even desirable.
9 DORA, now part of Google, is the team behind the Accelerate State of DevOps Report.
Get Infrastructure as Code, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.