Ghosts in the machines
The secret to successful infrastructure automation is people.
“The trouble with automation is that it often gives us what we don’t need at the cost of what we do.” —Nicholas Carr, The Glass Cage: Automation and Us
Virtualization and cloud hosting platforms have pervasively decoupled infrastructure from its underlying hardware over the past decade. This has led to a massive shift towards what many are calling dynamic infrastructure, wherein infrastructure and the tools and services used to manage it are treated as code, allowing operations teams to adopt software approaches that have dramatically changed how they operate. But with automation comes a great deal of fear, uncertainty and doubt.
Common (mis)perceptions of automation tend to pop up at the extreme ends: It will either liberate your people to never have to worry about mundane tasks and details, running intelligently in the background, or it will make SysAdmins irrelevant and eventually replace all IT jobs (and beyond). Of course, the truth is very much somewhere in between, and relies on a fundamental rethinking of the relationship between humans and automation.
Dynamic infrastructure offers a host of benefits to infrastructure teams and the other groups in an organization that they support. Bottom line, it moves them from the “Department of No” — the team that constantly denied requests for changes due to concerns over stability — to the “Department of Yes, And…” Using this approach, IT infrastructure teams can quickly respond to changes demanded by either customers or their own organization — on the flip side, they can recover from failure more quickly, too. Automation provides increased reproducibility, efficiency, and reliability to certain tasks and processes.
Automation also purportedly brings with it the promise of freeing up time for those who’ve implemented the automation to focus on other, possibly higher-level activities. The operative word here is purportedly: the prospect of more free time is often cited as a key benefit of automation, and yet it is often far from the truth. In fact, this misperception has its own name: the Substitution Myth.
The Substitution Myth
The Substitution Myth is the belief that machines can take on tasks previously done by humans — that automation can be substituted for human action without any larger impact on the system in which that action or task occurs, except to increase output. In particular, it assumes that automation needs no human oversight once implemented. This myth tends to surface in two distinct forms: leftovers or compensation.
Managing your leftovers
One approach to automation is to have machines do all the things that could possibly be automated, and have humans deal with whatever is “left over.” This implies, however, that whoever designs the automated tasks in the first place can fully and completely specify what they should be up front, so that they run along without ever requiring intervention. Anyone who has ever been involved in developing a product must know how unrealistic this is. The result is that whomever was supposed to be dealing with the leftovers is now also dealing with all the unanticipated things the automation doesn’t do, forcing them to handle all those unintended leftovers before getting to their own intended leftovers.
Divide and compensate
The other philosophy is to say that computers are very good at some things, and humans are good at others. Give the machines the former, the humans the latter, and on your way you go. This approach to functional allocation is heavily influenced by the Fitts list or the MABA-MABA (Men Are Better At, Machines Are Better At) theory of human-computer interaction that arose out of theories about allocation of tasks in automated flight systems in the late 70s and early 80s. It feels intuitively right: computers are better at plugging through repetitive information very quickly, humans are better at context-specific, qualitative judgments about that information. But the problem with this approach is that day-to-day tasks in the world we all operate are not so neatly composable in this way — we are constantly moving between a variety of tasks that could be more easily done by either machines or humans.
From substitution to cooperation
In either case — giving humans the leftovers or having machines compensate for the humans — both humans and computers are consistently treated as separable. Hence the idea that one could potentially substitute for the other, notably machines for humans. But try the mental exercise of imagining all the systems you interact with daily as actual co-workers — it gets somewhat terrifying when you consider the role the cloud has played in expanding any organization’s infrastructure, but it’s not that far from reality. Adding people to your team can increase output, but it also comes with more communication and coordination requirements as well. Just as your co-workers impact how your work proceeds on a daily basis, so too do the machines you work with. But we rarely regard them that way.
“…adding or expanding the machine’s role changes the cooperative architecture, changing the human’s role often in profound ways. Creating partially autonomous machine agents is, in part, like adding a new team member.” —Nadine Sarter
When it comes to how we regard computers, the compensation effect in particular has been most prevalent in airline automation, which remains heavily influenced by the Fitts list and other functional allocation principles. A rather extensive body of research has shown that automated flight management systems (FMS) demonstrate a variety of what Nadine Sarter termed “automation surprises,” or unexpected problems with human-automation interaction. Sarter contends that these surprises — many of which arise during safety-critical moments like takeoff and landing — are due in large part “from the introduction of automated systems that need to engage in, but were not designed for, cooperative activities with humans.”
Viewing computers as cooperating with humans, and vice versa, is a big shift, and not certainly one embraced by most human mental models of machines. People and computers must be team players, and design automation to best enable their work together.
Joint Cognitive Systems
“Our image… is that of a machine alone rapt in thought and action. But the reality is that automated subtasks exist in a larger context of interconnected tasks and multiple actors. Introducing automated and intelligent agents into a larger system changes the composition of the distributed system… and shifts the human’s role within that cooperative ensemble.” -David Woods
People are neither masters of machines, nor subservient to their machine-learning outcomes — we cannot, and should not, separate the two. We are actors, together, in a very complex system. David Woods calls this “joint cognitive systems,” (JCS) which provides a vastly different model for evaluating the role of automation in our daily work. In Woods’ words, a cognitive system “is goal directed; it uses knowledge about itself and its environment to monitor, plan, and modify its actions in the pursuit of goals.” These are the hallmarks of human cognition, and obviously modern advancements in computation have brought machines into this realm as well. What is most relevant is the emphasis on systems — a change in any part of the system impacts the system as a whole. That is, automation influences the system in which you co-exist with machines and the tasks you are both responsible for. Sidney Dekker takes this notion a bit further, suggesting that not only does automation not simply replace human work with machines, it “changes the task it was meant to support.”
From a systems-thinking perspective then, automation even changes your own cognitive models of the systems in which you operate, whether you are aware of it or not. And awareness is key. Machines are not aware of the systems or rules by which they operate, and by that token they cannot adapt or improvise — they follow a set of pre-scripted rules (albeit, possibly quite complex ones) and require intervention when required to operate outside those rules. But humans can improvise, and we do so regularly. Despite the bad rap that humans get when things go wrong (just read any article on a plane or train crash and hunt for “human error”), it is people’s fundamental ability to adapt, change, and flex that keep our systems running properly the vast majority of the time. This degree of autonomy — and its fundamental difference from scripted automation — is what differentiates brittle joint cognitive systems from the type that Woods et al aspire to.
Automated vs. Autonomous: Kale at Etsy
Differentiating between autonomy and automation is important. Automation entails a scripted execution of preconceived actions, whereas autonomous processes allow for, and plan on, some degree of human interaction and decision making. The two words are often used interchangeably but the human role in autonomy is a critical differentiator between systems that flex and those that break in much more unpleasant ways. Let’s look at an example of a joint cognitive system in modern web operations that embraces autonomy.
Like many modern web companies, Etsy generates an almost inconceivable amount of data just about its own systems. Kale is their approach to being able to detect events occurring amidst all those data that might point to actual problems with their systems. The first part is called Skyline, which is purely algorithmic anomaly detection — it is automated in the purest sense, running along in the background, flagging anomalous metrics. Skyline takes no actions, it just serves up the anomalies for humans to look at: “Hey, this is weird.”
However, when dealing with complex, distributed systems, anomalies often don’t come alone, hence the other half of Kale, called Oculus. The key to Oculus is that it is human-centered. It is essentially a search engine for graphs. But humans choose which graphs to give Oculus — either from Skyline or some other source entirely — and it in turn looks for others that are similar. This allows a human to then make decisions and take action — in this case, the human has autonomy, which is a very important distinction. The automation is supporting human decisions and actions, in Kale’s case, diagnosis-driven work. To steal a metaphor directly from John Allspaw, while traditional views of automation have tended to reflect the relationship between Hal9000 and Dave, a more evolved (and historically successful) model of automation looks much more like Jarvis and Iron Man.
Design for cooperation
“…for any non-trivial level of automation to be successful, the key requirement is to design for fluent, coordinated interaction between the human and machine elements of the system. In other words, automation and intelligent systems must be designed to participate in team play (Malin et al., 1991; Malin, 1999).” Christopherson and Woods, 2002.
Let’s take that Iron Man metaphor and kick it around for one final point. Yes, it’s science fiction, but the real heroes behind Iron Man’s ability to solve problems on the fly and improvise are the interfaces he uses. When Tony Stark is trying to figure something out, he asks Jarvis for various displays and data — they aren’t being pushed on him, and he’s not screaming and pulling his hair out about Jarvis doing weird things he doesn’t understand. There’s a slick Hollywood Minority Report veneer on it all (if only our IT tools really looked like that…) but the intent is true to an autonomous decision making scenario pairing human and machine. And it is the intent and design of those interfaces that really matters. The bottom line: the user experience of automation-related tools (especially those intended to diagnose infrastructure problems) is as important, if not more so, than whatever advanced math and special secret engineering sauce underlie them. People and machines use tools — design those tools for both.
This post is part of our ongoing exploration into end-to-end optimization.