6
THINKING IN PROMISES
IT’S EASY TO RECOGNIZE THE TRANSFORMATIVE IMPACT OF network-based businesses on society without understanding just how differently they are organized inside their walls.
During the years following my 1998 Open Source Summit, I’d developed a “stump speech” about the driving principles of open source software, hacker culture, and the Internet. One slide highlighted my vision for what made the bazaar of open source development, or a permissionless network like the Internet, so powerful:
• An architecture of participation means that your users help to extend your platform.
• Low barriers to experimentation mean that the system is “hacker friendly” for maximum innovation.
• Interoperability means that one component or service can be swapped out if a better one comes along.
• “Lock-in” comes because others depend on the benefit from your services, not because you’re completely in control.
I also talked about how these platforms are born and how they evolve. First, hackers and enthusiasts explore the potential of a new technology. Entrepreneurs are attracted to it and in their quest to build a business, make things easier for ordinary users. Dominant players develop a platform, raising barriers to entry. Progress stagnates; hackers and entrepreneurs move on, looking for new frontiers. But sometimes (just sometimes) the industry builds a healthy ecosystem, in which hackers, entrepreneurs, and platforms play a creative game of “leapfrog.” No one gets complete lock-in, and everyone has to improve in order to stay competitive.
That was followed up by a slide titled “History Lesson,” which ended on the talk’s punch line: “A platform strategy beats an application strategy every time!”
A PLATFORM BEATS AN APPLICATION EVERY TIME
Jeff Bezos heard me give this talk at my Emerging Technology Conference (ETech) and in 2003 asked me to deliver a version of it to a small group of developers at Amazon.
I had previously gone to Seattle in March 2001 to pitch Jeff on the idea that Amazon ought to be offering web services access to its data. For market research purposes, O’Reilly was “spidering” Amazon every three hours in order to download data on prices, ranks, page counts, and reviews of our books and those of our competitors. Web spidering seemed wasteful to me, since we had to download far more data than we needed and then extract just the bits we wanted. I was convinced that Amazon’s vast product catalog was a perfect example of the kind of rich data that ought to be programmatically accessible via a web services API in the next-generation “Internet operating system” I was evangelizing.
Jeff was intrigued by the idea, and soon discovered that a skunkworks web services project, initiated by Amazon engineer Rob Frederick, was already under way. He discovered also that there were many other small companies like us that were spidering Amazon and building unauthorized interfaces to their data. Rather than trying to shut us down, he brought us all in to learn from each other and to help inform Amazon’s strategy.
I vividly remember Jeff’s disappointment in my talk at this internal Amazon developer’s conference. He jumped up in the back of the room when I was done and said, “You didn’t say the bit about a platform beating an application every time!” I didn’t make that mistake when I gave another version of the talk at an Amazon all-hands meeting in May 2003.
The first-generation web services that the e-commerce giant rolled out in 2003 were all about access to their in-house product catalog and its underlying data, and had little to do with the infrastructure services that were launched in 2006 under the name Amazon Web Services (or AWS) and that sparked the great industry transformation that is now called “cloud computing.” Those services came about for entirely different reasons, but I like to think that I planted the seeds of the idea with Jeff that if Amazon was to prosper in the years ahead, it had to become far more than just an e-commerce application. It had to become a platform.
In that marvelous way he has of taking any idea and thinking it all the way through, Jeff took the idea of platform much further than I had imagined. As Jeff described in a short 2008 interview with Om Malik: “Four years ago is when it started, and we had enough complexity inside Amazon that we were finding we were spending too much time on fine-grained coordination between our network engineering groups and our applications programming groups. Basically what we decided to do is build a [set of APIs] between those two layers so that you could just do coarse-grained coordination between those two groups.” (That is, “small pieces loosely joined.”)
This is important: Amazon Web Services was the answer to a problem in organizational design. Jeff understood, as every network-enabled business needs to understand in the twenty-first century, that, as HR consultant Josh Bersin once said to me, “Doing digital isn’t the same as being digital.”
In the digital era, an online service and the organization that produces and manages it must become inseparable.
How Jeff took the idea of Amazon as a platform out of the realm of software and into organizational design ought to be taught in every business school.
The story was told by former Amazon engineer Steve Yegge in a post that he wrote for colleagues at Google, but which ended up being accidentally made public and went viral among Internet developers. It is known as “Stevey’s Platform Rant.” In it, Yegge describes a memo that he claimed Jeff Bezos wrote “back around 2002 I think, plus or minus a year.” As Yegge described it:
His Big Mandate went something along these lines:
1. All teams will henceforth expose their data and functionality through service interfaces.
2. Teams must communicate with each other through these interfaces.
3. There will be no other form of interprocess communication allowed: no direct linking, no direct reads of another team’s data store, no shared-memory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network.
4. It doesn’t matter what technology they use. HTTP, Corba, Pubsub, custom protocols—doesn’t matter. Bezos doesn’t care.
5. All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions.
6. Anyone who doesn’t do this will be fired.
Jeff’s first key insight was that Amazon could never turn itself into a platform unless it was itself built from the ground up using the same APIs that it would offer to external developers.
And sure enough, over the next few years, Amazon redesigned its application to rely on a comprehensive set of fundamental services—storage, computation, queuing, and eventually many more—that its own internal developers accessed via standardized application programming interfaces. By 2006, these services were robust and scalable enough, and the interfaces clearly enough defined, that they could be offered to Amazon’s customers.
Uptake was swift. Amazon’s low pricing and high capacity swept the market, radically lowering the barriers to entry for startups to experiment with new ideas, providing the stability and performance of top-notch Internet infrastructure for a fraction of the cost of building it yourself. The long Internet boom of the past decade can be traced back to Amazon’s strategic decision to rebuild its own infrastructure and then open that infrastructure to the world. It isn’t just startups either. Huge companies like Netflix host their services on top of AWS. It is now a $12 billion a year business.
Microsoft, Google, and many others have been playing catch-up in cloud computing, but they were late to the game. Amazon had one big advantage that Jeff explained to me not long after Amazon’s cloud computing offerings were formally introduced in 2006: “I started out as a retailer. That’s a really low-margin business. There’s no way this can be worse for me. Microsoft and Google have very high margins. This is always going to be a worse business for them.” By the time Microsoft and Google realized just how big a business cloud computing would be, they were far behind.
SOFTWARE AS ORGANIZATIONAL STRUCTURE
But perhaps the deepest insight about the nature of networked organizations comes from the way that Amazon structured itself internally to match the service-oriented design of its platform. Amazon Chief Technology Officer Werner Vogels described it in a 2006 blog post: “Services do not only represent a software structure but also the organizational structure. The services have a strong ownership model, which combined with the small team size is intended to make it very easy to innovate. In some sense you can see these services as small startups within the walls of a bigger company. Each of these services require a strong focus on who their customers are, regardless whether they are externally or internally.”
Work is done by small teams. (Amazon famously describes these as “two-pizza teams,” that is, teams small enough to be fed by two pizzas.) These teams work independently, starting with a high-level description of what they are trying to accomplish. Any project at Amazon is designed via a “working backwards” process. That is, the company, famous for its focus on the customer, starts with a press release that describes what the finished product does and why. (If it’s an internal-only service or product, the “customer” might be another internal team.) Then they write a “Frequently Asked Questions” document. They create mock-ups and other ways of defining the customer experience. They go so far as to write an actual user manual, describing how to use the product. Only then is the actual product green-lighted. Development is still iterative, informed by additional data from actual users as the product is built and tested, but the promise of the final product is where everything starts.
This is an example of what computer science and management theorist Mark Burgess calls “Thinking in Promises.” He wrote: “If you imagine a cookbook, each page usually starts with a promise of what the outcome will look like (in the form of a seductive picture), and then it includes a suggested recipe for making it. It does not merely throw a recipe at you, forcing you through the steps to discover the outcome on trust. It sets your expectations first. In computer programming, and in management, we are not always so helpful.”
Of course, writing the press release, or providing the picture that shows you the result of following the recipe, is only one part of what it takes to build a promise-centered organization. You have to work backward from the promise to the customer to the promises that each part of the organization needs to make to each other in order to fulfill it. The small teams are also a part of this approach. As is the design of a single, clearly-defined “fitness function” for each team (the one thing it promises to deliver, that can be measured and continuously improved).
At an Amazon management off-site, Jeff Bezos once famously responded to a suggestion that the company needed to improve communication between teams. “No, communication is terrible!” he said. The reason can be explained by the old joke: “One person sits and drinks. Two people clink and drink. The more people you add, the higher the ratio of clinking to drinking.” What you want is a situation where people “clink” only with the people doing shared work, not with everyone they touch. This is simple math. Communication gets worse as team size grows.
There’s a bit of a paradox here. Jeff was really asking for more effective, close communication within teams, coupled with highly structured communication between teams, mirroring the highly structured communication that makes it possible for modern Internet applications to work so well. He was arguing against the kind of backdoor communication that leads to messy workarounds that, over time, break under their own weight.
It is in this context that you can also understand why Jeff banned PowerPoint and insisted that all proposals and related presentations be made by written memos setting out the argument and evidence without relying on the artificial, misleading simplification of nested hierarchies. As Bill Janeway said to me, it seems that Jeff “wanted rich discussion up front leading to Decision Time and then highly structured communication during Execution Time.”
Promise theory, as Burgess outlines it, is a framework for understanding how independent actors make promises to each other—the essence of that highly structured communication. Those actors can be software modules promising to respond in a certain way to an API call, or small teams promising to deliver a particular result. Burgess writes: “Imagine a set of principles that could help you to understand how parts combine to become a whole, and how each part sees the whole from its own perspective. If such principles were any good, it shouldn’t matter whether we’re talking about humans in a team, birds in a flock, computers in a datacenter, or cogs in a Swiss watch. A theory of cooperation ought to be pretty universal, so we could apply it both to technology and to the workplace.”
That may sound terribly inhumane to some readers—to design an organization in such a way that people can be compared to cogs in a machine. But in fact, it’s entirely the opposite. It is the traditional command-and-control organization, in which people are told what to do and expected to comply without necessarily understanding why or what the desired outcome is that ends up being inhumane. Kim Rachmeler, for many years the head of Amazon customer service, said to me that when a team defines the interface that allows others to access the services it builds and provides, “the satisfaction of the people accessing the services is entirely in their hands.” Because this creates a tight feedback loop between the group and its customers, you can leave the implementation up to the creativity and the skill of the team building each function.
Kim explained to me that “writing the press release first is a mechanism to make customer obsession concrete.” As are two-pizza teams producing services with hardened APIs. “Amazon does a better job of creating these kinds of mechanisms for its corporate values than any other company I’ve seen,” Kim added. “And it starts from first principles (values) more than other companies as well.”
Music streaming service Spotify is another company exploring the intersection of online service design and organizational design. Its organizational culture has also been quite influential. In its animated explainer videos, Spotify plots organizational culture along two axes: alignment and autonomy. A traditional organization has high alignment but low autonomy, because managers tell people what to do and how to do it. In the kind of organization parodied by the comic strip Dilbert, neither the manager nor the workers know why they are doing what they are doing. This is a low-alignment/low-autonomy organization. A modern technology engineering organization (or an entire organization like Amazon or Spotify) seeks to have high alignment and high autonomy. Everyone knows what the goal is, but they are empowered to find their own way to do it.
This approach was also part of the revolution in warfare developed by General Stanley McChrystal in Afghanistan in response to the rapidly changing conditions on the ground there. In a presentation I heard General McChrystal give at the New York Times New Work Summit in the summer of 2016, he said, “I tell people, ‘Don’t follow my orders. Follow the orders I would have given you if I were there and knew what you know.’” That is, understand our shared objective, and use your best judgment about how to achieve it.
My nephew-in-law, Peter Kromhout, who served as a US Army Infantry captain in Afghanistan, confirmed McChrystal’s approach. “Before McChrystal, we were given a mission. We’d land and discover new intelligence that had to be acted on quickly, and we’d radio back to base for new instructions,” he said. “By the time we got an answer, things might have changed again. After the McChrystal Doctrine was introduced, we’d land, see that the mission had changed, and radio back to tell them what we were doing about it.”
This outcome-focused, outside-in approach means that, effectively, a team is promising a result, not how they will achieve it. As in Afghanistan, high autonomy is required by the rapidly changing conditions of a fast-growing Internet service.
High autonomy also provides a way of resolving unforeseen conflicts between separate teams. In his book The Amazon Way, former Amazon VP John Rossman describes how the company adopted an idea from Japanese lean manufacturing, “the Andon Cord.” Any employee who encounters a problem on the assembly line in a Toyota factory can pull a cord that stops production and lights up a large sign (“Andon”) to call for management attention. Once the Andon Cord was in place at Amazon, Rossman writes, “When customers began complaining about a problem with a product, customer care simply took that product down from the website and sent a message to the retail group that said, in effect, ‘Fix the defect or you can’t sell this product.’”
“Amazon’s version of the Andon Cord started with an experience Jeff had with Customer Connection, one of the mechanisms Amazon uses to make customer obsession concrete,” Kim Rachmeler told me. At the time, every level 7 or above manager had to spend some time working in customer service every two years, including Jeff. As part of this program, Amazon would pair the managers with a CS rep and have them answer a couple of phone calls.
Jeff took a call, Kim remembers. “‘Hello, this is Jeff Bezos, how may I help you?’ The customer didn’t pick up on whom they were talking to and launched into their problem. It seems that they had received a table and the top was damaged. Jeff (with the rep’s help) sent a replacement. When they hung up the call, the rep said something important: ‘That table always arrives damaged.’ It seems that the packaging was insufficient and so shipping usually caused problems with it. Jeff saw instantly that CS reps had knowledge that would be useful to retail but it was siloed. So he suggested the use of a mechanism like the Andon Cord—which was eventually implemented.”
The Andon Cord illustrates a key principle of promise-oriented systems. It sends a simple, unambiguous signal to other groups, and stands in stark contrast to a traditional management process. While each group has its own fitness function, which it is expected to relentlessly optimize, these functions can conflict; the fitness function of any group may be checked by that of another. The art of management is to shape these functions so that they drive the entire company in the direction it wants to go, which represents an overall fitness function for the organization.
High-autonomy technical cultures have developed a technique—the stand-up meeting—by which people and groups must work together toward a common goal and review the status of their promises to each other.
In a dysfunctional organization, introducing stand-ups is a great way of understanding what has gone wrong, and of introducing new, targeted communications protocols.
Mikey Dickerson, one of the Google engineers recruited by the White House to rescue the failed healthcare.gov website in the fall of 2013 and who later became the director of the new United States Digital Service, told me the story of the hundred days’ worth of stand-up meetings he held to knit the government contractors who’d built the failed site into a functioning organization that could actually make it work. It went something like this:
“Joe, you promised to get three new servers up by this morning. What’s the status?” “I haven’t gotten the security clearance yet from Mike.” “Mike, what’s the holdup?” “I never got a request for any security clearance from Joe.” “What do you mean, Mike? I have it right here.” “Listen, Joe. I’ve got the list of all my tickets [job requests] right here, and there’s nothing from you!”
It was only then that “Joe” and “Mike,” who worked for different contractors (there were more than thirty-three companies involved in the original healthcare.gov effort, working under sixty different contracts), discovered that they weren’t using the same issue-tracking system. Teams were literally sending into the void requests for work to be done by other teams. Because everybody had hidden dependencies on everyone else, work was at a standstill, with everyone waiting on results from everyone else before they could proceed.
Whether it’s through web services and APIs, or through tools like issue-tracking systems, a promise-oriented model works to increase autonomy because each autonomous agent defines and is made accountable for keeping well-documented promises.
YOU ARE ALL INSIDE THE APPLICATION
The change in software development from a model in which the goal was to produce an artifact (say, the “gold master” of the next release of Microsoft Windows, which was the target of years of development and would be duplicated onto millions of CD-ROMs and distributed to tens of thousands of retailers and corporate customers on the same day) to one in which software development was a process of continuous improvement was also a process of organizational discovery.
I still remember the wonder with which Mark Lucovsky, who’d been a senior engineering leader at Microsoft, described how different his process became when he moved to Google. “I make a change and roll it out live to millions of people at once.” Mark was describing a profound transformation in how software development works in the age of the cloud. There are no more gold masters. Today software has become a process of constant, more or less incremental improvements. From the point of view of the company offering an online service, software has gone from being a thing to a process, and ultimately, a series of business workflows. The design of those workflows has to be optimized not just for the creators of the software but for the people who will keep them running day-to-day.
The key idea is that a company is now a hybrid organism, made up of people and machines. I had made this point too in my 2003 Amazon all-hands talk. I’d told the story of von Kempelen’s Mechanical Turk, the chess-playing automaton that toured Europe in the late eighteenth and early nineteenth centuries, astonishing (and defeating) such luminaries as Napoleon and Benjamin Franklin. The supposed automaton actually had a chess master hidden inside, with a set of lenses to see the board and a set of levers to move the hands of the automaton. I thought this was a marvelous metaphor for the new generation of web applications.
As I spoke to the staff at Amazon, I reminded them that the application wasn’t just software, but contained an ever-changing river of content produced by their network of suppliers, enhanced by reviews, ratings, and other contributions from their vast network of customers. Those inputs were then formatted, curated, and extended by their own staff in the form of editorial reviews, designs, and programming. And that dynamic river of content was managed, day in and day out, by all of the people who worked for Amazon. I remember saying “All of you—programmers, designers, writers, product managers, product buyers, customer service reps—are inside the application.”
(For a long time, I wondered whether my telling of this story might have inspired Amazon to create the Amazon Mechanical Turk service, which uses a crowdsourced network of workers to perform small tasks that are hard for computers to do. However, while the service was launched in 2005, the patent for it was filed in 2001 though not issued until 2007, so at best I might have inspired the name. The name given to it in the patent diagrams is “Junta.”)
My insight that, on the Internet, programmers were “inside the application” had unfolded gradually over time. It first came to me when I was trying to understand why the Perl programming language had become so important in the early days of the web.
One conversation particularly sticks in my mind. I’d asked Jeffrey Friedl, author of the book Mastering Regular Expressions, which I’d published in 1997, just what it was that he did with Perl in his day job at Yahoo! “I spend my days writing regular expressions to match up news stories with ticker symbols so that we can display them on the appropriate pages of finance.yahoo.com,” he told me. (Regular expressions are like wildcards on steroids—a programming language feature that makes it possible to match any string of text using what appears to the uninitiated to be a magical incantation.) It immediately became clear to me that Jeffrey himself was as much a part of finance.yahoo.com as the Perl scripts he wrote, because he couldn’t just write them once and walk away. Due to the dynamic nature of the content the website was trying to reflect, he needed to keep changing his programs every day.
By the time I spoke at Amazon in 2003, I’d extended this insight to understand that all of the employees inside the company, and all of the participants in the extended network, from suppliers to customers rating and reviewing products, were part of the application.
But it wasn’t until 2006, when companies like Amazon and Microsoft were beginning to understand the possibilities of cloud computing, that another key element came into focus. I’d had a conversation with Debra Chrapaty, then VP of operations for the Microsoft Network. Her insightful comment perfectly encapsulated the change: “In the future, being a developer on someone’s platform will mean being hosted on their infrastructure.” For example, she talked about the competitive advantage she was developing by locating her data centers where power was cheap.
The post I wrote following our conversation was titled “Operations: The New Secret Sauce.” It connected deeply with Jesse Robbins, then Amazon’s “Master of Disaster,” whose job was to disrupt the operations of other groups in order to force them to become more resilient. He told me that he and many of his colleagues had printed out the post and hung it on the walls of their cubes. “It’s the first time anyone ever said we were important.”
The next year, Jesse, together with Steve Souders from Yahoo!, Andy Oram of O’Reilly Media, and Artur Bergman, the CTO of Wikia, asked for a meeting with me. “We need a gathering place for our tribe,” Jesse told me. I happily complied. We organized a summit to host the leaders of the emerging field of web operations, and soon thereafter launched our Velocity Conference to host the growing number of professionals who worked behind the scenes to make Internet sites run faster and more effectively. The Velocity Conference brought together a community working on a new discipline that came to be called DevOps, a portmanteau word combining software development and operations. (The term was coined a few months after the first Velocity Conference by Patrick Debois and Andrew “Clay” Shafer, who ran a series of what they called “DevOps Days” in Belgium.)
The primary insight of DevOps is that there were traditionally two separate groups responsible for the technical infrastructure of modern web applications: the developers who build the software, and the IT operations staff who manage the servers and network infrastructure on which it runs. And those two groups typically didn’t talk to each other, leading to unforeseen problems once the software was actually deployed at scale.
DevOps is a way of seeing the entire software life cycle as analogous to the lean manufacturing processes that Toyota had identified for manufacturing. DevOps takes the software life cycle and workflow of an Internet application and turns it into the workflow of the organization, building in measurement, identifying key choke points, and clarifying the network of essential communication.
In an appendix to The Phoenix Project, a novelized tutorial on DevOps created by Gene Kim, Kevin Behr, and George Spafford as homage to The Goal, the famous novel about the principles of lean manufacturing, Gene Kim notes that speed is one of the key competitive advantages that DevOps brings to an organization. A typical enterprise might deploy new software once every nine months, with a lead time of months or quarters. At companies like Amazon and Google, there are thousands of tiny deployments a day, with a lead time of minutes. Many of these deployments are of experimental features that may be rolled back or further modified. The capability to roll something back easily makes failure cheap and pushes decision making further down into the organization.
Much of this work is completely automated. Hal Varian calls this “computer kaizen,” referring to the Japanese term for continuous improvement. “Just as mass production changed the way products were assembled and continuous improvement changed how manufacturing was done,” he writes, “continuous experimentation . . . improve[s] the way we optimize business processes in our organizations.”
But DevOps also brings higher reliability and better responsiveness to customers. Gene Kim characterizes what happens in a high-performance DevOps organization: “Instead of upstream Development groups causing chaos for those in the downstream work centers (e.g., QA, IT operations, and Infosec), Development is spending twenty percent of its time helping ensure that work flows smoothly through the entire value stream, speeding up automated tests, improving deployment infrastructure, and ensuring that all applications create useful production telemetry.” He echoes the theme that it isn’t just a technical but an organizational practice: “Everyone in the value stream shares a culture that not only values each other’s time and contributions but also relentlessly injects pressure into the system of work to enable organizational learning and improvement.”
The practices of DevOps have continued to evolve. Google calls its version of the discipline “Site Reliability Engineering” (SRE). As described by Benjamin Treynor Sloss, who coined the term, “SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, design and implement automation with software to replace human labor.”
He makes the case that a traditional operations group has to scale linearly along with traffic to the service it supports. “Without constant engineering,” he says, “operations load increases and teams will need more people just to keep pace with the workload.” In the SRE approach, by contrast, the humans inside the machine who keep it going augment themselves by constantly teaching the machine how to duplicate what they do, at ever-increasing scale.
What we see, then, in the modern networked organization is not just a radical change in the external relationships between a company, its suppliers, and its customers, but a radical change in how the workers inside the company are organized, and how they are partnered with the software and machines they are building and operating.
As the principles of Internet-scale applications and services interpenetrate the real world, every company needs to transform itself to take advantage of techniques that were pioneered in the digital realm. This is not the work of a moment, but an ongoing exploration. At that 2003 all-day Amazon all-hands meeting, Jeff Bezos gave the opening talk. It was called “It’s Still Day 1.” He described the history of electricity, with vivid historical photographs of the nests of wires coming down from light sockets in the ceiling to power new kinds of electric devices. The standardized power plug hadn’t yet been invented. He showed how factories still drove the machines on their assembly lines with huge centralized motors, with belts and pulleys carrying the power, just as they had in the age of steam, not yet having realized that they could bring electricity directly to small motors where the work was actually happening.
Often, when new technology is first deployed, it amplifies the worst features of the old way of doing business. Only gradually do individuals and organizations realize, through a cascading network of innovations, how to put new technology properly to work.
Jeff was right. It’s still day one, even now. But searching out the possibilities of the future isn’t a job just for inventors of software or machines. It is the job of every business to ask itself what technology makes possible today not just for its customers, but for how the business itself is organized to serve them. That is also the job for other institutions, such as government.
Get WTF?: What's the Future and Why It's Up to Us now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.