Chapter 1. Distributed Systems
We’ll begin our journey through serverless by talking about distributed systems. Before we jump into definitions and examples, what do you need to know about distributed systems to be effective with serverless? When you develop an application, you have to make a large number of assumptions. Some may be as simple as knowing that one step will occur after another. Others may be far more complex. Distributed systems will tear apart all your assumptions about the environment in which your code will run and how it will operate. When you develop for a single computer, many of the harsh realities of the physical world are abstracted away. As soon as you start building a system that resides on multiple computers, all of those realities suddenly surface—though they might not be obvious.
This chapter will first offer a broad overview to better understand what you have signed up for.
If you do not have experience developing backend systems, my goal is to explain what has changed about your world. But even if you have experience, you will find value here: distributed systems can bring out the pessimism and cynicism even in experienced software engineers and system administrators. We’ll talk about what can go wrong and what you can do about it.
What Is a Distributed System?
A distributed system is any system where the individual components are separated and communicate over a network. A distributed system can be part of a larger or smaller distributed system. The internet is one giant distributed system. Your cell phone provider operates a giant distributed system in order to connect you to an even bigger one. Their system contains wireless gear, network gear, and applications such as billing and customer information.
When working with apps, we usually expect determinism: given a specific input, the output, and the states and sequences to achieve that output, will always be the same. The reality of distributed systems, however, is nondeterminism. As the complexity of your application grows, it becomes difficult to predict the state of it at any given point. It is assumed that all parts of the system will be unreliable in either obvious or nonobvious ways, but it is your job to build a reliable system from these unreliable components.
Your application does not live or process logic in a single place. If you have a browser-based application or a mobile application, the second you put a line of code in any other place, such as a function in the cloud, your system is distributed. Generally, components of a distributed system are asynchronous, meaning they pass off a task and do not wait directly for the result. But many important operations, such as the processing of credit card transactions, will be accessed synchronously, as they should block the progress of the calling task until completion of the vital transaction.
A serverless system is inherently distributed. Given that a serverless function is by its very nature stateless, if your application involves any form of state, it is going to have to be distributed. But aren’t all modern applications distributed? While the answer is likely yes, it is definitely true for applications that have ambitions to grow and become more complex.
If you are building a simple web app with a client frontend, a monolithic backend, and a database (also known as a three-tiered web application), then you are building a distributed system. However, many developers will neglect this fact when thinking about how the application stores its state in a database. And they will run into problems as a result. At some point in scaling up their system, they will likely face an issue caused by an application server, regarded as being easily and horizontally scalable, connecting to their database (vertically scalable). This issue could range anywhere from needing more resources for the database to simply needing to update the database configuration to allow additional connections. But those developers who forget they are working on a distributed system will have all of the problems of one, without any of the common patterns to minimize issues.
Serverless shifts many responsibilities to your cloud provider. However, as the software practitioner writing the business logic, there are still things you need to know and understand to make promises to your stakeholders in regard to the reliability, availability, and scalability of your software.
Why Do We Want a Distributed System?
Do you need a solution that handles what happens when someone introduces a bug that causes your database to lock, preventing your main user-facing system from operating? Well, that’s actually a strength of distributed systems because you can kill the failing service, and all of the work expected to be done by it will be delayed but not lost as it queues up. If your user registration code ran the email sending code directly, you would be down completely. Designing so that one failure does not cascade and directly cause another is the subject of Chapters 4 and 11, but the resources listed in “Further Reading” cover these concepts in much more depth.
Any application intended to scale must be a distributed system. Otherwise, you will be limited to the compute and storage of one computer, and your users must visit this computer and use it in person, only one at a time. There are many advantages of distributed systems, but there is no choice to be made. You are building a distributed system. You must learn the disadvantages of doing so to best limit their impact on your operations.
The Harsh Realities of Distributed Systems
Nothing about the network can be trusted. And in a distributed system, messages must be passed over the network. In Designing Data-Intensive Applications (O’Reilly), Martin Kleppmann expands on this interconnected relationship between your code and the source of so many problems, the network:
A node in the network cannot know anything for sure—it can only make guesses based on the messages it receives (or doesn’t receive) via the network. A node can only find out what state another node is in (what data it has stored, whether it is correctly functioning, etc.) by exchanging messages with it. If a remote node doesn’t respond, there is no way of knowing what state it is in, because problems in the network cannot reliably be distinguished from problems at a node.
Networks seem to be pretty reliable, but every now and then you have to hit the refresh button or wait. In a distributed system, your system has to deal with automating that refresh. If one system goes down, and all of the other systems start attacking it with requests when it is already failing to keep up, what happens?
There are far fewer things to consider when two pieces of code run in the same stack. Asynchronicity can create a lot of unintended effects, especially when unexpected by the programmer. Now add the reliability of a network to that.
To illustrate these issues, let’s look at a common application. This application has a user registration process. New registrations go into a task queue to perform some operations, such as sending a welcome email. The developer made a smart choice to decouple the user’s registration and the back-of-the-house logic, such as sending an email. If the application was suffering from issues with sending email, it should not block the user from registering successfully. Other actions in the system may also cause a task to get queued up that will send some form of notification. Seems simple, right? Let’s get to it.
The Physical World
In the aftermath of Hurricane Sandy in 2012, a group of operational engineers found themselves in a precarious situation. The power was out in Lower Manhattan. The data center had generators and diesel fuel on hand, but the diesel pump had failed due to flooding; the pump was in the basement, and the generators were on the roof. Heroically, the engineers mounted a bucket brigade to bring diesel fuel, in 5-gallon buckets, up 17 flights of stairs, in the dark.
The physical world itself is nowhere near perfect. Just because your organization does not own the servers, or can’t even touch or see them, does not mean you will not be affected by a fire, a power disruption, or another disaster, natural or otherwise. The companies relying on that particular data center were spared by the heroism of the bucket brigade, blissfully unaware of their servers’ potential to be cut off at any moment. When you host in the cloud, you may not be responsible for carrying a bucket, but you still have to deliver your application to end users despite the circumstances. The cloud solves this problem as much as current technology allows with multiple availability zones, which generally come for free in serverless compute, but must be accounted for in your persistence of data, and other services as well.
The physical world can be the root cause of many other failures we will encounter when working in the cloud:
- Network issues
-
Someone may have tripped on a cable and pulled it out of its socket.
- Clock issues
-
The physical hardware on the server responsible for keeping track of the time, a crystal, could be defective.
- Unresponsive node
-
There could be a fire.
Calling attention to this allows us to drastically simplify the rest of the issues we will face and focus more on the impact of these issues as you design your systems.
Missing Messages
Have you ever sent an email only to later find it stuck in the drafts or the outbox?
There is no guarantee when you make a request over a network that it will be delivered or processed. This is one of the many things we take for granted when working on software that will run locally. These issues are the simple reality of computing that has been abstracted away from engineers enough that people have forgotten their existence. Networks can have congestion just like your local interstate during rush hour. The modern network in a cloud computing environment is a distributed system itself. The easiest way to observe this is when using a mobile network. We have all had experiences with apps that hang because they expect an instantaneous response from a remote computing system. How would this affect your code if it were some kind of live or real-time game? If your response gets too delayed, it could even be rejected by the remote system as anticheating logic. Messages go missing, show up late, or even show up at the wrong destination. The wires can’t be trusted.
Unreliable Clocks
How important is it for your system to know what time it is? What happens if you set your iPhone back in time before iPhones existed? Or what if you set it to a time 30 years into the future? Either way, there is a good chance it won’t boot. Apple has never confirmed the issue, but it has been attributed to timestamps on Unix systems having been started on January 1, 1970, creating a date of 0. Remember that the engineers working on the iPhone likely did not expect users to set back their date so far in the past, but they permitted users to do so. This has caused unexpected bugs, even for Apple.
Servers have their system clock set automatically using the Network Time Protocol. While relying on your system clock seems like a sure thing, there are potential issues. Google published a paper on its internal Spanner database that details how they deal with time for this critical system. When their nodes were set to poll every 30 seconds, the system clock drifted by as much as 7 ms. That may not be an issue for you, even as both Google and Amazon offer enhanced synchronization based on GPS, and atomic clocks for hypersensitive systems such as trading stocks, though the common commodity system clock has some other quirks. When your clock drifts, it will eventually be corrected in a way that can alter the effect of your time-sensitive code. Multiple CPU cores have different references of the current time, and logic living inside a virtualized system on the cloud has an extra layer of separation from the reality of time passing in the outside world. Your code may experience jumps in time, forward or backward, at any time. It is important to utilize a monotonic clock when measuring the passage of time. A monotonic clock is one that is guaranteed to increase.
In addition to the clock being susceptible to changing more than a second in any given second, there is no way to guarantee that all of your nodes’ clocks will be set to the same time. They are subject to the same issues of network reliability we have already discussed. As with all issues you will face, there will be a trade-off in the importance of an aspect of your system to the use case and amount of engineering resources available. Building a social network for pets? Those seconds may not be worth your trouble. Building a high-frequency trading system? You may have to utilize hardware atomic clocks set by GPS, as those microseconds can cost megabucks.
The current time as it appears to your business logic can unexpectedly jump forward. Serverless functions, as with other forms of cloud compute, run your code in a virtualized or isolated way. The software that provides this virtualization or isolation can distort time in a number of ways. One distortion that can occur is when your code competes for shared resources, it may suffer from a pause due to multithreading. It will be put to sleep, then suddenly reactivated but with no understanding of the passage of time that occurred in the outside world. This can similarly be caused by processes such as memory swaps, garbage collection, or even synchronously waiting on some resource that is accessed over the network. Keep this in mind when attempting to squeeze more performance by using threads or subprocesses to perform additional work in your system.
These realities can manifest as issues where, for example, you can’t reliably know which change happened first in a series of events in a queue. In reality, when dealing with modern distributed systems, there is an expectation that your system may run in multiple different geographies. In this case, we have already learned that events can and will come out of order, and there is no real way to determine the order without some form of locking, which can be expensive and bring its own issues to bear. But if you need that kind of knowledge in your system, you won’t have any other choice. Even then you can and will be wrong about which task deserves to have the lock first. You have to handle that in software. No service will be offered in the short term that will solve this for you. Even if they start offering “consensus as a service” or something similar, you will still have to understand the trade-offs and issues around their use when implementing your business logic.
Cascading Failures
Let’s say that you, the developer of the application in this example, did a great job loosely coupling the two components provided. If the user registration system goes down, the email system won’t really mind at all. In fact, if the email system is serverless, it won’t even run (how efficient!). If the email system goes down, the user registration system stays up. Or so you might think. What happens if your task-queuing system becomes full and no longer accepts new tasks, and now your users can’t sign up? This is how issues compound, or cascade, to cause other issues.
In this example, when one system (sending mail) failed long and hard enough, the outage caused another system to fail. When your system is composed of dominoes, space them to avoid a chain reaction when one falls. No matter how slow it is (it could have been an entire weekend before the queue filled up), a resilient system will be engineered to avoid this issue. You may not be able to afford such resilience in your current project, but you must be mindful of it.
Unexpected Ordering
Have you ever shipped a new version of your code that included an additional timestamp field, only to find that somehow inserts are still being committed without one? When operating in a distributed system, there is no guarantee for the order of execution of logic split across multiple nodes. But how could your deployed changes not take effect? Simple: the old version of the code is running somewhere. It could be a task server that is faithfully chugging along while refusing to respond to requests for it to shut down so that it can be replaced with the new version of that code.
Meanwhile, there is another change waiting to be pushed to production that includes some kind of mandatory field on registration, such as a first name, as well as including that name in the welcome email. You have a large day of new sign-ups, and this code makes it out to production. Instantly, people stop getting welcome emails, and you now have a big headache—what went wrong? Synchronicity was assumed.
There were some number of welcome emails waiting to be sent out. When the new code hit production, the existing tasks were to send welcome emails to users that included their name, something those records don’t have. This particular issue can also occur due to network latency.
Idempotency
Idempotency is the idea that a certain task repeated more than once will have the same outcome. It is somewhat easy to build a system that will perform a given task at least once, but much more difficult, if not impossible to do in a guaranteed way, to build a system that performs a given task once and only once.
However your system sends email, whether speaking SMTP directly to your users’ mail exchanger or using a third-party API, it’s not hard to imagine a situation where an email was successfully sent, but a failure is reported. This happens more than you would imagine when you start to scale, and chaos takes full hold. You try and send the email, and it gets sent, but right before the other side responds with success, the network connection is severed, and you never get that successful response. So as a responsible developer, you have designed the system to try again. This time it works. But your task that has been attempted twice, was completed twice, and as a result the user got two welcome emails.
This is enough of an edge case that you may not try and over-optimize your system to always send exactly one welcome email, and you may not be able to without also having access to your user’s mailbox. But even then, what if they delete the message before you check? You will send it over and over again. A single node can never really know the truth of the outside world because it relies on the network to learn about the truth, and by the time it gets a response, that truth may be stale.
Once you accept that, you can design for it.
It is important to dig into the design to see how these things will fail, but just as important to deprioritize the rare case in which a user gets two welcome emails. Even if it impacts all users in a given day, you will be fine. But what if the task is to send $20 from User A to User B? Or since we are focused on registration, giving User A credit for referring User B? If that job gets into a queue and keeps failing and being retried, you may have a real issue on your hands. It is best to design your tasks to be idempotent—the outcome is the same no matter how many times the action is repeated.
What Am I Responsible For?
When you use an offering like Amazon’s Simple Queue Service (SQS), or Google’s Pub/Sub, you do not have to worry about keeping it running. You have to know what the limitations of these offerings are (how long a message can wait without being read before it gets expunged), and you have to deal with designing your systems to deal with a failure or outage of these systems. It is best to know as much as possible about how these systems work if you want to best understand how, when, and why they will fail, as well as the impact of anything you build that relies on these offerings. Additionally, it is great to see how reliable and robust systems were implemented and designed. Before using any new system, read the intended use cases, limitations, and watch a video from the cloud provider of the system implementation.1
When dealing with serverless compute, you don’t need to directly manage the clocks and networks and other pain points, but you may have to configure them (network), and learn to build in best practices around others (clocks).
What Do You Need to Consider When Designing a Distributed System?
Imagine a student asked to solve a math problem. Seems straightforward enough. Even if they have to slide off the cover of a graphing calculator to solve that problem, they will do it synchronously, one step at a time. Imagine a room full of students. How would it work if the students were paired up and had to solve problems in twos? What if only one of the students could read and write from the problem sheet, and the other one was only allowed to use the calculator, and neither was allowed to do any reading or writing from the answer sheet or any other piece of paper? How would that complicate things? This is one way to visualize how the components of your distributed system must orchestrate work in a larger cohesive system.
It is almost a guarantee that each component of your system will fail. Partial failures are particularly difficult to deal with because they break the determinism of the system. Total failures are generally easier to detect, and it’s necessary to keep components isolated from each other so they don’t cause failures to spread like wildfire through your system.
Loose Coupling (or Decoupling)
One of the most important factors of a well-designed distributed system is that its components are loosely coupled. This means that individual components can be changed independently of each other with hopefully no negative repercussions. This is achieved by defining APIs for each service to bind to, while the implementation details are abstracted away and hidden from the consuming service. This enables teams to operate independently and focus on the details that matter for their areas of concern. You may see this concept also referred to as being fully decoupled. Load balancers do this, isolating your logic from the unpredictable requests of users.
Design your system to be loosely coupled. Find your failure points and figure out how to avoid cascading failures. Do not let different components of your system interfere with the operations of another system, or attach to private integration points such as sharing a database.2 Teams can still view, learn, and submit revisions to each other’s code, but do not allow any circumvention of the aforementioned APIs. If two services share a database, they are not actually separate services. Even if they operate on different tables, one component can easily bring down the other since they have this tight coupling and reliance on similar components. We will discuss this concept further in Chapter 4.
Fault Tolerance
You must build leeway into your system to handle faults. The more faults your systems can tolerate, the less your users, and your engineering organization, will be forced to tolerate. Have you ever been on a team that seems to be fighting production fires all of the time? Depending on the scale of the system, letting your application control your time and schedule is a conscious choice that your team makes every day by not communicating the importance of shoring up your systems for production traffic, and the prioritization of improving upon technical debt, or the built-up amount of important work that was deferred usually in exchange for a short-term gain.
An important part of tolerating faults is having some idea that a node is up and operational. This is where the health check comes in. A health check is a simple API on part of the system that simply responds to a request to let another system know that it is indeed functioning. Some implement this as a simple static response, but if the component requires access to other systems, such as a database, you may want to verify that the component can connect to the database and successfully execute a simple query before responding that the component itself is up.
You must build in monitoring and alerting to be aware of the operation of your system at any given time (see Chapter 7), and have plans for dealing with failure (see Chapter 11).
Generating Unique (Primary) Keys
Loose coupling should be the rule for any integration point. When designing how your data will be stored, you may rely on your database to create unique identifiers for data being stored. But when you store something to a bucket system such as S3, you are forced to make your own identifier. Why is that?
Generating distinct identifiers known as distributed ID using an existing implementation, such as Twitter’s Snowflake, is considered to be a best practice. It can prevent issues with coupling to a specific database, as was the case for Twitter when it introduced Snowflake in 2010. Using distributed IDs also provides a benefit to relational databases because operations don’t have to consult and wait on an insertion to generate a primary key. When you perform an insert, you have to wait for it to return the primary key to create or update other linked objects. This can cascade for a complicated transaction without distributed IDs. The operation will be much simpler if performed in one transaction by generating your own IDs. And the same is true for complex microservices as well.
An example distributed ID consists of a combination of the time (usually so items can be sorted) and some form of entropy to reduce the likelihood of a collision of duplicated IDs to an infinitesimally small chance. These IDs allow you to sort based on the order of creation, although within a given timestamp it is impossible to know the order in which items were created. Given how much we have already discussed the inaccuracies of system clocks, you shouldn’t over-trust the accuracy of any timestamp, especially when debugging.
Planning for Idempotency
One way to attack idempotency is to design certain actions to be repeated as many times as needed to be successful. For example, I was designing a system and the organization I work for decided that being multiregion was important for all of our systems. The downstream effect was that my system would notify another system that something had happened. That system was properly designed to deduplicate those notifications. My initial thought of how to run the system in multiple regions was to simply run it in multiple regions. It would cost twice as much, and would be twice as much work, but the request would be met with minimum effort. Once it actually came time to implement multiregion support, we of course designed and deployed a optimized version. In fact, we were able to deduplicate the messages ourselves, but did not have to worry about the guarantee of deduplicating.
Two-Phase Changes
A two-phase change occurs when a change is broken up into two separate parts (phases) of order to be safely deployed to production. In a distributed system, certain changes (such as data migrations), must be done in two parts. In the first change, you update the code to handle the code both before the change and after. Then, once the code has been updated to handle either situation, you can safely push the new situation into existence. In the earlier example of a new field being introduced, with a reliance on that new field in logic for email code, it was assumed that no new users could be registered without that field, so it would not be an issue. But that change did not account for tasks that were in transit, in queues, or even live requests that happened during the deployment. There are a number of ways to solve for issues like this, but it is a great excuse to introduce you to the concept of two-phase changes or migrations. If you break that new feature into two different changes, you can release them sequentially to avoid this issue. You could deploy the new field, and after letting that change settle for an adequate amount of time, you could release the second. However, in this case it would be wise to ensure that the email process does not fail based on reliance on a field that previously did not exist. In that case, you could push out the change in one deployment, but keep this pattern in mind for other use cases around changing the structure of your database.
Further Reading
For more on the topics covered in this chapter, you can check out the following resources:
-
Release It!, 2nd Edition by Michael T. Nygard (Pragmatic Bookshelf)
-
Designing Data-Intensive Applications by Martin Kleppmann (O’Reilly). I strongly recommend Chapter 8, “The Trouble with Distributed Systems.” But you can skip the parts about designing consensus protocols, as it may be too advanced at this point of your journey.
-
Site Reliability Engineering by Betsy Beyer et al. and The Site Reliability Workbook by Betsy Beyer et al. (both O’Reilly)
-
Refactoring Databases: Evolutionary Database Design by Scott Ambler and Pramod Sadalage (Addison-Wesley)
Conclusion
We will zoom in further in the next chapter, which will cover a specific way to build a distributed system: microservices.
1 For instance, you can view this video for DynamoDB.
2 Or in the case of NoSQL, a Table.
Get Learning Serverless now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.