Chapter 1. Why a New Type of Database?

In this chapter, we examine two major forces compelling a new type of database: application development and business needs.

Development has evolved to fit the distributed mindset. Applications are global, developers expect the best tools, and consumers have higher expectations of resilience. But the databases backing global applications sometimes hold them back with centrally managed transactions.

The distributed mindset also serves modern businesses. Enterprise IT needs ways to comply with increasingly complex legal requirements for data. Businesses expect better availability for large workloads. But the databases that run in enterprises don’t scale gracefully.

The next two sections go into more detail on the application development and business needs. The result is clear: the old paradigm for databases can’t keep up with cloud infrastructure, so we need to rebuild from the ground up.

Application Development Has Evolved

Developers build wherever they like and have a world of valuable services at their fingertips. Sometimes development is just a little dollop of cloud services; sometimes it’s a bunch of things. For instance, with a service like Google’s Firebase, it is possible to set up data persistence and synchronization with Firebase Realtime Database, host web assets with Firebase Hosting, and ensure secure authentication with easily integrated Firebase Authentication. Pretty simple.

The application toolkit can run entirely on small, cloud-hosted development systems. There, developers run isolated versions of their system and unit-test feature builds. When they’re ready, they initiate a full suite of cloud-based automated tests. If those tests pass, code and assets automatically deploy to a global content distribution network, and automatically update a set of virtual machines or containers. Google is not alone in the distributed application development game. See Table 1-1 for an incomplete list of products offered by major cloud vendors in this same space.

Table 1-1. Cloud application development offerings from major cloud vendors
Cloud vendor Cloud application development products
Amazon Web Services (AWS) AWS CloudShell, AWS CodeBuild, AWS CodeCommit, AWS CodeDeploy, AWS CodePipeline, AWS Elastic Beanstalk, AWS Lambda, AWS X-Ray
Google Cloud Platform (GCP) App Engine, Cloud Build, Cloud Code, Cloud Monitoring, Cloud Run, Firebase suite, Google Kubernetes Engine
Microsoft Azure Azure App Service, Azure DevOps, Azure DevTest Labs, Azure Functions, Azure Monitor, Azure Pipelines

Globally available resources have become the status quo. They’re accessible, distributed, and resilient. Cloud computing is not just the infrastructure that runs this work; it is the mindset that’s now required for doing fast, impactful development work.

Our traditional SQL database options haven’t kept up. Centralized SQL databases, even those with read replicas in the cloud, put all the transactional load on a central system. The further away a transaction happens from the user, the more the user experience suffers. Application resources like app store binaries or web assets can use content distribution networks to make downloading and running apps seem fast. But if the transactional data powering the application is greatly slowed down, fast-loading web pages mean nothing.

Distributed SQL databases solve this problem with easy scalability and data locality. They deliver a single logical database across disparate, widely distributed hardware. Anywhere on earth that has a datacenter can host additional resources for distributed SQL databases. And they support consistent transactions everywhere using advanced consensus algorithms, thus avoiding the bottleneck of reliance on a centralized node for transaction consistency.

Some distributed SQL databases even allow for data relevant to a particular region to be tightly bound there. Relevant user data is close to where it is needed, which speeds up the user experience and eases the challenges associated with scale. Chapter 2 goes more in depth on the particulars of scale and consensus.

Business IT Needs Have Evolved

Consumers have three expectations of our applications: they must be fast; they must be correct; and they must be always on and available. These are critical expectations since consumers’ patience has waned and the ability to move on to another option is radically easy. If you opened an account and your balance was wrong, would you have confidence in your bank?

People also expect their apps and services to work everywhere, without delay. If you take a trip from London to Sydney, you’ll still have quick access to your Gmail and Instagram pictures, and those services will adjust to your new presence.

Finally, they don’t lose availability just because of some backend infrastructure problem—they naturally survive and can service any and all requests. If you run a business with a customer-facing application, your customers’ app experience must be as good as using Facebook or Instagram and must be at the top of your priority list.

The cloud helps meet these requirements, and businesses know it. See Table 1-2 for Gartner’s recorded and projected cloud revenue forecast. Infrastructure as a service (IaaS) and platform as a service (PaaS) will more than double over just four years! The business IT environment is increasingly cloud-dependent and is seeing enormous productivity growth from adopting cloud-first approaches to solutions. IT departments that learn to think from a distributed perspective for their data, operations, and compute workloads far outpace their on-premises-centered competitors.

Table 1-2. Projected worldwide public cloud service revenue forecast into 2022 (billions of US dollars)a
2018 2019 2020 2021 2022
Cloud business process services (BPaaS) 41.7 43.7 46.9 50.2 53.8
Cloud application infrastructure services (PaaS) 26.4 32.2 39.7 48.3 58.0
Cloud application services (SaaS) 85.7 99.5 116.0 133.0 151.1
Cloud management and security services 10.5 12.0 13.8 15.7 17.6
Cloud system infrastructure services (laaS) 32.4 40.3 50.0 61.3 74.1
Total market 196.7 227.8 266.4 308.5 354.6

a BPaaS stands for business process as a service; laaS stands for infrastructure as a service; PaaS stands for platform as a service; Saas stands for software as a service. Note the totals may not add up due to rounding. Source: Gartner (November 2019).

Again, existing SQL databases haven’t kept up. They are not particularly resilient to outages in infrastructure zones, especially when the zone that goes offline contains the main transactional instance. And while NoSQL alternatives help with this particular challenge, they can’t promise transactional consistency and present developers complexity around a document model that lacks the elegance and power of the relational model (for example, normalization, referential integrity, secondary indexes, and joins).

Distributed SQL databases fill in the gaps. They are highly resilient to any type of outage. They deliver the time-tested and familiar SQL query syntax that developers know and love and promise truly consistent ACID transactions. And most importantly, they are aligned with the elastic scale and ubiquitous nature of our cloud infrastructure.

The Evolution of the Database

In the cloud, the distributed mindset should permeate everything we do. This will allow us to take advantage of the elastic scale, resilience, and ubiquity of cloud infrastructure. And this especially applies to the database.

Cloud databases have iterated to try to meet these needs but continue to fall a bit short of delivering on these requirements. While we have moved the ball forward, we have not completely closed the gap. Let’s explore our progress to date:

Lift and shift

The first approach to transactional databases in the cloud involved simply lifting traditional RDBMS instances and shifting into hyperscaler datacenters on virtual machines. Amazon Relational Database Service (RDS) and Google Cloud SQL are examples of this. Both are great options for some workloads and wildly popular with developers. But they don’t leverage cloud distributed thinking; their value lies in ease of access/deployment (and capex/opex trade-offs). Transactions against these databases are solid but locked to a single instance. Scale is achieved through either deploying on a more high-powered instance or using manual sharding to gain horizontal scale. While useful, we haven’t quite cracked the elastic scale promise and must rely on active/passive configuration for resilience. We no longer have to deal with complex operations to deploy and manage, but we haven’t taken advantage of cloud infrastructure with this approach.

Move and improve

Some have chosen to augment existing databases with distributed technologies that automate sharding and enhance the database’s survivability by reworking a single layer of the database. This class of cloud database has moved a legacy database to the cloud and improved part of it to meet some cloud requirements. Amazon Aurora is a good example. Its major innovation is a distributed storage system that acts as a lake of data under many read-only Postgres instances. It improves availability, is simple to scale for reads, and looks and feels like a traditional SQL database. However, it uses a single node for writes, which can create bottlenecks and limit the ability to scale this database beyond a single region. Additionally, there are several new databases that automate sharding, but they struggle with anything beyond a single region.

NoSQL

NoSQL options are often considered as a cloud database and they definitely deliver value in this environment. Most notably, Apache Cassandra, DynamoDB, and MongoDB provide lightning speed with often schemaless, flexible structures. They scale easily. For data that does not require tight global consistency, like social media feeds, NoSQL is an excellent choice. But when developers need tight consistency for important things like financial transactions, NoSQL can’t meet those demands. Distributed transactions are not guaranteed to be correct and can only get you eventual consistency. Further, the document model limits their ability to deliver the elegance of the relational model—NoSQL databases are missing critical concepts like referential integrity, secondary indexes, joins, and normalization.

The Way Forward

These approaches cover a lot of ground, but the gap is clear. With application development and business needs both adopting the distributed mindset, databases need a new set of features to keep up:

Ease of scale

It should be easy to have as little or as much cloud power behind your database as you need.

Always-on resilience

Downtime should be eliminated.

Data locality

For both compliance and performance, you need to be able to tie certain data to certain geographic locations.

SQL

Developers love the SQL language for its expressive data features, and since they’re already familiar, many tasks are made easier.

ACID compliance

If you commit a record, you should be able to trust that subsequent queries will match that new state.

Doing globe-spanning database transactions consistently across a physical universe limited by the speed of light is an extremely difficult software engineering challenge. But it’s possible. A new set of players has emerged who rise to meet the distributed SQL mindset.

We’ve got the motivation down. Next, we’ll dive deep on those features, and how these new players are getting them done.

Get What Is Distributed SQL? now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.