Building—and scaling—a reliable distributed architecture
Five questions for Joseph Breuer and Robert Reta on managing dependencies, building for adaptability, and managing through change.
I recently asked Joseph Breuer and Robert Reta, both Senior Software Engineers at Netflix, to discuss what they have learned through implementing a service at scale at Netflix. Joseph and Robert will be presenting a session on Event Sourcing at Global Scale at Netflix at O’Reilly Velocity Conference, taking place October 1-4 in New York. Here are some highlights from our conversation.
What were some of the obstacles you faced while implementing at scale?
Joseph: The primary challenge when operating a service in a distributed architecture at scale is managing for the behavior of your downstream dependencies. Whether those dependencies are a datastore or a restful API defining timeouts, fallback data, and concurrency of the interactions will be the defining factor of your service. Your service may scale wonderfully, but if a dependency is not accounted for then the overall service can quickly fall over.
Robert: Having run only stateless servers until our offline work, we were too focused on scaling our CPU resources and ignoring other important metrics, such as disk capacity. The instance type we have chosen for our streaming licenses was not suitable for our offline licenses; we had way more CPU cycles then we could ever need but started to fill up our disks quickly. We didn’t take into account our instance type and how the resources scale differently.
What were some of the tradeoffs you had to make along the way?
Joseph: Implementing a service with Event Sourcing and Cassandra required two major paradigm shifts. First was the mutability of the data model. Second was the relational abstraction of data in a CRUD form. We did benefit from beginning with a clean slate in designing our service, but it was still a tradeoff.
Robert: We really wanted to implement a CQRS solution on top of our Event Sourcing architecture; it would have given us more power in terms of queries and big data analysis. But at some point you need to make a decision on initiating a feature freeze and stabilizing the release candidate.
What skills or experience do you need to operate at this scale?
Joseph: You need a willingness to experiment and to try new solutions until one works. Build a system that allows for rapid changes—meaning, avoid an architectural or design decision that prescribes an update or transition strategy.
Robert: You need a desire to live outside your comfort zone and take calculated risks. There was no option to rollback on launch day, so we had to fully scrutinize most of the decisions we make. But with the fast moving nature of the product, developers had to also be comfortable making a call and running with it.
You advocate for Cassandra over other tools—why?
Joseph: There are so many tools (our Velocity presentation goes into a number of them). When considering datastore architectures, your first choice is between relational versus distributed NoSQL. Think CAP theorem. For NoSQL your next choice is between document model versus time series. Cassandra supports time series data modeling, which was necessary for our implementation of Event Sourcing. For fine tuning, Cassandra differentiates consistency levels of read/writes independently per execution. Finally, it was critical that there was existing infrastructure to manage a multi-region Cassandra schema within Netflix.
Robert: I wouldn’t necessarily say we advocate it over other tools; it just made sense for this particular project. Each project has its own requirements in terms of consistency and availability, and for offline we really wanted to scale easily at the expense of consistency. Cassandra provides this out of the box so it made sense to run with it.
What other parts of the program for Velocity NY are of interest to you?
Joseph: Of particular interest are the sessions focusing on load testing and capacity setting of distributed systems—for instance, Susie Xia and Anant Rao’s How LinkedIn Determines the Capacity Limits of its Services Using Live Traffic and Jeffrey Valeo’s Lessons Learned from Load Testing Distributed Systems.
Robert: I’m most interested in socializing with other speakers who have similar professional backgrounds. People with these types of passions tend to be outside-the-box thinkers, and I like to bounce ideas back and forth with them.