Understanding tracing
Five questions for Bryan Liles on the complexities of tracing, recommended tools and skills, and how to learn more about monitoring.
I recently asked Bryan Liles, principal engineer on Capital One’s Cloud Engineering team, to discuss OpenTracing and the skills needed for successful tracing. Bryan is co-hosting a tutorial on OpenTracing and is also speaking about systems engineering career paths at the O’Reilly Velocity Conference, taking place Oct. 1-4 in New York.
What makes tracing complex?
The first thing that makes tracing complex is understanding how it fits into your application monitoring stack. I like to break down monitoring into metrics, logs, and tracing. Tracing allows you to understand how your application’s components interact with themselves and any potential consumers. Secondly, finding a good toolset that works a diverse application infrastructure is also complex. This is why I’m hoping to see OpenTracing become more successful since it provides a good interface based on real world work at Google and Twitter. Finally, tracing is complex because of the amount of components involved. If you working in a large microservice-based application, you could have scores of microservices coupled with databases of many types and other applications as well. Combined with the tracing infrastructure, this leads to a large amount of items to consider. OpenTracing helps again by providing standards and clients to help simplify integration for the developer and operations teams.
What skills do you need for tracing?
To be successful with tracing, you need to be comfortable in your chosen programming library. Since the OpenTracing ecosystem hasn’t fully matured, you might have to augment clients to allow them to interact with your tracing system. You should also understand how tracing fits into a well monitored application. This means you need to understand the basics of monitoring as well. Something often forgotten by application developers is the need for statistics. Often, we get discrete numbers that we can perform calculations on. But more and more, it is impossible to count everything one-by-one, so knowledge of statistics is paramount to help you determine trends or decide if you have enough data to be sure your measurements are valid.
Why do you advocate OpenTracing?
I’m a believer in standing on the shoulders of giants. OpenTracing evolves from the research performed at Google in their Dapper project. I also believe in the power of a confirmation. Twitter took the ideas of Dapper and produced Zipkin. The inclusion of the project in the CNCF affirms that the project has community support and is gearing itself up to be around for the long term. Couple this with the fact that it is open source with multiple implementations means that it is worthy of the community’s attention. Finally, the documentation and available sample code is of good quality, which makes it easy to get started.
What advice do you have for people interested in learning more about monitoring?
If you want to learn more about monitoring, there are a few things you can do. The Art of Monitoring by James Turnbull lays the theory foundation of what you should be monitoring, and one method to build a monitoring system. Site Reliability Engineering describes how Google does monitoring, and there are lots of good lessons in there including defining service level agreements (SLAs) and how to build systems to be monitored. After you understand the foundations, you should start on your own small project. Pay close attention to the items you are monitoring. Are they helping with your SLA or are they noise? How can logging help you with your SLA? Once you understand that, think about whether you’re logging too much or not logging enough. Next you can introduce tracing. The most important part of monitoring is understanding that it is all about the SLA. If you can’t tell if your application is up and working correctly, you aren’t delivering value to your customers. Finally, there are lots of good talks about monitoring. The Monitorama conference has scores of talks online on the subject. Velocity is also a good conference for finding talks about all the components of monitoring from the tools, to what you monitor, to how you determine the real causes and alert.
What other parts of Velocity interest you?
At Velocity, I’m looking for people solving complex problems with novel solutions. I enjoy the stories from the field, whether it be a deep dive on a narrow topic, or a shallow dive on a myriad of topics. Specifically, I enjoy talks about monitoring and its difficulties. It is reassuring to see organizations running into the same problems I’ve seen over the years.