Linux performance: Tracing the cloud
Five Questions for Sasha Goldshtein: Thoughts on dynamic tracing, performance tools, and performance optimization in the cloud.
Sasha Goldshtein, CTO of Sela Group and expert in distributed architecture, production debugging, and mobile application development, will be speaking about the modern Linux tracing landscape at Velocity in New York in September. I recently sat down with Sasha to discuss the latest in Linux performance tools. Here are highlights from our conversation.
What is modern dynamic tracing on Linux?
Tracing is a diagnostics and monitoring approach that allows obtaining accurate statistics and log messages from arbitrary points in the system, ranging from low-level kernel components to application-level libraries and runtimes. Unlike logging, tracing does not require developer effort to build in while the system is being developed. Rather, tracing can be activated on demand, turned on only when necessary, and can collect valuable information about the system’s behavior, ideally with a low overhead. Modern tracing tools begin to make this ideal a reality, by aggregating event data in the kernel and making it possible to generate averages and histograms without saving to disk and then processing millions of events.
Over the course of the last few years, new tracing tools were added to the Linux kernel, making the choice of tool non-trivial for many applications and scenarios. Notable choices include perf (a Swiss-army like front-end for performance investigations), SystemTap (which ships with a scripting language for building your own tracing tools), Sysdig (offering the simplicity of Wireshark rules for system-wide monitoring), and eBPF (providing a safe, in-kernel virtual machine implementation for high-performance, low-overhead tracing). Some of these technologies are still evolving, and others are already mature and can be used in widely deployed production systems.
Does the cloud require a different approach to performance optimization?
Yes, to a large extent.
First, some parts of the stack are not going to be under your control. You can evaluate and benchmark cloud solutions (especially when dealing with IaaS vendors, which often let you customize the amount of RAM, IOPS, and other performance features), but you’re still often getting a shared system with somewhat unpredictable performance patterns. Identifying the exact cause of a slowdown or a hiccup can be next to impossible when you don’t have full control over the environment.
Second, some of the performance monitoring tools that may hold in a small lab don’t work at cloud scale. As it gets easier to deploy large-scale systems, monitoring them at a low overhead and getting actionable performance information becomes harder without saturating the ops dashboard and making alerts impossible to follow. The ability to drill down into an issue on demand across a large number of servers is getting increasingly more important.
Third, the elasticity of cloud environments makes some performance optimization techniques obsolete, or at least not quite as relevant. You often don’t have to lock in to a specific processor, memory size, or network configuration, or even a specific cloud vendor. When a configuration you used in the past doesn’t get you the same return on investment, you can tune your deployment for another configuration without spending millions of dollars on physical servers.
Which Linux performance tools do you think your average ops person should learn more about?
A lot of ops and reliability engineers are familiar with high-level performance tools, such as top, iostat, and vmstat. However, figuring out the tricky issues requires a more fine-grained level of detail than these tools can provide. For example: Which code paths are allocating a lot of memory? Which IP addresses are typically facing slow TCP handshake times? Which URLs in the web application are responsible for a lot of the system load? Answering these lower-level questions requires low-level tools, and this is where dynamic tracing really shines. Without recompiling or reconfiguring the system, tools like perf front-end and the BPF tool collection can give you detailed insight on what’s happening in your application.
What are some of the challenges using Linux performance tools, and how can they be addressed in the future?
Even though a lot of effort has been invested over the years in improving the user-friendliness of the Linux performance tools, there is still plenty of room for further improvement. One particular area is a graphical front-end that can be used as a web-based dashboard for monitoring personnel. With a couple of clicks you could drill into a server and obtain accurate information from low-level tools. Another area for potential improvement is visualization—charts, flame graphs, latency histograms, heat maps, and so on. There’s a lot of work in these areas in open source projects and some commercial offerings as well. Making low-level performance and tracing tools accessible to a larger audience of engineers and support personnel is critically important for the next generation of systems at scale.
You’re speaking at the Velocity Conference in New York this September. What presentations are you looking forward to attending while there?
I’d be looking forward to the performance testing workshop on Tuesday—performance and load testing skills are an absolute must before you can even dream of doing any performance optimization. Another talk I wouldn’t miss is latency analysis of distributed systems on Thursday; with the micro-services hype, it’s getting harder and harder to figure out where the latency of each individual request resides.