What is effective performance engineering?
And what does it mean for your hardware, software, culture, and business?
While performance engineering is often defined narrowly as ensuring that nonfunctional requirements are met (such as response times, resource utilization, and throughput), the trend has moved toward a much broader application of the term.
“Performance engineering” doesn’t refer only to a specific role. More generally, it refers to the set of skills and practices that are gradually being understood and adopted across organizations that focus on achieving higher levels of performance in technology, in the business, and for end users.
Performance engineering embraces practices and capabilities that build in quality and performance throughout an organization, including functional requirements, security, usability, technology platform management, devices, third-party services, the cloud, and more.
Stakeholders of performance engineering run the gamut, including business, operations, development, testing/quality assurance, and end users.
Hardware
The traditional goal of performance engineering between the 1970s and the late 1990s was to optimize the application hardware to suit the needs of the business or, more accurately, the IT organization that provided the services to the business. This activity was really more of a capacity planning function, and many teams charged with carrying the mantle of performance reported to operations or infrastructure teams. Some still do (and that’s okay).
As hardware became more commoditized and the adoption of virtual infrastructure and “the cloud” more prevalent, this function took a backseat to development in an effort to deliver business applications and changes faster. It isn’t uncommon now for teams to have multiple environments to support development, test, production, and failover. While certainly more cost-effective than ever, virtualization has given us the false sense that these environments
are free.
The cloud allows service providers to charge a premium for computer power in exchange for the promise of higher uptime, higher availability, and virtually unlimited capacity. However, the cloud doesn’t promise an optimal user experience. Applications need to be optimized for the cloud in order to maximize the potential return on investment.
Software
Over the last 30 years, software has transformed from monolithic to highly distributed, and even the concept of model-view-controller (MVC) has evolved to service-oriented (SOA) and micro-service architectures, all in an effort to reduce the number of points of change or failure, and improve the time-to-value when new functionality is implemented. Isolating components also allows developers to test their discrete behavior, which often leaves end-to-end integrated testing out of scope. The assumption here is that if every component behaves as it should, the entire system should perform well. In an isolated environment this may be true, but there are many factors introduced when you’re building large-scale distributed systems that impact performance and the end-user experience—factors that may not be directly attributed to software, but should be considered nonetheless, such as network latency, individual component failures, and client-side behavior. It is important to build and test application components with all of these factors represented in order to optimize around them.
Culture
Every organization and group has a mission and vision. While they strive to attain these goals, performance becomes implied or implicit. But performance needs to be a part of all decisions around the steps taken to achieve a goal; it forms the basis of how an organization will embody performance engineering throughout their culture to achieve their mission and vision. We need to treat performance as a design principle, similar to deciding whether to build applications using MVC or micro-services architectures, or asking why a new epic (or the relative size of a requirement, in Agile terminology) is important to the business, and how performance with the business/technology/end user will make a difference for all stakeholders. Performance needs to be an overarching requirement from the beginning, or we have already started on the wrong foot.
In order to build a culture that respects the performance requirements of the organization and our end users, there needs to be some incentive to do so. If it doesn’t come from the top down, then we can take a grassroots approach, but first we need to quantify what performance means to our business, users, and team. We must understand the impact and cost of every transaction in the system, and seek to optimize that for improved business success.
The key takeaway here should be that performance is everyone’s responsibility, not just the developers’, the testers’, or the operations team’s. It needs to be part of our collective DNA. “Performance First” can be a mantra for every stakeholder.
The end-user experience should be at the forefront of thinking when it comes to performance. The satisfaction of your end users will ultimately drive business success (or failure), and can be quantified by a number of metrics in described in “Metrics for Success” on page 82.
The point is that it shouldn’t matter whether your servers can handle 1,000 hits/second or if CPU usage is below 80%. If the experience of the end user is slow or unreliable, the end result should be considered a failure.
Business
What does performance mean to your business? Aberdeen Group surveyed 160 companies with average annual revenue of over $1 billion, finding that a one-second delay in response time caused an 11% decrease in page views, a 7% decrease in “conversions,” and a 16% decrease in customer satisfaction.
Google conducted experiments on two different page designs, one with 10 results per page and another with 30. The larger design page took a few hundred milliseconds longer to load, reducing search usage by 20%. Traffic at Google is correlated to click-through rates, and click-through rates are correlated with revenue, so the 20% reduction in traffic would have led to a 20% reduction in revenue. Reducing Google Maps’ 100-kilobyte page weight by almost a third increased traffic by over one-third.
The correlation between response time and revenue is not restricted to Google. A former employee of Amazon.com discovered that 100 milliseconds of delay reduced revenues by 1%. Whether you are selling goods online or providing access to healthcare registration for citizens, there is a direct correlation between the performance of your applications and the success of your business.