The challenges of holiday performance readiness
Five important things you can do to survive the holiday rush.
Preparing for holidays, especially if you’re engaging in commerce, can be a stressful time if haven’t done your homework. Black Friday and other holidays can cause traffic to skyrocket, often exceeding 100x non-holiday peak. Failure during Black Friday or other peak times could put you out of a job, generate bad press for your company and even impact your company’s stock price.
To survive, you need to first understand that holiday readiness is primarily a management activity. While technology is involved, it is not the focus. Let’s review some of the top ways to ensure you have a stress-free holiday or special event.
Remove the “Human” element
Humans are bad at repeatedly performing even basic tasks. From driving a car, to eating soup without spilling it on your shirt—humans make mistakes. In our industry, one mistyped character can mean the difference between success and failure. The airline industry has coped with this by making extensive use of automation (autopilot) and checklists. These two principles can be directly applied to holiday readiness.
Automation should be extensively employed to avoid downtime. Auto-scaling should be employed at all levels to avoid mistakes while provisioning hardware and software. The installation and configuration of all software should be scripted. Health checking should be automated. A human’s job should be to automate—not to manually perform tasks.
Where automation cannot be employed, checklists should be used—just like pilots use in the cockpit. For example, each team should develop an extensive checklist verifying that their part of the overall system is functioning. Check for file permission issues, web server configuration, iptables rules, etc. In addition to verification checklists, have checklists for what happens in an outage. Whose role is it to communicate with executives? Whose role is it to communicate with each vendor? There shouldn’t be any “guessing.” You should have checklists for everything.
Manage change properly
To avoid unexpected downtime, you should aim to minimize as many changes as possible. Changes include deploying new custom code, changing or upgrading supporting software (application servers, databases, firmware, etc.), manually adding new network gear, upgrading hardware, etc. Every change is a possible cause of an outage. Many retailers freeze their production systems for the months of October and November in preparation for Black Friday, with any change requiring the CIO to sign off. While this puts forward business and technology progress on hold, it does wonders for stability. Changes that need to go into production close to your special event should go through an extensive change management process. Remember, many have lost their jobs due to outages.
Cache everything
Most of the traffic for a holiday is likely to be for a handful of pages, like the home page, category overview pages, and product detail pages. A 10x, 100x, or even 1,000x spike in traffic can often be served if those common handful of pages are served directly by your Content Delivery Network (CDN). Rather than pass the requests back to your platform, the CDN can directly serve up a cached copy of the most common pages. Once cached, the pages are just static files served from a web server. Any CDN could serve millions of copies of those pages per second. In addition to caching entire pages, you can also cache page fragments, images, objects within your custom application and objects within your datastore. Caching is well understood and should be done liberally.
Test, test, test. And then test again.
Testing can occur in three locations: the developer’s local environment, an integration environment, and production. Let’s explore each.
Locally, your developers should always be running unit tests that verify code’s functionality. Ideally, at least 80% of all code written should be covered by one or more unit tests. Successful unit testing should always be a prerequisite before code is checked in to source control. Developers should also be executing white box security scans, which are scans that look inside the source code for vulnerabilities.
Next, once code is checked in, there should be at least one environment where your developers’ code is tested in together. Functional testing should be at the application layer (e.g. direct API calls) and through the various user interfaces (e.g. web, mobile, IoT, etc). At the application layer, you’ll want to make sure that each component produces the correct output for a given input. For example, the REST API for pricing should return something like {"price": 19.95}
when invoked. There should be hundreds of tests for each component, ensuring that every conceivable input will be gracefully tolerated. You’ll also want to test your application’s functionality through the various user interfaces using some form of synthetic testing. Synthetic testing simulates real end-user behavior and interacts with your application through a web browser, mobile, etc. It’s more comprehensive than component testing, which tests each component in isolation.
You’ll also want to test the performance of each component and various transactions, both without and with load. Find out how much load your application can take before performance rises to an unacceptable level. Work with your CDN vendor to throttle traffic at the edge, before your application’s breaking point is hit.
Security should also be tested, but in this environment it should be black box testing rather than white box. Black box testing is from the outside, with no access to the source code. This type of testing acts like an outside hacker would and includes port scans, testing for cross-site scripting vulnerabilities, etc.
Testing in production is largely the same as integration, but you’ll want your synthetic testing to be from multiple endpoints around the world in order to more accurately test the user experience. A new generation of cloud-based load generators can quickly generate load from around the world, simulating a variety of different devices (various web browsers, various mobile devices, etc).
Intelligently monitor health
The health of each instance of your application should be continually monitored in production. Health checks should be automated and as in-depth as possible. Define a single URL (e.g. /healthcheck/) for checking an application’s health. That endpoint should test common application functions, like retrieving a product and placing an order in a commerce application. Once these tests are performed, a simple message should be returned, like {"healthy": true}
. If the endpoint doesn’t respond or responds with {"healthy": false}
too many times, it should be automatically pulled from the load balancer and a new instance should be spawned. Everything should be automated, including the checking of health and re-provisioning of failed instances.
Conclusion
Holiday readiness is both a technical and human problem. Start by automating as much as possible to remove the human factor and reinforce all manual work with appropriate change management. Then, focus on technology. Optimize your stack by caching at all possible layers. Then, test everything. Finally, ensure you are properly monitoring your entire stack.
A good first step is to get your teams on the same page, with a shared set of metrics for what success looks like. An example might be handling 25,000 HTTP requests per second and 99.9999% uptime through the month of November. Once the different teams are unified in achieving a shared measurable goal, you can then begin to implement the topics discussed earlier. Good luck!
This post is a collaboration between O’Reilly and HPE. See our statement of editorial independence.