The Hadoop tipping point
Real-time automation is the key to Hadoop performance at scale.
Tools like Hadoop are an integral part of many modern IT departments. Clickstream data analysis was one of the first analytical workloads of Hadoop, soon joined by sentiment data analysis—a la Facebook and Twitter—to identify influential users and target marketing campaigns more effectively. Another Hadoop workhorse (though perhaps not as sexy) is server log data analysis—used to track usage and identify vulnerabilities. These are just a few examples of the numerous possibilities for the Hadoop framework, and the demand for additional use cases and applications continues to increase.
Our big data culture pushes for more and more data and real-time insights. So, we create massive infrastructures to feed the hungry beast. We build-build-build, but find that maintaining (and guaranteeing) Hadoop performance remains an unsolved problem. If performance issues are observed, we start by building more. Are you seeing slower job completions? Scale out! If that doesn’t work, tune! Do you have one critical workload that absolutely must complete on time? Isolate it on its own cluster! While these techniques can alleviate some of the symptoms of poor performance, they don’t cure the underlying issue. While YARN can verify that requested resources are available before a job begins, its architecture is simply not built to manage the performance of actively running jobs.
In addition, we over-estimate needed resources. It’s not just poor planning. It’s actually quite difficult to estimate needed resources because performance is dependent upon so many variables, including:
- Number of cores
- Number of nodes
- Load changes (size, intensity)
- Interference (for shared clusters)
With infinite time, you could test job performance while changing each of these variables, one by one. And this would, in theory, work perfectly—until a software upgrade is needed or the hardware of the cluster changes. Then, the experiments must start all over again. Obviously, no one has the time or resources to undertake such a complex and time-intensive experiment. So, the workaround is to overestimate resources and allow our resource managers to “traffic cop” jobs until the maximum requested resources are available. We’re provisioning to meet peak need, but our day-to-day cluster usage only averages 6-12%. And—here’s the kicker—we’re not actually preventing resource contention at all! Though YARN, Mesos, and virtualization hypervisors wait until the requested resources are available to start a job, they can’t actually prevent resource contention once the job has begun. So, what is there to do? It’s tempting to go back to our traditional “solutions” (provision, tune, isolate), but our analytic needs have outgrown them. We want to centralize cluster resources, allow ad-hoc analysis amongst several users or groups, and run different workloads simultaneously on a single cluster. We’ve passed a tipping point where manual solutions fail, and only a real-time, programmatic solution will suffice.
Reiss & Tumanov, et. al., studied Google’s cluster to characterize the needs of a large, modern Hadoop cluster. The workloads on today’s clusters are dynamic and have heterogenous resource needs and resource usage patterns. They found that traditional slot- and core-based schedulers are ineffective, and propose new characteristics for an adaptive resource scheduler—modern characteristics to meet modern demands. Specifically, they recommend a next-gen scheduler that can:
- Make rapid task scheduling decisions
- Revoke previous assignments
- Dynamically accommodate resource requests from running jobs
- Predict (not just allocate) resource usage
Their recommendations, published in 2012, did not fall on deaf ears. A variety of new tools are challenging the traditional max-resource-allocation architecture in response to the needs for real-time resource allocation and resource prediction. The new free report “The Hadoop Performance Myth” reviews the current state of Hadoop performance and how emerging tools are responding to industry needs.
For example, what if resources assigned to low-priority jobs could be revoked and re-allocated while the job is running? What if instead of allocating maximum resources to jobs, minimum requested resources were assigned? Some of these new tools were born in academia, like Quasar, while others are already in the commercial space with experimental support, like Mesosphere’s “oversubscription” model (introduced in Mesos 0.23.0). There are also new commercial tools that offer quicker implementations than upgrading YARN or Mesos versions—Pepperdata, for example, doesn’t alter the resource scheduler itself; it provides an agent on each node that re-informs the scheduler of available and utilized resources in real time (3-5 second intervals). It also works with the resource scheduler to re-assign unused resources while jobs are in progress. I wouldn’t be surprised if many of the major Hadoop distributions released adjustments to the resource scheduler in coming months and years.
It will be exciting to see the impact emerging tools and re-imagined resource schedulers will have on cluster utilization. What if we could increase the average cluster utilization average from 6-12% to, say, 12-24%? Could a combination of these emerging technologies, plus efficiency measures like co-locating workloads, push us even farther? To, dare I say, unlock the 50% utilization range to companies other than Google and Twitter? There’s no single solution or silver bullet, but the possibilities are incredible when we intelligently combine techniques.
Download “The Hadoop Performance Myth” to learn more about the challenges of improving Hadoop performance and potential solutions that can help when traditional best practices fall short.
This post is a collaboration between O’Reilly and Pepperdata. See our statement of editorial independence.