Where should you manage a cloud-based Hadoop cluster?
Comparing AWS, GCP, and Azure for large-scale analytics.
It’s no secret that Hadoop and public cloud play very nicely with each other. Rather than having to provision and maintain a set number of servers and expensive networking equipment in house, Hadoop clusters can be spun up in the cloud as a managed service, letting users pay only for what they use, only when they use it.
The scalability and per-workload customizability of public cloud is also unmatched. Rather than having one predefined set of servers (with a set amount of RAM, CPU, and network capability) in-house, public cloud offers the ability to stand up workload-specific clusters with varying amounts of those resources tailored for each workload. The access to “infinite” amounts of hardware that public cloud offers is also a natural fit for Hadoop, as running 100 nodes for 10 hours is the same cost and complexity level as running 1,000 nodes for one hour.
But among cloud providers, the similarities largely end there. Although Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP) each have their own offerings for both managed and VM-based clusters, there are many differentiators that may drive you to one provider over another.
High-level differentiators
When comparing the “Big 3” providers in the context of Hadoop operations, several important factors come into play. The high-level ones being:
- Network isolation: This refers to the ability to create “private” networks and control routing, IP address spaces, subnetting, and additional security. In this area, AWS, Azure, and GCP each provide roughly equal offerings in the way of VPC, Azure Virtual Networks, and Google Subnetworks respectively.
- Type and number of underlying VMs: For workload customizability, the more VM types, the better. Although all providers have “general,” “high CPU,” and “high RAM” instance types, AWS takes the ball further with “high storage,” GPU, and “high IO” instance types. AWS also has the largest raw number of instance types (currently 55), while both GCP and Azure offer only 18 each.
- Cost granularity: For short-term workloads (those completing in just a few hours), costs can vary greatly, with Azure offering the most granular model (minute-by-minute granularity), GCP offering the next best model (pre-billing the first 10 minutes of usage, then billing for each minute), and AWS offering the least flexibility (each full hour billed ahead).
- Cost flexibility: How you pay for your compute nodes makes an even bigger difference with regard to cost. AWS wins here with multiple models like Spot Instances & Reservations, which can save up to 90% of the cost of the “on-demand” models, which all three support. Azure and GCP both offer cost-saving mechanisms, with Azure using reservations (but only up to 12 months) and GCP using “sustained-use discounts,” which are automatically applied for heavily utilized instances. AWS’ reservations can go up to three years, and therefore offer deeper discounts.
- Hadoop support: Each provider offers a managed, hosted version of Hadoop. AWS’ is called Elastic MapReduce or EMR, Azure’s is called HDInsight, and GCP’s is called DataProc. EMR and DataProc both use core Apache Hadoop (EMR also supports MapR distributions), while Azure uses Hortonworks. Outside of the managed product, each provider also offers the ability to use raw instance capacity to build Hadoop clusters, removing the convenience of the managed service but allowing for much more customizability, including the ability to choose alternate distributions like Cloudera.
Cloud ecosystem integration
In addition to the high-level differentiators, one of the public cloud’s biggest impacts for Hadoop operations is the integration to other cloud-based services like object stores, archival systems, and the like. Each provider is roughly equivalent with regard to integration and support of:
- Object storage and data archival: Each provider has near parity here for both cost and functionality, with their respective object stores (S3 for AWS, Blob Storage for Azure, and Google Cloud Storage for GCP) being capable of acting as a data sink or source.
- NoSQL integrations: Each provider has different, but comparable, managed NoSQL offerings (DynamoDB for AWS, DocumentDB and Managed MongoDB for Azure, and BigTable and BigQuery for GCP), which again can act as data sinks or sources for Hadoop.
- Dedicated point-to-point fiber interconnects: Each provider offers comparable capability to stretch dedicated, secured fiber connections between on-premise data centers and their respective clouds. AWS’s is DirectConnect, Azure’s is ExpressRoute, and GCP’s is Google Cloud Interconnect.
- High-speed networking: AWS and Azure each offer the ability to launch clusters in physically grouped hardware (ideally all machines in the same rack if possible), allowing the often bandwidth-hungry Hadoop clusters to take advantage of 10Gbps network interconnects. AWS offers Placement Groups, and Azure offers Affinity Groups. DataProc offers no such capability, but GCP’s cloud network is already well known as the most performant of the three.
Big data is more than just Hadoop
Although the immediate Hadoop-related ecosystems discussed above have few differentiators, the access to the provider’s other services and features can give Hadoop administrators many other tools to either perform analytics elsewhere (off the physical cluster) or make Hadoop operations easier to perform.
AWS really shines here with a richer service offering than any of the three. Some big services that come into play for larger systems are Kinesis (which provides near-real-time analytics and stream ingestion), Lambda (for event-driven analytics architectures), Import / Export Snowball (for secure, large-scale data import/export), and AWS IoT (for ingestion and processing of IoT device data)—all services either completely absent at Azure or GCP, or much less mature and not as rich in features.
Key takeaways
While one could argue any number of additions or edits to the comparisons above, it represents a good checklist to use when comparing where to launch a managed or unmanaged cloud-based Hadoop cluster. One of the great things about using Hadoop in the cloud is that it’s nearly the exact same regardless of distribution or cloud provider. Each of the big three have a mature offering with regards to Hadoop, so whichever partner you choose, you can bet that your cluster will work well, provide cost-saving options, strong security features, and all the flexibility that public cloud provides.
This post is a collaboration between O’Reilly and Pepperdata. See our statement of editorial independence.