Dataproc Cookbook

Book description

Get up to speed with Dataproc, the fully managed and highly scalable service for running open source big data tools and frameworks, including Hadoop, Spark, Flink, and Presto. This cookbook shows data engineers, data scientists, data analysts, and cloud architects how to use Dataproc, integrated with Google Cloud, for data lake modernization, ETL, and secure data science at a fraction of the cost.

Narasimha Sadineni from Google and former Googler Anu Venkataraman show you how to set up and run Hadoop and Spark jobs on Dataproc. You'll learn how to create Dataproc clusters and run data engineering and data science workloads in long-running, ephemeral, and serverless ways. In the process, you'll gain an understanding of Dataproc, orchestration, logging and monitoring, Spark History Server, and migration patterns.

This cookbook includes hands-on examples for configuring, logging, securing clusters, and migrating from on-prem to Dataproc. You'll learn how to:

  • Create Dataproc clusters on Compute Engine and Kubernetes Engine
  • Run data science workloads on Dataproc
  • Execute Spark jobs on Dataproc Serverless
  • Optimize Dataproc clusters to be cost effective and performant
  • Monitor Spark jobs in various ways
  • Orchestrate various workloads and activities
  • Use different methods for migrating data and workloads from existing Hadoop clusters to Dataproc

Publisher resources

View/Submit Errata

Table of contents

  1. Brief Table of Contents (Not Yet Final)
  2. 1. Creating Dataproc Cluster
    1. 1.1. Installing Google Cloud CLI
    2. 1.2. Granting IAM privileges to user
    3. 1.3. Configuring a Network and Firewall rules
    4. 1.4. Create Dataproc Cluster from UI
    5. 1.5. Creating Dataproc Cluster using gcloud
    6. 1.6. Creating Dataproc Cluster using API Endpoints
    7. 1.7. Creating Dataproc Cluster using Terraform
    8. 1.8. Creating cluster using Python
    9. 1.9. Duplicating a Dataproc Cluster
  3. 2. Running Hive/Spark/Sqoop Workloads
    1. 2.1. Adding Required Privileges for Jobs
    2. 2.2. Generating 1TB of Data using a MapReduce job
    3. 2.3. Running a Hive job to Show Records from Employee Table
    4. 2.4. Converting XML Data to Parquet Using Scala Spark on Dataproc
    5. 2.5. Converting XML data to Parquet using PySpark on Dataproc
    6. 2.6. Submitting a SparkR Job
    7. 2.7. Migrating Data from Cloud SQL to Hive Using Sqoop Job
    8. 2.8. Choosing Deploy Modes When Submitting a Spark Job to Dataproc
  4. 3. Advanced Dataproc Cluster Configuration
    1. 3.1. Creating an Auto Scaling Policy
    2. 3.2. Attaching an Auto Scaling Policy to a Dataproc Cluster
    3. 3.3. Optimizing Cluster Costs with a Mixed On-Demand and Spot Instance Auto Scaling Policy
    4. 3.4. Adding Local SSDs to Dataproc Worker Nodes
    5. 3.5. Creating a Cluster with a Custom Image
    6. 3.6. Building a Cluster with Custom Machine Types
    7. 3.7. Bootstrapping Dataproc Clusters with Initialization Scripts
    8. 3.8. Scheduling Automatic Deletion of Unused Clusters
    9. 3.9. Overriding Hadoop Configurations
  5. 4. Serverless Spark and Ephemeral Dataproc Clusters
    1. 4.1. Running on Dataproc: Serverless vs Ephemeral Clusters
    2. 4.2. Run Sequence of Jobs on Ephemeral Cluster
    3. 4.3. Executing a Spark Batch Job to Convert XML Data to Parquet on Dataproc Serverless
    4. 4.4. Running a Serverless Job Using Premium Tier Configuration
    5. 4.5. Giving a Unique Custom Name to a Dataproc Serverless Spark Job
    6. 4.6. Clone a Dataproc Serverless Spark Job
    7. 4.7. Run a Serverless Job on Spark RAPIDS Accelerator
    8. 4.8. How to Configure a Spark History Server
    9. 4.9. Writing Spark Events to the Spark History Server from Dataproc Serverless
    10. 4.10. Monitoring of Serverless Spark jobs
    11. 4.11. Calculate the Price of a Serverless Batch
  6. 5. Dataproc Metastore
    1. 5.1. Creating a Dataproc Metastore Service Instance
    2. 5.2. 6.2 Attaching a DPMS Instance to One or More Clusters
    3. 5.3. Creating Tables and Verifying Metadata in DPMS
    4. 5.4. Installing an Open Source Hive Metastore
    5. 5.5. Attaching an External Apache Hive Metastore to the Cluster
    6. 5.6. Searching for Metadata in a Dataplex Data Catalog
    7. 5.7. Automate the Backup of a DPMS Instance
  7. 6. Dataproc Security
    1. 6.1. Managing Identities in Dataproc Clusters
    2. 6.2. Securing Your Perimeter Using VPC Service Controls
    3. 6.3. Authenticating Using Kerberos
    4. 6.4. Installing Ranger
    5. 6.5. Securing Cluster Resources Using Ranger
    6. 6.6. Managing Credentials in the Google Cloud Environment
    7. 6.7. Enforcing Restriction Across All Clusters
    8. 6.8. Tokenizing Sensitive Data
  8. About the Authors

Product information

  • Title: Dataproc Cookbook
  • Author(s): Narasimha Sadineni, Anuyogam Venkataraman
  • Release date: June 2025
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098157708