Data Exploration and Preparation with BigQuery

Book description

Leverage BigQuery to understand and prepare your data to ensure that it's accurate, reliable, and ready for analysis and modeling

Key Features

  • Use mock datasets to explore data with the BigQuery web UI, bq CLI, and BigQuery API in the Cloud console
  • Master optimization techniques for storage and query performance in BigQuery
  • Engage with case studies on data exploration and preparation for advertising, transportation, and customer support data
  • Purchase of the print or Kindle book includes a free PDF eBook

Book Description

Data professionals encounter a multitude of challenges such as handling large volumes of data, dealing with data silos, and the lack of appropriate tools. Datasets often arrive in different conditions and formats, demanding considerable time from analysts, engineers, and scientists to process and uncover insights. The complexity of the data life cycle often hinders teams and organizations from extracting the desired value from their data assets. Data Exploration and Preparation with BigQuery offers a holistic solution to these challenges.

The book begins with the basics of BigQuery while covering the fundamentals of data exploration and preparation. It then progresses to demonstrate how to use BigQuery for these tasks and explores the array of big data tools at your disposal within the Google Cloud ecosystem.

The book doesn’t merely offer theoretical insights; it’s a hands-on companion that walks you through properly structuring your tables for query efficiency and ensures adherence to data preparation best practices. You’ll also learn when to use Dataflow, BigQuery, and Dataprep for ETL and ELT workflows. The book will skillfully guide you through various case studies, demonstrating how BigQuery can be used to solve real-world data problems.

By the end of this book, you’ll have mastered the use of SQL to explore and prepare datasets in BigQuery, unlocking deeper insights from data.

What you will learn

  • Assess the quality of a dataset and learn best practices for data cleansing
  • Prepare data for analysis, visualization, and machine learning
  • Explore approaches to data visualization in BigQuery
  • Apply acquired knowledge to real-life scenarios and design patterns
  • Set up and organize BigQuery resources
  • Use SQL and other tools to navigate datasets
  • Implement best practices to query BigQuery datasets
  • Gain proficiency in using data preparation tools, techniques, and strategies

Who this book is for

This book is for data analysts seeking to enhance their data exploration and preparation skills using BigQuery. It guides anyone using BigQuery as a data warehouse to extract business insights from large datasets. A basic understanding of SQL, reporting, data modeling, and transformations will assist with understanding the topics covered in this book.

Table of contents

  1. Data Exploration and Preparation with BigQuery
  2. Contributors
  3. About the author
  4. About the reviewers
  5. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
    4. Download the example code files
    5. Conventions used
    6. Get in touch
    7. Share Your Thoughts
    8. Download a free PDF copy of this book
  6. Part 1: Introduction to BigQuery
  7. Chapter 1: Introducing BigQuery and Its Components
    1. Technical requirements
    2. What is BigQuery?
    3. Understanding how BigQuery works
      1. Dremel, the execution engine
      2. Colossus distributed storage
      3. The Borg compute platform
      4. Jupiter network infrastructure
    4. BigQuery administration and access
      1. Tools for administration
      2. Understanding identity and access management
    5. BigQuery best practices and cost management
      1. Best practices
      2. Understanding and controlling costs
    6. Extending your data
      1. BigQuery ML
      2. External datasets
      3. External connections
    7. Summary
    8. References
  8. Chapter 2: BigQuery Organization and Design
    1. Technical requirements
    2. Understanding BigQuery’s resource hierarchy
      1. Organizations, folders, and projects
      2. BigQuery-specific resources
      3. BigQuery storage
    3. Exploring architecture patterns
      1. The centralized enterprise data warehouse
      2. The decentralized data warehouse
      3. The cross-org data exchange
      4. Schema design
      5. Table design
    4. Summary
  9. Part 2: Data Exploration with BigQuery
  10. Chapter 3: Exploring Data in BigQuery
    1. Technical requirements
    2. What is data exploration?
      1. Fundamentals
      2. Data life cycle
      3. Common challenges and solutions
    3. Introduction to exploring data in BigQuery
      1. Exploring data in the BigQuery SQL workspace
      2. Exploring schema and table structure
      3. Exploring data using SQL
      4. Exploring data using the bq command-line interface
      5. Exploring data with visualization tools
    4. Enhancing data exploration in BigQuery
      1. Advanced approaches
      2. Best practices
    5. Summary
  11. Chapter 4: Loading and Transforming Data
    1. Technical requirements
    2. Exploring data loading techniques
      1. Batch loading data
      2. Streaming ingestion of data
      3. Scheduled loading of data
      4. Situations where you do not need to load data
    3. Data transformation with BigQuery
    4. Evaluating ETL and ELT approaches for data integration
    5. Hands-on exercise – data loading and transformation in BigQuery
      1. Understanding the scenario
      2. Loading data from a local file
      3. Preparing and transforming data
    6. Summary
  12. Chapter 5: Querying BigQuery Data
    1. Technical requirements
    2. Understanding query structure
      1. Action command – the SELECT clause
      2. Location command – the FROM clause
      3. Filtering command – the WHERE clause
      4. Selection handling commands – the GROUP BY, HAVING, ORDER BY, and LIMIT clauses
    3. Understanding data types
    4. Using expressions and aggregations
      1. Expressions
      2. Aggregations
    5. Joining tables
      1. Inner joins
      2. Outer joins
    6. Using functions
    7. Advanced querying techniques
      1. Subqueries
      2. Window functions
      3. Common table expressions
      4. Array functions
    8. Saving, sharing, and scheduling queries
    9. Optimizing queries
    10. Troubleshooting queries
    11. Summary
    12. Further reading
  13. Chapter 6: Exploring Data with Notebooks
    1. Technical requirements
    2. Understanding the value of using notebooks
      1. Jupyter notebooks
    3. Using Workbench notebook instances in Vertex AI
      1. Creating a managed notebook
      2. Executions and schedules
      3. Hands-on exercise – analyzing Google Trends data with Workbench
    4. Using Colab notebooks
      1. Comparing Workbench instances and Colab
    5. Summary
    6. Further reading
  14. Chapter 7: Further Exploring and Visualizing Data
    1. Technical requirements
    2. Understanding data distributions
      1. Why is it important to understand data distributions?
    3. Uncovering relationships in data
    4. Exploring BigQuery data with Google Sheets
      1. Connecting to Sheets from BigQuery using Explore with Sheets
      2. Connecting to BigQuery using Connected Sheets
      3. Column statistics
      4. Collaboration with BigQuery data in Sheets
    5. Visualizing BigQuery data using Looker Studio
      1. Creating the right visualizations
    6. Hands-on exercise – creating visualizations with Looker Studio
      1. Commonly created charts
      2. Calculations in visualization tools
      3. Data quality discovery while creating visualizations
      4. Filtering data in visualizations
    7. Integrating other visualization tools with BigQuery
    8. Summary
    9. Further reading
  15. Part 3: Data Preparation with BigQuery
  16. Chapter 8: An Overview of Data Preparation Tools
    1. Technical requirements
    2. Getting started with data preparation
      1. Clearly defining your data preparation goals
      2. Evaluating your current data quality
      3. Data cleansing and transformation
      4. Validating prepared data
    3. Data preparation approaches
    4. Data preparation tools
      1. Visual data preparation tools
      2. Query and code-based tools
      3. Automated data preparation
    5. Summary
    6. Further reading
  17. Chapter 9: Cleansing and Transforming Data
    1. Technical requirements
    2. Using ELT for cleansing and transforming data
    3. Assessing dataset integrity
      1. The shape of the dataset
      2. Skew of the dataset
      3. Data profiling
      4. Data validation
      5. Data visualization
    4. Using SQL for data cleansing and transformation
      1. SQL data cleansing strategies and examples
      2. SQL data transformation strategies and examples
      3. Writing query results
    5. Using Cloud Dataprep for visual cleansing and transformation
    6. Summary
    7. Further reading
  18. Chapter 10: Best Practices for Data Preparation, Optimization, and Cost Control
    1. Technical requirements
    2. Data preparation best practices
      1. Understanding your data and business requirements
      2. Denormalizing your data
      3. Optimizing schema design
      4. Considering nested and repeated fields
      5. Using correct data types
      6. Data cleansing and validation
      7. Partitioning and clustering
      8. Optimizing data loading
    3. Best practices for optimizing storage
      1. Long-term and compressed storage
      2. Cross-cloud data analytics with federated access model and BigQuery Omni
      3. Backup and recovery
    4. Best practices for optimizing compute
      1. Analysis cost options
      2. Query optimization
      3. Query optimization cheat sheet
    5. Monitoring and controlling costs
      1. Query plan and query performance insights
      2. Monitoring, estimating, and optimizing costs
      3. Controlling costs
    6. Summary
    7. Further reading
  19. Part 4: Hands-On and Conclusion
  20. Chapter 11: Hands-On Exercise – Analyzing Advertising Data
    1. Technical requirements
    2. Exercise and use case overview
    3. Loading CSV data files from local upload
    4. Data preparation
      1. Standardizing date formats
    5. Data exploration, analysis, and visualization
      1. Analyzing ads and sales data
      2. Return on ad spend
      3. Visualizations
    6. Summary
    7. References
  21. Chapter 12: Hands-On Exercise – Analyzing Transportation Data
    1. Technical requirements
    2. Exercise and use case overview
    3. Loading data from GCS to BigQuery
      1. Uploading data files to Google Cloud Storage
      2. Loading data into BigQuery
    4. Data preparation
    5. Data exploration and analysis
      1. Visualizing data with BigQuery geography functions
    6. Summary
    7. Further reading
  22. Chapter 13: Hands-On Exercise – Analyzing Customer Support Data
    1. Technical requirements
    2. Exercise and use case overview
    3. Data loading from CSV upload
    4. Data preparation
    5. Data exploration and analysis
      1. Count of ticket_type across both datasets
      2. The most common support issues using ticket_subject data
      3. Average resolution time per ticket_type
      4. Customer demographics using customer_age and customer_gender
    6. Analyzing emotions with sentiment analysis
      1. Creating a connection
      2. Granting access to the external connection service account
      3. Creating a model
      4. Querying the model
    7. Summary
    8. References and further reading
  23. Chapter 14: Summary and Future Directions
    1. Summary of key points
      1. Chapter 1, Introducing BigQuery and Its Components
      2. Chapter 2, BigQuery Organization and Design
      3. Chapter 3, Exploring Data in BigQuery
      4. Chapter 4, Loading and Transforming Data
      5. Chapter 5, Querying BigQuery Data
      6. Chapter 6, Exploring Data with Notebooks
      7. Chapter 7, Further Exploring and Visualizing Data
      8. Chapter 8, An Overview of Data Preparation Tools
      9. Chapter 9, Cleansing and Transforming Data
      10. Chapter 10, Best Practices for Data Preparation, Optimization, and Cost Control
      11. Chapter 11, Hands-On Exercise – Analyzing Advertising Data
      12. Chapter 12, Hands-On Exercise – Analyzing Transportation Data
      13. Chapter 13, Hands-On Exercise – Analyzing Customer Support Data
    2. Future directions
      1. More integration with AI and ML
      2. Generative AI
      3. Natural language queries
      4. DataOps
      5. Hybrid and multi-cloud data analysis
      6. Zero-ETL and real-time analytics
      7. Data governance and privacy
      8. Federated learning
      9. Data clean rooms
      10. Data monetization
    3. Additional resources
    4. Final words
  24. Index
    1. Why subscribe?
  25. Other Books You May Enjoy
    1. Packt is searching for authors like you
    2. Share Your Thoughts
    3. Download a free PDF copy of this book

Product information

  • Title: Data Exploration and Preparation with BigQuery
  • Author(s): Mike Kahn
  • Release date: November 2023
  • Publisher(s): Packt Publishing
  • ISBN: 9781805125266