Modern Data Architectures with Python

Book description

Build scalable and reliable data ecosystems using Data Mesh, Databricks Spark, and Kafka

Key Features

  • Develop modern data skills used in emerging technologies
  • Learn pragmatic design methodologies such as Data Mesh and data lakehouses
  • Gain a deeper understanding of data governance
  • Purchase of the print or Kindle book includes a free PDF eBook

Book Description

Modern Data Architectures with Python will teach you how to seamlessly incorporate your machine learning and data science work streams into your open data platforms. You’ll learn how to take your data and create open lakehouses that work with any technology using tried-and-true techniques, including the medallion architecture and Delta Lake.

Starting with the fundamentals, this book will help you build pipelines on Databricks, an open data platform, using SQL and Python. You’ll gain an understanding of notebooks and applications written in Python using standard software engineering tools such as git, pre-commit, Jenkins, and Github. Next, you’ll delve into streaming and batch-based data processing using Apache Spark and Confluent Kafka. As you advance, you’ll learn how to deploy your resources using infrastructure as code and how to automate your workflows and code development. Since any data platform's ability to handle and work with AI and ML is a vital component, you’ll also explore the basics of ML and how to work with modern MLOps tooling. Finally, you’ll get hands-on experience with Apache Spark, one of the key data technologies in today’s market.

By the end of this book, you’ll have amassed a wealth of practical and theoretical knowledge to build, manage, orchestrate, and architect your data ecosystems.

What you will learn

  • Understand data patterns including delta architecture
  • Discover how to increase performance with Spark internals
  • Find out how to design critical data diagrams
  • Explore MLOps with tools such as AutoML and MLflow
  • Get to grips with building data products in a data mesh
  • Discover data governance and build confidence in your data
  • Introduce data visualizations and dashboards into your data practice

Who this book is for

This book is for developers, analytics engineers, and managers looking to further develop a data ecosystem within their organization. While they’re not prerequisites, basic knowledge of Python and prior experience with data will help you to read and follow along with the examples.

Table of contents

  1. Modern Data Architectures with Python
  2. Contributors
  3. About the author
  4. About the reviewers
  5. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
    4. Download the example code files
    5. Conventions used
    6. Get in touch
    7. Share Your Thoughts
    8. Download a free PDF copy of this book
  6. Part 1:Fundamental Data Knowledge
  7. Chapter 1: Modern Data Processing Architecture
    1. Technical requirements
    2. Databases, data warehouses, and data lakes
      1. OLTP
      2. OLAP
      3. Data lakes
      4. Event stores
      5. File formats
      6. Data platform architecture at a high level
    3. Comparing the Lambda and Kappa architectures
      1. Lambda architecture
      2. Kappa architecture
    4. Lakehouse and Delta architectures
      1. Lakehouses
      2. The seven central tenets
      3. The medallion data pattern and the Delta architecture
      4. Data mesh theory and practice
      5. Defining terms
      6. The four principles of data mesh
    5. Summary
    6. Practical lab
      1. Solution
  8. Chapter 2: Understanding Data Analytics
    1. Technical requirements
    2. Setting up your environment
      1. Python
      2. venv
      3. Graphviz
      4. Workflow initialization
    3. Cleaning and preparing your data
      1. Duplicate values
      2. Working with nulls
      3. Using RegEx
      4. Outlier identification
      5. Casting columns
      6. Fixing column names
      7. Complex data types
    4. Data documentation
      1. diagrams
      2. Data lineage graphs
    5. Data modeling patterns
      1. Relational
    6. Dimensional modeling
      1. Key terms
    7. OBT
    8. Practical lab
      1. Loading the problem data
      2. Solution
    9. Summary
  9. Part 2: Data Engineering Toolset
  10. Chapter 3: Apache Spark Deep Dive
    1. Technical requirements
    2. Setting up your environment
      1. Python, AWS, and Databricks
      2. Databricks CLI
    3. Cloud data storage
      1. Object storage
      2. Relational
      3. NoSQL
    4. Spark architecture
      1. Introduction to Apache Spark
      2. Key components
      3. Working with partitions
      4. Shuffling partitions
      5. Caching
      6. Broadcasting
      7. Job creation pipeline
    5. Delta Lake
      1. Transaction log
      2. Grouping tables with databases
      3. Table
    6. Adding speed with Z-ordering
      1. Bloom filters
    7. Practical lab
      1. Problem 1
      2. Problem 2
      3. Problem 3
    8. Solution
    9. Summary
  11. Chapter 4: Batch and Stream Data Processing Using PySpark
    1. Technical requirements
    2. Setting up your environment
      1. Python, AWS, and Databricks
      2. Databricks CLI
    3. Batch processing
      1. Partitioning
      2. Data skew
      3. Reading data
    4. Spark schemas
      1. Making decisions
      2. Removing unwanted columns
      3. Working with data in groups
    5. The UDF
    6. Stream processing
      1. Reading from disk
      2. Debugging
      3. Writing to disk
      4. Batch stream hybrid
      5. Delta streaming
      6. Batch processing in a stream
    7. Practical lab
      1. Setup
      2. Creating fake data
      3. Problem 1
      4. Problem 2
      5. Problem 3
    8. Solution
      1. Solution 1
      2. Solution 2
      3. Solution 3
    9. Summary
  12. Chapter 5: Streaming Data with Kafka
    1. Technical requirements
    2. Setting up your environment
      1. Python, AWS, and Databricks
      2. Databricks CLI
    3. Confluent Kafka
      1. Signing up
    4. Kafka architecture
      1. Topics
      2. Partitions
      3. Brokers
      4. Producers
      5. Consumers
    5. Schema Registry
    6. Kafka Connect
    7. Spark and Kafka
    8. Practical lab
    9. Solution
    10. Summary
  13. Part 3:Modernizing the Data Platform
  14. Chapter 6: MLOps
    1. Technical requirements
    2. Setting up your environment
      1. Python, AWS, and Databricks
      2. Databricks CLI
    3. Introduction to machine learning
      1. Understanding data
      2. The basics of feature engineering
      3. Splitting up your data
      4. Fitting your data
      5. Cross-validation
    4. Understanding hyperparameters and parameters
      1. Training our model
      2. Working together
    5. AutoML
    6. MLflow
      1. MLOps benefits
    7. Feature stores
      1. Hyperopt
    8. Practical lab
      1. Create an MLflow project
    9. Summary
  15. Chapter 7: Data and Information Visualization
    1. Technical requirements
      1. Setting up your environment
    2. Principles of data visualization
      1. Understanding your user
      2. Validating your data
    3. Data visualization using notebooks
      1. Line charts
      2. Bar charts
      3. Histograms
      4. Scatter plots
      5. Pie charts
      6. Bubble charts
      7. A single line chart
      8. A multiple line chart
      9. A bar chart
      10. A scatter plot
      11. A histogram
      12. A bubble chart
      13. GUI data visualizations
    4. Tips and tricks with Databricks notebooks
      1. Magic
      2. Markdown
      3. Other languages
      4. Terminal
      5. Filesystem
      6. Running other notebooks
      7. Widgets
    5. Databricks SQL analytics
      1. Accessing SQL analytics
      2. SQL Warehouses
      3. SQL editors
      4. Queries
      5. Dashboards
      6. Alerts
      7. Query history
    6. Connecting BI tools
    7. Practical lab
      1. Loading problem data
      2. Problem 1
      3. Solution
      4. Problem 2
      5. Solution
    8. Summary
  16. Chapter 8: Integrating Continous Integration into Your Workflow
    1. Technical requirements
    2. Setting up your environment
      1. Databricks
      2. Databricks CLI
      3. The DBX CLI
      4. Docker
      5. Git
      6. GitHub
      7. Pre-commit
      8. Terraform
      9. Docker
      10. Install Jenkins, container setup, and compose
    3. CI tooling
      1. Git and GitHub
      2. Pre-commit
    4. Python wheels and packages
      1. Anatomy of a package
    5. DBX
      1. Important commands
    6. Testing code
    7. Terraform – IaC
      1. IaC
      2. The CLI
      3. HCL
    8. Jenkins
      1. Jenkinsfile
    9. Practical lab
      1. Problem 1
      2. Problem 2
    10. Summary
  17. Chapter 9: Orchestrating Your Data Workflows
    1. Technical requirements
    2. Setting up your environment
      1. Databricks
      2. Databricks CLI
      3. The DBX CLI
    3. Orchestrating data workloads
      1. Making life easier with Autoloader
      2. Reading
      3. Writing
      4. Two modes
      5. Useful options
    4. Databricks Workflows
    5. Terraform
      1. Failed runs
    6. REST APIs
      1. The Databricks API
      2. Python code
      3. Logging
    7. Practical lab
      1. Solution
      2. Lambda code
      3. Notebook code
    8. Summary
  18. Part 4:Hands-on Project
  19. Chapter 10: Data Governance
    1. Technical requirements
    2. Setting up your environment
      1. Python, AWS, and Databricks
      2. The Databricks CLI
    3. What is data governance?
      1. Data standards
    4. Data catalogs
      1. Data lineage
      2. Data security and privacy
      3. Data quality
      4. Great Expectations
      5. Creating test data
      6. Data context
      7. Data source
      8. Batch request
      9. Validator
      10. Adding tests
      11. Saving the suite
      12. Creating a checkpoint
      13. Datadocs
      14. Testing new data
      15. Profiler
      16. Databricks Unity
    5. Practical lab
    6. Summary
  20. Chapter 11: Building out the Groundwork
    1. Technical requirements
    2. Setting up your environment
      1. The Databricks CLI
      2. Git
      3. GitHub
      4. pre-commit
      5. Terraform
      6. PyPI
    3. Creating GitHub repos
    4. Terraform setup
      1. Initial file setup
      2. Schema repository
      3. Schema repository
      4. ML repository
      5. Infrastructure repository
    5. Summary
  21. Chapter 12: Completing Our Project
    1. Technical requirements
    2. Documentation
      1. Schema diagram
      2. C4 System Context diagram
    3. Faking data with Mockaroo
    4. Managing our schemas with code
    5. Building our data pipeline application
    6. Creating our machine learning application
    7. Displaying our data with dashboards
    8. Summary
  22. Index
    1. Why subscribe?
  23. Other Books You May Enjoy
    1. Packt is searching for authors like you
    2. Share Your Thoughts
    3. Download a free PDF copy of this book

Product information

  • Title: Modern Data Architectures with Python
  • Author(s): Brian Lipp
  • Release date: September 2023
  • Publisher(s): Packt Publishing
  • ISBN: 9781801070492