Building ETL Pipelines with Python

Book description

Develop production-ready ETL pipelines by leveraging Python libraries and deploying them for suitable use cases

Key Features

  • Understand how to set up a Python virtual environment with PyCharm
  • Learn functional and object-oriented approaches to create ETL pipelines
  • Create robust CI/CD processes for ETL pipelines
  • Purchase of the print or Kindle book includes a free PDF eBook

Book Description

Modern extract, transform, and load (ETL) pipelines for data engineering have favored the Python language for its broad range of uses and a large assortment of tools, applications, and open source components. With its simplicity and extensive library support, Python has emerged as the undisputed choice for data processing.

In this book, you’ll walk through the end-to-end process of ETL data pipeline development, starting with an introduction to the fundamentals of data pipelines and establishing a Python development environment to create pipelines. Once you've explored the ETL pipeline design principles and ET development process, you'll be equipped to design custom ETL pipelines. Next, you'll get to grips with the steps in the ETL process, which involves extracting valuable data; performing transformations, through cleaning, manipulation, and ensuring data integrity; and ultimately loading the processed data into storage systems. You’ll also review several ETL modules in Python, comparing their pros and cons when building data pipelines and leveraging cloud tools, such as AWS, to create scalable data pipelines. Lastly, you’ll learn about the concept of test-driven development for ETL pipelines to ensure safe deployments.

By the end of this book, you’ll have worked on several hands-on examples to create high-performance ETL pipelines to develop robust, scalable, and resilient environments using Python.

What you will learn

  • Explore the available libraries and tools to create ETL pipelines using Python
  • Write clean and resilient ETL code in Python that can be extended and easily scaled
  • Understand the best practices and design principles for creating ETL pipelines
  • Orchestrate the ETL process and scale the ETL pipeline effectively
  • Discover tools and services available in AWS for ETL pipelines
  • Understand different testing strategies and implement them with the ETL process

Who this book is for

If you are a data engineer or software professional looking to create enterprise-level ETL pipelines using Python, this book is for you. Fundamental knowledge of Python is a prerequisite.

Table of contents

  1. Building ETL Pipelines with Python
  2. Contributors
  3. About the authors
  4. About the reviewers
  5. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
    4. Download the example code files
    5. Conventions used
    6. Get in touch
    7. Share Your Thoughts
    8. Download a free PDF copy of this book
  6. Part 1:Introduction to ETL, Data Pipelines, and Design Principles
  7. Chapter 1: A Primer on Python and the Development Environment
    1. Introducing Python fundamentals
      1. An overview of Python data structures
      2. Python if…else conditions or conditional statements
      3. Python looping techniques
      4. Python functions
      5. Object-oriented programming with Python
      6. Working with files in Python
      7. Establishing a development environment
      8. Version control with Git tracking
      9. Documenting environment dependencies with requirements.txt
      10. Utilizing module management systems (MMSs)
      11. Configuring a Pipenv environment in PyCharm
    2. Summary
  8. Chapter 2: Understanding the ETL Process and Data Pipelines
    1. What is a data pipeline?
    2. How do we create a robust pipeline?
      1. Pre-work – understanding your data
      2. Design planning – planning your workflow
      3. Architecture development – developing your resources
      4. Putting it all together – project diagrams
    3. What is an ETL data pipeline?
      1. Batch processing
      2. Streaming method
      3. Cloud-native
    4. Automating ETL pipelines
    5. Exploring use cases for ETL pipelines
    6. Summary
    7. References
  9. Chapter 3: Design Principles for Creating Scalable and Resilient Pipelines
    1. Technical requirements
    2. Understanding the design patterns for ETL
      1. Basic ETL design pattern
      2. ETL-P design pattern
      3. ETL-VP design pattern
      4. ELT two-phase pattern
    3. Preparing your local environment for installations
    4. Open source Python libraries for ETL pipelines
      1. Pandas
      2. NumPy
    5. Scaling for big data packages
      1. Dask
      2. Numba
    6. Summary
    7. References
  10. Part 2:Designing ETL Pipelines with Python
  11. Chapter 4: Sourcing Insightful Data and Data Extraction Strategies
    1. Technical requirements
    2. What is data sourcing?
    3. Accessibility to data
    4. Types of data sources
    5. Getting started with data extraction
      1. CSV and Excel data files
      2. Parquet data files
      3. API connections
      4. Databases
      5. Data from web pages
    6. Creating a data extraction pipeline using Python
      1. Data extraction
      2. Logging
    7. Summary
    8. References
  12. Chapter 5: Data Cleansing and Transformation
    1. Technical requirements
      1. Scrubbing your data
      2. Data transformation
      3. Data cleansing and transformation in ETL pipelines
      4. Understanding the downstream applications of your data
    2. Strategies for data cleansing and transformation in Python
      1. Preliminary tasks – the importance of staging data
      2. Transformation activities in Python
      3. Creating data pipeline activity in Python
    3. Summary
  13. Chapter 6: Loading Transformed Data
    1. Technical requirements
    2. Introduction to data loading
      1. Choosing the load destination
      2. Types of load destinations
    3. Best practices for data loading
    4. Optimizing data loading activities by controlling the data import method
      1. Creating demo data
      2. Full data loads
      3. Incremental data loads
    5. Precautions to consider
    6. Tutorial – preparing your local environment for data loading activities
      1. Downloading and installing PostgreSQL
      2. Creating data schemas in PostgreSQL
    7. Summary
  14. Chapter 7: Tutorial – Building an End-to-End ETL Pipeline in Python
    1. Technical requirements
    2. Introducing the project
      1. The approach
      2. The data
    3. Creating tables in PostgreSQL
    4. Sourcing and extracting the data
    5. Transformation and data cleansing
    6. Loading data into PostgreSQL tables
    7. Making it deployable
    8. Summary
  15. Chapter 8: Powerful ETL Libraries and Tools in Python
    1. Technical requirements
    2. Architecture of Python files
    3. Configuring your local environment
      1. config.ini
      2. config.yaml
    4. Part 1 – ETL tools in Python
      1. Bonobo
      2. Odo
      3. Mito ETL
      4. Riko
      5. pETL
      6. Luigi
    5. Part 2 – pipeline workflow management platforms in Python
      1. Airflow
    6. Summary
  16. Part 3:Creating ETL Pipelines in AWS
  17. Chapter 9: A Primer on AWS Tools for ETL Processes
    1. Common data storage tools in AWS
      1. Amazon RDS
      2. Amazon Redshift
      3. Amazon S3
      4. Amazon EC2
    2. Discussion – Building flexible applications in AWS
      1. Leveraging S3 and EC2
    3. Computing and automation with AWS
      1. AWS Glue
      2. AWS Lambda
      3. AWS Step Functions
    4. AWS big data tools for ETL pipelines
      1. AWS Data Pipeline
      2. Amazon Kinesis
      3. Amazon EMR
    5. Walk-through – creating a Free Tier AWS account
      1. Prerequisites for running AWS from your device in AWS
      2. AWS CLI
      3. Docker
      4. LocalStack
      5. AWS SAM CLI
    6. Summary
  18. Chapter 10: Tutorial – Creating an ETL Pipeline in AWS
    1. Technical requirements
    2. Creating a Python pipeline with Amazon S3, Lambda, and Step Functions
      1. Setting the stage with the AWS CLI
      2. Creating a “proof of concept” data pipeline in Python
      3. Using Boto3 and Amazon S3 to read data
      4. AWS Lambda functions
      5. AWS Step Functions
    3. An introduction to a scalable ETL pipeline using Bonobo, EC2, and RDS
      1. Configuring your AWS environment with EC2 and RDS
      2. Creating an RDS instance
      3. Creating an EC2 instance
      4. Creating a data pipeline locally with Bonobo
      5. Adding the pipeline to AWS
    4. Summary
  19. Chapter 11: Building Robust Deployment Pipelines in AWS
    1. Technical requirements
    2. What is CI/CD and why is it important?
      1. The six key elements of CI/CD
      2. Essential steps for CI/CD adoption
      3. CI/CD is a continual process
    3. Creating a robust CI/CD process for ETL pipelines in AWS
      1. Creating a CI/CD pipeline
    4. Building an ETL pipeline using various AWS services
      1. Setting up a CodeCommit repository
      2. Orchestrating with AWS CodePipeline
      3. Testing the pipeline
    5. Summary
  20. Part 4:Automating and Scaling ETL Pipelines
  21. Chapter 12: Orchestration and Scaling in ETL Pipelines
    1. Technical requirements
      1. Performance bottlenecks
      2. Inflexibility
      3. Limited scalability
      4. Operational overheads
    2. Exploring the types of scaling
      1. Vertical scaling
      2. Horizontal scaling
    3. Choose your scaling strategy
      1. Processing requirements
      2. Data volume
      3. Cost
      4. Complexity and skills
      5. Reliability and availability
    4. Data pipeline orchestration
      1. Task scheduling
      2. Error handling and recovery
      3. Resource management
      4. Monitoring and logging
      5. Putting it together with a practical example
    5. Summary
  22. Chapter 13: Testing Strategies for ETL Pipelines
    1. Technical requirements
    2. Benefits of testing data pipeline code
      1. How to choose the right testing strategies for your ETL pipeline
      2. How often should you test your ETL pipeline?
      3. Creating tests for a simple ETL pipeline
      4. Unit testing
      5. Validation testing
      6. Integration testing
      7. End-to-end testing
      8. Performance testing
      9. Resilience testing
    3. Best practices for a testing environment for ETL pipelines
      1. Defining testing objectives
      2. Establishing a testing framework
      3. Automating ETL tests
      4. Monitoring ETL pipelines
    4. ETL testing challenges
      1. Data privacy and security
      2. Environment parity
      3. Top ETL testing tools
    5. Summary
  23. Chapter 14: Best Practices for ETL Pipelines
    1. Technical requirements
      1. Data quality
      2. Poor scalability
      3. Lack of error-handling and recovery methods
    2. ETL logging in Python
      1. Debugging and issue resolution
      2. Auditing and compliance
      3. Performance monitoring
      4. Including contextual information
      5. Handling exceptions and errors
      6. The Goldilocks principle
      7. Implementing logging in Python
    3. Checkpoint for recovery
    4. Avoiding SPOFs
    5. Modularity and auditing
      1. Modularity
      2. Auditing
    6. Summary
  24. Chapter 15: Use Cases and Further Reading
    1. Technical requirements
    2. New York Yellow Taxi data, ETL pipeline, and deployment
      1. Step 1 – configuration
      2. Step 2 – ETL pipeline script
      3. Step 3 – unit tests
    3. Building a robust ETL pipeline with US construction data in AWS
      1. Prerequisites
      2. Step 1 – data extraction
      3. Step 2 – data transformation
      4. Step 3 – data loading
      5. Running the ETL pipeline
      6. Bonus – deploying your ETL pipeline
    4. Summary
    5. Further reading
  25. Index
    1. Why subscribe?
  26. Other Books You May Enjoy
    1. Packt is searching for authors like you
    2. Share Your Thoughts
    3. Download a free PDF copy of this book

Product information

  • Title: Building ETL Pipelines with Python
  • Author(s): Brij Kishore Pandey, Emily Ro Schoof
  • Release date: September 2023
  • Publisher(s): Packt Publishing
  • ISBN: 9781804615256