Data Engineering on Azure

Book description

Build a data platform to the industry-leading standards set by Microsoft’s own infrastructure.

In Data Engineering on Azure you will learn how to:

  • Pick the right Azure services for different data scenarios
  • Manage data inventory
  • Implement production quality data modeling, analytics, and machine learning workloads
  • Handle data governance
  • Using DevOps to increase reliability
  • Ingesting, storing, and distributing data
  • Apply best practices for compliance and access control

Data Engineering on Azure reveals the data management patterns and techniques that support Microsoft’s own massive data infrastructure. Author Vlad Riscutia, a data engineer at Microsoft, teaches you to bring an engineering rigor to your data platform and ensure that your data prototypes function just as well under the pressures of production. You'll implement common data modeling patterns, stand up cloud-native data platforms on Azure, and get to grips with DevOps for both analytics and machine learning.

About the Technology
Build secure, stable data platforms that can scale to loads of any size. When a project moves from the lab into production, you need confidence that it can stand up to real-world challenges. This book teaches you to design and implement cloud-based data infrastructure that you can easily monitor, scale, and modify.

About the Book
In Data Engineering on Azure you’ll learn the skills you need to build and maintain big data platforms in massive enterprises. This invaluable guide includes clear, practical guidance for setting up infrastructure, orchestration, workloads, and governance. As you go, you’ll set up efficient machine learning pipelines, and then master time-saving automation and DevOps solutions. The Azure-based examples are easy to reproduce on other cloud platforms.

What's Inside
  • Data inventory and data governance
  • Assure data quality, compliance, and distribution
  • Build automated pipelines to increase reliability
  • Ingest, store, and distribute data
  • Production-quality data modeling, analytics, and machine learning


About the Reader
For data engineers familiar with cloud computing and DevOps.

About the Author
Vlad Riscutia is a software architect at Microsoft.

Quotes
A definitive and complete guide on data engineering, with clear and easy-to-reproduce examples.
- Kelum Prabath Senanayake, Echoworx

An all-in-one Azure book, covering all a solutions architect or engineer needs to think about.
- Albert Nogués, Danone

A meaningful journey through the Azure ecosystem. You’ll be building pipelines and joining components quickly!
- Todd Cook, Appen

A gateway into the world of Azure for machine learning and DevOps engineers.
- Krzysztof Kamyczek, Luxoft

Table of contents

  1. inside front cover
    1. Data Platform Architecture
  2. Data Engineering on Azure
  3. Copyright
  4. dedication
  5. brief contents
  6. contents
  7. front matter
    1. preface
    2. acknowledgments
    3. about this book
    4. about the author
    5. about the cover illustration
  8. 1 Introduction
    1. 1.1 What is data engineering?
    2. 1.2 Who this book is for
    3. 1.3 What is a data platform?
      1. 1.3.1 Anatomy of a data platform
      2. 1.3.2 Infrastructure as code, codeless infrastructure
    4. 1.4 Building in the cloud
      1. 1.4.1 IaaS, PaaS, SaaS
      2. 1.4.2 Network, storage, compute
      3. 1.4.3 Getting started with Azure
      4. 1.4.4 Interacting with Azure
    5. 1.5 Implementing an Azure data platform
    6. Summary
  9. Part 1 Infrastructure
  10. 2 Storage
    1. 2.1 Storing data in a data platform
      1. 2.1.1 Storing data across multiple data fabrics
      2. 2.1.2 Having a single source of truth
    2. 2.2 Introducing Azure Data Explorer
      1. 2.2.1 Deploying an Azure Data Explorer cluster
      2. 2.2.2 Using Azure Data Explorer
      3. 2.2.3 Working around query limits
    3. 2.3 Introducing Azure Data Lake Storage
      1. 2.3.1 Creating an Azure Data Lake Storage account
      2. 2.3.2 Using Azure Data Lake Storage
      3. 2.3.3 Integrating with Azure Data Explorer
    4. 2.4 Ingesting data
      1. 2.4.1 Ingestion frequency
      2. 2.4.2 Load type
      3. 2.4.3 Restatements and reloads
    5. Summary
  11. 3 DevOps
    1. 3.1 What is DevOps?
      1. 3.1.1 DevOps in data engineering
    2. 3.2 Introducing Azure DevOps
      1. 3.2.1 Using the az azure-devops extension
    3. 3.3 Deploying infrastructure
      1. 3.3.1 Exporting an Azure Resource Manager template
      2. 3.3.2 Creating Azure DevOps service connections
      3. 3.3.3 Deploying Azure Resource Manager templates
      4. 3.3.4 Understanding Azure Pipelines
    4. 3.4 Deploying analytics
      1. 3.4.1 Using Azure DevOps marketplace extensions
      2. 3.4.2 Storing everything in Git; deploying everything automatically
    5. Summary
  12. 4 Orchestration
    1. 4.1 Ingesting the Bing COVID-19 open dataset
    2. 4.2 Introducing Azure Data Factory
      1. 4.2.1 Setting up the data source
      2. 4.2.2 Setting up the data sink
      3. 4.2.3 Setting up the pipeline
      4. 4.2.4 Setting up a trigger
      5. 4.2.5 Orchestrating with Azure Data Factory
    3. 4.3 DevOps for Azure Data Factory
      1. 4.3.1 Deploying Azure Data Factory from Git
      2. 4.3.2 Setting up access control
      3. 4.3.3 Deploying the production data factory
      4. 4.3.4 DevOps for the Azure Data Factory recap
    4. 4.4 Monitoring with Azure Monitor
    5. Summary
  13. Part 2 Workloads
  14. 5 Processing
    1. 5.1 Data modeling techniques
      1. 5.1.1 Normalization and denormalization
      2. 5.1.2 Data warehousing
      3. 5.1.3 Semistructured data
      4. 5.1.4 Data modeling recap
    2. 5.2 Identity keyrings
      1. 5.2.1 Building an identity keyring
      2. 5.2.2 Understanding keyrings
    3. 5.3 Timelines
      1. 5.3.1 Building a timeline view
      2. 5.3.2 Using timelines
    4. 5.4 Continuous data processing
      1. 5.4.1 Tracking processing functions in Git
      2. 5.4.2 Keyring building in Azure Data Factory
      3. 5.4.3 Scaling out
    5. Summary
  15. 6 Analytics
    1. 6.1 Structuring storage
      1. 6.1.1 Providing development data
      2. 6.1.2 Replicating production data
      3. 6.1.3 Providing read-only access to the production data
      4. 6.1.4 Storage structure recap
    2. 6.2 Analytics workflow
      1. 6.2.1 Prototyping
      2. 6.2.2 Development and user acceptance testing
      3. 6.2.3 Production
      4. 6.2.4 Analytics workflow recap
    3. 6.3 Self-serve data movement
      1. 6.3.1 Support model
      2. 6.3.2 Data contracts
      3. 6.3.3 Pipeline validation
      4. 6.3.4 Postmortems
      5. 6.3.5 Self-serve data movement recap
    4. Summary
  16. 7 Machine learning
    1. 7.1 Training a machine learning model
      1. 7.1.1 Training a model using scikit-learn
      2. 7.1.2 High spender model implementation
    2. 7.2 Introducing Azure Machine Learning
      1. 7.2.1 Creating a workspace
      2. 7.2.2 Creating an Azure Machine Learning compute target
      3. 7.2.3 Setting up Azure Machine Learning storage
      4. 7.2.4 Running ML in the cloud
      5. 7.2.5 Azure Machine Learning recap
    3. 7.3 MLOps
      1. 7.3.1 Deploying from Git
      2. 7.3.2 Storing pipeline IDs
      3. 7.3.3 DevOps for Azure Machine Learning recap
    4. 7.4 Orchestrating machine learning
      1. 7.4.1 Connecting Azure Data Factory with Azure Machine Learning
      2. 7.4.2 Machine learning orchestration
      3. 7.4.3 Orchestrating recap
    5. Summary
  17. Part 3 Governance
  18. 8 Metadata
    1. 8.1 Making sense of the data
    2. 8.2 Introducing Azure Purview
    3. 8.3 Maintaining a data inventory
      1. 8.3.1 Setting up a scan
      2. 8.3.2 Browsing the data dictionary
      3. 8.3.3 Data dictionary recap
    4. 8.4 Managing a data glossary
      1. 8.4.1 Adding a new glossary term
      2. 8.4.2 Curating terms
      3. 8.4.3 Custom templates and bulk import
      4. 8.4.4 Data glossary recap
    5. 8.5 Understanding Azure Purview's advanced features
      1. 8.5.1 Tracking lineage
      2. 8.5.2 Classification rules
      3. 8.5.3 REST API
      4. 8.5.4 Advanced features recap
    6. Summary
  19. 9 Data quality
    1. 9.1 Testing data
      1. 9.1.1 Availability tests
      2. 9.1.2 Correctness tests
      3. 9.1.3 Completeness tests
      4. 9.1.4 Detecting anomalies
      5. 9.1.5 Testing data recap
    2. 9.2 Running data quality checks
      1. 9.2.1 Testing using Azure Data Factory
      2. 9.2.2 Executing tests
      3. 9.2.3 Creating and using a template
      4. 9.2.4 Running data quality checks recap
    3. 9.3 Scaling out data testing
      1. 9.3.1 Supporting multiple data fabrics
      2. 9.3.2 Testing at rest and during movement
      3. 9.3.3 Authoring tests
      4. 9.3.4 Storing tests and results
    4. Summary
  20. 10 Compliance
    1. 10.1 Data classification
      1. 10.1.1 Feature data
      2. 10.1.2 Telemetry
      3. 10.1.3 User data
      4. 10.1.4 User-owned data
      5. 10.1.5 Business data
      6. 10.1.6 Data classification recap
    2. 10.2 Changing classification through processing
      1. 10.2.1 Aggregation
      2. 10.2.2 Anonymization
      3. 10.2.3 Pseudonymization
      4. 10.2.4 Masking
      5. 10.2.5 Processing classification changes recap
    3. 10.3 Implementing an access model
      1. 10.3.1 Security groups
      2. 10.3.2 Securing Azure Data Explorer
      3. 10.3.3 Access model recap
    4. 10.4 Complying with GDPR and other considerations
      1. 10.4.1 Data handling
      2. 10.4.2 Data subject requests
      3. 10.4.3 Other considerations
    5. Summary
  21. 11 Distributing data
    1. 11.1 Data distribution overview
    2. 11.2 Building a data API
      1. 11.2.1 Introducing Azure Cosmos DB
      2. 11.2.2 Populating the Cosmos DB collection
      3. 11.2.3 Retrieving data
      4. 11.2.4 Data API recap
    3. 11.3 Serving machine learning
    4. 11.4 Sharing data for bulk copy
      1. 11.4.1 Separating compute resources
      2. 11.4.2 Introducing Azure Data Share
      3. 11.4.3 Sharing data for bulk copy recap
    5. 11.5 Data sharing best practices
    6. Summary
  22. Appendix A. Azure services
    1. Azure Storage
    2. Azure SQL
    3. Azure Synapse Analytics
    4. Azure Data Explorer
    5. Azure Databricks
    6. Azure Cosmos DB
  23. Appendix B. KQL quick reference
    1. Common query reference
    2. SQL to KQL
  24. Appendix C. Running code samples
  25. index
  26. inside back cover
    1. MLOps

Product information

  • Title: Data Engineering on Azure
  • Author(s): Vlad Riscutia
  • Release date: August 2021
  • Publisher(s): Manning Publications
  • ISBN: 9781617298929