Book description
Cowritten by Ralph Kimball, the world's leading data warehousing authority, whose previous books have sold more than 150,000 copies
Delivers real-world solutions for the most time- and labor-intensive portion of data warehousing-data staging, or the extract, transform, load (ETL) process
Delineates best practices for extracting data from scattered sources, removing redundant and inaccurate data, transforming the remaining data into correctly formatted data structures, and then loading the end product into the data warehouse
Offers proven time-saving ETL techniques, comprehensive guidance on building dimensional structures, and crucial advice on ensuring data quality
Table of contents
- Copyright
- Credits
- Acknowledgments
- About the Authors
- Introduction
-
I. Requirements, Realities, and Architecture
- 1. Surrounding the Requirements
-
2. ETL Data Structures
- 2.1. To Stage or Not to Stage
- 2.2. Designing the Staging Area
-
2.3. Data Structures in the ETL System
- 2.3.1. Flat Files
- 2.3.2. XML Data Sets
- 2.3.3. Relational Tables
- 2.3.4. Independent DBMS Working Tables
- 2.3.5. Third Normal Form Entity/Relation Models
- 2.3.6. Nonrelational Data Sources
- 2.3.7. Dimensional Data Models: The Handoff from the Back Room to the Front Room
- 2.3.8. Fact Tables
- 2.3.9. Dimension Tables
- 2.3.10. Atomic and Aggregate Fact Tables
- 2.3.11. Surrogate Key Mapping Tables
- 2.4. Planning and Design Standards
- 2.5. Summary
-
II. Data Flow
-
3. Extracting
- 3.1. Part 1: The Logical Data Map
- 3.2. Inside the Logical Data Map
- 3.3. Building the Logical Data Map
- 3.4. Integrating Heterogeneous Data Sources
-
3.5. Mainframe Sources
- 3.5.1. Working with COBOL Copybooks
- 3.5.2. EBCDIC Character Set
- 3.5.3. Converting EBCDIC to ASCII
- 3.5.4. Transferring Data between Platforms
- 3.5.5. Handling Mainframe Numeric Data
- 3.5.6. Using PICtures
- 3.5.7. Unpacking Packed Decimals
- 3.5.8. Working with Redefined Fields
- 3.5.9. Multiple OCCURS
- 3.5.10. Managing Multiple Mainframe Record Type Files
- 3.5.11. Handling Mainframe Variable Record Lengths
- 3.6. Flat Files
- 3.7. XML Sources
- 3.8. Web Log Sources
- 3.9. ERP System Sources
- 3.10. Part 3: Extracting Changed Data
- 3.11. Summary
-
4. Cleaning and Conforming
- 4.1. Defining Data Quality
- 4.2. Assumptions
- 4.3. Part 1: Design Objectives
- 4.4. Part 2: Cleaning Deliverables
-
4.5. Part 3: Screens and Their Measurements
- 4.5.1. Anomaly Detection Phase
- 4.5.2. Types of Enforcement
- 4.5.3. Column Property Enforcement
- 4.5.4. Structure Enforcement
- 4.5.5. Data and Value Rule Enforcement
- 4.5.6. Measurements Driving Screen Design
- 4.5.7. Overall Process Flow
- 4.5.8. The Show Must Go On—Usually
- 4.5.9. Screens
- 4.5.10. Known Table Row Counts
- 4.5.11. Column Nullity
- 4.5.12. Column Numeric and Date Ranges
- 4.5.13. Column Length Restriction
- 4.5.14. Column Explicit Valid Values
- 4.5.15. Column Explicit Invalid Values
- 4.5.16. Checking Table Row Count Reasonability
- 4.5.17. Checking Column Distribution Reasonability
- 4.5.18. General Data and Value Rule Reasonability
-
4.6. Part 4: Conforming Deliverables
- 4.6.1. Conformed Dimensions
- 4.6.2. Designing the Conformed Dimensions
- 4.6.3. Taking the Pledge
- 4.6.4. Permssible Variation of Conformed Dimensions
- 4.6.5. Conformed Facts
- 4.6.6. The Fact Table Provider
- 4.6.7. The Dimension Manager: Publishing Conformed Dimensions to Affected Fact Tables
- 4.6.8. Detailed Delivery Steps for Conformed Dimensions
- 4.6.9. Implementing the Conforming Modules
- 4.6.10. Matching Drives Deduplication
- 4.6.11. Surviving: Final Step of Conforming
- 4.6.12. Delivering
- 4.7. Summary
-
5. Delivering Dimension Tables
- 5.1. The Basic Structure of a Dimension
- 5.2. The Grain of a Dimension
- 5.3. The Basic Load Plan for a Dimension
- 5.4. Flat Dimensions and Snowflaked Dimensions
- 5.5. Date and Time Dimensions
- 5.6. Big Dimensions
- 5.7. Small Dimensions
- 5.8. One Dimension or Two
- 5.9. Dimensional Roles
- 5.10. Dimensions as Subdimensions of Another Dimension
- 5.11. Degenerate Dimensions
- 5.12. Slowly Changing Dimensions
- 5.13. Type 1 Slowly Changing Dimension (Overwrite)
- 5.14. Type 2 Slowly Changing Dimension (Partitioning History)
- 5.15. Precise Time Stamping of a Type 2 Slowly Changing Dimension
- 5.16. Type 3 Slowly Changing Dimension (Alternate Realities)
- 5.17. Hybrid Slowly Changing Dimensions
- 5.18. Late-Arriving Dimension Records and Correcting Bad Data
- 5.19. Multivalued Dimensions and Bridge Tables
- 5.20. Ragged Hierarchies and Bridge Tables
- 5.21. Technical Note: POPULATING HIERARCHY BRIDGE TABLES
- 5.22. Using Positional Attributes in a Dimension to Represent Text Facts
- 5.23. Summary
-
6. Delivering Fact Tables
- 6.1. The Basic Structure of a Fact Table
- 6.2. Guaranteeing Referential Integrity
- 6.3. Surrogate Key Pipeline
- 6.4. Fundamental Grains
- 6.5. Transaction Grain Fact Tables
-
6.6. Preparing for Loading Fact Tables
- 6.6.1. Managing Indexes
- 6.6.2. Managing Partitions
- 6.6.3. Outwitting the Rollback Log
- 6.6.4. Loading the Data
- 6.6.5. Incremental Loading
- 6.6.6. Inserting Facts
- 6.6.7. Updating and Correcting Facts
- 6.6.8. Negating Facts
- 6.6.9. Updating Facts
- 6.6.10. Deleting Facts
- 6.6.11. Physically Deleting Facts
- 6.6.12. Logically Deleting Facts
- 6.7. Factless Fact Tables
- 6.8. Augmenting a Type 1 Fact Table with Type 2 History
- 6.9. Graceful Modifications
- 6.10. Multiple Units of Measure in a Fact Table
- 6.11. Collecting Revenue in Multiple Currencies
- 6.12. Late Arriving Facts
- 6.13. Aggregations
- 6.14. Delivering Dimensional Data to OLAP Cubes
- 6.15. Summary
-
3. Extracting
-
III. Implementation and Operations
-
7. Development
- 7.1. Current Marketplace ETL Tool Suite Offerings
- 7.2. Current Scripting Languages
-
7.3. Time Is of the Essence
- 7.3.1. Push Me or Pull Me
- 7.3.2. Ensuring Transfers with Sentinels
- 7.3.3. Sorting Data during Preload
- 7.3.4. Sorting on Mainframe Systems
- 7.3.5. Sorting on Unix and Windows Systems
- 7.3.6. Trimming the Fat (Filtering)
- 7.3.7. Extracting a Subset of the Source File Records on Mainframe Systems
- 7.3.8. Extracting a Subset of the Source File Fields
- 7.3.9. Extracting a Subset of the Source File Records on Unix and Windows Systems
- 7.3.10. Extracting a Subset of the Source File Fields
- 7.3.11. Creating Aggregated Extracts on Mainframe Systems
- 7.3.12. Creating Aggregated Extracts on UNIX and Windows Systems
- 7.4. Using Database Bulk Loader Utilities to Speed Inserts
- 7.5. Managing Database Features to Improve Performance
- 7.6. Troubleshooting Performance Problems
-
7.7. Increasing ETL Throughput
- 7.7.1. Reducing Input/Output Contention
- 7.7.2. Eliminating Database Reads/Writes
- 7.7.3. Filtering as Soon as Possible
- 7.7.4. Partitioning and Parallelizing
- 7.7.5. Updating Aggregates Incrementally
- 7.7.6. Taking Only What You Need
- 7.7.7. Bulk Loading/Eliminating Logging
- 7.7.8. Dropping Databases Constraints and Indexes
- 7.7.9. Eliminating Network Traffic
- 7.7.10. Letting the ETL Engine Do the Work
- 7.8. Summary
- 8. Operations
- 9. Metadata
-
10. Responsibilities
-
10.1. Planning and Leadership
- 10.1.1. Having Dedicated Leadership
- 10.1.2. Planning Large, Building Small
- 10.1.3. Hiring Qualified Developers
- 10.1.4. Building Teams with Database Expertise
- 10.1.5. Don't Try to Save the World
- 10.1.6. Enforcing Standardization
- 10.1.7. Monitoring, Auditing, and Publishing Statistics
- 10.1.8. Maintaining Documentation
- 10.1.9. Providing and Utilizing Metadata
- 10.1.10. Keeping It Simple
- 10.1.11. Optimizing Throughput
- 10.2. Managing the Project
- 10.3. Summary
-
10.1. Planning and Leadership
-
7. Development
-
IV. Real Time Streaming ETL Systems
-
11. Real-Time ETL Systems
- 11.1. Why Real-Time ETL?
- 11.2. Defining Real-Time ETL
- 11.3. Challenges and Opportunities of Real-Time Data Warehousing
- 11.4. Real-Time Data Warehousing Review
-
11.5. Categorizing the Requirement
- 11.5.1. Data Freshness and Historical Needs
- 11.5.2. Reporting Only or Integration, Too?
- 11.5.3. Just the Facts or Dimension Changes, Too?
- 11.5.4. Alerts, Continuous Polling, or Nonevents?
- 11.5.5. Data Integration or Application Integration?
- 11.5.6. Point-to-Point versus Hub-and-Spoke
- 11.5.7. Customer Data Cleanup Considerations
- 11.6. Real-Time ETL Approaches
- 11.7. Summary
- 12. Conclusions
-
11. Real-Time ETL Systems
Product information
- Title: The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data
- Author(s):
- Release date: October 2004
- Publisher(s): Wiley
- ISBN: 9780764567575
You might also like
book
The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling
Single most authoritative guide from the inventor of the technique. Presents unique modeling techniques for e-commerce, …
book
Building a Scalable Data Warehouse with Data Vault 2.0
The Data Vault was invented by Dan Linstedt at the U.S. Department of Defense, and the …
book
The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition
Updated new edition of Ralph Kimball's groundbreaking book on dimensional modeling for data warehousing and business …
book
Data Management at Scale, 2nd Edition
As data management continues to evolve rapidly, managing all of your data in a central place, …