Book description
Extract, Transform, and Load (ETL) is the essence of data integration and this book shows you how to achieve it quickly and efficiently using Pentaho Data. A hands-on guide that you’ll find an indispensable time-saver.
- Manipulate your data by exploring, transforming, validating, and integrating it
- Learn to migrate data between applications
- Explore several features of Pentaho Data Integration 5.0
- Connect to any database engine, explore the databases, and perform all kind of operations on databases
In Detail
Capturing, manipulating, cleansing, transferring, and loading data effectively are the prime requirements in every IT organization. Achieving these tasks require people devoted to developing extensive software programs, or investing in ETL or data integration tools that can simplify this work.
Pentaho Data Integration is a full-featured open source ETL solution that allows you to meet these requirements. Pentaho Data Integration has an intuitive, graphical, drag-and-drop design environment and its ETL capabilities are powerful. However, getting started with Pentaho Data Integration can be difficult or confusing.
"Pentaho Data Integration Beginner's Guide, Second Edition" provides the guidance needed to overcome that difficulty, covering all the possible key features of Pentaho Data Integration.
"Pentaho Data Integration Beginner's Guide, Second Edition" starts with the installation of Pentaho Data Integration software and then moves on to cover all the key Pentaho Data Integration concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to do all kinds of data manipulation and work with plain files. Then, the book gives you a primer on databases and teaches you how to work with databases inside Pentaho Data Integration. Moreover, you will be introduced to data warehouse concepts and you will learn how to load data in a data warehouse. After that, you will learn to implement simple and complex processes. Finally, you will have the opportunity of applying and reinforcing all the learned concepts through the implementation of a simple datamart.
With "Pentaho Data Integration Beginner's Guide, Second Edition", you will learn everything you need to know in order to meet your data manipulation requirements.
Table of contents
-
Pentaho Data Integration Beginner's Guide
- Table of Contents
- Pentaho Data Integration Beginner's Guide
- Credits
- About the Author
- About the Reviewers
- www.PacktPub.com
- Preface
-
1. Getting Started with Pentaho Data Integration
- Pentaho Data Integration and Pentaho BI Suite
- Exploring the Pentaho Demo
- Installing PDI
- Time for action – installing PDI
- Launching the PDI graphical designer – Spoon
- Time for action – starting and customizing Spoon
- Time for action – creating a hello world transformation
- Installing MySQL
- Time for action – installing MySQL on Windows
- Time for action – installing MySQL on Ubuntu
- Summary
-
2. Getting Started with Transformations
- Designing and previewing transformations
- Time for action – creating a simple transformation and getting familiar with the design process
- Running transformations in an interactive fashion
- Time for action – generating a range of dates and inspecting the data as it is being created
- Handling errors
- Time for action – avoiding errors while converting the estimated time from string to integer
- Time for action – configuring the error handling to see the description of the errors
- Summary
-
3. Manipulating Real-world Data
- Reading data from files
- Time for action – reading results of football matches from files
- Time for action – reading all your files at a time using a single text file input step
- Time for action – reading all your files at a time using a single text file input step and regular expressions
- Sending data to files
- Time for action – sending the results of matches to a plain file
- Getting system information
- Time for action – reading and writing matches files with flexibility
- Time for action – running the matches transformation from a terminal window
- XML files
- Time for action – getting data from an XML file with information about countries
- Summary
-
4. Filtering, Searching, and Performing Other Useful Operations with Data
- Sorting data
- Time for action – sorting information about matches with the Sort rows step
- Calculations on groups of rows
- Time for action – calculating football match statistics by grouping data
- Filtering
- Time for action – counting frequent words by filtering
- Time for action – refining the counting task by filtering even more
- Looking up data
- Time for action – finding out which language people speak
-
5. Controlling the Flow of Data
- Splitting streams
- Time for action – browsing new features of PDI by copying a dataset
- Time for action – assigning tasks by distributing
- Splitting the stream based on conditions
- Time for action – assigning tasks by filtering priorities with the Filter rows step
- Time for action – assigning tasks by filtering priorities with the Switch/Case step
- Merging streams
- Time for action – gathering progress and merging it all together
- Time for action – giving priority to Bouchard by using the Append Stream
- Treating invalid data by splitting and merging streams
- Time for action – treating errors in the estimated time to avoid discarding rows
- Summary
-
6. Transforming Your Data by Coding
- Doing simple tasks with the JavaScript step
- Time for action – counting frequent words by coding in JavaScript
- Reading and parsing unstructured files with JavaScript
- Time for action – changing a list of house descriptions with JavaScript
- Doing simple tasks with the Java Class step
- Time for action – counting frequent words by coding in Java
- Transforming the dataset with Java
- Time for action – splitting the field to rows using Java
- Avoiding coding by using purpose built steps
- Summary
-
7. Transforming the Rowset
- Converting rows to columns
- Time for action – enhancing the films file by converting rows to columns
- Aggregating data with a Row Denormaliser step
- Time for action – aggregating football matches data with the Row Denormaliser step
- Normalizing data
- Time for action – enhancing the matches file by normalizing the dataset
- Generating a custom time dimension dataset by using Kettle variables
- Time for action – creating the time dimension dataset
- Time for action – parameterizing the start and end date of the time dimension dataset
- Summary
-
8. Working with Databases
- Introducing the Steel Wheels sample database
- Time for action – creating a connection to the Steel Wheels database
- Time for action – exploring the sample database
- Querying a database
- Time for action – getting data about shipped orders
- Time for action – getting orders in a range of dates using parameters
- Time for action – getting orders in a range of dates by using Kettle variables
- Sending data to a database
- Time for action – loading a table with a list of manufacturers
- Time for action – inserting new products or updating existing ones
- Time for action – testing the update of existing products
- Eliminating data from a database
- Time for action – deleting data about discontinued items
- Summary
-
9. Performing Advanced Operations with Databases
- Preparing the environment
- Time for action – populating the Jigsaw database
- Looking up data in a database
- Time for action – using a Database lookup step to create a list of products to buy
- Time for action – using a Database join step to create a list of suggested products to buy
- Introducing dimensional modeling
- Loading dimensions with data
- Time for action – loading a region dimension with a Combination lookup/update step
- Time for action – testing the transformation that loads the region dimension
- Time for action – keeping a history of changes in products by using the Dimension lookup/update step
- Time for action – testing the transformation that keeps history of product changes
- Summary
-
10. Creating Basic Task Flows
- Introducing PDI jobs
- Time for action – creating a folder with a Kettle job
- Designing and running jobs
- Time for action – creating a simple job and getting familiar with the design process
- Running transformations from jobs
- Time for action – generating a range of dates and inspecting how things are running
- Receiving arguments and parameters in a job
- Time for action – generating a hello world file by using arguments and parameters
- Running jobs from a terminal window
- Time for action – executing the hello world job from a terminal window
- Using named parameters and command-line arguments in transformations
- Time for action – calling the hello world transformation with fixed arguments and parameters
- Deciding between the use of a command-line argument and a named parameter
- Summary
-
11. Creating Advanced Transformations and Jobs
- Re-using part of your transformations
- Time for action – calculating statistics with the use of a subtransformations
- Time for action – generating top average scores by copying and getting rows
- Iterating jobs and transformations
- Time for action – generating custom files by executing a transformation for every input row
- Enhancing your processes with the use of variables
-
Time for action – generating custom messages by setting a variable with the name of the examination file
- What just happened?
- Setting variables inside a transformation
- Running a job inside another job with a Job job entry
- Have a go hero – processing several files at once
- Have a go hero – enhancing the jigsaw database update process
- Have a go hero – executing the proper jigsaw database update process
- Pop quiz – deciding the scope of variables
- Summary
-
12. Developing and Implementing a Simple Datamart
- Exploring the sales datamart
- Loading the dimensions
- Time for action – loading the dimensions for the sales datamart
- Extending the sales datamart model
- Loading a fact table with aggregated data
- Time for action – loading the sales fact table by looking up dimensions
- Getting facts and dimensions together
- Time for action – loading the fact table using a range of dates obtained from the command line
- Time for action – loading the SALES star
- Automating the administrative tasks
- Time for action – automating the loading of the sales datamart
- Summary
-
A. Working with Repositories
- Creating a database repository
- Time for action – creating a PDI repository
- Working with the repository storage system
- Time for action – logging into a database repository
- Examining and modifying the contents of a repository with the Repository Explorer
- Migrating from file-based system to repository-based system and vice versa
- Summary
- B. Pan and Kitchen – Launching Transformations and Jobs from the Command Line
- C. Quick Reference – Steps and Job Entries
- D. Spoon Shortcuts
- E. Introducing PDI 5 Features
- F. Best Practices
-
G. Pop Quiz Answers
- Chapter 1, Getting Started with Pentaho Data Integration
- Chapter 2, Getting Started with Transformations
- Chapter 3, Manipulating Real-world Data
- Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data
- Chapter 5, Controlling the Flow of Data
- Chapter 6, Transforming Your Data by Coding
- Chapter 8, Working with Databases
- Chapter 9, Performing Advanced Operations with Databases
- Chapter 10, Creating Basic Task Flows
- Chapter 11, Creating Advanced Transformations and Jobs
- Chapter 12, Developing and Implementing a Simple Datamart
- Index
Product information
- Title: Pentaho Data Integration Beginner's Guide
- Author(s):
- Release date: October 2013
- Publisher(s): Packt Publishing
- ISBN: 9781782165040
You might also like
book
Pentaho Data Integration Quick Start Guide
Get productive quickly with Pentaho Data Integration Key Features Take away the pain of starting with …
book
Pentaho 3.2 Data Integration Beginner's Guide
Explore, transform, validate, and integrate your data with ease Get started with Pentaho Data Integration from …
book
Pentaho Data Integration Cookbook Second Edition
The premier open source ETL tool is at your command with this recipe-packed cookbook. Learn to …
book
Pentaho® Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration
A complete guide to Pentaho Kettle, the Pentaho Data lntegration toolset for ETL This practical book …