Building a Data Warehouse with MySQL and Perl
Pages: 1, 2
ETL
If you want to build a data warehouse, your first job is to come up with a dimensional model for your data. This task is largely outside the scope of this article, but I'll point out some good references in Going Further.
Once you have a data schema for your warehouse, you'll need to fill it with data. This process is known as extract, transform, and load, or ETL for short. The first step, extraction, is simply the process of selecting all the data of interest from the operational database. Then the data must be transformed into the format needed by the warehouse. This could be as simple as renaming some of the fields or as complex as cleaning dirty data and computing new fields. Finally the data must be loaded into the data warehouse.
ETL is typically a batch process--often running on a nightly basis. Some systems do real-time ETL, updating the warehouse as new data appears in the operational databases.
Extract
Our ETL process uses MySQL's SELECT ... INTO OUTFILE system to create CSVs containing data from our operational database. Many tables are dumped as-is, but for the more complex cases we do mix some transformations into the SELECT statement. For example, here's a statement that extracts all the data from the link table that has corresponding records in person_link or campaign_link_click:
SELECT l.link_id, l.link_type_id, l.url, l.from_url,
l.position, l.message_id, l.message_part_id,
COALESCE(l.campaign_id, m.campaign_id) AS campaign_id
FROM link l
LEFT OUTER JOIN mailing_message m ON (l.message_id = m.message_id)
WHERE EXISTS (
SELECT 1 FROM person_link WHERE link_id = l.link_id
) OR EXISTS (
SELECT 1 FROM campaign_link_click WHERE link_id = l.link_id
)
INTO OUTFILE ?
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
In the process the link is assigned a single campaign_id from either the mailing_message or the link record itself. The details of why this is done aren't important, but suffice it to say that doing this in the SELECT statement is easier and faster than it would be in the transform step proper.
You may be wondering what kind of strain extraction places on the operational database. The answer is quite a lot! To deal with this problem we use MySQL's replication system to produce a low-cost copy of the operational database on the warehouse machine. This means the warehouse loader can run without affecting operational performance.
Transform
The end result of the extraction process is a CSV file containing data destined for the warehouse. We use the excellent Text::CSV_XS module to read these files, and apply any additional transformations needed in Perl code.
For example, we compute data about the most recent, highest, and first contribution given by each person in our database. These values are frequently queried, and queries can run much faster with these values in the person dimension.
Another example is geo-coding. We feed address data to Geo::Coder::US and Geo::PostalCode to produce a latitude and longitude for each person. These values go into the person dimension and power "within X miles of" queries. The warehouse transform step provides an ideal time to run potentially expensive data-sweetening operations like geo-coding.
The transform step ends by writing out another CSV file with the transformed data. We use Text::CSV_XS to format the data for writing, too.
Load
Finally, the data needs to get into the warehouse. Here we use MySQL's LOAD DATA INFILE command to load CSV data directly into the database. To speed up the load, we first disable indexes with ALTER TABLE foo DISABLE KEYS, and after the load, we re-enable them with ALTER TABLE foo ENABLE KEYS. Each table needs to be cleared before loading, which we do with the MySQL TRUNCATE command.
You may be wondering what happens to clients using the warehouse while an ETL process is running. In our case, nothing at all! This magic is achieved by actually having two warehouse databases, one in use and the other free for loading. All the data goes into the loading database, and when it's full we swap it into place with a big ol' RENAME:
RENAME TABLE
live_db.table_foo TO loading_db.table_foo_tmp,
loading_db.table_foo TO live_db.table_foo,
loading_db.table_foo_tmp TO loading_db.table_foo
live_db.table_bar TO loading_db.table_bar_tmp,
loading_db.table_bar TO live_db.table_bar,
loading_db.table_bar_tmp TO loading_db.table_bar
...
This produces an atomic switch of all tables in the loading database with the tables in the live database. It will wait for any running queries in the warehouse to finish before performing the swap, which is exactly what we want.
ETL Observations
CSV Isn't a Standard
One major problem we had in constructing our ETL process was in finding a way to speak MySQL's rather perverse CSV syntax (as used by SELECT ... INTO OUTFILE and LOAD DATA INFILE) with Text::CSV_XS. MySQL uses \N to indicate NULLs in CSV data output, while also using the backslash as a quote character in string. Thus, when Text::CSV_XS reads \N it gets turned into N before we can process it!
Worse, trying to get Text::CSV_XS to write a literal \N proved impossible. Fortunately LOAD DATA INFILE will also treat the string "NULL" as NULL.
An ETL process based on something better standardized like XML wouldn't have these problems, although it might not perform as well due to formatting overhead.
Transforms Are Not Always Needed
Many tables in our operational database don't need any transformation to become dimensions in the warehouse. We have plans to provide a straight database-to-database extract-and-load for these tables using MySQL's SELECT ... INTO TABLE syntax. This should perform much better by entirely skipping the intermediate CSV files.
Incremental Loading Is Highly Desirable
As we developed more tools that use the warehouse, it became more and more desirable to have it kept up-to-date more frequently. We are in the process of designing a system that will keep the warehouse up-to-date in real time. The system will be incremental, moving data as it is added and changed in the main database, likely using triggers to help us know what needs moving without burdening application code.
Using MySQL as a Data Warehouse
Any relational database system can be used to build a data warehouse. At Plus Three we chose MySQL v4.1.x for our warehouse system. This choice was natural for us as we're largely a MySQL shop. We chose v4.1.x because it was the stable version when we started development.
Our operational database uses MySQL's InnoDB backend, providing referential integrity and transactions. However, we chose MySQL's MyISAM backend for our warehouse. Warehouse applications are read-only, and as such, have little need for run-time referential integrity checks and transactions. MyISAM can also be faster than InnoDB in some circumstances.
An upside to using MySQL for the warehouse is high capacity and decent performance with little tuning. Our larger installations handle tables with more than one hundred million rows without major difficulties. Although we've spent some time optimizing for particular warehouse queries, the majority were fast enough out of the gate.
All that said, there are features we'd like to have that MySQL is lacking. Perhaps chief among them is support for bitmap indexes. Bitmap indexes are ideal for the kind of low-cardinality data that is commonly used in data warehouses. PostgreSQL supports bitmap indexes as of version v8.1, as do a number of commercial database systems.
We also found MySQL's subquery optimization to be of poor quality. Subqueries are relatively new to MySQL, so there's a good chance this will improve when we upgrade to MySQL v5.1.x soon.
A nontraditional alternative is a column-oriented database, like Vertica and MonetDB (which has a DBI driver). Column-oriented database systems store data by column rather than by row, and as such, offer performance advantages for typical data-warehouse usage patterns.
Going Further
Data warehousing is a deep topic, and I've only really scratched the surface. If this article has piqued your interest, I suggest you pick up a copy of The Data Warehouse Toolkit by Ralph Kimball and Margy Ross. This book is invaluable in presenting dimensional modeling, the key to getting the most out of a data warehouse.
Sam Tregar is employed as a Perl programmer for Plus Three LP and is an active participant in the Perl development community.
Return to Databases.
You must be logged in to the O'Reilly Network to post a talkback.
Showing messages 1 through 3 of 3.
-
delimiters
2007-07-25 10:55:16 awillis [Reply | View]
I've found that not using csv at all when loading data into and out of mysql is often easier. SELECT INTO OUTFILE and LOAD DATA INFILE default to tab separated output. If you have text fields that contain tabs, and you can't convert them to spaces, then perhaps I would use a comma delimiter.
-
Awesome Article!
2007-05-01 09:40:11 Matthew Russell |
[Reply | View]
I just wanted to say that this is one of the best articles I've ever read on the O'Reilly Network. Great job of introducing this topic and providing a great example of how it can actually be used in real world applications. Thanks again!






There is only one thing missing that is to visualize the progress of the ETL process.
What would be greater than to have the ability to monitor the whole ETL process with the help of a Perl programme. The TK module from Perl would be an excellent tool to visualize the process. Even better would be to define different OO objects in Perl for different tasks that have to be carried out in the ETL process. For each ETL OO object specific properties could be defined. For example to carry out and sql task, it has to have a name, a SQL command, and so forth.
When this is compiled as a Perl service the ETL process can be run continously.
When at the end of the process a email will be sent to confirm that a process has been carried out succesfully or has gone wrong the whole programme is completely selfsufficient.
DutchAlbatros