Chapter 8. Creating Complex Data Sets for Analysis

In Chapters 3 through 7, we looked at a number of ways in which SQL can be used to perform analysis on data in databases. In addition to these specific use cases, sometimes the goal of a query is to assemble a data set that is specific yet general-purpose enough that it can be used to perform a variety of further analyses. The destination might be a database table, a text file, or a business intelligence tool. The SQL that is needed might be simple, requiring only a few filters or aggregations. Often, however, the code or logic needed to achieve the desired data set can become very complex. Additionally, such code is likely to be updated over time, as stakeholders request additional data points or calculations. The organization, performance, and maintainability of your SQL code become critical in a way that isn’t the case for one-time analyses.

In this chapter, I’ll discuss principles for organizing code so that it’s easier to share and update. Then I’ll discuss when to keep query logic in the SQL and when to consider moving to permanent tables via ETL (extract-transform-load) code. Next, I’ll explain the options for storing intermediate results—subqueries, temp tables, and common table expressions (CTEs)—and considerations for using them in your code. Finally, I’ll wrap up with a look at techniques for reducing data set size and ideas for handling data privacy and removing personally identifiable information (PII).

When to Use ...

Get SQL for Data Analysis now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.