Skip to content
  • Sign In
  • Try Now
View all events
Data Science

Smarter SQL for Data Science

Published by Pearson

Beginner to advanced content levelBeginner to advanced

Rising to the Challenge of Data Science

  • Understand why SQL remains a powerful tool for the modern age
  • Develop greater fluency in your use of SQL—reduce fiddling
  • Learn how to make use of advanced features of database products such as PostgreSQL

Do you work with SQL databases? Do you ever write select statements? Have you ever considered the dizzying array of possibilities when writing a select statement to perform a complex analysis?

Do you wish you were more confident when it comes to writing joins, self joins, group bys, window functions, common table expressions, recursive common table expressions?

Given an analysis problem, do you struggle with the exact features and phrasing you need to solve that problem within a single SQL statement?

If that sounds like you, you’re going to want to join us for this comprehensive training on SQL for the modern age!

Disclaimer: This training will include PostgreSQL-specific features & functionality that may not be present in other database products like MariaDB, MySQL, or SQLite.

Rising to the Challenge of Data Science Series: Data science is more than creating complex models or analyses from data. An effective data scientist must navigate, normalize, clean, analyze, draw conclusions from, and communicate their data instead of relying on data engineers (cleaning/storage) or software developers (dashboarding/communication). To become an independent, full stack data scientist, it is essential to master all of the skills and mechanics in this series not only to write more robust, maintainable, and extensible code, but also to adequately communicate your results to stakeholders and colleagues, keeping the message within your own control.

Courses in this series:

  • Smarter Shell for Data Science
  • Smarter SQL for Data Science
  • Smarter Statistics for Data Science
  • Smarter Dashboarding for Data Science
  • Smarter Pandas for Data Science
  • Smarter Plotting for Data Science

What you’ll learn and how you can apply it

By the end of the live online course, you’ll understand:

  • The underlying design of SQL and how it influences our decision-making
  • How to use the capabilities of the modern SQL dialect to effectively solve analytical problems
  • How to determine when the core data retrieval functionalities within SQL are, or are not, appropriate for solving your problems
  • Similarities and differences between pandas and SQL, and the strengths and weaknesses of each

And you’ll be able to:

  • Construct both simple and complex queries, combine multiple queries, and make your queries more programmatic
  • Write sophisticated queries using features such as SQL variables and temporary tables
  • Write queries that work with hierarchical or graph data of variable (or unknown) depth
  • Avoid problems from anomalous data using table normalization

This live event is for you because...

  • You come from a non-traditional (non-programming) background and need to become effective with data science and analytics
  • You need the foundational knowledge to empower the data science/data analysis needs of your team
  • You encounter SQL and other database products (like PostgreSQL) regularly in your work
  • You aren’t using all of the functionalities in your database and you would like to see which advanced features of SQL could benefit you in your work

Prerequisites

  • Some prior experience with SQL and database tools
  • Knowledge of relevant syntax & core functionality

Course Set-up

  • No specific set-up required. Course notes and all materials provided during the session.
  • Recommended: Up-to-date Python installation and coding environment.

Recommended Preparation

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

Segment 1: Easy Selects for Complex Analyses (45 minutes)

We’ll do a basic review on how to construct simple select queries with group by and join clauses. We’ll discuss the underlying design of SQL, and how this influences our select statement—e.g., when and why do we use a where clause vs a having clause? We’ll cover common extensions to select, such as the ability to union, intersect, and subtract (except) query results. Finally, we’ll discuss basic dynamic mechanisms (e.g., prepared statements, string interpolation, and dynamic SQL), their use and their fundamental limitations.

In this section, we will aim to answer the following questions:

  • How do I query row-by-row, and what if I need to look at multiple rows at once? (simple select, limit, distinct, order by, group by, where vs having)
  • What complex expressions can I select? (null, is null, coalesce, case)
  • Do I care about data types? (json, jsonb, date/times, arbitrary precision numbers, enumerations, and ranges)
  • How do I query across multiple tables? (join, natural join, inner join vs outer join vs left join vs cross)
  • How do I combine multiple queries? (union, intersect, except)

Break (10 minutes)

Q&A (5 minutes)

Segment 2: Hard Selects for Complex Analyses (45 minutes)

Next, we’ll discuss sophisticated querying functionality, such as the use of SQL variables and temporary tables. We’ll discuss what problems self joins can solve (and when they are necessary), when we need to reach for subqueries, and what the advantages of common table expressions are. We’ll discuss how group by relates to window functions (and what problems window functions uniquely solve). Finally, we’ll cover recursive CTEs and the complex analyses they enable.

In this section, we will aim to answer the following questions:

  • Why doesn’t SQL have variables? Isn’t it a programming language? (variables, temporary tables)
  • What do I do if I have to look at two rows at the same time? (self join)
  • If there are no variables, how do I reuse results (and what if I need one query to drive another)? (subqueries)
  • How do I eliminate repetition (and write queries that compose multiple operations)? (common table expressions)

Break (10 minutes)

Q&A (5 minutes)

Segment 3: Designing for Easy Selects (45 minutes)

We’ll discuss how to approach the design of our tables and data pipelines to make them amenable to easy analysis with SQL. We’ll discuss the necessity of bitemporal design when working with (external) temporal data. We’ll cover common SQL/DDL-level mechanisms such as constraints, primary keys, and foreign keys, and when they are appropriate tools for our use. We’ll discuss anomalies that might appear in our data and how to apply simple forms of table normalization to avoid these issues. Finally, we’ll discuss some theory of database design and querying, such as transactions, transactional properties (ACID), and isolation levels.

In this section, we will aim to answer the following questions:

  • What do I do if I want to guarantee that analytical data is free from (modeling) errors? (constraints, foreign keys)
  • What kind of anomalies might arise when data is updated? (insert, delete, update anomalies)
  • When and how might I choose (or not choose) to “normalize” my table design, and what do I gain? (database normalization, 1NF, 2NF, 3NF)
  • How can I tell when my queries are fast or slow, and what simple things can I do about it? (explain, analyze, and indices)

Break (10 minutes)

Q&A (5 minutes)

Segment 4: pandas vs. SQL (45 minutes)

We’ll discuss the pros and cons of using SQL over pandas (and vice versa) in order to best leverage their strengths. Moreover, we will discuss the features that make SQL unique and why it is one of the most popular choices for data persistence. We’ll present some data-driven problems and compare and contrast how they would be solved in pandas vs. in SQL. Finally, we’ll discuss when you would want your analyses to rely on SQL or pandas.

In this section, we will aim to answer the following questions:

  • Tabular Data And Not Enough Memory? (in-core vs out-of-core, chunking, mmap/numpy.memmap)
  • How do I aggregate tabular data? (select … group by vs pandas.DataFrame.groupby(…).agg)
  • How do I group and transform tabular data? (pandas.DataFrame.groupby(...).transform vs window functions)
  • What if I want to combine related datasets? (pandas.DataFrame.join vs pandas.merge vs select … join …)
  • What else is different? (list comprehensions vs pandas vs SQL)

Break (10 minutes)

Q&A (5 minutes)

Segment 5: What Makes PostgreSQL so Cool? (45 minutes)

Finally, we’ll discuss features and extensions to SQL that PostgreSQL offers. We’ll discuss what problems these features allow us to solve, and why we might choose to use PostgreSQL over alternatives such as MariaDB, MySQL, or SQLite.

In this section, we will aim to answer the following questions:

  • How can I model arrays? (array type)
  • Can I define my own types and operations on those types? (user defined types, conversions, user defined aggregations)
  • How can I model records? (composite types)
  • Can I write functions in SQL? (functions vs procedures)

Break (10 minutes)

Q&A (5 minutes)

Your Instructors

  • James Powell

    James Powell is a world-renowned expert in data science and open source scientific computing. His subject mastery and guidance are sought after by global organizations for consulting, coaching, and staff training. He is one of the most prolific speakers in the community, and he shares new content each week on his YouTube page, LinkedIn, and Discord.

    In addition to O’Reilly Online Learning, James and his team at Don't Use This Code provide Python training and consulting services. He would love to talk to you about how your team uses open source tools for AI/ML - e-mail him at james@dutc.io.

    James’s courses are dense and fast-paced. Be sure to stay for the full session and don’t worry if it goes by quickly. You’ll want to re-watch the video later to capture all of the details.

    linkedinXsearch
  • Cameron Riddell

    Cameron Riddell is a leading specialist on data analysis. He is an expert on analytic and visualization tools such as pandas, Matplotlib, and Bokeh. He delivers key insights to corporate clients to help design and maintain robust, efficient systems for data analysis and visualization.

    Cameron is also a prolific presenter, and you will want to check out his weekly blog by signing up for our newsletter.

    You can expect that any course led by Cameron will pack a lot of content into a short amount of time! You’ll want to stay for the entire session to see how it comes together and refer back to the recording to take in all of the detail.

    linkedinXsearch