Chapter 1. A SELECT Review

This chapter represents a very quick review of SQL elements used throughout this book. In particular, we will cover the following topics:

These will all be used in examples in the remainder of the book, so let’s take some time and get used to them here.

Simple SELECT Statements

In the beginning was the word, and the word was SELECT. To get anything out of a relational database table, you need SELECT. Here it is in its simplest form (to save space, only the first 10 rows are shown in the result set, and we’re counting DC as a state):

SELECT * FROM crm.CustomerCountByState;
  State Total
0 AK 6
1 AL 0
2 AR 1
3 AZ 9
4 CA 72
5 CO 9
6 CT 5
7 DC 1
8 DE 0
9 FL 28

That gives us a “raw” dump of all the columns in the table (the *), in whatever order the rows may happen to be stored in the database table (database order).

Now we will switch to a bit “wider” dataset with more columns. Let’s take the top 10 by whatever states happen to sort first in the customer dataset. We’re also going to be choosy about which columns we want and not just grab them all:

SELECT TOP 10   /* Because we only want 10           */
    LastName,
    FirstName,
    City,
    State
FROM crm.NormalizedCustomer
ORDER BY State; -- This is the filter for the TOP 10
  LastName FirstName City State
0 Campain Roxane Fairbanks AK
1 Ferencz Erick Fairbanks AK
2 Giguere Wilda Anchorage AK
3 Kitty Gail Anchorage AK
4 Paprocki Lenna Anchorage AK
5 Weight Penney Anchorage AK
6 Deleo Carin Little Rock AR
7 Borgman Keneth Phoenix AZ
8 Eschberger Christiane Phoenix AZ
9 Kannady Regenia Scottsdale AZ

Let’s stop and take a look at all of that. First, refer to the SQL SELECT itself. I tend to always break up my SQL statements in this manner. It makes understanding the different clauses easier. I am able to quickly scan the “pattern” of this style. Then we should see a table with the 10 rows of 4 columns we asked for.

Let’s keep going. How many customers do we have in each state? The following code is a common pattern—I use it all the time while doing exploratory data analysis (EDA). This example may not count true unique customers if there are duplicates in our data (often likely, even in a production CRM system), but this will be close enough for our example:

SELECT TOP 10
    State,
    COUNT(*) Total  /* "Total" is the name assigned to the column */
FROM crm.NormalizedCustomer
GROUP BY State      /* Required for aggregate functions      */
ORDER BY 2 DESC, 1; -- States with most customers first
  State Total
0 CA 72
1 NJ 52
2 NY 46
3 TX 32
4 PA 29
5 FL 28
6 OH 22
7 MD 17
8 IL 15
9 MI 14
Note

In most examples, I am going to use something like SELECT TOP 10 to limit the resulting output, as shown here.

In this type of distribution query, we are looking at one (or more) columns and doing a COUNT(*) on the number of rows for each group (we will cover COUNT in Chapter 2 on useful SQL functions). Note we can give the output of the COUNT a column name in the results, in this case Total.

The results show us a nice distribution with a “long tail” of 1s in the Total column. The GROUP BY is required and needs to contain all columns that aren’t part of the aggregation (more on that in a moment).

The ORDER BY is interesting. Here I am sorting the total of the most common states first (2 DESC specifies the second column in the result set, in descending order) and then the first column in the result set (the 1). Normally, using column ordinals in this clause is frowned on in production code because if the column list changes, you often have to go and rework the ORDER BY. Remove a column and suddenly your ordinal may be out of range, causing an error. But we aren’t doing production code in this book. I often use this shorthand form to save typing when using the pattern:

SELECT Foo, COUNT(*) Total FROM Bar GROUP BY Foo ORDER BY 2 DESC, 1;

Many dialects support using ordinals in the GROUP BY as well. Again, I don’t think using them is a good idea for anything that might be pushed into production.

If I didn’t want to use ordinals, I would do the following. This is the same query, but I am not using ordinals, and I am giving meaningful names to the result columns:

SELECT TOP 10
    State    [Top 10 States],   /* You can put spaces in column names to make */
    COUNT(*) [Total Customers]  /* the results more readable (and if you do,  */
                                /* surround the name with square brackets).   */
FROM crm.NormalizedCustomer
GROUP BY State
ORDER BY COUNT(*) DESC, State;  -- Note using the COUNT function again
  Top 10 States Total Customers
0 CA 72
1 NJ 52
2 NY 46
3 TX 32
4 PA 29
5 FL 28
6 OH 22
7 MD 17
8 IL 15
9 MI 14

Common Table Expressions

That second COUNT(*) brings up another point. We can use functions in other places, such as the WHERE or ORDER BY clause. What if instead of the top 10, we simply want all states with 10 or more customers? We wouldn’t know that answer in advance, so we couldn’t use it in a SELECT TOP 10 (or LIMIT 10) clause. Here is one way to do it, using common table expressions (CTEs), aka the WITH statement:

/*
   Think of a CTE as a temporary "view,"
   although it isn't as performant. In
   this case the CTE name is CustomerCountByState.
*/
WITH CustomerCountByState
AS
(
    SELECT
        State,
        COUNT(*) Total
    FROM crm.NormalizedCustomer
    GROUP BY State
)
/*
   Now we can use our CTE.
*/
SELECT
    State [States w/More Than 10 Customers],
    Total [# in State]
FROM CustomerCountByState
WHERE Total > 9             /* Or >= 10, your choice */
ORDER BY Total DESC, State;
  States w/More Than 10 Customers # in State
0 CA 72
1 NJ 52
2 NY 46
3 TX 32
4 PA 29
5 FL 28
6 OH 22
7 MD 17
8 IL 15
9 MI 14
10 MA 12
11 WI 11
12 TN 10

CTEs aren’t efficient. If you use them in multiple places in a query, they act more like a macro expansion, not a true view. But I like them because they allow me to build up queries in an orderly manner that others can understand. There are many ways to solve the preceding problem in SQL. A subquery would get us to the same place but in my opinion isn’t as readable. Basically you can think of a CTE as lifting the subquery in the following SELECT out and predefining it and giving it a name. I have aliased the subquery CustomerCountByState so that the similarity to the preceding CTE is explicit:

SELECT
    State [States w/More Than 10 Customers],
    Total [# in State]
FROM
(
    SELECT
        State,
        COUNT(*) Total
    FROM crm.NormalizedCustomer
    GROUP BY State
) CustomerCountByState   /* Gotta give the subquery an alias.
                            It can be anything. I often use "A"
                            for simple queries like this, but am
                            naming it the same as the CTE to be
                            explicit. */
WHERE Total > 9
ORDER BY Total DESC, State;
  States w/More Than 10 Customers # in State
0 CA 72
1 NJ 52
2 NY 46
3 TX 32
4 PA 29
5 FL 28
6 OH 22
7 MD 17
8 IL 15
9 MI 14
10 MA 12
11 WI 11
12 TN 10
Note

In many cases I will use CTEs during EDA and then, as I get closer to rolling something into production, change them into proper views with the same names. Hence:

-- This:
WITH Foo
AS
(
    SELECT * FROM Bar WHERE Status = 'Foo'
)
SELECT * FROM Foo;
-- Becomes this:

CREATE VIEW Foo
AS
(
    SELECT * FROM Bar WHERE Status = 'Foo'
)

The query can remain unchanged (presuming it is running in the default schema), and the view can be used in other queries as well:

SELECT * FROM Foo;

The interesting thing about CTEs is I often chain them in a row, each building and filtering on the one before it:

WITH CustomerCountByState
AS
(
    SELECT
        State,
        COUNT(*) Total
    FROM crm.NormalizedCustomer
    GROUP BY State
),
CitiesInTopStates
AS
(
    SELECT
        City,
        State,
        COUNT(*) Total
    FROM crm.NormalizedCustomer
    WHERE
        /*
           States with more than 4 (or >= 5) customers.
        */
        State IN (SELECT State FROM CustomerCountByState WHERE Total > 4)
    GROUP BY City, State
)
SELECT
    City,
    State,
    Total
FROM CitiesInTopStates
/*
   Cities with 5 or more (or > 4) customers.
*/
WHERE Total >= 5
ORDER BY Total DESC, City, State;
  City State Total
0 New York NY 14
1 Philadelphia PA 8
2 Chicago IL 7
3 Miami FL 6
4 Raton NM 6
5 Baltimore MD 5
6 Gardena CA 5
7 Milwaukee WI 5
8 Orlando FL 5
9 Phoenix AZ 5
10 San Francisco CA 5

Note in the following example that a CTE can be a nice way to get you away from using ordinals or SQL functions in your ORDER BY clause, but I wouldn’t recommend using them only for that:

WITH TopTenStates
AS
(
    SELECT
        State,
        COUNT(*) Total
    FROM crm.NormalizedCustomer
    GROUP BY State
)
SELECT TOP 10
    State [Top 10 States],
    Total [Total Customers]
FROM TopTenStates
ORDER BY Total DESC, State;
  Top 10 States Total Customers
0 CA 72
1 NJ 52
2 NY 46
3 TX 32
4 PA 29
5 FL 28
6 OH 22
7 MD 17
8 IL 15
9 MI 14

In CASE of Emergency

I use the CASE expression a lot. It is basically an expression form of a switch statement in a C-syntax programming language. Remember your ancient history, like this example:

/* Ye olde C-style switch statement - brought to you by K&R. */
switch(foo) {
    case 0:
        bar = "It was zero.";
        break;
    case 1:
        bar = "It was one.";
        break;
    default:
        bar = "Dunno what it was.";
        break;
}

You can achieve similar results in SQL as follows:

SELECT
    CASE foo
        WHEN 0 THEN 'It was zero.'
        WHEN 1 THEN 'It was one.'
        ELSE 'Dunno what it was'
    END bar
FROM xyzzy;

In this book, you will see me use all three of the styles used in the following code block for varying reasons, including mood and the space available to show an example (including whether I am emphasizing the CASE or something else in the query).

Consider this simple table:

City USPSCity
O’Fallon OFallon

How do we check if the two cities are “the same”? The following code shows how. I expand upon this technique later in the book; this example is instead to show three styles of CASE expressions all checking for the same thing:

/*
   We will cover this example in depth in Chapter 3.
*/
WITH NormalizeCities
AS
(
    SELECT
        City,
        USPSCity,
        REPLACE(TRANSLATE(City,     '.,-''()[]`', '         '), 
                ' ', '') NormCity,
        REPLACE(TRANSLATE(USPSCity, '.,-''()[]`', '         '), 
                ' ', '') NormUSPSCity
    FROM Foo
)
SELECT
    City,
    USPSCity [USPS City],
    /*
       Single-line CASE statement.
    */
    CASE City WHEN USPSCity THEN 'Match!' ELSE 'No Match!' END [Match?],
    NormCity [Norm. City],
    NormUSPSCity [Norm. USPS City],
    /*
       Multi-line CASE with expression test after the CASE.
    */
    CASE NormCity
        WHEN NormUSPSCity THEN 'Match!'
        ELSE 'No Match!'
    END [Do Normalized Match?],
    /*
       Multi-LINE CASE with expression test in the WHEN. Note "=".
    */ 
    CASE
        WHEN NormCity = NormUSPSCity THEN 'Match!'
        ELSE 'No Match!'
    END [Still?]
FROM NormalizeCities;
City USPS City Match? Norm. City Norm. USPS City Do Normalized Match? Still?
O’Fallon OFallon No Match! OFallon OFallon Match! Match!

Gee, that’s a lot! Let’s take that apart piece by piece. The WITH statement creating the NormalizeCities CTE is simply making the example SELECT query CASE statements simpler (the REPLACE and TRANSLATE functions will be explained in Chapter 2).

The SELECT is working completely with the SQL variables—no table, view, or CTE required. We are simply looking at different ways to express the CASE expression, so that’s what I want to emphasize here. The four columns in the SELECT dumping out the various forms of the city are obvious. So we will look at the three CASE expressions in order. All are basically a simple “if…then…else” expression—if a match occurs, then return “Match!”, else “No Match!” For something as simple as this, see also the IIF SQL function in Chapter 2.

The first CASE is expressed all in a line:

CASE City WHEN USPSCity THEN 'Match!' ELSE 'No Match!' END [Match?]
Note

You can read it like this: “In the CASE of the value in City, WHEN it is equal to USPSCity THEN return the value ‘Match!’, ELSE return the value ‘No Match!’, and name the result (the column in the result set) Match?

The next one varies only in adding new lines and indentation to make the syntactic elements more obvious to the script reader (that’s you). It returns a column named Do Normalized Match?:

CASE NormCity
    WHEN NormUSPSCity THEN 'Match!'
    ELSE 'No Match!'
END [Do Normalized Match?],

The third is of more interest:

CASE
    WHEN NormCity = NormUSPSCity THEN 'Match!'
    ELSE 'No Match!'
END [Still?]

In this case (ahem) there is no value following the CASE. Instead, the WHEN has the test clause in it, followed by the THEN.

Note

Think of it like this: “In the following CASEs, WHEN the value of NormCity is equal to the value of NormUSPSCity, THEN return the value ‘Match!’, ELSE return the value ‘No Match!’ and name the result Still?

This is often the preferred form, because it allows more complexity. See the next example.

As mentioned, the SQL function IIF does a similar job (as can COALESCE or ISNULL), so why go to all the complications of a CASE expression? For something like this, I wouldn’t. However, let’s change the city and see how we can use CASE to handle a much more complicated subject. We will only use the last style of CASE expression for this so we can look at each test line by line:

DECLARE @City VARCHAR(50) = 'St Louis'           /* How we store it           */
DECLARE @ImportCity VARCHAR(50) = 'Saint Louis'  /* How Marketing receives it */
SELECT
    @City City,
    @ImportCity [Imported City],
    CASE
        /*
           Really crude approach, but effective. Note trailing spaces in each
           match term to make sure and not find embedded "st" instances, for
           example.
        */
        WHEN REPLACE(@City, 'St ',   'Saint ') =  @ImportCity THEN 'Match!'
        WHEN REPLACE(@City, 'St. ',  'Saint ') =  @ImportCity THEN 'Match!'
        WHEN REPLACE(@City, 'Ste ',  'Sainte ') = @ImportCity THEN 'Match!'
        WHEN REPLACE(@City, 'Ste. ', 'Sainte ') = @ImportCity THEN 'Match!'
        /*
           And so on, and so forth...however many CASEs you got...
        */
        ELSE 'No Match!'
    END [Result?];
City Imported City Result?
St Louis Saint Louis Match!

Even that’s not too exciting, since it could be replaced by a single IIF function (explained in Chapter 2):

SELECT
    @City City,
    @ImportCity [Imported City],
    IIF(REPLACE(@City, 'St ',   'Saint ') =  @ImportCity
        OR REPLACE(@City, 'St. ',  'Saint ') =  @ImportCity
        OR REPLACE(@City, 'Ste ',  'Sainte ') = @ImportCity
        OR REPLACE(@City, 'Ste. ', 'Sainte ') = @ImportCity,
        'Match!', 'No Match!') [Result?];

That then brings us back to why we should use CASE over something simpler like IIF. How about this? The LEN, LEFT, and RIGHT functions will also be explained in Chapter 2.

DECLARE @City VARCHAR(50) = 'Ste Genevieve'       /* How we store it           */
DECLARE @ImportCity VARCHAR(50) = 'St Genevieve'  /* How Marketing receives it */
SELECT
    @City City,
    @ImportCity [Imported City],
    CASE
        /*
           WHENs can actually check all variety of tests.
        */
        WHEN REPLACE(@City, 'St ',   'Saint ') =  @ImportCity THEN 'Match!'
        WHEN REPLACE(@City, 'St. ',  'Saint ') =  @ImportCity THEN 'Match!'
        WHEN REPLACE(@City, 'Ste ',  'Sainte ') = @ImportCity THEN 'Match!'
        WHEN REPLACE(@City, 'Ste. ', 'Sainte ') = @ImportCity THEN 'Match!'
        WHEN LEFT(@ImportCity, LEN('St ')) = 'St '
            AND RIGHT(@ImportCity, LEN('Genevieve')) = 'Genevieve'
            THEN 'Chauvinist!'
        WHEN LEFT(@ImportCity, LEN('St. ')) = 'St. '
            AND RIGHT(@ImportCity, LEN('Genevieve')) = 'Genevieve'
            THEN 'Chauvinist!'
        ELSE 'No Match!'
    END [Result?];
City Imported City Result?
Ste Genevieve St Genevieve Chauvinist!

Here, we were checking beyond simple tests of equality to look and see if there are any underlying data quality issues, which we politely call out (since the town and person are both of French origin and Genevieve is feminine, the proper title is “Sainte,” not “Saint”).

Joins

We’ve covered simple SELECT statements. Let’s look at joins next. Like belly buttons, there are basically two types: INNER and OUTER. There are also LEFT and RIGHT joins, but for our purposes I always use LEFT. Let’s look at an INNER JOIN first. For every row in the customer data, find the corresponding “friendly” state name in the postal reference table. You may not know it using the SQL INNER JOIN syntax; you may know it simply as a correlated SELECT:

SELECT TOP 10
    C.LastName [Last Name],
    C.FirstName [First Name],
    C.City,
    A.StateOrProvince State
/*
   Both tables are in the FROM
*/
FROM crm.NormalizedCustomer C,
     ref.PostalAbbreviations A
WHERE
    C.State = A.Abbreviation
ORDER BY C.LastName, C.FirstName;
  Last Name First Name City State
0 NULL NULL Jefferson City Missouri
1 NULL NULL Boulder Colorado
2 NULL NULL Raton New Mexico
3 Abdallah Johnetta Chapel Hill North Carolina
4 Acey Geoffrey Palatine Illinois
5 Acuff Weldon Arlington Heights Illinois
6 Adkin Barbra Brooklyn New York
7 Agramonte Fausto New York New York
8 Ahle Delmy Providence Rhode Island
9 Albares Cammy Laredo Texas

We can also use the INNER JOIN...ON syntax to be more explicit about our intentions:

SELECT TOP 10
    C.LastName [Last Name],
    C.FirstName [First Name],
    C.City,
    A.StateOrProvince State
FROM crm.NormalizedCustomer C
INNER JOIN ref.PostalAbbreviations A ON
    C.State = A.Abbreviation
ORDER BY C.LastName, C.FirstName;
  Last Name First Name City State
0 NULL NULL Jefferson City Missouri
1 NULL NULL Boulder Colorado
2 NULL NULL Raton New Mexico
3 Abdallah Johnetta Chapel Hill North Carolina
4 Acey Geoffrey Palatine Illinois
5 Acuff Weldon Arlington Heights Illinois
6 Adkin Barbra Brooklyn New York
7 Agramonte Fausto New York New York
8 Ahle Delmy Providence Rhode Island
9 Albares Cammy Laredo Texas

Note the two result sets are identical. I will tend to use explicit JOIN syntax throughout this book.

The thing to remember with inner joins is if there isn’t a match between the two, then a row doesn’t get sent to the result set. INNER is like “intersection” if you remember sets from math. For example, our customer data contains only United States states plus the District of Columbia. If there were a customer from the Canadian province of New Brunswick, then that customer’s data would not be included in the preceding results.

One thing that’s new in both examples is we’ve given each table an alias: “C” for NormalizedCustomer and “A” for PostalAbbreviations. This helps distinguish which columns we are talking about when they both share the same name: is it the column in the customer data or the one in the reference data? When there are no columns with the same names between the two tables, the aliases aren’t needed, but I find them good practice to always use them on joins, again to be explicit about my intentions. Notice we also gave more user-friendly column names in the result set, with spaces in Last Name and First Name instead of just CamelCase, e.g., LastName.

Also, on a JOIN the ON is not a complete replacement for the WHERE clause. In fact, trying to make it so can lead to some surprising results.

A Diversion into NULL Values

Check this out: we have NULL values in our CRM data in some name fields:

SELECT LastName, FirstName from crm.NormalizedCustomer WHERE LastName IS NULL;
  LastName FirstName
0 NULL NULL
1 NULL NULL
2 NULL NULL

We also have them in our import dataset:

SELECT LastName, FirstName from staging.Customers ORDER BY LastName, FirstName;
  LastName FirstName
0 NULL NULL
1 NULL NULL
2 NULL NULL
3 Abdallah Johnetta
4 Acey Geoffrey
5 Acuff Weldon
6 Adkin Barbra
7 Agramonte Fausto
8 Ahle Delmy
9 Albares Cammy

So why don’t they get found in a JOIN?

SELECT
	SC.LastName,
	SC.FirstName
FROM staging.Customers SC
INNER JOIN crm.NormalizedCustomer NC ON
	SC.LastName = NC.LastName
	AND SC.FirstName = NC.FirstName
ORDER BY SC.LastName, SC.FirstName;
  LastName FirstName
0 Abdallah Johnetta
1 Acey Geoffrey
2 Acuff Weldon
3 Adkin Barbra
4 Agramonte Fausto
5 Ahle Delmy
6 Albares Cammy

“Wait!” you exclaim. “I remember something about ANSI_NULLS.” Let’s try that:

SET ANSI_NULLS OFF;
SELECT
	SC.LastName,
	SC.FirstName
FROM staging.Customers SC
INNER JOIN crm.NormalizedCustomer NC ON
	SC.LastName = NC.LastName
	AND SC.FirstName = NC.FirstName
ORDER BY SC.LastName, SC.FirstName;
  LastName FirstName
0 Abdallah Johnetta
1 Acey Geoffrey
2 Acuff Weldon
3 Adkin Barbra
4 Agramonte Fausto
5 Ahle Delmy
6 Albares Cammy

Hmmm…still missing those NULL rows. Why? Because it is working as designed; as the documentation says, “SET ANSI_NULLS ON affects a comparison only if one of the operands of the comparison is either a variable that is NULL or a literal NULL. If both sides of the comparison are columns or compound expressions, the setting does not affect the comparison.” This is where you have to use a WHERE clause instead, and something like EXISTS:

SELECT
	SC.LastName,
	SC.FirstName
FROM staging.Customers SC
WHERE EXISTS (SELECT LastName, FirstName FROM crm.NormalizedCustomer)
ORDER BY SC.LastName, SC.FirstName
  LastName FirstName
0 NULL NULL
1 NULL NULL
2 NULL NULL
3 Abdallah Johnetta
4 Acey Geoffrey
5 Acuff Weldon
6 Adkin Barbra
7 Agramonte Fausto
8 Ahle Delmy
9 Albares Cammy

So, be aware when joining tables if any of the join columns have NULLs in them, or you will have to change your approach.

OUTER JOINs

Now for an OUTER JOIN. This is not a “real” example but is useful for our limited data so far. What states do we not have customers in? A left outer join will have NULL for any values in the “right-side” table (in the following dbo.PotentialMatches) that don’t have a match from the “left-side” table:

SELECT TOP 10
    A.StateOrProvince State,
    P.last_name [Last Name],
    P.first_name [First Name],
    P.company_name Company,
    P.City
FROM ref.PostalAbbreviations A
/*
   Attempt to join against our import data.
*/
LEFT OUTER JOIN dbo.PotentialMatches P ON
    A.Abbreviation = P.State
ORDER BY A.StateOrProvince;
  State Last Name First Name Company City
0 Alabama NULL NULL NULL NULL
1 Alaska NULL NULL NULL NULL
2 Arizona NULL NULL NULL NULL
3 Arkansas NULL NULL NULL NULL
4 California Hamilton Charlene NULL Santa Rosa
5 California Rochin Xuan NULL San Mateo
6 California Waycott Kanisha NULL L.A.
7 Colorado NULL NULL Foo, Inc. Boulder
8 Connecticut Layous Ma NULL North Haven
9 Delaware NULL NULL NULL NULL

Those first four rows with NULL values in the import data indicate there were four states found in the ref.PostalAbbreviations table (left) that had no matches within the import data (right).

Now we can add the WHERE clause using this fact:

SELECT TOP 10
    A.StateOrProvince [1st 10 States Not In Import File]
FROM ref.PostalAbbreviations A
/*
   Join against our customer data.
*/
LEFT OUTER JOIN dbo.PotentialMatches P ON
    A.Abbreviation = P.State
WHERE
    P.State IS NULL     /* Where no match in import data */
ORDER BY A.StateOrProvince;
  1st 10 States Not In Import File
0 Alabama
1 Alaska
2 Arizona
3 Arkansas
4 Delaware
5 District of Columbia
6 Georgia
7 Idaho
8 Illinois
9 Iowa

If the “left” and “right” seems confusing, think of the SELECT as all on one line, reading from left to right:

SELECT * FROM lefttable l LEFT OUTER JOIN righttable r ON l.somecol = r.somecol

In a left OUTER JOIN, the columns from the right table will have NULLs in the result set if there wasn’t a match.

The preceding is a bit contrived and may not be useful, however. There could be lots of states without potential customers. Here’s a better question: What states do we not have current customers in? Where do we need to concentrate our sales efforts?

SELECT TOP 10
    A.StateOrProvince [State Not In Import File]
FROM ref.PostalAbbreviations A
/*
   Join against our customer data.
*/
LEFT OUTER JOIN crm.NormalizedCustomer C ON
    A.Abbreviation = C.State
WHERE
    C.State IS NULL     /* Where no match in customer data */
ORDER BY A.StateOrProvince;
  State Not In Import File
0 Alabama
1 Delaware
2 Vermont
3 West Virginia

Of course, we could have gotten the same result with a sub-SELECT. Which is clearer? I will let you decide:

SELECT
    StateOrProvince [State Without Customers]
FROM ref.PostalAbbreviations
WHERE
    Abbreviation NOT IN (SELECT State FROM crm.NormalizedCustomer)
ORDER BY StateOrProvince;
  State Without Customers
0 Alabama
1 Delaware
2 Vermont
3 West Virginia

Obviously the reverse is useful, too, and actually the more common case:

SELECT
    COUNT(DISTINCT A.StateOrProvince) [# States With Customers]
FROM ref.PostalAbbreviations A
LEFT OUTER JOIN crm.NormalizedCustomer C ON
    A.Abbreviation = C.State
WHERE
    C.State IS NOT NULL;  -- At least 1 match in customer data
# States With Customers
47

Let’s go back to the import data. It wasn’t useful to ask what states weren’t in it; it could be many. It wasn’t useful to do a join to ask what states were in it; we can just extract that with a SELECT DISTINCT or a GROUP BY State. But import data is noisy. Are there any records in the import data that aren’t from a valid United States state or the District of Columbia? For once I will show a RIGHT OUTER JOIN, just to keep the tables in the SELECT in the same order:

SELECT
    P.State [Invalid State In Import File]
FROM ref.PostalAbbreviations A
/*
   Join against our customer data.
*/
RIGHT OUTER JOIN dbo.PotentialMatches P ON
    A.Abbreviation = P.State
WHERE
    A.Abbreviation IS NULL     /* Where no match in left dataset */
ORDER BY P.State;
  Invalid State In Import File
0 NULL
1 NB

We see the import file has at least one missing state (not unusual) and one for “NB”. What’s that?

SELECT
    P.country,
    P.state [Invalid State In Import File]
FROM ref.PostalAbbreviations A
/*
   Join against our customer data.
*/
RIGHT OUTER JOIN dbo.PotentialMatches P ON
    A.Abbreviation = P.state
WHERE
    A.Abbreviation IS NULL     /* Where no match in left dataset */
ORDER BY P.state;
  Country Invalid State In Import File
0 U.S. NULL
1 Canada NB

Ah, “NB” is in Canada, so it is New Brunswick! One mystery solved, another opened. What are we going to do with it? In many cases you will have to ask “The Business” how to deal with this. For regulatory reasons you may not be able to deal with someone in Canada. Do you “drop the record on the floor” or send it somewhere so a “We’re sorry” communication can be sent? Or does it need to be routed to a subsidiary or parent company in that country? That is beyond the scope of the book but does show some of the issues you will have to deal with. But you know how to identify the person in any case:

SELECT
    last_name [Last Name],
    first_name [FirstName]
FROM dbo.PotentialMatches
WHERE state = 'NB'
Last Name FirstName
Snedley Clive

Again, outer joins are useful when you know data on one “side” (I typically choose the right) could be missing or not matched when trying to join with data from the other side, in this case the left. In other words, if you think of the canonical list of all states as being in the reference table on the “left,” there may or may not be customers from all those states in the CRM table on the right, and hence when joining, you can expect NULL values in result rows for the right-side columns where there is a row in the left with no match on the right.

Finding the Most Current Value

Consider the following “real-life” scenario.

A loan may move from an opportunity in a CRM system to a loan application in the loan origination system. Being different applications, these are different tables in different databases. Then perhaps the loan origination system moves the loan from the application phase to underwriting and the processing phase, which moves it to another table, too. Perhaps it then goes to a servicing table after the loan is signed by the customer and booked. That table may in fact be in another application database as well, as part of the banking “core” system. In all four of those tables across three applications will be a value for the total loan amount, which may naturally change along the way with the normal back-and-forth of getting a loan and then paying it off over time. So, the question becomes, “Regardless of the state of the loan—opportunity, application, processing, or servicing—what is the current loan amount?”

Behold the COALESCE expression and its beauty (see also Chapter 2), which returns the first non-NULL value in a list of values. Also notice how I naturally include the use of outer joins and their utility, too. Sneaky:

SELECT
   COALESCE(L.LoanAmount, P.LoanAmount, 
            A.LoanRequestAmount, O.OpportunityAmount) LoanAmount
FROM
   crm.Opportunity O
LEFT OUTER JOIN loan.LoanApplication A ON
   O.CustomerId = A.CustomerId
LEFT OUTER JOIN loan.LoanProcessing P ON
   O.CustomerId = P.CustomerId
LEFT OUTER JOIN core.Loan L ON
   O.CustomerId = L.CustomerId
WHERE O.CustomerId = 12345;

First, we know the preceding is unreal because there is no way three disparate systems would share the same customer IDs, let alone have the same column names for them. We will cover how those types of joins are done across applications that don’t know about each other in later chapters, especially Chapter 9. But looking beyond that, we start the SELECT with the Opportunity table, because that’s where any deal is going to start—with a loan officer opening an opportunity and filling in a few details. The LEFT OUTER JOIN clauses then bring in the rows from the other applications, if they exist.

Since the flow of the loan is from opportunity to application to processing to servicing, we can see that in reverse order in the COALESCE, a dashboard or report will typically want the “most current” information, so we start by looking at the core system’s L.LoanAmount field. If it is not NULL, the loan has booked, and the core’s loan accounting system will have the currently calculated daily loan balance. If that is NULL, however, then the other columns are examined, left to right, until a non-NULL value is found, with O.OpportunityAmount being the guaranteed stop-gap (and also the fuzziest value when you think about it, since it will almost certainly be subject to change over the life of the loan process).

If the previous COALESCE had been expressed as either a CASE expression or an IIF function, it would have been much more complicated and ugly!

Final Thoughts on SELECT

To do anything “in real life” that I discuss in this book, you will need at least SELECT access to the tables and views under investigation. In Microsoft SQL Server, this can be easily granted with the db_datareader role on the database, but since that is a database-level role, it may not be appropriate to your environment if you work in a highly secured industry. You will need to be granted at least the following or equivalent on every table you will be accessing:

GRANT SELECT ON Foo.Bar TO YourLogin;

YourLogin can obviously be a database role or group in which your ID is a member. Of course, if you created the database(s), you are db_owner and can do what you want.

I discussed the utility of CTEs (see “Common Table Expressions”). I use them a lot in my work but can’t emphasize enough that they are not a panacea. If a query is going to get run repeatedly, it is better to create a view for it and let the SQL optimizer do its job. CTEs behave basically like macros, that is, “expanding in place” where they are referenced, and can be very slow if you create complicated chains of them. However, sometimes in a consulting job all you have is SELECT granted and no Data Definition Language (DDL) access to CREATE a view. Then you have to do what you have to do, and CTEs are often “good enough.”

In case you missed it, I find the CASE expression flowing out of my fingertips almost every time I am doing matching or mapping of data. I find it much more readable than the IIF function, even if it is a bit more typing:

-- This?
SELECT IIF(Foo = 1, 'Foo', 'Unfoo') [Foo?] FROM Xyzzy;

-- Or this?
SELECT CASE WHEN Foo = 1 THEN 'Foo' ELSE 'Unfoo' END [Foo?] FROM Xyzzy;

In this case (I promise, that’s the last pun), they are equivalent, and the choice is a matter of personal taste. However, as so often happens, requirements change, and suddenly you may find yourself wanting to do the following:

-- No. Just no. This way lies (t)error!
SELECT IIF(Foo = 1, 'Foo', 
       IIF(Bar = 1, 'Bar', 'Unbarrable')) [Foo or Bar?] FROM Xyzzy;

-- But this approach adapts pretty easily.
SELECT
    CASE
        WHEN Foo = 1 THEN 'Foo'
        WHEN Bar = 1 THEN 'Bar'
        ELSE 'Unbarrable'
    END [Foo or Bar?]
FROM Xyzzy;

You can see that the IIF approach will get ugly pretty fast as the tests to be checked multiply.

We’ve now seen that the absence of data, such as NULLs returned from LEFT OUTER JOIN queries, can be as interesting as its presence and that, in fact, missing data is the first level of “fuzzy.” Let’s move on to Chapter 2 and look at some useful SQL functions to add to our SELECT statements to start dealing with all that fuzzy data out there!

Get Fuzzy Data Matching with SQL now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.