Preface
Welcome to Learning Snowflake SQL and Scripting. Perhaps you are brand new to databases and will need to run queries or reports against a Snowflake database. Or perhaps, like myself, you have been working with databases such as Oracle, SQL Server, or MySQL for years, and your company has begun transitioning to cloud-based platforms. Whatever the case, this book strives to empower you with a detailed understanding of Snowflake’s SQL implementation so that you can be as effective as possible.
To help put things in context, I’ll start with a brief history of databases, starting with the introduction of relational databases in the 1980s and leading up to the availability of cloud-based database platforms such as Snowflake. If you’re ready to jump right into learning SQL, feel free to move on to Chapter 1, but you should read “Setting Up a Sample Database” if you want to create your own database with which to experiment.
Relational Database Primer
Computerized database systems have been around since the 1960s, but for the purposes of this book, let’s start with the introduction of relational databases, which started hitting the market in the 1980s with products such as Oracle, Sybase, SQL Server (Microsoft), and Db2 (IBM). Relational databases are based on rows of data stored in tables, and related rows stored in different tables are linked using redundant values. For example, the ACME Wholesale customer can be identified using customer ID 123
in the Customer
table, and any of ACME’s orders in the Orders
table would also be identified using customer ID 123
.
A table, such as Customer
or Orders
mentioned above, is comprised of multiple columns, such as name, address, and telephone number. One or more of these columns is used to uniquely identify each row in the table (known as the primary key). For the example database used for this book, there is a Customer
table whose primary key consists of a single column named custkey
, and every row in the Customer
table must have a unique custkey
value. There is also an Orders
table that includes the column custkey
to reference the customer who placed the order. The Orders.custkey
column is referred to as a foreign key to the Customer
table and must contain a value that exists in the Customer.custkey
column. Table P-1 shows a quick recap of the terminology introduced so far.
Column | An individual piece of data |
Row | A set of related columns |
Table | A set of rows |
Primary key | One or more columns that can be used as a unique identifier for each row in a table |
Foreign key | One or more columns that can be used together to identify a single row in another table |
So far, we’ve discussed the use of redundant column values to link tables via primary and foreign keys, but there are rules regarding the storage of redundant values. For example, it is perfectly fine for the Orders
table to include a column to hold values of Customer.custkey
, but it is not okay to include other columns from the Customer
table, such as the name or address columns. If you were looking at a row in the Orders
table and wanted to know the name and address of the customer who placed the order, you should go get these values from the Customer
table rather than storing the customer’s name and address in the Orders
table. The process of designing a database to ensure that each independent piece of information is in only one place (except for foreign keys) is known as normalization.
Normalization rules also apply to individual columns, in that a column should hold only a single independent value. One common example would be a mailing address, which is comprised of multiple elements, such as street, city, state, and zip code. A normalized design would therefore include multiple columns, as demonstrated in Table P-2.
Address1 | 3 Maple Street |
Address2 | Suite 305 |
City | Anytown |
State | TX |
Zip code | 12345 |
Companies often have multiple databases used for different purposes, and the degree of normalization can vary greatly. A database used exclusively by a company’s shipping department, for example, may include a single address column used to print shipping labels, which for the example above might contain the value "3 Maple Street, Suite 305, Anytown, TX 12345.
" It may also be the case that the shipping database is refreshed daily from a central, normalized database.
Snowflake
First launched in 2014, Snowflake is a cloud-based, full-featured, relational database. Snowflake databases can be hosted on any of the three major cloud platforms (Amazon AWS, Microsoft Azure, and Google Cloud), which allows customers with existing cloud deployments to stick with what they know. Both storage and compute engines can be scaled on demand, and Snowflake’s software as a service (SaaS) model frees companies from the need to hire legions of network, server, and database administrators, allowing organizations to focus on their core business.
There are many ways to interact with Snowflake, but for the purposes of this book I suggest you use Snowflake’s browser-based graphical tool named Snowsight, which is an excellent tool and is regularly updated and enhanced. Read Snowsight’s online documentation for an overview of its capabilities.
What Is SQL?
Structured query language (SQL) is the language originally developed for querying and manipulating data in relational databases. The SQL language has evolved to handle complex data, such as JavaScript Object Notation (JSON) documents, allowing easier integration between SQL and procedural languages, such as Java.
The SQL language is comprised of several groups of commands, as shown in Table P-3.
Category | Usage | Examples |
---|---|---|
Schema statements | Creating and modifying database structures | Create table, Create index, Alter table |
Data statements | Querying and manipulating data | Select, Insert, Update, Delete, Merge |
Transaction statements | Creating and ending transactions | Commit, Rollback |
You may also see schema statements classified as data definition language (DDL), and data statements classified as data manipulation language (DML). The schema statements are used to create or alter tables, indexes, views, and various other database structures. Once these structures are in place, you will use the data statements to insert, modify, and delete rows in your tables, and to retrieve data.
While you will see some schema statements used in this book, the vast majority of examples cover the data statements, which, though few in number, are rich and powerful statements worthy of in-depth study.
SQL is a nonprocedural language, meaning that you define what you want done, but not how to do it. For example, if you are running a report that lists the top ten customers per geographic region, you would write a select
statement that sums sales for each customer, but it would be up to the database server to determine how best to retrieve the data. There are generally multiple ways to generate a particular set of results, and some are more efficient than others, so it is left to the database server to determine how to pull data from multiple tables in an efficient manner.
What Is SQL Scripting?
If you have programmed with a procedural language such as Java, C#, or Go, you are familiar with such programming constructs as looping, if-then-else, and exception handling. SQL, being a nonprocedural language, has none of these constructs. To bridge this gap, most database platforms provide both a nonprocedural SQL implementation along with a procedural language that includes both the SQL data statements such as select
and insert
along with all of the usual procedural programming constructs. Oracle, for example, provides the PL/SQL procedural language, while Microsoft provides the Transact-SQL language.
Snowflake provides the Snowflake Scripting language, which allows you to declare variables, incorporate looping and if-then-else statements, and detect and handle exceptions. Snowflake Scripting language will be covered in Chapters 15, 16, and 17 of this book.
Setting Up a Sample Database
The nice people at Snowflake have provided several sample databases so that potential customers can gain experience with their SQL implementation. One of the sample databases, TPCH_SF1, is used for the majority of the examples in this book. However, since the TPCH_SF1 database is quite large (over 8.7 million rows of data), I chose to use a small subset (about 330,000 rows) of TPCH_SF1. You will have two options for setting up your own sample database (see “Sample Database Setup”), which will depend on whether the TPCH_SF1 sample database is still being made available by Snowflake.
The sample database contains eight tables containing information about customer orders of a set of parts provided by a set of suppliers, a real-life example of which might be a company that sells automobile parts made by other companies. Appendix A contains a visual representation of these tables along with the relationships between the tables.
If you want to run the example queries in this book, setting up a free 30-day Snowflake account is very easy. Once your account is active, you can follow my instructions for setting up your sample database.
Sample Database Setup
No matter which of the options you choose for creating your sample database tables, there are a couple of things you will need to do first.
Create a worksheet
In Snowsight, worksheets are where you will interact with your database. You can create different worksheets for different purposes, so let’s create a worksheet called Learning_Snowflake_SQL. To do so, click Worksheets in the left-hand menu, then click the “+” button at the top right and choose SQL Worksheet. A new worksheet tab will appear and will be given a default name based on the current date/time. You can click the menu next to the worksheet name and choose Rename, at which point you can name it LEARNING_SNOWFLAKE_SQL
, as shown in Figure P-2.
You can use this worksheet to run your SQL commands, starting with the create database
statement in the next section.
Create your database
Now that you have a worksheet, you can start entering commands. The first task will be creating your sample database, as shown in Figure P-3.
After typing create database learning_sql
into your worksheet, click the Run button (the white arrow with a blue background at the top right) to execute your command. Your database will be created and a schema named Public will be created by default. This is where the tables for your sample database will be created.
Sample Database Option #1: Copy from TPCH_SF1
In order to choose this option, which is the simpler of the two methods, you must first check to see which Snowflake sample databases are available. To do so, choose the Data>Databases menu option to see the list of available databases. If you see TPCH_SF1 under the SNOWFLAKE_SAMPLE_DATA database, you’re in luck, as shown in Figure P-4.
The next section describes how to copy the data from TPCH_SF1 into your own database.
Create and populate the tables
Before executing the commands to create your tables, you will need to specify the database and schema in which you will be working, as shown in Figure P-5.
After entering the use schema
command and pressing the Run button, you’re ready to create your tables. Here’s the set of commands:
create table region as select * from snowflake_sample_data.tpch_sf1.region; create table nation as select * from snowflake_sample_data.tpch_sf1.nation; create table part as select * from snowflake_sample_data.tpch_sf1.part where mod(p_partkey,50) = 8; create table partsupp as select * from snowflake_sample_data.tpch_sf1.partsupp where mod(ps_partkey,50) = 8; create table supplier as with sp as (select distinct ps_suppkey from partsupp) select s.* from snowflake_sample_data.tpch_sf1.supplier s inner join sp on s.s_suppkey = sp.ps_suppkey; create table lineitem as select l.* from snowflake_sample_data.tpch_sf1.lineitem l inner join part p on p.p_partkey = l.l_partkey; create table orders as with li as (select distinct l_orderkey from lineitem) select o.* from snowflake_sample_data.tpch_sf1.orders o inner join li on o.o_orderkey = li.l_orderkey; create table customer as with o as (select distinct o_custkey from orders) select c.* from snowflake_sample_data.tpch_sf1.customer c inner join o on c.c_custkey = o.o_custkey;
This script can also be found at my GitHub page.
Once you have loaded these eight create table
commands into your worksheet, you can run them individually by highlighting and executing each statement, or you can run all of them in a single execution by dropping down the menu on the right side of the Run button and choosing Run All, as shown in Figure P-6.
Whether you run them one at a time or all together, the end result should be eight new tables in the Public schema of your Learning_SQL database. If you run into problems, you can simply start again by re-creating the database via create or replace database learning_sql
, which will drop any existing tables.
Sample Database Option #2: Load Data from GitHub Files
You will need to use this option if the TPCH_SF1 sample database is no longer available from Snowflake (or if you were unable for any reason to use option #1). For this option, you will need to create eight tables in the learning_sql.public schema, and then load each table using CSV files available from my GitHub page. The following sections will lead you through the process.
Creating sample database tables
Load the Learning_Snowflake_SQL_Schema.sql file from GitHub into your worksheet. Figure P-7 shows what it looks like.
You can highlight each create table
statement and click the Run button, or you can use the drop-down menu next to the Run button to choose the Run All option, which will execute all eight statements. After executing, you should be able to see all eight tables under Database>LEARNING_SQL>PUBLIC>Tables, as shown in Figure P-8.
Once again, if you run into problems, you can simply start again by recreating the database via create or replace database learning_sql
, which will drop any existing tables.
Load files into tables
Now that the tables are in place, you can load each one individually using CSV files available from my GitHub page. Each table has a similarly named file; for example, you will load the Customer
table using the customer.csv.gz file, the Region
table using the region.csv.gz file, etc. Table P-4 shows the files for each of the eight tables.
To download all eight files from GitHub, select each file individually and use the “Download raw file” option.
To load the Customer
table, find the table in the Database>LEARNING_SQL>PUBLIC>Tables menu and click the Load Table menu option, as shown in Figure P-9.
Figure P-10 shows the pop-up window that is launched from the Load Data menu option.
Click Browse and choose the appropriate GitHub file, as shown in Figure P-11.
Click Next and choose the “Delimited Files (CSV or TSV)” option under the File Format drop-down, as shown in Figure P-12.
Leave all the other fields as is (default options) and click Next. After completing, you should see a window similar to Figure P-13.
Click Done and move on to the next table. Be careful to choose the appropriate file for each of the other seven tables.
Conventions Used in This Book
The following typographical conventions are used in this book:
- Italic
-
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Constant width
-
Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.
Constant width bold
-
Shows commands or other text that should be typed literally by the user, as well as code of particular interest to the present discussion.
Constant width italic
or<constant width in angle brackets>
-
Shows text that should be replaced with user-supplied values or by values determined by context.
Note
This element signifies a general note.
Warning
This element indicates a warning or caution.
Using Code Examples
Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/alanbeau/learningsnowflakesql.
If you have a technical question or a problem using the code examples, please send an email to bookquestions@oreilly.com.
This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.
We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Learning Snowflake SQL and Scripting by Alan Beaulieu (O’Reilly). Copyright 2024 Alan Beaulieu, 978-1-098-14032-8.”
If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com.
O’Reilly Online Learning
Note
For more than 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed.
Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit http://oreilly.com.
How to Contact Us
Please address comments and questions concerning this book to the publisher:
- O’Reilly Media, Inc.
- 1005 Gravenstein Highway North
- Sebastopol, CA 95472
- 800-889-8969 (in the United States or Canada)
- 707-829-7019 (international or local)
- 707-829-0104 (fax)
- support@oreilly.com
- https://www.oreilly.com/about/contact.html
We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/learning-snowflake-and-sql.
For news and information about our books and courses, visit https://oreilly.com.
Find us on LinkedIn: https://linkedin.com/company/oreilly-media.
Follow us on Twitter: https://twitter.com/oreillymedia.
Watch us on YouTube: https://youtube.com/oreillymedia.
Acknowledgments
I would like to thank several people at O’Reilly Media for helping bring this book to life, including Andy Kwan for helping with various technical issues, Corbin Collins for his excellent advice and editing skills, and Katherine Tozer and Carol Keller for the final push to get us to the finish line. I would also like to thank my tech reviewers, including Ed Crean and Joyce Kay Avila, whose expertise with the Snowflake ecosystem helped make this book much more detailed. Also, thanks to Pankaj Gupta and Nadir Doctor, both of whom saw the book at the early release time and enthusiastically volunteered to review the book. Lastly, thanks to my wife Nancy for encouraging me through not just this book but the five editions of Learning SQL and Mastering Oracle SQL over the past 20 years.
Get Learning Snowflake SQL and Scripting now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.