Chapter 4. Organizing Data in Greenplum

To make effective use of Greenplum, architects, designers, developers, and users must be aware of the various methods by which data can be stored because these will affect performance in loading, querying, and analyzing datasets. A simple “lift and shift” from a transactional data model is almost always suboptimal. Data warehouses generally prefer a data model that is flatter than a normalized transactional model. Data model aside, Greenplum offers a wide variety of choices in how data is organized, including the following:

Distribution

Determines into which segment table rows are assigned.

Partitioning

Determines how the data is stored on each of the segments.

Orientation

Determines whether the data is stored by rows or by columns.

Compression

Used to minimize data table storage in the disk system.

Append-optimized tables

Used to enhance performance for data that is rarely changed.

External tables

Provide a method for accessing data outside Greenplum.

Indexing

Used to speed lookups of individual rows in a table.

Distributing Data

One of the most important methods for achieving good query performance from Greenplum is the proper distribution of data. All other things being equal, having roughly the same number of rows in each segment of a database is a huge benefit. In Greenplum, the data distribution policy is determined at table creation time. Greenplum adds a distribution clause to the Data Definition Language (DDL) for a CREATE TABLE statement. Prior to Greenplum 6, there were two distribution methods. In random distribution, each row is randomly assigned a segment when the row is initially inserted. The other distribution method uses a hash function computed on the values of some columns in the table.

Here are some examples:

CREATE TABLE bar
(id INT, stuff TEXT, dt DATE) DISTRIBUTED BY (id);

CREATE TABLE foo
(id INT, more_stuff TEXT, size FLOAT8 ) DISTRIBUTED RANDOMLY;

CREATE TABLE gruz
(id INT, still_more_stuff TEXT, zipcode CHAR(5));

CREATE TABLE repl
(id INT, even_more_stuff TEXT) DISTRIBUTED REPLICATED;

In the case of the bar table, Greenplum will compute a hash value on the id column of the table when the row is created and then uses that value to determine in which segment the row should reside.

The foo table will have rows distributed randomly among the segments. Although this will generate a good distribution with almost no skew, it might not be useful when colocated joins will help performance. Random distribution is fine for small-dimension tables and lookup tables, but probably not for large fact tables.

For the gruz table there is no distribution clause. Greenplum will use a set of default rules. If a column is declared to be a primary key, Greenplum will hash on this column for distribution. Otherwise, it will choose the first column in the text of the table definition if it is an eligible data type. If the first column is a user-defined type or a geometric type, the distribution policy will be random. It’s a good idea to explicitly distribute the data and not rely on the set of default rules; they might change in a future major release.

New in Greenplum Version 6

Replicated tables first appear in version 6.

There will be a copy of the repl table on each segment if the DISTRIBUTED REPLICATED clause is supplied. You can use replicated tables to improve query performance by eliminating broadcast motions for the table. Typically, you might use this for larger-dimension tables in a star schema. A more sophisticated use is for user-defined functions that run in the segments rather than the master where functions require access to all rows of the table. This is not advised for large tables.

Currently an update on the distributed column(s) is allowed only with the use of the Pivotal Query Optimizer (discussed later in this chapter). With the legacy query planner, updates on the distribution column were not permitted, because this would require moving the data to a different segment.

A poorly chosen distribution key can result in suboptimal performance. Consider a database with 64 segments and the table gruz had it been distributed by the gender column. Let’s suppose that there are three values, “M,” “F,” and NULL for unknown. It’s likely that roughly 50% of the values will be “M,” roughly 50% “F,” and a small proportion of NULLs. This would mean that 2 of the 64 segments would be doing useful work and others would have nothing to do. Instead of achieving an increase in speed of 64 times over a nonparallel database, Greenplum could generate an increase of only two times, or, in worst case, no speed increase at all if the two segments with the “F” and “M” data live on the same segment host.

The distribution key can be a set of columns, but experience has shown that one or two columns usually is sufficient. Distribution columns should always have high cardinality, although that in and of itself will not guarantee good distribution. That’s why it is important to check the distribution. To determine how the data actually is distributed, Greenplum provides a simple SELECT statement that shows the data distribution:

SELECT gp_segment_id, count(*) FROM foo GROUP BY 1;
 gp_segment_id | count
---------------+--------
             1 | 781632
             0 | 749898
             2 | 761375
             3 | 753063

gp_segment_id is a pseudocolumn, one that is not explicitly defined by the CREATE TABLE statement but maintained by Greenplum.

A best practice is to run this query after the initial data load to see that the chosen distribution policy is generating equidistribution. You also should run it occasionally in the life of the data warehouse to see whether changing conditions suggest a different distribution policy.

Another guiding principle in data distribution is colocating joins. If there are queries that occur frequently and consume large amounts of resources in the cluster, it is useful to distribute the tables on the same columns.

For example, if queries frequently join tables on the id column, there might be performance gains if both tables are distributed on the id column because the rows to be joined will be in the same segment. This is less important if one of the tables is very small, perhaps a smaller-dimension table in a data warehouse. In that case, Greenplum will automatically broadcast or redistribute this table to all the segments at a relatively small cost completely transparently to the user.

Here’s an example to illustrate colocation:

SELECT f.dx, f.item, f.product_name f.store_id s.store_address
FROM fact_table f, store_table s
WHERE f.store_id = s.id AND f.dx = 23
ORDER BY f.product_name;

To colocate the join, use the following distribution clause for the fact_table:

DISTRIBUTED BY (store_id)

For the store_table use this:

DISTRIBUTED BY (id)

Of course, some other queries will use different join conditions, so it’s not possible to have every join colocated. Some experience with the use case will go a long way to find the best distribution policy. It’s important that the join columns have the same data type. The integer 1 and the character “1” are not the same and will hash differently.

If there is no prior knowledge of how the tables are to be used in select statements, random distribution can be effective. If further usage determines that another distribution pattern is preferable, it’s possible to redistribute the table. This is often done by a CTAS, or create table as select. To redistribute a table, use the following commands:

CREATE TABLE new_foo AS SELECT * from foo
   DISTRIBUTED BY (some_other_coummn);
DROP TABLE foo;
ALTER TABLE new_foo RENAME TO foo;

Or, more simply, use the following:

ALTER TABLE foo SET DISTRIBUTED BY (some_other_column);

Both of these methods require resource usage and should be done at relatively quiet times. In addition, Greenplum requires ample disk space to accommodate both versions of the table during the reorganization.

Polymorphic Storage

Greenplum employs the concept of polymorphic storage. That is, there are a variety of methods by which Greenplum can store data in a persistent manner. This can involve partitioning data, organizing the storage by row or column orientation, various compression options, and even storing it externally to the Greenplum Database. There is no single best way that suits all or even the majority of use cases. The storage method is completely transparent to user queries, and thus users do not do coding specific to the storage method. Utilizing the storage options appropriately can enhance performance.

Partitioning Data

Partitioning a table is a common technique in many databases, particularly in Greenplum. It involves separating table rows for efficiency in both querying and archiving data. You partition a table when you create it by using a clause in the DDL for table creation. Here is an example of a table partitioned by range of the timestamp column into partitions for each week:

CREATE TABLE foo (fid INT, ftype TEXT, fdate DATE)
DISTRIBUTED by (fid)
PARTITION BY RANGE(fdate)
(
    PARTITION week START ('2017-11-01'::DATE)
                  END ('2018-01-31'::DATE)
                  EVERY ('1 week'::INTERVAL)
);

This example creates 13 partitions of the table, one for each week from 2017-11-01 to 2018-01-31. Be aware that the end date does not include January 31, 2018. To include that date, end on Feburary 1 or use the INCLUSIVE clause.

What is happening here? As rows enter the system, they are distributed into segments based upon the hash values for the fid column. Partitioning works at the storage level in each and every segment. All values whose fdate values are in the same week will be stored in a separate file in the segment host’s filesystem. Why is this a good thing? Because Greenplum usually does full-table scans rather than index scans; if a WHERE clause in the query can limit the date range, the optimizer can eliminate reading the files for all dates outside that range.

If our query were

SELECT * FROM foo WHERE fdate > '2018-01-14'::DATE;

the optimizer would know that it had to read data from only the three latest partitions, which eliminates about 77% of the partition scans and would likely reduce query time by a factor of four. This is known as partition elimination.

In addition to range partitioning, there is also list partitioning, in which the partitions are defined by discrete values, as demonstrated here:

CREATE TABLE bar (bid integer, bloodtype text, bdate date)
DISTRIBUTED by (bid)
PARTITION BY LIST(bloodtype)
(PARTITION a values('A+', 'A', 'A-'),
 PARTITION b values ('B+', 'B-', 'B'),
 PARTITION ab values ('AB+', 'AB-', 'AB'),
 PARTITION o values ('O+', 'O-', 'O')
 DEFAULT PARTITION unknown);

This example also exhibits the use of a default partition into which rows will go if they do not match any of the values of the a, b, ab, or o partitions. Without a default partition, an attempt to insert a row that does not map to a partition will result in an error. The default partition is handy for catching data errors but has a performance hit. The default partition will always be scanned even if explicit values are given in the WHERE clause that map to specific partitions. Also, if the default partition is used to catch errors and it is not pruned periodically, it can grow quite large.

In this example, the partitions are named. In the range partitioning example, they are not. Greenplum will assign names to the partitions if they are unnamed in the DDL.

Greenplum allows subpartitioning, that is, the creation of partitions within partitions of a different column than the major partition column as the subpartitions. This might sound appealing, but it can lead to a huge number of small partitions, many of which have little or no data, but need to be opened and closed by the filesystem during table scans. In this example, there are 52 × 5 = 260 partitions. If this were column oriented, we would get 52 × 5 × (number of columns) partitions. In Greenplum, as in Postgres, each partition is actually a table:

CREATE TABLE gruz (bid INT, bloodtype TEXT, bdate DATE)
DISTRIBUTED by (bid)
PARTITION BY RANGE (bdate)
SUBPARTITION BY LIST(bloodtype)
SUBPARTITION TEMPLATE
(SUBPARTITION a VALUES('A+', 'A', 'A-'),
 SUBPARTITION b VALUES ('B+', 'B-', 'B'),
 SUBPARTITION ab VALUES ('AB+', 'AB-', 'AB'),
 SUBPARTITION o VALUES ('O+', 'O-', 'O'),
 DEFAULT SUBPARTITION unknown)
(START (DATE '2017-01-01')
 END (DATE '2017-12-25')
   EVERY (INTERVAL '1 week'));

Often, new Greenplum users confuse partitioning data with distributing data. All data must be distributed (this is how Greenplum works), but partitioning is optional. Partitioning data is orthogonal to distribution. There is no notion of randomly partitioning data because it would make no sense. The optimizer would not be able to do partition elimination because it could not use the WHERE clause to determine where data resided on the filesystem. Distribution works to separate data across the segments, and partitioning works to separate data within segments. You can envision this two-dimensional “striping” of data as “plaiding,” with stripes running in perpendicular directions. There is a further confusion: PARTITION is a keyword in the definition of Window Functions. This use of partition has nothing to do with table partitioning.

Just as Greenplum did not allow update of the distribution columns, it also forbade the update of the partition column until the use of the GPORCA optimizer.

In many analytic data warehouses, recent data is more important than older data, and it is helpful to archive older data. This older data may be occasionally useful, but it might be best archived outside the database. The external table mechanism is ideal for this use.

It is possible to add, drop, and split partitions. The Greenplum Database documentation covers this in detail.

Here are some best practices on partitioning:

  • It makes little sense to partition small lookup tables and the like.

  • Never partition on columns that are part of the distribution clause.

  • In column-oriented tables, the number of files for heavily partitioned tables can be very large as each column will have its own file. A table with 50 columns, partitioned hourly with 10 subpartitions over two years, will have 50 × 24 × 365 × 2 × 10 = 8,760,000 files per segment. With a segment host having eight segments this would be more than 70 million files that the server must open and close to do a full-table scan. This will have a huge performance impact.

To obtain more detail on the partitions, the following query works nicely on range partitioned tables. A WHERE clause for particular schemas and tables will shorten the output:

SELECT
       schemaname || '.' || tablename AS "Schema.Table"
      ,partitiontablename AS "PartTblName"
      ,partitiontype AS "Type"
      ,split_part(partitionrangestart, '::', 1) AS "Start"
      ,split_part(partitionrangeend, '::', 1) AS "End"
      ,partitionendinclusive AS "End Inclusive?"
FROM
     pg_partitions;

Orientation

Greenplum offers several methods for storing the data in tables. The default is to store the data on the disk in blocks of rows. This method is useful if any alterations to the data are likely and frequent. You might do this for a load table, then do some data verification and cleansing, and then add the data to an existing fact table in a star schema.

You can also opt to store data in column orientation. Here, Greenplum groups data by columns when storing it on the disk. Why would you choose one rather than the other?

Many data warehouses have tables with very wide rows, with hundreds of columns or more. Often analytics or reports derived from these tables only use a fraction of the columns in any particular query. In these cases, column orientation would result in many fewer I/O operations. Also, compression works best when there are similar values in the data. Common values are much more likely in columns, such as zip code, country name or department name.

Like partitioning and distribution, table orientation is done at table creation with an orientation clause, as follows:

create table foo_col (id int, name text,...)
    distributed by (id) with (orientation=column);

An optimal orientation is highly dependent on the data and queries and may require some trial and error in test and development before going into production.

Compression

Compression is normally considered a method to decrease the storage requirements for data. In Greenplum, it has an additional benefit. When doing analytics with large datasets, queries often read the entire table from disk. If the table is compressed, the smaller size can result in many fewer disk reads, enhancing query performance. Small tables need not be compressed because it will yield neither significant performance gain nor storage gains. Compression and expansion require CPU resources. Systems already constrained by CPU need to experiment to determine whether compression will be helpful in query performance.

Like distribution, you specify compression when you create a table. Greenplum offers many compression options but only for append-optimized or column-oriented tables. The first statement in the following example creates a table, the second generates an error:

CREATE TABLE foobar (fid INT, ft TEXT)
WITH (compresstype = quicklz,orientation=column,appendonly=true)
DISTRIBUTED by (fid);

CREATE TABLE gruz (gid INT, gt TEXT)
WITH (compresstype = quicklz,orientation=row)
DISTRIBUTED by (gid);

ERROR: invalid option "compresstype" for base relation.
Only valid for Append Only relations

You can use zlib and QuickLZ to compress row-oriented tables. You can use run-length encoding (RLE) to compress column-oriented tables. Run-length encoding can be very effective when used on columns with sorted data.

Note

Zstandard compression is only available in Greenplum 6. It uses a newer algorithm, doing lossless compression at reasonably high speed.

CREATE TABLE snarf (snid INT, sntype TEXT)
WITH (compresstype = zstd,orientation=column,appendonly=true)
DISTRIBUTED by (snidid);

A column-oriented table also can have different compression strategies for different columns. In the following example, each of the four columns is compressed differently, although this should not be interpreted as a best practice:

create table comp1
   ( cid int ENCODING(compresstype=quicklz),
     ct text ENCODING(compresstype=zlib, compresslevel =  5),
     cd date ENCODING(compresstype=rle_type),
     cx smallint)
with (appendonly=true, orientation=column)
distributed by (cid);

You can run zlib compression at many levels. It’s tempting to try the higher levels, but they will take considerably longer to compress and might not provide any significant advantage in compression ratio or in improving query speed.

Append-Optimized Tables

An append-optimized table also stores table data by row but more compactly. This format excludes information that makes updates faster, so you should use it only for data that rarely changes. Prior to Greenplum 4.3, Greenplum used append-only tables, which allowed rows to be appended but not updated, as shown here:

CREATE TABLE foobar (fid INT, ft TEXT)
WITH (APPENDONLY = TRUE) DISTRIBUTED BY (fid);

Append-optimized tables are particularly useful in denormalized fact tables in a data warehouse where new data is loaded into staging or ETL tables, cleansed and corrected if necessary, and then placed in the production tables.

External Tables

The Greenplum concept of an external table is data that is stored outside the database proper but can be accessed via SQL statements, as though it were a normal database table. The external table is useful as a mechanism for:

  • Loading data

  • Archiving data

  • Integrating data into distributed queries

Because no data in an external table lives within a Greenplum database, an external table is a metadata object that instructs Greenplum how to access the data when required in a query.

External tables can be either readable or writeable and they can refer to files or, in the case of external web tables, can be defined in terms of a URL or a process that generates data.

Creating a readable external table is much like creating a normal table. Data manipulation language (DML) operations, such as update and delete, are not permitted; only those involved in reading the table are allowed. It does not have a DISTRIBUTED BY clause, but does have a LOCATION clause that informs Greenplum where the data can be found and which protocol is used to access it.

Chapter 8 has details on the many methods and protocols for using external data in Greenplum. This chapter illustrates the use of the simple file protocol. Chapter 5, on loading data, focuses on the gpfdist protocol, Greenplum’s parallel transport protocol.

In the following example, raw data to be loaded into Greenplum is located in a small flat file on a segment host. It has comma-separated values (CSV) with a header line. The head of the file would look like the following:

dog_id, dog_name, dog_dob
123,Fido,09/09/2010
456,Rover,01/21/2014
789,Bonzo,04/15/2016

The corresponding table would be as follows:

CREATE TABLE dogs
(dog_id int, dog_name text, dog_dob date) distributed randomly;

The external table would be defined like this:

CREATE READABLE EXTERNAL TABLE dogs_ext (
   dog_id int, dog_name text, dog_dob date)
LOCATION ('file://seghost1/external/expenses1.csv')
FORMAT 'csv' (header);

It could be queried as this:

SELECT * from dogs_ext;

Indexing

In transaction systems, indexes are used very effectively to access a single row or a small number of rows. In analytic data warehouses, this sort of access is relatively rare. Most of the time, the queries access all or many of the rows of the table, and a full table scan gives better performance in an MPP architecture than an index scan. In general, Greenplum favors full-table scans over index scans in most cases.

For that reason, when moving data warehouses to Greenplum, best practice is to drop all the indexes and add them on an as-needed basis for particular commonly occurring queries. This has two added benefits. It saves considerable disk space and time. Maintaining indexes during large data loads degrades performance. Best practice is to drop the indexes on a table during a load and re-create them thereafter. Even though this is a parallel operation, it does consume computing resources.

This is not to suggest that you shouldn’t use indexes in Greenplum, but you should use them only for specific situations for which there is a measurable benefit.

GPText includes specific indexing for texts. This is critical for doing text analytics. In addition, GIN indexes can be used for text data. But these indexes are quite different than the usual B-tree indexes.

Note

GIN indexes for text data are available in Greenplum 6.

Additional Resources

The Pivotal Greenplum Administrator Guide has a section on table creation and storage options. It’s well worth reading before beginning any Greenplum project.

Jon Roberts discusses Greenplum storage in some blog posts on his PivotalGuru website. Topics include append-optimized tables, notes on tuning that rely heavily on data organization in Greenplum, and a section on external storage with Amazon S3.

As mentioned earlier, although Greenplum does not require extensive indexing, the documentation includes a detailed discussion of indexing strategies.

Get Data Warehousing with Greenplum, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.