Chapter 9. Schema Design
Hive looks and acts like a relational database. Users have a familiar nomenclature such as tables and columns, as well as a query language that is remarkably similar to SQL dialects they have used before. However, Hive is implemented and used in ways that are very different from conventional relational databases. Often, users try to carry over paradigms from the relational world that are actually Hive anti-patterns. This section highlights some Hive patterns you should use and some anti-patterns you should avoid.
Table-by-Day
Table-by-day is a pattern where a table
named supply
is appended with a
timestamp such as supply_2011_01_01
,
supply_2011_01_02
, etc. Table-by-day is
an anti-pattern in the database world, but due to common implementation
challenges of ever-growing data sets, it is still widely used:
hive
>
CREATE
TABLE
supply_2011_01_02
(
id
int
,
part
string
,
quantity
int
);
hive
>
CREATE
TABLE
supply_2011_01_03
(
id
int
,
part
string
,
quantity
int
);
hive
>
CREATE
TABLE
supply_2011_01_04
(
id
int
,
part
string
,
quantity
int
);
hive
>
....
load
data
...
hive
>
SELECT
part
,
quantity
supply_2011_01_02
>
UNION
ALL
>
SELECT
part
,
quantity
from
supply_2011_01_03
>
WHERE
quantity
<
4
;
With Hive, a partitioned table should be used instead. Hive uses
expressions in the WHERE
clause to
select input only from the partitions needed for the query. This query
will run efficiently, and it is clean and easy on the eyes:
hive
>
CREATE
TABLE
supply
(
id
int
,
part
string
,
quantity
int
)
Get Programming Hive now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.