Chapter 17. Storage Handlers and NoSQL
Storage Handlers are a combination of InputFormat
, OutputFormat
, SerDe
, and specific code that Hive uses to treat
an external entity as a standard Hive table. This allows the user to issue
queries seamlessly whether the table represents a text file stored in Hadoop
or a column family stored in a NoSQL database such as Apache
HBase, Apache Cassandra, and
Amazon DynamoDB. Storage handlers are not only limited
to NoSQL databases, a storage handler could be designed for many different
kinds of data stores.
Note
A specific storage handler may only implement some of the capabilities. For example, a given storage handler may allow read-only access or impose some other restriction.
Storage handlers offer a streamlined system for ETL. For example, a Hive query could be run that selects a data table that is backed by sequence files, however it could output to text files.
Storage Handler Background
Hadoop has an abstraction known as InputFormat
that allows data from different
sources and formats to be used as input for a job. The TextInputFormat
is a concrete implementation of
InputFormat
. It works by providing
Hadoop with information on how to split a given path into multiple tasks,
and it provides a RecordReader
that
provides methods for reading data from each split.
Hadoop also has an abstraction known as OutputFormat
, which takes the output from a job
and outputs it to an entity. The TextOutputFormat
is a concrete implementation of
OutputFormat
. It works ...
Get Programming Hive now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.