Chapter 11. Other File Formats and Compression
One of Hive’s unique features is that Hive does not force data
to be converted to a specific format. Hive leverages Hadoop’s InputFormat
APIs to read data from a variety of
sources, such as text files, sequence files, or even custom formats.
Likewise, the OutputFormat
API is used to
write data to various formats.
While Hadoop offers linear scalability in file storage for uncompressed data, storing data in compressed form has many benefits. Compression typically saves significant disk storage; for example, text-based files may compress 40% or more. Compression also can increase throughput and performance. This may seem counterintuitive because compressing and decompressing data incurs extra CPU overhead, however, the I/O savings resulting from moving fewer bytes into memory can result in a net performance gain.
Hadoop jobs tend to be I/O bound, rather than CPU bound. If so, compression will improve performance. However, if your jobs are CPU bound, then compression will probably lower your performance. The only way to really know is to experiment with different options and measure the results.
Determining Installed Codecs
Based on your Hadoop version, different codecs will be
available to you. The set
feature in
Hive can be used to display the value of hiveconf
or Hadoop configuration values. The
codecs available are in a comma-separated list named io.compression.codec
:
#
hive
-
e
"set io.compression.codecs"
io
.
compression
.
codecs
=
org
.
apache ...
Get Programming Hive now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.