Chapter 1. Introduction to Apache Drill
If you do any kind of data analytics, you will most certainly agree that there is a major problem in the industry. As data volumes have increased exponentially in the past 10 years, so too have the types of systems and formats designed to analyze this data. All of this comes back to the basic fact that data is not optimally arranged for ad hoc analysis. Quite often, a data scientist or analyst will be confronted with data in a wide variety of formats, such as CSV, JSON, and XML, to say nothing of the various systems in which data can be stored, such as Hadoop, Hive, MongoDB, and relational databases. According to a 2016 article by Forbes, data scientists spend nearly 80% of their time gathering and cleaning their data—88% of them say this is the least enjoyable part of their job.
In his book Taming the Big Data Tidal Wave (Wiley), Bill Franks, who was chief analytics officer at Teradata when he wrote it, introduces a metric called Time to Insight (TTI), which is the amount of time it takes from when an organization receives or generates data to when it is able to derive an actionable insight from that data. Driving down TTI is important because often the value of the insights decreases over time. Consider security information. If you are protecting a computer system, you want to detect intrusions as soon as possible to prevent the intruder from stealing data. If it takes you years to detect an intrusion, the value of that knowledge is pretty low because the intruder has likely already stolen all your valuable data, whereas if you can identify that intrusion within seconds, you can prevent the theft.
The cost of analytic insight also increases with the complexity of your tools. If your analysts need to write complex code to analyze your data, you will need higher-skilled, more costly analysts in order to extract insights from the data, and it will take them more time to do so, driving up the TTI.
Companies have approached this problem in different ways. Many have taken the route of simply storing all of their data into a massive Hadoop cluster, or data lake. However, even though it does solve the problem of data being in multiple disparate systems, the data-lake approach does not (necessarily) solve the problem of data being in multiple formats.
Another approach is to use a central analytic platform that ingests an organization’s data via an extract, transform, and load (ETL) process. This approach is also costly in that it requires you to send your data over a network, thereby increasing network usage; store copies of the same data in multiple systems; and transform the data from its original format into a standardized format, which might or might not be suitable for it. Additionally, it creates complexity when the original source systems want to change their schemas. More serious, however, is that this approach requires extensive involvement from an IT department to set up and configure.
Regardless of which approach you take, you can see that deriving value from these disparate datasets is time-consuming, costly, and difficult. This is where Drill can have a major impact.
Drill is an extremely versatile, powerful, extensible, and easy-to-use tool that can help you analyze your data more effectively, drive down your time to insight, and get more value from your data.
What Is Apache Drill?
At its core, Apache Drill is a SQL engine for big data. In practical terms, what this means is that Drill acts as an intermediary that allows you to query self-describing data using standard ANSI SQL. To use a comparison from the science fiction series Star Trek, Drill acts as a universal translator1 for your data and enables you to use SQL to interact with your data as if it were a table in a database, whether it is or not. Bringing this down to earth, Drill enables an analyst, armed only with a knowledge of SQL or a business intelligence (BI) tool such as Tableau, to analyze and query their data without having to transform the data or move it to a centralized data store.
Drill Is Versatile
One of Drill’s major strong points is its versatility. Out of the box, Drill can query a wide variety of file formats, including:
-
CSV, TSV, PSV, or any delimited data
-
JSON
-
Parquet
-
Avro2
-
Hadoop Sequence files
-
Apache and Nginx web server logs
-
Log files
-
PCAP/PCAP-NG
Chapters 4 and 5 explain how to query these file formats.
In addition to being able to query a wide variety of file formats, Drill can query data in external systems including:
-
HBase
-
Hive
-
Kafka for streaming data
-
MapR-DB
-
MongoDB
-
Open Time Series Database
-
Nearly any relational database with a JDBC driver
Finally, Drill supports files stored in a variety of distributed filesystems such as:
-
Hadoop Distributed File System (HDFS)
-
MapR-FS
-
Amazon Simple Storage Service (S3)
Chapter 6 explains in detail how to connect Drill to these source systems and query them.
Additionally, Drill has extensive APIs that enable you to write your own custom functions, and readers for other data sources. You will learn how to write custom user-defined functions in Chapter 11 and storage and format plug-ins in Chapter 12.
Drill Is Easy to Use
One of the aspects of Drill that is striking to first-time users is how easy it actually is to use. Because Drill uses standard ANSI SQL with some additions, if you are comfortable with SQL, you already know 80 to 90% of what you need to execute queries using Drill. Drill has Open Database Connectivity (ODBC) and Java Database Connectivity (JDBC) interfaces, so it can connect to most BI tools (such as Tableau, Qlix, and others) quite easily.
If you prefer a scripting language such as Python or R, modules exist that enable you to quickly and easily execute a query in Drill and import the results into those languages. Chapter 7 explains how to connect to Drill from a wide variety of other languages and platforms.
Drill does not require you to define a schema
One aspect of Drill that often intrigues new users is that you do not need to define a schema prior to querying your data. Relational databases require you to define a rigid schema prior to inserting the data into the schema. This schema enables databases to build indexes to enable rapid querying. Tools that follow this model are known as schema-on-write.
One of the limitations of a schema-on-write system is that schema modifications after the fact are quite difficult, if not impossible. As tools have progressed, a new technique of schema building has evolved, known as schema-on-read, whereby the data is inserted into the system without any schema definition; however, prior to querying the user must define the schema. This allows flexibility in the data model, but at a performance cost.
Drill does not follow either of these design paradigms. When using Drill, it is not necessary to define a schema at all prior to querying. Instead, Drill infers the schema from the structure of the data. This approach has its advantages, in that it is extremely flexible and does not have any time requirement from the users. It does have limitations, though, insomuch as Drill can struggle with malformed data. We discuss the technical details and limitations of this approach later in the book.
A Word About Drill’s Performance
Drill makes extensive use of in-memory data buffers, CPU optimization, and transfers between clusters of nodes, the result of which is that Drill can process data much faster than MapReduce or Hive. Drill’s performance is comparable to if not better than Apache Spark’s for various common operations such as joins and aggregation. What is interesting to note about Drill is that, unlike many big data tools, it also works very well with small datasets on a single-computer installation.
Drill’s performance is best when querying data stored in Parquet format—a columnar format that is particularly well suited for use with Drill.
Drill Is Not a Database
Because Drill looks like a relational database to a user, users often expect database-like performance in all situations. Drill is very fast and does take advantage of many optimizations when querying data sources, but you should not expect nanosecond response times when using Drill to query data. We discuss performance tuning later in the book.
Where Drill offers a significant advantage over other big data tools is that it dramatically reduces the amount of human effort needed to query data sources. Because this is directly correlated with cost, Drill can quantifiably reduce the cost of accessing your big data systems.
A Very Brief History of Big Data
Since humans began processing digital data, there has been a perpetual struggle between not having enough processing power to deal with the data and not having the storage capacity to store the data. Before the turn of the century, most systems that stored great amounts of data were large databases developed by companies like Oracle, Teradata, and others. These systems were closed source, proprietary, and dependent on extremely expensive, high-end hardware to run.
In 2004, there was a watershed moment in computing that changed all that. That year, Google released a research paper titled “MapReduce: Simplified Data Processing on Large Clusters” by Jeffrey Dean and Sanjay Ghemawat that defined the MapReduce process—a relatively simple two-step process that enabled Google to process incredibly large amounts of data on commodity hardware. This paper and an earlier Google research paper titled “The Google File System” by Sanjay Ghemawat et al. inspired what became one of the most popular open source big data platforms: Apache Hadoop, which was first released in April 2006 and quickly set a new record by sorting 1.8 TB in 47.9 hours on 188 nodes of commodity hardware.
Hadoop
It became quickly apparent that Hadoop and MapReduce were going to change analytics. Around the same time, Amazon was releasing its Amazon Web Services (AWS) product, which enabled companies to quickly set up entire virtual systems to process ever-increasing amounts of data. The combination of Apache Hadoop and virtualization tools such as AWS made it much easier and less expensive for companies to store and process large datasets.
However, as more companies adopted Hadoop and MapReduce, they discovered a few problems with this system. Although it was now possible to process, store, and derive insights from extremely large datasets (now known as big data), it was extremely difficult and inefficient. Some problems, particularly those that involve grouping and counting things, are very easy to port to the MapReduce paradigm, but other problems, such as joining datasets together, were quite difficult. Additionally, MapReduce in Hadoop requires Java development skills and hence is out of the reach of most business analysts and many data scientists.
Facebook and Yahoo! both developed answers to this challenge in the form of higher-level platforms that abstracted the technical details of the MapReduce framework into a simpler language. Yahoo!’s solution, Apache Pig, was released in late 2008, and Facebook’s became the Apache Hive project and was released in the same time frame. Hive and Pig both accomplish a similar goal of providing a high-level language with which a non–software developer can analyze large datasets; however, the two platforms took fundamentally different approaches to this problem. Hive uses a subset of SQL known as HiveQL to interact with the data, whereas Pig has its own language known as Pig Latin.
Pig, Hive, and a few other tools were major leaps forward in democratizing access to big data, but at the time they were basically wrappers for MapReduce, which by design relies on intense disk input/output (I/O) and as a result is slow. In 2012, there was another big shift in the big data landscape with the release of Apache Spark. Spark is an open source, distributed computing framework that works with Hadoop; but unlike the other systems, Spark makes extensive use of in-memory operations, which, when combined with other algorithmic improvements over MapReduce, results in performance improvements of several orders of magnitude. Spark also includes a robust machine learning library and interfaces for Python, SQL, Scala, Java, and R.
The 2010 Google Dremel paper by Sergey Melnik et al. was another game changer in that it inspired a number of tools that enabled ad hoc analysis of very large datasets, including Drill, Impala, and Presto. Earlier tools such as MapReduce and Spark broke processing into stages, with the output of each stage materialized to disk and a shuffle that sent these results to the next stage. Dremel and the tools it inspired divide each stage into batches, then exchange batches from one node to another directly over the network, without the disk I/O of the prior generation of tools. This is often called “in-memory processing,” though the term does not capture the exchange aspect of the Dremel model. See Chapter 3 for details.
Drill in the Big Data Ecosystem
Drill occupies a unique place in the big data ecosystem in that it facilitates ad hoc analysis of a wide variety of different data sources and formats using a common language: SQL. If you have a problem that can be expressed as a table—and many data analysis problems can—Drill might be a good addition to your toolkit.
With that said, Drill is not a good tool if you are doing operations that are highly iterative, such as machine learning, or cannot be expressed easily in SQL. Drill is a query engine, and it does not perform transactions and is not meant to write data (although Drill can create views and tables). You wouldn’t want to use Drill to replace a database for tasks like a running a website.
Comparing Drill with Similar Tools
Throughout the years there have been various tools that attempt to provide a SQL layer on top of Hadoop. All of these tools, Drill included, were also inspired by the Google Dremel paper by Melnik et al. The tools that are most similar to Drill are Apache Impala and Presto. Both offer users the ability to query large datasets stored in Hadoop using SQL. Presto and Drill are both implemented in Java, whereas Impala is written in C++ and Java, which complicates development. Impala is linked with Hive’s metastore: you cannot use it independently of the metastore.
Presto was developed by Facebook to be used on its large data stores. It offers a similar capability as Drill; however, unlike Drill, Presto requires users to define a schema prior to querying their data.
Additional Drill Resources
Comprehensive documentation for Drill is available on the Drill project website. Drill’s source code is published on GitHub. There is additional developer documentation available in the GitHub repository.
Get Learning Apache Drill now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.