Navigating the Hadoop ecosystem
A field guide to the Apache Hadoop projects, subprojects, and related technologies.
Marshall Presser, co-author of Field Guide to Hadoop, contributed to this post.
IT managers, developers, data analysts, and system architects are encountering the largest and most disruptive change in data analysis since the ascendency of the relational database in early 1980s — the challenge to process, organize, and take full advantage of big data. With 73% of organizations making big data investments in 2014 and 2015, this transition is occurring at a historic pace, requiring new ways of thinking to go along with new tools and techniques.
Hadoop is the cornerstone of this change to a landscape of systems and skills we’ve traditionally possessed. In the nine short years since the project revolutionized data science at Yahoo!, an entire ecosystem of technologies has sprung up around it. While the power of this ecosystem is plain to see, it can be a challenge to navigate your way through the complex and rapidly evolving collection of projects and products.
A couple years ago, my coworker Marshall Presser and I started our journey into the world of Hadoop. Like many folks, we found the company we worked for was making a major investment in the Hadoop ecosystem, and we had to find a way to adapt. We started in all of the typical places — blog posts, trade publications, Wikipedia articles, and project documentation. Quickly, we learned that many of these sources are often highly biased, either too shallow or too deep, and just plain inconsistent.
While there are a lot of powerful things you can do with Hadoop, they aren’t all immediately apparent. As we worked more within the Hadoop ecosystem, we discovered which products work well for which applications, and we began to document, specifically, what makes one product different from the next, and what conditions are relevant for choosing one over another. What follows are three things you should know, as you navigate the Hadoop ecosystem.
Discover how to index and search free-form customer feedback
What do you do when asked how often a product is mentioned in your customer feedback? What if the customer feedback form you’ve been using collects free-form text and you have no way to search it? Blur is a Hadoop-integrated tool with the main purpose of serving as a document warehouse. It’s used for indexing and searching text with Hadoop, and because it has Lucene at its core, there are many useful features, such as wildcard search, fuzzy matching, and paged results. You can load data into Blur using MapReduce when you have large amounts of data you want to index in bulk, or by using the mutation interface when you want to stream in data. Blur gives you the power to search through unstructured data in a way that, otherwise, would be very challenging.
Automate management of your Hadoop environment
Once you’ve built your big data architecture, how do you keep tabs on it, and how do you find out if part of your system is failing? Once you find a problem, how do you bring that component back into the fold after you’ve fixed it? Within the big data ecosystem, there are a wide variety of tools to help you automate management of your Hadoop architecture. Three primary ways to categorize these tools are: node configuration management, resource tracking, and coordination. The first category, node configuration tools, includes products such as Puppet and Chef, which enable you to do things like change the parameters of your operating system and install software. The second category, resource tracking tools (such as Ganglia), give you a single dashboard for insight into your overall architecture. While many individual components within your architecture might include features for monitoring performance, that capability is restricted to the individual component. The third category, coordination tools (such as Zookeeper), allow you to synchronize all of the moving parts within your big data system to help you accomplish a single goal.
Caging the elephant? How to build a truly secure big data environment
Now that Hadoop has hit the mainstream, the issue of security is gaining attention. In earlier days, Hadoop’s basic security model might have been described as “build a fence around the elephant, but once inside the fence, security is a bit lax.” With an ever-growing presence of security-focused tools, you might wonder which ones meet your needs for everything ranging from authentication and authorization, to data protection, data governance, and auditing.
In a secure computing system, authentication is the first checkpoint, asking the user a critical question: who are you? Traditional methods for authentication are done outside of Hadoop, usually at the client site, or within the web server, and include Kerberos, Lightweight Directory Access Protocol (LDAP), and Active Directory (AD). The second checkpoint, authorization, addresses the question: what can you do? When it comes to authorization, Hadoop is spread all over the place. MapReduce, with its job queue system, differs from HDFS, which uses a common read/write/execute permission, whereas HBase utilizes column family and table-level authorization, and Accumulo operates at a cell-level authorization. Each of these products stores authorization in a different way, allowing you to choose the very best option for your situation.
To protect your data at rest and in transit, encryption methods such as HTTP, RPC, JDBC, and ODBC encrypt data over the wire. While HDFS currently has no native encryption, there’s a proposal to include this in a future release. Data governance and auditing are being handled at the component-level within Hadoop. HDFS and MapReduce include basic mechanisms for governance and auditing, Hive metastore provides logging services, and Oozie provides logging for its job-management service.
Additionally, with new tools such as Sentry and Knox, as well as the more-established, Kerberos, there’s a range of options to consider for your situation.
A shifting landscape
One of the most exciting aspects of the Hadoop ecosystem is the rate at which its growing — it seems like a major feature or project announcement is happening every month. For an up-to-date and in-depth exploration into the Hadoop ecosystem, including sample code, purchase the Field Guide to Hadoop.