Book description
Explore the Hadoop MapReduce v2 ecosystem to gain insights from very large datasets
In Detail
Starting with installing Hadoop YARN, MapReduce, HDFS, and other Hadoop ecosystem components, with this book, you will soon learn about many exciting topics such as MapReduce patterns, using Hadoop to solve analytics, classifications, online marketing, recommendations, and data indexing and searching. You will learn how to take advantage of Hadoop ecosystem projects including Hive, HBase, Pig, Mahout, Nutch, and Giraph and be introduced to deploying in cloud environments.
Finally, you will be able to apply the knowledge you have gained to your own real-world scenarios to achieve the best-possible results.
What You Will Learn
Configure and administer Hadoop YARN, MapReduce v2, and HDFS clusters
Use Hive, HBase, Pig, Mahout, and Nutch with Hadoop v2 to solve your big data problems easily and effectively
Solve large-scale analytics problems using MapReduce-based applications
Tackle complex problems such as classifications, finding relationships, online marketing, recommendations, and searching using Hadoop MapReduce and other related projects
Perform massive text data processing using Hadoop MapReduce and other related projects
Deploy your clusters to cloud environments
Table of contents
-
Hadoop MapReduce v2 Cookbook Second Edition
- Table of Contents
- Hadoop MapReduce v2 Cookbook Second Edition
- Credits
- About the Author
- Acknowledgments
- About the Author
- About the Reviewers
- www.PacktPub.com
- Preface
-
1. Getting Started with Hadoop v2
- Introduction
- Setting up Hadoop v2 on your local machine
- Writing a WordCount MapReduce application, bundling it, and running it using the Hadoop local mode
- Adding a combiner step to the WordCount MapReduce program
- Setting up HDFS
- Setting up Hadoop YARN in a distributed cluster environment using Hadoop v2
- Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoop distribution
- HDFS command-line file operations
- Running the WordCount program in a distributed cluster environment
- Benchmarking HDFS using DFSIO
- Benchmarking Hadoop MapReduce using TeraSort
-
2. Cloud Deployments – Using Hadoop YARN on Cloud Environments
- Introduction
- Running Hadoop MapReduce v2 computations using Amazon Elastic MapReduce
- Saving money using Amazon EC2 Spot Instances to execute EMR job flows
- Executing a Pig script using EMR
- Executing a Hive script using EMR
- Creating an Amazon EMR job flow using the AWS Command Line Interface
- Deploying an Apache HBase cluster on Amazon EC2 using EMR
- Using EMR bootstrap actions to configure VMs for the Amazon EMR jobs
- Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment
-
3. Hadoop Essentials – Configurations, Unit Tests, and Other APIs
- Introduction
- Optimizing Hadoop YARN and MapReduce configurations for cluster deployments
- Shared user Hadoop clusters – using Fair and Capacity schedulers
- Setting classpath precedence to user-provided JARs
- Speculative execution of straggling tasks
- Unit testing Hadoop MapReduce applications using MRUnit
- Integration testing Hadoop MapReduce applications using MiniYarnCluster
- Adding a new DataNode
- Decommissioning DataNodes
- Using multiple disks/volumes and limiting HDFS disk usage
- Setting the HDFS block size
- Setting the file replication factor
- Using the HDFS Java API
-
4. Developing Complex Hadoop MapReduce Applications
- Introduction
- Choosing appropriate Hadoop data types
- Implementing a custom Hadoop Writable data type
- Implementing a custom Hadoop key type
- Emitting data of different value types from a Mapper
- Choosing a suitable Hadoop InputFormat for your input data format
- Adding support for new input data formats – implementing a custom InputFormat
- Formatting the results of MapReduce computations – using Hadoop OutputFormats
- Writing multiple outputs from a MapReduce computation
- Hadoop intermediate data partitioning
- Secondary sorting – sorting Reduce input values
- Broadcasting and distributing shared resources to tasks in a MapReduce job – Hadoop DistributedCache
- Using Hadoop with legacy applications – Hadoop streaming
- Adding dependencies between MapReduce jobs
- Hadoop counters to report custom metrics
-
5. Analytics
- Introduction
- Simple analytics using MapReduce
- Performing GROUP BY using MapReduce
- Calculating frequency distributions and sorting using MapReduce
- Plotting the Hadoop MapReduce results using gnuplot
- Calculating histograms using MapReduce
- Calculating Scatter plots using MapReduce
- Parsing a complex dataset with Hadoop
- Joining two datasets using MapReduce
-
6. Hadoop Ecosystem – Apache Hive
- Introduction
- Getting started with Apache Hive
- Creating databases and tables using Hive CLI
- Simple SQL-style data querying using Apache Hive
- Creating and populating Hive tables and views using Hive query results
- Utilizing different storage formats in Hive - storing table data using ORC files
- Using Hive built-in functions
- Hive batch mode - using a query file
- Performing a join with Hive
- Creating partitioned Hive tables
- Writing Hive User-defined Functions (UDF)
- HCatalog – performing Java MapReduce computations on data mapped to Hive tables
- HCatalog – writing data to Hive tables from Java MapReduce computations
-
7. Hadoop Ecosystem II – Pig, HBase, Mahout, and Sqoop
- Introduction
- Getting started with Apache Pig
- Joining two datasets using Pig
- Accessing a Hive table data in Pig using HCatalog
- Getting started with Apache HBase
- Data random access using Java client APIs
- Running MapReduce jobs on HBase
- Using Hive to insert data into HBase tables
- Getting started with Apache Mahout
- Running K-means with Mahout
- Importing data to HDFS from a relational database using Apache Sqoop
- Exporting data from HDFS to a relational database using Apache Sqoop
-
8. Searching and Indexing
- Introduction
- Generating an inverted index using Hadoop MapReduce
- Intradomain web crawling using Apache Nutch
- Indexing and searching web documents using Apache Solr
- Configuring Apache HBase as the backend data store for Apache Nutch
- Whole web crawling with Apache Nutch using a Hadoop/HBase cluster
- Elasticsearch for indexing and searching
- Generating the in-links graph for crawled web pages
- 9. Classifications, Recommendations, and Finding Relationships
-
10. Mass Text Data Processing
- Introduction
- Data preprocessing using Hadoop streaming and Python
- De-duplicating data using Hadoop streaming
- Loading large datasets to an Apache HBase data store – importtsv and bulkload
- Creating TF and TF-IDF vectors for the text data
- Clustering text data using Apache Mahout
- Topic discovery using Latent Dirichlet Allocation (LDA)
- Document classification using Mahout Naive Bayes Classifier
- Index
Product information
- Title: Hadoop MapReduce v2 Cookbook - Second Edition
- Author(s):
- Release date: February 2015
- Publisher(s): Packt Publishing
- ISBN: 9781783285471
You might also like
book
Optimizing Hadoop for MapReduce
This book is the perfect introduction to sophisticated concepts in MapReduce and will ensure you have …
book
Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem
Get Started Fast with Apache Hadoop ® 2, YARN, and Today’s Hadoop Ecosystem With Hadoop 2.x …
book
Hadoop Blueprints
Use Hadoop to solve business problems by learning from a rich set of real-life case studies …
book
Hadoop: Data Processing and Modelling
Unlock the power of your data with Hadoop 2.X ecosystem and its data warehousing techniques across …