Book description
If you are ready to dive into the MapReduce framework for processing large datasets, this practical book takes you step by step through the algorithms and tools you need to build distributed MapReduce applications with Apache Hadoop or Apache Spark. Each chapter provides a recipe for solving a massive computational problem, such as building a recommendation system. You’ll learn how to implement the appropriate MapReduce solution with code that you can use in your projects.
Dr. Mahmoud Parsian covers basic design patterns, optimization techniques, and data mining and machine learning solutions for problems in bioinformatics, genomics, statistics, and social network analysis. This book also includes an overview of MapReduce, Hadoop, and Spark.
Publisher resources
Table of contents
- Foreword
-
Preface
- What Is MapReduce?
- Hadoop and Spark
- What Is in This Book?
- What Is the Focus of This Book?
- Who Is This Book For?
- Online Resources
- What Software Is Used in This Book?
- Conventions Used in This Book
- Using Code Examples
- Safari® Books Online
- How to Contact Us
- Acknowledgments
- Comments and Questions for This Book
- 1. Secondary Sort: Introduction
- 2. Secondary Sort: A Detailed Example
- 3. Top 10 List
- 4. Left Outer Join
- 5. Order Inversion
- 6. Moving Average
- 7. Market Basket Analysis
- 8. Common Friends
- 9. Recommendation Engines Using MapReduce
- 10. Content-Based Recommendation: Movies
- 11. Smarter Email Marketing with the Markov Model
- 12. K-Means Clustering
- 13. k-Nearest Neighbors
- 14. Naive Bayes
- 15. Sentiment Analysis
- 16. Finding, Counting, and Listing All Triangles in Large Graphs
- 17. K-mer Counting
- 18. DNA Sequencing
- 19. Cox Regression
- 20. Cochran-Armitage Test for Trend
- 21. Allelic Frequency
- 22. The T-Test
-
23. Pearson Correlation
- Pearson Correlation Formula
- Pearson Correlation Example
- Data Set for Pearson Correlation
- POJO Solution for Pearson Correlation
- POJO Solution Test Drive
- MapReduce Solution for Pearson Correlation
- Hadoop Implementation Classes
-
Spark Solution for Pearson Correlation
- Input
- Output
- Spark Solution
- High-Level Steps
- Step 1: Import required classes and interfaces
- smaller() method
- MutableDouble class
- toMap() method
- toListOfString() method
- readBiosets() method
- Step 2: Handle input parameters
- Step 3: Create a Spark context object
- Step 4: Create list of input files/biomarkers
- Step 5: Broadcast reference as global shared object
- Step 6: Read all biomarkers from HDFS and create the first RDD
- Step 7: Filter biomarkers by reference
- Step 8: Create (Gene-ID, (Patient-ID, Gene-Value)) pairs
- Step 9: Group by gene
- Step 10: Create Cartesian product of all genes
- Step 11: Filter redundant pairs of genes
- Step 12: Calculate Pearson correlation and p-value
- Pearson Correlation Wrapper Class
- Testing the Pearson Class
- Pearson Correlation Using R
- YARN Script to Run Spark Program
- Spearman Correlation Using Spark
- 24. DNA Base Count
- 25. RNA Sequencing
- 26. Gene Aggregation
- 27. Linear Regression
-
28. MapReduce and Monoids
- Introduction
- Definition of Monoid
-
Monoidic and Non-Monoidic Examples
- Maximum over a Set of Integers
- Subtraction over a Set of Integers
- Addition over a Set of Integers
- Multiplication over a Set of Integers
- Mean over a Set of Integers
- Non-Commutative Example
- Median over a Set of Integers
- Concatenation over Lists
- Union/Intersection over Integers
- Functional Example
- Matrix Example
- MapReduce Example: Not a Monoid
- MapReduce Example: Monoid
- Spark Example Using Monoids
- Conclusion on Using Monoids
- Functors and Monoids
- 29. The Small Files Problem
- 30. Huge Cache for MapReduce
- 31. The Bloom Filter
- A. Bioset
-
B. Spark RDDs
- Spark Operations
- Tuple<N>
-
RDDs
- How to Create RDDs
- Creating RDDs Using Collection Objects
- Collecting Elements of an RDD
- Transforming an Existing RDD into a New RDD
- Creating RDDs by Reading Files
- Grouping by Key
- Mapping Values
- Reducing by Key
- Combining by Key
- Filtering an RDD
- Saving an RDD as an HDFS Text File
- Saving an RDD as an HDFS Sequence File
- Reading an RDD from an HDFS Sequence File
- Counting RDD Items
- Spark RDD Examples in Scala
- PySpark Examples
- How to Package and Run Spark Jobs
- Creating the JAR for Data Algorithms
- Running a Job in a Spark Cluster
- Running a Job in Hadoop’s YARN Environment
- Bibliography
- Index
Product information
- Title: Data Algorithms
- Author(s):
- Release date: July 2015
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781491906187
You might also like
book
Algorithms and Data Structures for Massive Datasets
Massive modern datasets make traditional data structures and algorithms grind to a halt. This fun and …
book
Data Algorithms with Spark
Apache Spark's speed, ease of use, sophisticated analytics, and multilanguage support makes practical knowledge of this …
book
Learning Algorithms
When it comes to writing efficient code, every software professional needs to have an effective working …
book
Data Mesh
We're at an inflection point in data, where our data management solutions no longer match the …