Video description
9+ Hours of Video Instruction
The perfect (and fast) way to get started with Hadoop and Spark
Hadoop and Spark Fundamentals LiveLessons provides 9+ hours of video introduction to the Apache Hadoop Big Data ecosystem. The tutorial includes background information and explains the core components of Hadoop, including Hadoop Distributed File Systems (HDFS), MapReduce, the YARN resource manager, and YARN Frameworks. In addition, it demonstrates how to use Hadoop at several levels, including the native Java interface, C++ pipes, and the universal streaming program interface. Examples include how to use benchmarks and high-level tools, including the Apache Pig scripting language, Apache Hive "SQL-like" interface, Apache Flume for streaming input, Apache Sqoop for import and export of relational data, and Apache Oozie for Hadoop workflow management. In addition, there is comprehensive coverage of Spark, PySpark, and the Zeppelin web-GUI. The steps for easily installing a working Hadoop/Spark system on a desktop/laptop and on a local stand-alone cluster using the powerful Ambari GUI are also included. All software used in these LiveLessons is open source and freely available for your use and experimentation. A bonus lesson includes a quick primer on the Linux command line as used with Hadoop and Spark.
Downloads associated with this LiveLesson can be found at https://www.clustermonkey.net/download/LiveLessons/Hadoop_Fundamentals/
About the Instructor
Douglas Eadline, PhD, began his career as a practitioner and a chronicler of the Linux cluster HPC revolution and now documents big data analytics. Starting with the first Beowulf Cluster how-to document, Doug has written hundreds of articles, white papers, and instructional documents covering High Performance Computing (HPC) and Data Analytics. Prior to starting and editing the popular ClusterMonkey.net website in 2005, he served as editor-in-chief for ClusterWorld Magazine, and was senior HPC editor for Linux Magazine. Currently, he is a writer and consultant to the HPC/Data Analytics industry and leader of the Limulus Personal Cluster Project. He is author of Hadoop Fundamentals LiveLessons and Apache Hadoop YARN Fundamentals LiveLessons videos from Pearson, and book coauthor of Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2 and Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale. He is also the sole author of Hadoop 2 Quick Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem.
Skill Level
- Beginner
- Intermediate
Learn How To
- Understand Hadoop design and key components
- How the MapReduce process works in Hadoop
- Understand the relationship of Spark and Hadoop
- Key aspects of the new YARN design and Frameworks
- Use, administer, and program HDFS
- Run and administer Hadoop/Spark programs
- Write basic MapReduce/Spark programs
- Install Hadoop/Spark on a laptop/desktop
- Run Apache Pig, Hive, Flume, Sqoop, Oozie, Spark applications
- Perform basic data Ingest with Hive and Spark
- Use the Zeppelin web-GUI for Spark/Hive programing
- Install and administer Hadoop with the Apache Ambari GUI tool
Who Should Take This Course
- Users, developers, and administrators interested in learning the fundamental aspects and operations of the open source Hadoop and Spark ecosystems
Course Requirements
- Basic understanding of programming and development
- A working knowledge of Linux systems and tools
- Familiarity with Bash, Python, Java, and C++
Lesson 1: Background Concepts
This lesson introduces Hadoop and Spark along with the many aspects and features that enable the analysis of large unstructured data sets. Many of these discussions about Hadoop ignore the fundamental change Hadoop brings to data management. Doug explains this key point using the data lake metaphor, and then provides background on how the Hadoop data platform, MapReduce, and Spark fit into the data analytics landscape. A bonus lesson is also included for new Linux users that provides the basics of the command line interface used throughout these lessons.
Lesson 2: Running Hadoop on a Desktop or Laptop
A real Hadoop installation, whether it be a local cluster or in the cloud, can be difficult to configure and possibly an expensive proposition. In order to make the examples of this tutorial more accessible, you learn how to install the Hortonworks HDP Sandbox on a desktop or laptop. The "Sandbox" is a freely available Hadoop virtual machine that provides a full Hadoop environment (including Spark). You can use this environment to try most of the examples in this tutorial. If you would rather learn about Hadoop and Spark installation details, we will also do a direct single (Linux) machine install using the latest Hadoop and Spark binary code.
Lesson 3: The Hadoop Distributed File System
The backbone of Hadoop is the Hadoop Distributed File System or HDFS. In this lesson you learn the basics of HDFS and how it is different from many standard file systems used today. In particular, Doug explains why various design trade-offs provide HDFS with a performance edge in big data applications. You also learn how to navigate HDFS using the Hadoop tools and how to use HDFS in user programs. Finally, I present some of the new features available in HDFS including high availability, federation, snapshots, and NFS access.
Lesson 4: Hadoop MapReduce
If the Hadoop Distributed File System is the backbone of Hadoop, then MapReduce is the muscle that operates on big data. In this lesson, Doug shows you how MapReduce compares to a traditional search approach. From there, he shows you how to compile and run a Java MapReduce application. Deeper background on how MapReduce works is presented along with how to use MapReduce with other languages and how to do simple debugging of a MapReduce program.
Lesson 5: Hadoop MapReduce Examples
This lesson continues with MapReduce examples. Doug first shows you a multifile word count program, and then moves on to a more practical log file analysis. From there, he demonstrates how to use a really large text file, like Wikipedia. The lesson concludes with some examples of running MapReduce benchmarks and the using the YARN job browser.
Lesson 6: Higher Level Tools
While Hadoop is very effective at presenting a basic scalable MapReduce model, some higher-level approaches have been developed. In this lesson, Doug teaches you how to use Apache Pig–a Hadoop scripting language that simplifies using MapReduce. In addition, he shows you how to use Apache Hive QL–an SQL-like language that enables higher-level "ad hoc" queries using MapReduce and HDFS. And finally, the Oozie workflow manager is presented.
Lesson 7: Using the Spark Language
Spark has become a popular tool for data analytics. In this lesson, Doug provides some of the basic aspects of the Spark language and demonstrates the Python-Spark interface, PySpark, with a simple command line example. Additional aspects of the Spark language are also used in the next two lessons.
Lesson 8: Getting Data into Hadoop HDFS
The first, and often overlooked step in data analytics, is "data ingest." As was demonstrated in Lesson 3, files can be simply copied into HDFS. However, there are methods that can preserve and import structure that could be lost with simple copying. In this lesson. Doug demonstrates how to import data into Hive tables and use Spark to import data into HDFS. He also demonstrates importing log and other streaming data directly into HDFS using Apache Flume. Finally, a complete example of using Apache Sqoop to import and export a relational database into and out of HDFS is presented.
Lesson 9: Using the Zeppelin Web Interface
Although much of the early Hadoop applications were developed using the command line interface, new web-based GUI tools such as Apache Zeppelin offer a more user-friendly approach to application development. In this lesson, a walk-through of the Zeppelin interface is provided and includes an example of how to create an interactive Zeppelin notebook for a simple Spark application.
Lesson 10: Learning Basic Hadoop Installation and Administration
One of the challenges facing Hadoop users and administrators is setting up a real cluster for production use. In this lesson, Doug teaches you how to use the Ambari web GUI to install, monitor, and administer a full Hadoop installation. He also provides a few important command line tools that will help with basic administration. Finally, some additional HDFS features such as snapshots and NFSv3 mounts are demonstrated.
About Pearson Video Training
Pearson publishes expert-led video tutorials covering a wide selection of technology topics designed to teach you the skills you need to succeed. These professional and personal technology videos feature world-leading author instructors published by your trusted technology brands: Addison-Wesley, Cisco Press, Pearson IT Certification, Prentice Hall, Sams, and Que Topics include: IT Certification, Network Security, Cisco Technology, Programming, Web Development, Mobile Development, and more. Learn more about Pearson Video training at http://www.informit.com/video.
Table of contents
- Introduction
- Lesson 1: Background Concepts
- Lesson 2: Running Hadoop on a Desktop or Laptop
- Lesson 3: The Hadoop Distributed File System
- Lesson 4: Hadoop MapReduce
- Lesson 5: Hadoop MapReduce Examples
- Lesson 6: Higher Level Tools
- Lesson 7: Using the Spark Language
- Lesson 8: Getting Data into Hadoop HDFS
- Lesson 9: Using the Zeppelin Web Interface
- Lesson 10: Learning Basic Hadoop Installation and Administration
- Summary
Product information
- Title: Hadoop and Spark Fundamentals
- Author(s):
- Release date: June 2018
- Publisher(s): Pearson
- ISBN: 0134770862
You might also like
video
Learning Apache Hadoop
In this Introduction to Hadoop training course, expert author Rich Morrow will teach you the tools …
video
Hadoop Fundamentals LiveLessons (Video Training), 2/e
Apache Hadoop is a freely available open source tool-set that enables big data analysis. This Hadoop …
book
Practical Hadoop Migration: How to Integrate Your RDBMS with the Hadoop Ecosystem and Re-Architect Relational Applications to NoSQL
Re-architect relational applications to NoSQL, integrate relational database management systems with the Hadoop ecosystem, and transform …
book
Hadoop 2.x Administration Cookbook
Over 100 practical recipes to help you become an expert Hadoop administrator About This Book Become …