Running Hadoop HDFS

A distributed processing framework wouldn't be complete without distributed storage. One of them is HDFS. Even if Spark is run on local mode, it can still use a distributed file system at the backend. Like Spark breaks computations into subtasks, HDFS breaks a file into blocks and stores them across a set of machines. For HA, HDFS stores multiple copies of each block, the number of copies is called replication level, three by default (refer to Figure 3-5).

NameNode is managing the HDFS storage by remembering the block locations and other metadata such as owner, file permissions, and block size, which are file-specific. Secondary Namenode is a slight misnomer: its function is to merge the metadata modifications, edits, into ...

Get Scala:Applied Machine Learning now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.