HDFS Daemons

An HDFS cluster has two types of nodes operating in master-worker pattern
  • NameNode: Manages the filesystem's directory structure and meta data for all the files. This information is persisted on local disk in the form of 2 files
    1. fsimage This is master copy of the metadata for the file system.
    2. edits: This file stores changes(delta/modifications) made to the meta information. In new version of hadoop (I am using 2.4) there would be multiple edit files(per transaction) that get created which store the changes made to meta.
    In addition to this the data node also has mapping of blocks to the datanode where the block is persisted but, that information does not get persisted to the disk. Instead data ndoes send list of blocks that they have to Namenode on startup and Namenode keeps is in memory The name node filesystem metadata is served entirely from RAM for fast lookup and retrieval and thus places a cap on how much metadata the name node can handle.
  • Secondary namenode: The job of secondary namenode is to merge the copy of fsimage and edits file for primary Namenode. So the basic issue is its very CPU consuming to take the fsimage and apply all the edits to it, so that work is delegated to secondary namenode. The secondary namenode downloads the edits file from primary and applies/merges it with fsimage and then sends it back to primary.
  • DataNde: This is workhorse daemon that is responsible for storing and retrieving blocks of data. This daemon is also responsible for maintaining block report(List of blocks that are stored on that datanode). It sends a heart beat to Namenode at regular interval(1 hr) and as part of the heart beat it also sends block report
There are two ways to start the daemons necessary for HDFS one is you can start they individually using start <daemontype> ex. start namenode or you can start all of them using start-dfs


Anonymous said...

nice explanation....

Anonymous said...

aren't jobtracker and tasktracker also daemons?