Finding out the blocks used for storing file in HDFS

The HDFS stores files by breaking them into blocks of 64 MB in size (default block size but you can change it). It stores these blocks across one or more disks/machines, Unlike filesystem for a single disk, a file in HDFS that is smaller than single block does not occupy a fully block's worth of underlying storage. Map tasks in Map Reduce operate on one block at a time (InputSplit takes care of assigning different map task to each of teh block). Replication is handled at block level, what it means is HDFS will replicate the block to different machine and if one of the block is corrupted or one of the machine where the block is stored is down it can read that from different machine, in case of corrupted unreachable block HDFS will take care of replication of the block to bring the replication factor back to the normal level. Some applications may choose to set a high replication factor for blocks in popular file to spread the read load on the cluster I wanted to figure out how the file gets stored in HDFS(/user/user/aesop.txt file as example), so i tried these steps on my local hadoop single-node cluster.
  1. First i looked at my hadoop-site.xml to find out value of dfs.data.dir element, which points to where the data is stored. In my case its /home/user/data directory
    
    <configuration>
      <property>
             <name>dfs.replication</name>
             <value>1</value>
      </property>
     <property>
             <name>dfs.data.dir</name>
             <value>/home/user/data</value>
      </property> 
    </configuration>
    
  2. There is a file /user/user/aesop.txt stored in my HDFS and i wanted to see where the file is stored so i did execute hdfs fsck /user/user/aesop.txt -blocks -files command to get list of blocks where this file is located
    It is stored in BP-1574103969-127.0.1.1-1403309876533:blk_1073742129_1306
  3. When i looked under /home/user/data/current directory i saw BP-1574103969-127.0.1.1-1403309876533 which gives me first part of the block name, and it has bunch of subdirectories
  4. The next part was to find blk_1073742129_1306 which is second part of the block name so i did search for it without _1306 the last part and i found .meta file and a normal file that contains the actual content of aesop.txt, when i opened it i could see the content of the file like this
Note: I found this on my local hadoop, but havent seen this method in documentation so it might not work for you. But the good part is as long as your HDFS commands are working you dont need to worry about this.

1 comment:

Abhishek Singh said...

were you able to read the meta file? does it contain the physical disk memory address in which the file data is stored?