In this post, I take an example file in HDFS, run filecheck to find locations of file’s block replications, file’s block pool ID and block ID. This information will help me locate the file’s block on local filesystem on one of the DataNodes.
In second part, I alter the file on local filesystem (from HDFS standpoint, it is a block). This results in Namenode defining the block as corrupted and new replication is created on another DataNode.
HDFS
Show details of the example file in HDFS:
hadoop fs -ls /tmp/test_spark.csv
Output:
-rw-r–r– 3 ubuntu hdfs 56445434 2016-03-06 18:17 /tmp/test_spark.csv
Run tail on the file:
hadoop fs -tail /tmp/test_spark.csv
The output is this:
804922,177663.1,793945.2,”factor_1_10000″,”factor_2_10000″
93500,378660.1,120037.2,”factor_1_10000″,”factor_2_10000″
394490,149354.1,253562.2,”factor_1_10000″,”factor_2_10000″
253001,446918.1,602891.2,”factor_1_10000″,”factor_2_10000″
196553,945027.1,97370.2,”factor_1_10000″,”factor_2_10000″
83715,56758.1,888537.2,”factor_1_10000″,”factor_2_10000″
593831,369048.1,844320.2,”factor_1_10000″,”factor_2_10000″
721077,109160.1,604853.2,”factor_1_10000″,”factor_2_10000″
383946,111066.1,779658.2,”factor_1_10000″,”factor_2_10000″
461973,695670.1,596577.2,”factor_1_10000″,”factor_2_10000″
70845,360039.1,479357.2,”factor_1_10000″,”factor_2_10000″
813333,839700.1,568456.2,”factor_1_10000″,”factor_2_10000″
967549,721770.1,998214.2,”factor_1_10000″,”factor_2_10000″
919219,466408.1,583846.2,”factor_1_10000″,”factor_2_10000″
977914,169416.1,412922.2,”factor_1_10000″,”factor_2_10000″
739637,25221.1,626499.2,”factor_1_10000″,”factor_2_10000″
223358,918445.1,337362.2,”factor_1_10000″,”factor_2_10000″
I run filecheck:
hdfs fsck /tmp/test_spark.csv -files -blocks -locations
The output is:
Connecting to namenode via http://w-namenode1.domain.com:50070/fsck?ugi=ubuntu&files=1&blocks=1&locations=1&path=%2Ftmp%2Ftest_spark.csv
FSCK started by ubuntu (auth:SIMPLE) from /10.0.XXX.75 for path /tmp/test_spark.csv at Sun Mar 06 18:18:44 CET 2016
/tmp/test_spark.csv 56445434 bytes, 1 block(s): OK
BP-1553412973-10.0.160.75-1456844185620:blk_1073741903_1079 len=56445434 repl=3 [DatanodeInfoWithStorage[10.0.XXX.103:50010,DS-1c68e4c7-d424-47e8-b7cc-941198fe2415,DISK], DatanodeInfoWithStorage[10.0.XXX.105:50010,DS-26bc20ee-68d8-423b-b707-26ae6e986562,DISK], DatanodeInfoWithStorage[10.0.XXX.104:50010,DS-76aaea28-2822-4982-8602-f5db3c47d3fd,DISK]]
Status: HEALTHY
Total size: 56445434 B
Total dirs: 0
Total files: 1
Total symlinks: 0
Total blocks (validated): 1 (avg. block size 56445434 B)
Minimally replicated blocks: 1 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 4
Number of racks: 1
FSCK ended at Sun Mar 06 18:18:44 CET 2016 in 1 milliseconds
The filesystem under path ‘/tmp/test_spark.csv’ is HEALTHY
The file is stored in one block ( dfs.blocksize is by default 134217728).
Replication factor is 3 (default) and the block can be found on the following DataNodes: 10.0.XXX.103, 10.0.XXX.104, 10.0.XXX.105
BP-1553412973-10.0.160.75-1456844185620 - Block Pool ID
blk_1073741903_1079 - Block ID
Linux
Now I can look for the file in Linux.
I connect to one of the datanodes that was given in the output of hadoop filecheck command.
ssh -i .ssh/key 10.0.XXX.103
Property dfs.datanode.data.dir in hdfs-default.xml, if you are manually administrating the cluster, or, in Ambari, HDFS -> Configs -> Settings -> DataNode -> DataNode directories, tells us where on the local filesystem the DataNode should store its blocks.
Default is /hadoop/hdfs/data.
If I list details of the file:
sudo -u hdfs ls -l /hadoop/hdfs/data/current/BP-1553412973-10.0.160.75-1456844185620/current/finalized/subdir0/subdir0/blk_1073741903
The output is the following:
-rw-r–r– 1 hdfs hadoop 56445434 Mar 6 18:17 /hadoop/hdfs/data/current/BP-1553412973-10.0.160.75-1456844185620/current/finalized/subdir0/subdir0/blk_1073741903
The size of the file is the same as when listing the file using hadoop fs -ls earlier (one block for this file).
Now I run tail on this file:
sudo -u hdfs tail /hadoop/hdfs/data/current/BP-1553412973-10.0.160.75-1456844185620/current/finalized/subdir0/subdir0/blk_1073741903
Result:
721077,109160.1,604853.2,”factor_1_10000″,”factor_2_10000″
383946,111066.1,779658.2,”factor_1_10000″,”factor_2_10000″
461973,695670.1,596577.2,”factor_1_10000″,”factor_2_10000″
70845,360039.1,479357.2,”factor_1_10000″,”factor_2_10000″
813333,839700.1,568456.2,”factor_1_10000″,”factor_2_10000″
967549,721770.1,998214.2,”factor_1_10000″,”factor_2_10000″
919219,466408.1,583846.2,”factor_1_10000″,”factor_2_10000″
977914,169416.1,412922.2,”factor_1_10000″,”factor_2_10000″
739637,25221.1,626499.2,”factor_1_10000″,”factor_2_10000″
223358,918445.1,337362.2,”factor_1_10000″,”factor_2_10000″
Output of tail matches the output of tail ran with hadoop fs command.
Changing the file in Linux
If I open this file for editing:
sudo -u hdfs vi /hadoop/hdfs/data/current/BP-1553412973-10.0.160.75-1456844185620/current/finalized/subdir0/subdir0/blk_1073741903
and change it. The file disappears from the parent folder.
Filecheck in HDFS
Now I run filecheck on the same file again:
hdfs fsck /tmp/test_spark.csv -files -blocks -locations
The output is the following:
Connecting to namenode via http://w-namenode1.domain.com:50070/fsck?ugi=ubuntu&files=1&blocks=1&locations=1&path=%2Ftmp%2Ftest_spark.csv
FSCK started by ubuntu (auth:SIMPLE) from /10.0.XXX.75 for path /tmp/test_spark.csv at Sun Mar 06 18:34:41 CET 2016
/tmp/test_spark.csv 56445434 bytes, 1 block(s): OK
BP-1553412973-10.0.160.75-1456844185620:blk_1073741903_1079 len=56445434 repl=3 [DatanodeInfoWithStorage[10.0.XXX.102:50010,DS-db55f66a-e6b6-480a-87bf-2053fbed2960,DISK], DatanodeInfoWithStorage[10.0.XXX.105:50010,DS-26bc20ee-68d8-423b-b707-26ae6e986562,DISK], DatanodeInfoWithStorage[10.0.XXX.104:50010,DS-76aaea28-2822-4982-8602-f5db3c47d3fd,DISK]]
Status: HEALTHY
Total size: 56445434 B
Total dirs: 0
Total files: 1
Total symlinks: 0
Total blocks (validated): 1 (avg. block size 56445434 B)
Minimally replicated blocks: 1 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 4
Number of racks: 1
FSCK ended at Sun Mar 06 18:34:41 CET 2016 in 1 milliseconds
The filesystem under path ‘/tmp/test_spark.csv’ is HEALTHY
File is still replicated 3 times on 3 DataNodes, but this time on DataNodes 10.0.XXX.102, 10.0.XXX.104, 10.0.XXX.105.
The output shows that one replication is not on datanode with IP 10.0.XXX.103 anymore. That was the datanode I connected to temper with the file.
NameNode has identified that the block is corrupted and has created a new replica of the block.