Apache Hadoop 3 as a Service on AWS

Apache Hadoop 3.1 cluster built from CLI. Link to github repository is below.

The general idea is to have a solution that builds an Apache Hadoop 3 cluster from command line. This can be useful for learning purposes, for testing or for spinning a Hadoop cluster for a certain job and then terminating it, hence minimizing costs.

Motivation

a couple of years ago I listened to a Spark Summit conference and one company introduced the following architectural solution: data were sitting in S3, when there was the need for analysis, a Hadoop cluster was created, data was pushed to HDFS and analyses were done. After the results were collected, the Hadoop cluster was terminated.

About

The code has no exception handling, it uses AWS’s t2.micro instances to prove the point. There is a lot of potential in building a friendly user interface to parametrize the solution. There is only one input parameter – number of datanodes. When using AWS’s free instances, make sure you do not have more than 20 of them running.

There are four files:

  • HaaS.sh
  • script_namenode.sh
  • script_datanode.sh
  • terminate_cluster.sh

The HaaS.sh file launches the instances for namenode and datanode(s) (namenode instance is dedicated for namenode related services – no datanode services are installed there). It is advised to start at least one datanode. Example on how to launch a cluster with 5 datanodes: . Haas.sh 5

When EC2 instance for namenode is ready, script_namenode.sh is executed on that instance. When EC2 instance(s) for datanode(s) are ready, script_datanode.sh is executed on the instance(s).

Prerequisities

I have defined one instance as “Initial” instance. This is where the scripts are located and this instance creates and terminates the cluster. This instance is not a part of the Hadoop cluster, it launches the cluster and terminates it. I am using Ubuntu 16.04 for all my instances. Make sure you have awscli package installed and aws configured on this initial instance.

Prerequisities on AWS

  • key pair
  • security group
    • open all traffic for all instances in the same subnet and security group
    • open port 9870 for Namenode Web Interface
    • open port 8088 for Resource Manager (YARN)
    • open port 19888 for MapReduce JobHistory server
  • subnet

Times

Launching a Hadoop cluster with 10 datanodes took less than 10 minutes. When testing, I did also come down to 8 minutes. I am using sleep command in the Haas.sh script in order to wait for the instances to either start running or for Hadoop to download and install (unpack). Room for optimization here as well.

Order of execution

The HaaS.sh script does the following actions:

  • launch namenode instance and read output text into a variable
  • parse the variable to collect instance id and private ip
  • create instances.list and add namenode instance id to it
  • append private ip and instance name to /etc/hosts
  • enable passwordless ssh to namenode
  • launch datanode(s)
  • update local /etc/hosts
  • create workers file
  • enable passwordless ssh to datanode(s)
  • start services on datanode(s)
  • copy /etc/hosts from initial instance to all Hadoop instances
  • copy workers file to namenode’s $HADOOP_HOME/etc/hadoop
  • start services on datanode(s)
  • remove temporary files

 

Link to the scripts can be found here.

Advertisements

Bash script for creating new user in Hadoop and Ambari Views

Here is a bash script I used a couple of years ago for creating Hadoop users from CLI (or batch). It might be useful for someone.

The script does the following:

  • creates a Linux user
  • generates keys
  • creates home directory in HDFS
  • adds user to a group
  • allocates HDFS space quota
  • gives access in Ambari Views
#!/bin/bash

NEW_USER="$1"
DEPT_NAME="$2"
NAMENODE="t-namenode1"
AMBARI="t-ambari"

#
echo "Creating user "$NEW_USER

#Creating user with no password with user's folder
sudo adduser --disabled-password --gecos "" $NEW_USER

#Create Linux user on the namenode
ssh -i /home/ubuntu/.ssh/key $NAMENODE 'sudo adduser --disabled-password --gecos "" $NEW_USER && sudo chown $NEW_USER:$NEW_USER /home/$NEW_USER'

#Prepare .ssh folder
cd /user/$NEW_USER
sudo mkdir .ssh
sudo chown $NEW_USER:$NEW_USER .ssh/
sudo chmod 700 .ssh

#Create private and public key
sudo -u $NEW_USER  ssh-keygen -t rsa -f $NEW_USER-key

#Copy public key to the authorized_keys
sudo -u $NEW_USER cp $NEW_USER-key.pub .ssh/authorized_keys
sudo -u $NEW_USER chmod 600 .ssh/authorized_keys

#######HDFS
echo "Create system folder for user"
sudo -u hdfs hadoop fs -mkdir /user/$NEW_USER
echo "Change owner of the system folder"
sudo -u hdfs hadoop fs -chown $NEW_USER:hdfs /user/$NEW_USER

#Defining HDFS space quota
echo "Allocate 100g of space on HDFS for the user"
sudo -su hdfs hdfs dfsadmin -setSpaceQuota 100g /department/$DEPT_NAME/users/$NEW_USER

#Access to Ambari Views
curl -iv -u admin:admin -H "X-Requested-By: ambari" -X POST -d  '{"Users/user_name": "$USER_NAME", "Users/password":  "$USER_NAME", "Users/active": true, "Users/admin": false }' http://$AMBARI:8080/api/v1/users

#Add user to a group in Ambari Views
curl -iv -u admin:admin -H "X-Requested-By: ambari" -X POST -d '[{"MemberInfo/user_name":"$NEW_USER", "MemberInfo/group_name":"$DEPT_NAME"}]' http://$AMBARI:8080/api/v1/groups/$DEPT_NAME/members

echo "User's folder on the client:"
ls -l /user/$NEW_USER

echo "User's system folder on HDFS:"
sudo -u $HDFS hadoop fs -ls /user/$NEW_USER

 

My Work cluster in detail

The cluster was built on OpenStack private cloud owned by a Swiss organization Switch.

The Hadoop distributor was Hortonworks, except for Spark and Zeppelin, who were Apache’s.

Potential users

Since the project owner was an organization supporting educational entities in Switzerland, the potential users were researchers, scientists, students…

I had the luxury of having almost unlimited resources on the infrastructure so I have built 5 Hadoop clusters – 4 were Hortonworks Hadoop clusters, one was Apache Hadoop cluster. Out of the 4, one was the Work environment which was exposed to the end users. And this is the cluster that is described in detail in this post.
Keep in mind that I was working on my own on this development – which meant administering and upgrading 5 clusters and doing data science at the same time. In order to make it work, I had to use the YARN inside me and distribute the limited resources effectively.

Initial resources

Keeping in mind the point of distributed systems is scalability, I have defined the initial cluster with the following capabilities.
6 instances with corresponding details:

  • Ambari Server
  • NameNode
  • DataNode (3)
  • Client
Instance RAM VCPU Default disk size Volume No. Volume Size Security group
Ambari 8GB 8 VCPU 20GB None None sg-ambari
NameNode 32GB 8 VCPU 20GB 1 200GB sg-namenode
DataNode (3) 32GB 8 VCPU 20GB 3 200GB sg-datanode
Client 16GB 16 VCPU 20GB 1 500GB sg-client

Note: There were three DataNodes in the initial cluster.

Characteristics of the cluster

The initial cluster had 1.7 TB HDFS, replication factor was 3, block size was the default 128MB. Rack awareness was not set in the initial cluster and the queue was the default.
On the YARN side, I have made some changes and had 84GB RAM (3 x 32GM = 96GB RAM. 4GB per DataNode was left for services on the instance -> 96GB – 12GB = 84GB) as maximum amount of RAM resources for the cluster – the default values by Apache (Hortonworks?) are quite more conservative.

In the cluster building process the versions were Ambari 2.1 and HDP 2.3. When Ambari 2.2 and HDP 2.4 were available, the cluster was upgraded.

Ambari

Ambari had a server for itself, the database for collecting statistics was MySql. The idea was always to migrate the Ambari Server if needed. The migration to new Ambari server is easy so I could afford to start small for this service.
The Ambari Views was enabled for the users who wanted to upload the files to the HDFS manually. Hive was also available through this service and I on my one of my test environments, I have even embedded Zeppelin in Ambari Views. Though, on the Work cluster, Zeppelin was offered only as an independent service on the Client.
All the ports for Ambari to properly work were in the sg-ambari security group.

NameNode

The initial plan with the NameNode was to run all the services on it except Spark and Zeppelin. When the resources would expand beyond the instance’s capabilities, some services would be moved to a new instance, or unused services would be stopped (experience showed Hive had little popularity among the academia). Using Ambari, migrating services is an easy process, I could afford to have all services running on one NameNode. Only cluster administrator had access to this instance. With other words, client tools were not installed on this instance.
All the ports for the NameNode to properly work were in the sg-namenode security group.

DataNode

I started with 3 DataNodes, which offered 1.7TB of storage on the HDFS. The DataNodes were also used as Workers for Spark and Supervisors for Storm. The users had no access to the DataNodes directly – no client was installed here. This would change according to the needs so that some jobs could access data directly locally.
All the ports for the DataNodes to properly work were in the sg-datanode security group.

Client

The client was users’ window to the cluster. Spark 2.0 (before Summer 2016 it was Spark 1.6) was offered as the computational engine – only one. One reason was also easier administration and optimizatoin from my side.
The users could use the command line interface (CLI), RStudio or Zeppelin. Ambari Views as well, but that was running on Ambari instance. More advanced users went with the CLI, users who wanted to learn Spark were using Zeppelin.
Client for Storm was also installed on this instance. Due to more complex programming (in Java), all the Topologies were handled by me, the users were defining requirements and using the data stored by the Storm.
All the ports for the Client to properly work were in the sg-datanode security group.

See below for page 2.

Streaming with Storm – simple example with HDFS bolt

This post describes a simple Storm topology – random words are written to HDFS. The topology is uploaded on the cluster from the client node. Nimbus is on the cluster’s NameNode. I have 4 DataNodes and on each of them a Supervisor is installed. More on how I installed and configured Storm can be found here.

Services used

I am using Hortonworks 2.4, Hadoop is version 2.7.1, Storm is version 0.10.0. All services were installed through Ambari.

Preparing development environment

Create a new maven project. How to install maven is explained here.

mvn archetype:generate -DgroupId=org.package -DartifactId=storm-project -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

When the project is created, step into the directory (in this case it is storm-project) where the pom.xml file is also located.

In the org.package (./src/main/java/org.package), create folder spout. The App.java can be deleted.

There are 3 files important for this topology: pom.xml, the spout file and the topology file.

Prepare pom.xml

The pom file for this case includes Storm dependencies, with scope provided. Storm jars are not packed together with the topology! It is important to match the versions.

maven-shade-plugin

Add build node with the plugin

    <build>
        <sourceDirectory>src/</sourceDirectory>
        <resources>
            <resource>
                <directory>${basedir}</directory>
                <includes>
                    <include>*</include>
                </includes>
            </resource>
        </resources>
        <outputDirectory>classes/</outputDirectory>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>1.4</version>
                <configuration>
                    <createDependencyReducedPom>true</createDependencyReducedPom>
                </configuration>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <transformers>
                                <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
                                <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                    <mainClass></mainClass>
                                </transformer>
                            </transformers>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

clojure

Add clojure in the dependencies node. Be sure to check for newer version

<dependency>
    <groupId>org.clojure</groupId>
    <artifactId>clojure</artifactId>
    <version>1.8.0</version>
</dependency>

storm-core

Make sure the version matches Storm installation

<dependency>
    <groupId>org.apache.storm</groupId>
    <artifactId>storm-core</artifactId>
    <version>0.10.0</version>
    <!-- keep storm out of the jar-with-dependencies -->
    <scope>provided</scope>
</dependency>

hadoop-client

Hadoop client XML node. Make sure the version matches your Hadoop installation. org.slf4j is omitted otherwise messages about multiple version of the package are appearing

<dependency>
	<groupId>org.apache.hadoop</groupId>
	<artifactId>hadoop-client</artifactId>
	<version>2.7.1</version>
	<exclusions>
		<exclusion>
			<groupId>org.slf4j</groupId>
			<artifactId>slf4j-log4j12</artifactId>
		</exclusion>
	</exclusions>
</dependency>

hadoop-hdfs

Hadoop hdfs XML node. Make sure the version matches your Hadoop installation. org.slf4j is again omitted

<dependency>
	<groupId>org.apache.hadoop</groupId>
	<artifactId>hadoop-hdfs</artifactId>
	<version>2.7.1</version>
	<exclusions>
		<exclusion>
			<groupId>org.slf4j</groupId>
			<artifactId>slf4j-log4j12</artifactId>
		</exclusion>
	</exclusions>
</dependency>

storm-hdfs

<dependency>
	<groupId>org.apache.storm</groupId>
	<artifactId>storm-hdfs</artifactId>
	<version>0.10.1</version>
</dependency>

Now that the pom.xml is in order, you can package the project to see if pom.xml is valid

mvn package

Build success should appear. If not, the pom.xml is invalid and should be taken care of.

Click on the next page for Spout.

Upgrading Hortonworks Data Platform from 2.3.4 to 2.4.0

This post describes how to do an Express Upgrade of Hortonworks Data Platform (HDP) with Ambari.

Ugrading HDP begins with upgrading Ambari, Ambari Metrics and, not mandatory but recommended, adding Grafana.

When this is in place and all services are up and running, Upgrading HDP to 2.4 can begin.

Backup

File backup

Creating a backup of all the important files and databases is the first step. The following steps are done on the NameNode.

Create backup directory

mkdir /home/ubuntu/HDP-2.3.4-backup

Run HDFS filesystem check and save the ouptut to a file in the backup directory

sudo -u hdfs hdfs fsck / -files -blocks -locations > /home/ubuntu/HDP-2.3.4-backup/dfs-old-fsck-1.log

Gather basic filesystem information and statistics in a report

sudo -u hdfs hdfs dfsadmin -report > /home/ubuntu/HDP-2.3.4-backup/dfs-old-report-1.log

List the whole HDFS directory and save the ouptut to a file

sudo -u hdfs hdfs dfs -ls -R > /home/ubuntu/HDP-2.3.4-backup/dfs-old-lsr-1.log

Enter Safemode, mandatory for next steps

sudo -u hdfs hdfs dfsadmin -safemode enter

Save current namespace and reset edits log

sudo -u hdfs hdfs dfsadmin -saveNamespace

Make a copy of the VERSION file (here is HDP’s default directoy, file VERSION should reside in ${dfs.namenode.name.dir}/current)

sudo cp /hadoop/hdfs/namenode/current/VERSION /home/ubuntu/HDP-2.3.4-backup/

Leave Safemode

sudo -u hdfs hdfs dfsadmin -safemode leave

Finalize upgrade of HDFS
According to the Apache Hadoop documentation:

“Datanodes delete their previous version working directories, followed by Namenode doing the same. This completes the upgrade process.”

sudo -u hdfs hdfs dfsadmin -finalizeUpgrade

Database backup

My cluster has MySql database that is used by Hive and Ranger. That means I have 3 databases to back up: hive, ranger and ranger_audit (since I am storing audit data in a database).

hive

DAT=`date +%Y%m%d_%H%M%S`
mysqldump -u root -proot hive > /home/ubuntu/HDP-2.3.4-backup/hive_$DAT.sql

This is done beforehand so that you can check the checkbox and move on in the process of upgrade

Hive upgrade warning

Ranger

This is done beforehand so that you can check the checkbox and move on in the process of upgrade

Ranger Admin warning

ranger_admin

DAT=`date +%Y%m%d_%H%M%S`
mysqldump -u root -proot ranger > /home/ubuntu/HDP-2.3.4-backup/ranger_$DAT.sql

ranger_audit

DAT=`date +%Y%m%d_%H%M%S`
mysqldump -u root -proot ranger_audit > /home/ubuntu/HDP-2.3.4-backup/ranger_audit_$DAT.sql

 

Content of backup folder

/home/ubuntu/HDP-2.3.4-backup/
├── dfs-old-fsck-1.log
├── dfs-old-lsr-1.log
├── dfs-old-report-1.log
├── hive_20160804_074811.sql
├── ranger_20160804_074907.sql
├── ranger_audit_20160804_074914.sql
└── VERSION

Click below on Page 2 to continue with the process.

Creating and adding a DataNode with multiple volumes

In this example I am adding a new DataNode with 3 volumes 200 GB, each.

The DataNode is created through the WebUI in the cloud and so are the 3 volumes. Each volume is attached to device in the following order:

volume01 – /dev/vdb
volume02 – /dev/vdc
volume03 – /dev/vdd

After the new “soon-to-be” DataNode instance has been created and volumes attached there is some work to be done in the command line interface:

  1. Use ssh to connect to the new DataNode instance.
    ssh -i .ssh/key w-datanode04
  2. Update and upgrade the system.
    sudo apt-get update -y && sudo apt-get upgrade -y
  3. Create the directories where the data for each volume for the DataNode will be stored.
    sudo mkdir -p /data/vol1 /data/vol2 /data/vol3
  4. Format file system for every device attached to every volume.
    sudo mkfs.ext4 /dev/vdb
    sudo mkfs.ext4 /dev/vdc
    sudo mkfs.ext4 /dev/vdd
  5. Mount the volumes to the respective directory.
    sudo mount /dev/vdb /data/vol1
    sudo mount /dev/vdc /data/vol2
    sudo mount /dev/vdd /data/vol3
  6. Label the volumes for easier future work.
    sudo e2label /dev/vdb "vol1"
    sudo e2label /dev/vdc "vol2"
    sudo e2label /dev/vdd "vol3"
  7. Open and update /etc/fstab.
    This is smart to do to keep the volumes mounted to the directories after the DataNode is restarted.

    LABEL=vol1 /data/vol1 ext4 defaults,nobootwait 0 0
    LABEL=vol2 /data/vol2 ext4 defaults,nobootwait 0 0
    LABEL=vol3 /data/vol3 ext4 defaults,nobootwait 0 0
    
  8. Check if volumes are mounted to correct directories.
    df -h

    Something like this should appear:

    /dev/vdb      197G      241M      187G      1%      /data/vol1
    /dev/vdc      197G      299M      187G      1%      /data/vol2
    /dev/vdd      197G        65M      187G      1%      /data/vol3

  9. For future reference, you can check the size of all monuted folders under directory /data.
    sudo du -hs /data/vol*

    Something similar to this should be in the output.
    The used disk information in the below example shows data after some files have been done to the HDFS. Immidiately after the Datanode is added to the Hadoop claster, the DataNode holds no filesblocks.

    181M     /data/vol1
    240M    /data/vol2
    5.4M     /data/vol3

Now the DataNode with multiple volumes is ready to be added to the cluster.

It is important to change the property dfs.datanode.data.dir in hdfs-default.xml. Or if you are using Ambari: HDFS -> Configs -> Settings and on the right side, you find the first property under DataNode to be “DataNode directories”.

Note: if you are adding new DataNodes with new DataNodes directories, it is smart to first append the new directories to the existing ones (comma separated, no spaces) and after the DataNodes are added, then remove the old directories.
If there is a directory in this property that does not exist, HDFS will ignore it and will not fail.

How to add a DataNode to a cluster with Ambari is described here.

Where are HDFS files in Linux?

In this post, I take an example file in HDFS, run filecheck to find locations of file’s block replications, file’s block pool ID and block ID. This information will help me locate the file’s block on local filesystem on one of the DataNodes.

In second part, I alter the file on local filesystem (from HDFS standpoint, it is a block). This results in Namenode defining the block as corrupted and new replication is created on another DataNode.

HDFS

Show details of the example file in HDFS:

hadoop fs -ls  /tmp/test_spark.csv

Output:

-rw-r–r–   3 ubuntu hdfs   56445434 2016-03-06 18:17 /tmp/test_spark.csv

Run tail on the file:

hadoop fs -tail  /tmp/test_spark.csv

The output is this:

804922,177663.1,793945.2,”factor_1_10000″,”factor_2_10000″
93500,378660.1,120037.2,”factor_1_10000″,”factor_2_10000″
394490,149354.1,253562.2,”factor_1_10000″,”factor_2_10000″
253001,446918.1,602891.2,”factor_1_10000″,”factor_2_10000″
196553,945027.1,97370.2,”factor_1_10000″,”factor_2_10000″
83715,56758.1,888537.2,”factor_1_10000″,”factor_2_10000″
593831,369048.1,844320.2,”factor_1_10000″,”factor_2_10000″
721077,109160.1,604853.2,”factor_1_10000″,”factor_2_10000″
383946,111066.1,779658.2,”factor_1_10000″,”factor_2_10000″
461973,695670.1,596577.2,”factor_1_10000″,”factor_2_10000″
70845,360039.1,479357.2,”factor_1_10000″,”factor_2_10000″
813333,839700.1,568456.2,”factor_1_10000″,”factor_2_10000″
967549,721770.1,998214.2,”factor_1_10000″,”factor_2_10000″
919219,466408.1,583846.2,”factor_1_10000″,”factor_2_10000″
977914,169416.1,412922.2,”factor_1_10000″,”factor_2_10000″
739637,25221.1,626499.2,”factor_1_10000″,”factor_2_10000″
223358,918445.1,337362.2,”factor_1_10000″,”factor_2_10000″

I run filecheck:

hdfs fsck /tmp/test_spark.csv -files -blocks -locations

The output is:

Connecting to namenode via http://w-namenode1.domain.com:50070/fsck?ugi=ubuntu&files=1&blocks=1&locations=1&path=%2Ftmp%2Ftest_spark.csv
FSCK started by ubuntu (auth:SIMPLE) from /10.0.XXX.75 for path /tmp/test_spark.csv at Sun Mar 06 18:18:44 CET 2016
/tmp/test_spark.csv 56445434 bytes, 1 block(s):  OK
BP-1553412973-10.0.160.75-1456844185620:blk_1073741903_1079 len=56445434 repl=3 [DatanodeInfoWithStorage[10.0.XXX.103:50010,DS-1c68e4c7-d424-47e8-b7cc-941198fe2415,DISK], DatanodeInfoWithStorage[10.0.XXX.105:50010,DS-26bc20ee-68d8-423b-b707-26ae6e986562,DISK], DatanodeInfoWithStorage[10.0.XXX.104:50010,DS-76aaea28-2822-4982-8602-f5db3c47d3fd,DISK]]

Status: HEALTHY
Total size:    56445434 B
Total dirs:    0
Total files:   1
Total symlinks:                0
Total blocks (validated):      1 (avg. block size 56445434 B)
Minimally replicated blocks:   1 (100.0 %)
Over-replicated blocks:        0 (0.0 %)
Under-replicated blocks:       0 (0.0 %)
Mis-replicated blocks:         0 (0.0 %)
Default replication factor:    3
Average block replication:     3.0
Corrupt blocks:                0
Missing replicas:              0 (0.0 %)
Number of data-nodes:          4
Number of racks:               1
FSCK ended at Sun Mar 06 18:18:44 CET 2016 in 1 milliseconds
The filesystem under path ‘/tmp/test_spark.csv’ is HEALTHY

The file is stored in one block ( dfs.blocksize is by default 134217728).
Replication factor is 3 (default) and the block can be found on the following DataNodes: 10.0.XXX.103, 10.0.XXX.104, 10.0.XXX.105

BP-1553412973-10.0.160.75-1456844185620 - Block Pool ID
blk_1073741903_1079 - Block ID

Linux

Now I can look for the file in Linux.

I connect to one of the datanodes that was given in the output of hadoop filecheck command.

ssh -i .ssh/key 10.0.XXX.103

Property dfs.datanode.data.dir in hdfs-default.xml, if you are manually administrating the cluster, or, in Ambari, HDFS -> Configs -> Settings -> DataNode -> DataNode directories, tells us where on the local filesystem the DataNode should store its blocks.

Default is /hadoop/hdfs/data.

If I list details of the file:

sudo -u hdfs ls -l /hadoop/hdfs/data/current/BP-1553412973-10.0.160.75-1456844185620/current/finalized/subdir0/subdir0/blk_1073741903

The output is the following:

-rw-r–r– 1 hdfs hadoop 56445434 Mar  6 18:17 /hadoop/hdfs/data/current/BP-1553412973-10.0.160.75-1456844185620/current/finalized/subdir0/subdir0/blk_1073741903

The size of the file is the same as when listing the file using hadoop fs -ls earlier (one block for this file).

Now I run tail on this file:

sudo -u hdfs tail /hadoop/hdfs/data/current/BP-1553412973-10.0.160.75-1456844185620/current/finalized/subdir0/subdir0/blk_1073741903

Result:

721077,109160.1,604853.2,”factor_1_10000″,”factor_2_10000″
383946,111066.1,779658.2,”factor_1_10000″,”factor_2_10000″
461973,695670.1,596577.2,”factor_1_10000″,”factor_2_10000″
70845,360039.1,479357.2,”factor_1_10000″,”factor_2_10000″
813333,839700.1,568456.2,”factor_1_10000″,”factor_2_10000″
967549,721770.1,998214.2,”factor_1_10000″,”factor_2_10000″
919219,466408.1,583846.2,”factor_1_10000″,”factor_2_10000″
977914,169416.1,412922.2,”factor_1_10000″,”factor_2_10000″
739637,25221.1,626499.2,”factor_1_10000″,”factor_2_10000″
223358,918445.1,337362.2,”factor_1_10000″,”factor_2_10000″

Output of tail matches the output of tail ran with hadoop fs command.

 

Changing the file in Linux

If I open this file for editing:

sudo -u hdfs vi /hadoop/hdfs/data/current/BP-1553412973-10.0.160.75-1456844185620/current/finalized/subdir0/subdir0/blk_1073741903

and change it. The file disappears from the parent folder.

 

Filecheck in HDFS

Now I run filecheck on the same file again:

hdfs fsck /tmp/test_spark.csv -files -blocks -locations

The output is the following:

Connecting to namenode via http://w-namenode1.domain.com:50070/fsck?ugi=ubuntu&files=1&blocks=1&locations=1&path=%2Ftmp%2Ftest_spark.csv
FSCK started by ubuntu (auth:SIMPLE) from /10.0.XXX.75 for path /tmp/test_spark.csv at Sun Mar 06 18:34:41 CET 2016
/tmp/test_spark.csv 56445434 bytes, 1 block(s):  OK

BP-1553412973-10.0.160.75-1456844185620:blk_1073741903_1079 len=56445434 repl=3 [DatanodeInfoWithStorage[10.0.XXX.102:50010,DS-db55f66a-e6b6-480a-87bf-2053fbed2960,DISK], DatanodeInfoWithStorage[10.0.XXX.105:50010,DS-26bc20ee-68d8-423b-b707-26ae6e986562,DISK], DatanodeInfoWithStorage[10.0.XXX.104:50010,DS-76aaea28-2822-4982-8602-f5db3c47d3fd,DISK]]

Status: HEALTHY
Total size:    56445434 B
Total dirs:    0
Total files:   1
Total symlinks:                0
Total blocks (validated):      1 (avg. block size 56445434 B)
Minimally replicated blocks:   1 (100.0 %)
Over-replicated blocks:        0 (0.0 %)
Under-replicated blocks:       0 (0.0 %)
Mis-replicated blocks:         0 (0.0 %)
Default replication factor:    3
Average block replication:     3.0
Corrupt blocks:                0
Missing replicas:              0 (0.0 %)
Number of data-nodes:          4
Number of racks:               1
FSCK ended at Sun Mar 06 18:34:41 CET 2016 in 1 milliseconds
The filesystem under path ‘/tmp/test_spark.csv’ is HEALTHY

File is still replicated 3 times on 3 DataNodes, but this time on DataNodes 10.0.XXX.102,  10.0.XXX.104, 10.0.XXX.105.

The output shows that one replication is not on datanode with IP 10.0.XXX.103 anymore. That was the datanode I connected to temper with the file.

NameNode has identified that the block is corrupted and has created a new replica of the block.