My Work cluster in detail

The cluster was built on OpenStack private cloud owned by a Swiss organization Switch.

The Hadoop distributor was Hortonworks, except for Spark and Zeppelin, who were Apache’s.

Potential users

Since the project owner was an organization supporting educational entities in Switzerland, the potential users were researchers, scientists, students…

I had the luxury of having almost unlimited resources on the infrastructure so I have built 5 Hadoop clusters – 4 were Hortonworks Hadoop clusters, one was Apache Hadoop cluster. Out of the 4, one was the Work environment which was exposed to the end users. And this is the cluster that is described in detail in this post.
Keep in mind that I was working on my own on this development – which meant administering and upgrading 5 clusters and doing data science at the same time. In order to make it work, I had to use the YARN inside me and distribute the limited resources effectively.

Initial resources

Keeping in mind the point of distributed systems is scalability, I have defined the initial cluster with the following capabilities.
6 instances with corresponding details:

  • Ambari Server
  • NameNode
  • DataNode (3)
  • Client
Instance RAM VCPU Default disk size Volume No. Volume Size Security group
Ambari 8GB 8 VCPU 20GB None None sg-ambari
NameNode 32GB 8 VCPU 20GB 1 200GB sg-namenode
DataNode (3) 32GB 8 VCPU 20GB 3 200GB sg-datanode
Client 16GB 16 VCPU 20GB 1 500GB sg-client

Note: There were three DataNodes in the initial cluster.

Characteristics of the cluster

The initial cluster had 1.7 TB HDFS, replication factor was 3, block size was the default 128MB. Rack awareness was not set in the initial cluster and the queue was the default.
On the YARN side, I have made some changes and had 84GB RAM (3 x 32GM = 96GB RAM. 4GB per DataNode was left for services on the instance -> 96GB – 12GB = 84GB) as maximum amount of RAM resources for the cluster – the default values by Apache (Hortonworks?) are quite more conservative.

In the cluster building process the versions were Ambari 2.1 and HDP 2.3. When Ambari 2.2 and HDP 2.4 were available, the cluster was upgraded.

Ambari

Ambari had a server for itself, the database for collecting statistics was MySql. The idea was always to migrate the Ambari Server if needed. The migration to new Ambari server is easy so I could afford to start small for this service.
The Ambari Views was enabled for the users who wanted to upload the files to the HDFS manually. Hive was also available through this service and I on my one of my test environments, I have even embedded Zeppelin in Ambari Views. Though, on the Work cluster, Zeppelin was offered only as an independent service on the Client.
All the ports for Ambari to properly work were in the sg-ambari security group.

NameNode

The initial plan with the NameNode was to run all the services on it except Spark and Zeppelin. When the resources would expand beyond the instance’s capabilities, some services would be moved to a new instance, or unused services would be stopped (experience showed Hive had little popularity among the academia). Using Ambari, migrating services is an easy process, I could afford to have all services running on one NameNode. Only cluster administrator had access to this instance. With other words, client tools were not installed on this instance.
All the ports for the NameNode to properly work were in the sg-namenode security group.

DataNode

I started with 3 DataNodes, which offered 1.7TB of storage on the HDFS. The DataNodes were also used as Workers for Spark and Supervisors for Storm. The users had no access to the DataNodes directly – no client was installed here. This would change according to the needs so that some jobs could access data directly locally.
All the ports for the DataNodes to properly work were in the sg-datanode security group.

Client

The client was users’ window to the cluster. Spark 2.0 (before Summer 2016 it was Spark 1.6) was offered as the computational engine – only one. One reason was also easier administration and optimizatoin from my side.
The users could use the command line interface (CLI), RStudio or Zeppelin. Ambari Views as well, but that was running on Ambari instance. More advanced users went with the CLI, users who wanted to learn Spark were using Zeppelin.
Client for Storm was also installed on this instance. Due to more complex programming (in Java), all the Topologies were handled by me, the users were defining requirements and using the data stored by the Storm.
All the ports for the Client to properly work were in the sg-datanode security group.

See below for page 2.

Creating and adding a DataNode with multiple volumes

In this example I am adding a new DataNode with 3 volumes 200 GB, each.

The DataNode is created through the WebUI in the cloud and so are the 3 volumes. Each volume is attached to device in the following order:

volume01 – /dev/vdb
volume02 – /dev/vdc
volume03 – /dev/vdd

After the new “soon-to-be” DataNode instance has been created and volumes attached there is some work to be done in the command line interface:

  1. Use ssh to connect to the new DataNode instance.
    ssh -i .ssh/key w-datanode04
  2. Update and upgrade the system.
    sudo apt-get update -y && sudo apt-get upgrade -y
  3. Create the directories where the data for each volume for the DataNode will be stored.
    sudo mkdir -p /data/vol1 /data/vol2 /data/vol3
  4. Format file system for every device attached to every volume.
    sudo mkfs.ext4 /dev/vdb
    sudo mkfs.ext4 /dev/vdc
    sudo mkfs.ext4 /dev/vdd
  5. Mount the volumes to the respective directory.
    sudo mount /dev/vdb /data/vol1
    sudo mount /dev/vdc /data/vol2
    sudo mount /dev/vdd /data/vol3
  6. Label the volumes for easier future work.
    sudo e2label /dev/vdb "vol1"
    sudo e2label /dev/vdc "vol2"
    sudo e2label /dev/vdd "vol3"
  7. Open and update /etc/fstab.
    This is smart to do to keep the volumes mounted to the directories after the DataNode is restarted.

    LABEL=vol1 /data/vol1 ext4 defaults,nobootwait 0 0
    LABEL=vol2 /data/vol2 ext4 defaults,nobootwait 0 0
    LABEL=vol3 /data/vol3 ext4 defaults,nobootwait 0 0
    
  8. Check if volumes are mounted to correct directories.
    df -h

    Something like this should appear:

    /dev/vdb      197G      241M      187G      1%      /data/vol1
    /dev/vdc      197G      299M      187G      1%      /data/vol2
    /dev/vdd      197G        65M      187G      1%      /data/vol3

  9. For future reference, you can check the size of all monuted folders under directory /data.
    sudo du -hs /data/vol*

    Something similar to this should be in the output.
    The used disk information in the below example shows data after some files have been done to the HDFS. Immidiately after the Datanode is added to the Hadoop claster, the DataNode holds no filesblocks.

    181M     /data/vol1
    240M    /data/vol2
    5.4M     /data/vol3

Now the DataNode with multiple volumes is ready to be added to the cluster.

It is important to change the property dfs.datanode.data.dir in hdfs-default.xml. Or if you are using Ambari: HDFS -> Configs -> Settings and on the right side, you find the first property under DataNode to be “DataNode directories”.

Note: if you are adding new DataNodes with new DataNodes directories, it is smart to first append the new directories to the existing ones (comma separated, no spaces) and after the DataNodes are added, then remove the old directories.
If there is a directory in this property that does not exist, HDFS will ignore it and will not fail.

How to add a DataNode to a cluster with Ambari is described here.

Removing Datanode(s) from a cluster with Ambari

Datanodes come and go.
Proper removal of a Datanode is important, otherwise you might end up with missing blocks or unconsolidated Ambari meta database.
In this post, I am removing 2 Datanodes from my cluster. Before doing this, I have to know the replication factor (3 in my case) and the number of Datanodes left after the decommission. This is important because of the following rule:

If replication factor is higher than the number of existing datanodes after the removal, the removal process is not going to succeed!

In my case, I have 5 datanodes in the cluster, so I am going to be left with 3 Datanodes after removing 2.

Following procedure is needed to remove Datanode(s) properly:

  1. From the Hosts list, check the Datanodes you wish to remove.
  2. Decommission Nodemanagers
    decomission nodemanager
  3. Decommision Datanodesdecommission datanode
  4. Stop Ambari Metrics on each Datanode
    ambari metrics stop
  5. While on the same page, stop all components
    stop all componentsRepeat this for every Datanode.Click OK on the confirmation window.
    stop all components confirmation
    Background Operations window informs you when the components are stopped.
    status report
  6. Stop the Ambari Agent.
    Log in to each to-be-removed Datanode and stop the Ambari Agent.

    sudo ambari-agent stop
  7. Delete Host
    While on the same page, from the Host Actions menu, choose Delete Host
    delete host
    Repeat this for every Datanode.
  8. Restart HDFS and YARN services from Ambari.
    The soon to be removed Datanode(s) are in decommissioned status and will remain in it until HDFS service is restarted. Ambari reminds you to restart HDFS and YARN.HDFS status

The unwanted Datanodes have now been removed. I have a cluster with 3 Datanodes after removing 2.

Where are HDFS files in Linux?

In this post, I take an example file in HDFS, run filecheck to find locations of file’s block replications, file’s block pool ID and block ID. This information will help me locate the file’s block on local filesystem on one of the DataNodes.

In second part, I alter the file on local filesystem (from HDFS standpoint, it is a block). This results in Namenode defining the block as corrupted and new replication is created on another DataNode.

HDFS

Show details of the example file in HDFS:

hadoop fs -ls  /tmp/test_spark.csv

Output:

-rw-r–r–   3 ubuntu hdfs   56445434 2016-03-06 18:17 /tmp/test_spark.csv

Run tail on the file:

hadoop fs -tail  /tmp/test_spark.csv

The output is this:

804922,177663.1,793945.2,”factor_1_10000″,”factor_2_10000″
93500,378660.1,120037.2,”factor_1_10000″,”factor_2_10000″
394490,149354.1,253562.2,”factor_1_10000″,”factor_2_10000″
253001,446918.1,602891.2,”factor_1_10000″,”factor_2_10000″
196553,945027.1,97370.2,”factor_1_10000″,”factor_2_10000″
83715,56758.1,888537.2,”factor_1_10000″,”factor_2_10000″
593831,369048.1,844320.2,”factor_1_10000″,”factor_2_10000″
721077,109160.1,604853.2,”factor_1_10000″,”factor_2_10000″
383946,111066.1,779658.2,”factor_1_10000″,”factor_2_10000″
461973,695670.1,596577.2,”factor_1_10000″,”factor_2_10000″
70845,360039.1,479357.2,”factor_1_10000″,”factor_2_10000″
813333,839700.1,568456.2,”factor_1_10000″,”factor_2_10000″
967549,721770.1,998214.2,”factor_1_10000″,”factor_2_10000″
919219,466408.1,583846.2,”factor_1_10000″,”factor_2_10000″
977914,169416.1,412922.2,”factor_1_10000″,”factor_2_10000″
739637,25221.1,626499.2,”factor_1_10000″,”factor_2_10000″
223358,918445.1,337362.2,”factor_1_10000″,”factor_2_10000″

I run filecheck:

hdfs fsck /tmp/test_spark.csv -files -blocks -locations

The output is:

Connecting to namenode via http://w-namenode1.domain.com:50070/fsck?ugi=ubuntu&files=1&blocks=1&locations=1&path=%2Ftmp%2Ftest_spark.csv
FSCK started by ubuntu (auth:SIMPLE) from /10.0.XXX.75 for path /tmp/test_spark.csv at Sun Mar 06 18:18:44 CET 2016
/tmp/test_spark.csv 56445434 bytes, 1 block(s):  OK
BP-1553412973-10.0.160.75-1456844185620:blk_1073741903_1079 len=56445434 repl=3 [DatanodeInfoWithStorage[10.0.XXX.103:50010,DS-1c68e4c7-d424-47e8-b7cc-941198fe2415,DISK], DatanodeInfoWithStorage[10.0.XXX.105:50010,DS-26bc20ee-68d8-423b-b707-26ae6e986562,DISK], DatanodeInfoWithStorage[10.0.XXX.104:50010,DS-76aaea28-2822-4982-8602-f5db3c47d3fd,DISK]]

Status: HEALTHY
Total size:    56445434 B
Total dirs:    0
Total files:   1
Total symlinks:                0
Total blocks (validated):      1 (avg. block size 56445434 B)
Minimally replicated blocks:   1 (100.0 %)
Over-replicated blocks:        0 (0.0 %)
Under-replicated blocks:       0 (0.0 %)
Mis-replicated blocks:         0 (0.0 %)
Default replication factor:    3
Average block replication:     3.0
Corrupt blocks:                0
Missing replicas:              0 (0.0 %)
Number of data-nodes:          4
Number of racks:               1
FSCK ended at Sun Mar 06 18:18:44 CET 2016 in 1 milliseconds
The filesystem under path ‘/tmp/test_spark.csv’ is HEALTHY

The file is stored in one block ( dfs.blocksize is by default 134217728).
Replication factor is 3 (default) and the block can be found on the following DataNodes: 10.0.XXX.103, 10.0.XXX.104, 10.0.XXX.105

BP-1553412973-10.0.160.75-1456844185620 - Block Pool ID
blk_1073741903_1079 - Block ID

Linux

Now I can look for the file in Linux.

I connect to one of the datanodes that was given in the output of hadoop filecheck command.

ssh -i .ssh/key 10.0.XXX.103

Property dfs.datanode.data.dir in hdfs-default.xml, if you are manually administrating the cluster, or, in Ambari, HDFS -> Configs -> Settings -> DataNode -> DataNode directories, tells us where on the local filesystem the DataNode should store its blocks.

Default is /hadoop/hdfs/data.

If I list details of the file:

sudo -u hdfs ls -l /hadoop/hdfs/data/current/BP-1553412973-10.0.160.75-1456844185620/current/finalized/subdir0/subdir0/blk_1073741903

The output is the following:

-rw-r–r– 1 hdfs hadoop 56445434 Mar  6 18:17 /hadoop/hdfs/data/current/BP-1553412973-10.0.160.75-1456844185620/current/finalized/subdir0/subdir0/blk_1073741903

The size of the file is the same as when listing the file using hadoop fs -ls earlier (one block for this file).

Now I run tail on this file:

sudo -u hdfs tail /hadoop/hdfs/data/current/BP-1553412973-10.0.160.75-1456844185620/current/finalized/subdir0/subdir0/blk_1073741903

Result:

721077,109160.1,604853.2,”factor_1_10000″,”factor_2_10000″
383946,111066.1,779658.2,”factor_1_10000″,”factor_2_10000″
461973,695670.1,596577.2,”factor_1_10000″,”factor_2_10000″
70845,360039.1,479357.2,”factor_1_10000″,”factor_2_10000″
813333,839700.1,568456.2,”factor_1_10000″,”factor_2_10000″
967549,721770.1,998214.2,”factor_1_10000″,”factor_2_10000″
919219,466408.1,583846.2,”factor_1_10000″,”factor_2_10000″
977914,169416.1,412922.2,”factor_1_10000″,”factor_2_10000″
739637,25221.1,626499.2,”factor_1_10000″,”factor_2_10000″
223358,918445.1,337362.2,”factor_1_10000″,”factor_2_10000″

Output of tail matches the output of tail ran with hadoop fs command.

 

Changing the file in Linux

If I open this file for editing:

sudo -u hdfs vi /hadoop/hdfs/data/current/BP-1553412973-10.0.160.75-1456844185620/current/finalized/subdir0/subdir0/blk_1073741903

and change it. The file disappears from the parent folder.

 

Filecheck in HDFS

Now I run filecheck on the same file again:

hdfs fsck /tmp/test_spark.csv -files -blocks -locations

The output is the following:

Connecting to namenode via http://w-namenode1.domain.com:50070/fsck?ugi=ubuntu&files=1&blocks=1&locations=1&path=%2Ftmp%2Ftest_spark.csv
FSCK started by ubuntu (auth:SIMPLE) from /10.0.XXX.75 for path /tmp/test_spark.csv at Sun Mar 06 18:34:41 CET 2016
/tmp/test_spark.csv 56445434 bytes, 1 block(s):  OK

BP-1553412973-10.0.160.75-1456844185620:blk_1073741903_1079 len=56445434 repl=3 [DatanodeInfoWithStorage[10.0.XXX.102:50010,DS-db55f66a-e6b6-480a-87bf-2053fbed2960,DISK], DatanodeInfoWithStorage[10.0.XXX.105:50010,DS-26bc20ee-68d8-423b-b707-26ae6e986562,DISK], DatanodeInfoWithStorage[10.0.XXX.104:50010,DS-76aaea28-2822-4982-8602-f5db3c47d3fd,DISK]]

Status: HEALTHY
Total size:    56445434 B
Total dirs:    0
Total files:   1
Total symlinks:                0
Total blocks (validated):      1 (avg. block size 56445434 B)
Minimally replicated blocks:   1 (100.0 %)
Over-replicated blocks:        0 (0.0 %)
Under-replicated blocks:       0 (0.0 %)
Mis-replicated blocks:         0 (0.0 %)
Default replication factor:    3
Average block replication:     3.0
Corrupt blocks:                0
Missing replicas:              0 (0.0 %)
Number of data-nodes:          4
Number of racks:               1
FSCK ended at Sun Mar 06 18:34:41 CET 2016 in 1 milliseconds
The filesystem under path ‘/tmp/test_spark.csv’ is HEALTHY

File is still replicated 3 times on 3 DataNodes, but this time on DataNodes 10.0.XXX.102,  10.0.XXX.104, 10.0.XXX.105.

The output shows that one replication is not on datanode with IP 10.0.XXX.103 anymore. That was the datanode I connected to temper with the file.

NameNode has identified that the block is corrupted and has created a new replica of the block.

 

Adding new DataNode to the cluster using Ambari

I am going to add one DataNode to my existing cluster. This is going to be done in Ambari. My Hadoop ditribution is Hortonworks.

Work on the node

Adding new node to the cluster affects all the existing nodes – they should know about the new node and the new node should know about the existing nodes. In this case, I am using /etc/hosts to keep nodes “acquainted” with each other.

My only source of truth for /etc/hosts is on Ambari server. From there I run scripts that update the /etc/hosts file on other nodes.

  1.  Open the file.
    sudo vi /etc/hosts
  2. Add a new line to it and save the file. In Ubuntu, this takes immediate effect.

    10.0.XXX.XX     t-datanode02.domain       t-datanode02

  3. Running the script to update the cluster.
    As per now, I have one line per node in the script, as shown below. it is on my to-do list to create a loop that would read from original /etc/hosts and update the cluster.
    So the following line is added to the existing lines in the script.

    cat /etc/hosts | ssh ubuntu@t-datanode02 -i /home/ubuntu/.ssh/key "sudo sh -c 'cat > /etc/hosts'";
  4. Updating the system on the new node
    I tend to run this from Ambari. If multiple nodes are added, I run a script.

    ssh -i /home/ubuntu/.ssh/key ubuntu@t-datanode02 'sudo apt-get update -y && sudo apt-get upgrade -y'
  5. Adjusting maximum number of open files and processes.
    Since this is a DataNode we are adding, number of open files and processes has to be adjusted.
    Open the limits.conf file on the node.

    sudo vi /etc/security/limits.conf
  6. Add the following two lines at the end of the file

    *                –       nofile          32768
    *                –       nproc           65536

  7. Save the file, exit the CLI and log in again.
  8. The changes can be seen by typing the following command.
    ulimit -a

    Output is the following:

    core file size          (blocks, -c) 0
    data seg size           (kbytes, -d) unlimited
    scheduling priority             (-e) 0
    file size               (blocks, -f) unlimited
    pending signals                 (-i) 257202
    max locked memory       (kbytes, -l) 64
    max memory size         (kbytes, -m) unlimited
    open files                      (-n) 32768
    pipe size            (512 bytes, -p) 8
    POSIX message queues     (bytes, -q) 819200
    real-time priority              (-r) 0
    stack size              (kbytes, -s) 8192
    cpu time               (seconds, -t) unlimited
    max user processes              (-u) 65536
    virtual memory          (kbytes, -v) unlimited
    file locks                      (-x) unlimited

Work from Ambari

  1. Log in to Ambari, click on Hosts and choose Add New Hosts from the Actions menu.
    ambari-add-new-host
  2. In step Install Options, add the node that is soon to become a DataNode.
    Hortonworks warns against using anything than FQDN as Target Hosts!

    If multiple nodes are added in this step, they can be written one per line. If there is a numerical pattern in the names of the nodes , Pattern Expressions can be used.
    Example nodes:
    datanode01
    datanode02
    datanode03
    Writing this in one line with Pattern Expressions:
    datanode[01-03]
    Worry not, Ambari will ask you to confirm the host names if you have used Pattern Expressions:ambari-pattern-expression-example
    (This is a print screen from one of my earlier cluster installations)Private key has to be defined and SSH User Account is by default root, but that will not work. In my case, I am using Ubuntu, so the user is ubuntu.
    ambari-new-host-install-options
    Now I can click Register and Confirm.
  3. In the Confirm Hosts step, Ambari server connects to the new node using SSH, it registers the new node to the cluster and installs Ambari Agent in order to keep control over it.Registering phase:
    ambari-new-host-registering-status
    New node has been registered successfully:
    ambari-new-host-success-status
    If anything else but this message is shown, click on the link to check the results. The list of checks performed is shown and everything should be in order before continuing (Earlier versions had a problem if ntpd or snappy was not installed/started, for example).
    ambari-new-host-check-passed
    All good in the hood here so I can continue with the installation.
  4. In step Assign Slaves and Clients I define my node to be a DataNode and has a NodeManager installed as well (if you are running Apache Storm, Supervisor is also an option).
    ambari-new-host-assign-slaves-clientsClick next.
  5. In step Configurations, there is not much to do, unless you operate with more than one Configuration Group.
    ambari-new-host-configurationsClick Next.
  6. In step Review, one can just doublecheck if everything is as planned.
    Click deploy if everything is as it should be.
  7. Step Install, Start and Test is the last step. After everything is installed, new DataNode has joined the cluster.Here is how this should look like:
    ambari-new-host-install-successClick Next.
  8. Final step – Summary – gives a status update.ambari-new-host-summaryClick on Complete and list of installed Hosts will load.