Creating and adding a DataNode with multiple volumes

In this example I am adding a new DataNode with 3 volumes 200 GB, each.

The DataNode is created through the WebUI in the cloud and so are the 3 volumes. Each volume is attached to device in the following order:

volume01 – /dev/vdb
volume02 – /dev/vdc
volume03 – /dev/vdd

After the new “soon-to-be” DataNode instance has been created and volumes attached there is some work to be done in the command line interface:

  1. Use ssh to connect to the new DataNode instance.
    ssh -i .ssh/key w-datanode04
  2. Update and upgrade the system.
    sudo apt-get update -y && sudo apt-get upgrade -y
  3. Create the directories where the data for each volume for the DataNode will be stored.
    sudo mkdir -p /data/vol1 /data/vol2 /data/vol3
  4. Format file system for every device attached to every volume.
    sudo mkfs.ext4 /dev/vdb
    sudo mkfs.ext4 /dev/vdc
    sudo mkfs.ext4 /dev/vdd
  5. Mount the volumes to the respective directory.
    sudo mount /dev/vdb /data/vol1
    sudo mount /dev/vdc /data/vol2
    sudo mount /dev/vdd /data/vol3
  6. Label the volumes for easier future work.
    sudo e2label /dev/vdb "vol1"
    sudo e2label /dev/vdc "vol2"
    sudo e2label /dev/vdd "vol3"
  7. Open and update /etc/fstab.
    This is smart to do to keep the volumes mounted to the directories after the DataNode is restarted.

    LABEL=vol1 /data/vol1 ext4 defaults,nobootwait 0 0
    LABEL=vol2 /data/vol2 ext4 defaults,nobootwait 0 0
    LABEL=vol3 /data/vol3 ext4 defaults,nobootwait 0 0
    
  8. Check if volumes are mounted to correct directories.
    df -h

    Something like this should appear:

    /dev/vdb      197G      241M      187G      1%      /data/vol1
    /dev/vdc      197G      299M      187G      1%      /data/vol2
    /dev/vdd      197G        65M      187G      1%      /data/vol3

  9. For future reference, you can check the size of all monuted folders under directory /data.
    sudo du -hs /data/vol*

    Something similar to this should be in the output.
    The used disk information in the below example shows data after some files have been done to the HDFS. Immidiately after the Datanode is added to the Hadoop claster, the DataNode holds no filesblocks.

    181M     /data/vol1
    240M    /data/vol2
    5.4M     /data/vol3

Now the DataNode with multiple volumes is ready to be added to the cluster.

It is important to change the property dfs.datanode.data.dir in hdfs-default.xml. Or if you are using Ambari: HDFS -> Configs -> Settings and on the right side, you find the first property under DataNode to be “DataNode directories”.

Note: if you are adding new DataNodes with new DataNodes directories, it is smart to first append the new directories to the existing ones (comma separated, no spaces) and after the DataNodes are added, then remove the old directories.
If there is a directory in this property that does not exist, HDFS will ignore it and will not fail.

How to add a DataNode to a cluster with Ambari is described here.

One thought on “Creating and adding a DataNode with multiple volumes

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s