Hadoop Benchmark test – TestDFSIO

Preparing the environment

Create a folder where the benchmark result files are saved:

sudo -u hdfs mkdir /home/hdfs/benchmark

Give access to everyone (if more users would like to run benchmark tests, otherwise skipp this and run the commands as hdfs user) :

sudo -u hdfs chmod 777 /home/hdfs/benchmark

About TestDFSIO benchmark test

Program TestDFSIO can be found in jar file /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar.

The TestDFSIO benchmark is used for measuring I/O (read/write) performance. It does this by using a MapReduce job to read and write files in parallel. Hence, functional MapReduce is needed for it.

The benchmark test uses one map task per file.

By invoking the benchmark with no arguments

yarn jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO

usage instructions are shown:

Missing arguments.

Usage: TestDFSIO [genericOptions] -read [-random | -backward | -skip [-skipSize Size]] | -write | -append | -truncate | -clean [-compression codecClassName] [-nrFiles N] [-size Size[B|KB|MB|GB|TB]] [-resFile resultFileName] [-bufferSize Bytes] [-rootDir]

Arguments

Defining result file (-resFile)

By using argument -resFile, the file location and name are defined for the results of the tests.

Example:

-resFile /home/hdfs/benchmark/TestDFSIOwrite

If this argument is not given, the result file is written in current directory under the name TestDFSIO_results.log.

If the argument is pointing to an existing file, the result is appended to it. Here follows an example of a result file after 2 tests have been run:

—– TestDFSIO —– : write
Date & time: Sun Jun 19 17:39:20 CEST 2016
Number of files: 10
Total MBytes processed: 100.0
Throughput mb/sec: 17.13796058269066
Average IO rate mb/sec: 17.385766983032227
IO rate std deviation: 2.1966324914130517
Test exec time sec: 30.607

—– TestDFSIO —– : write
Date & time: Sun Jun 19 17:47:23 CEST 2016
Number of files: 10
Total MBytes processed: 100.0
Throughput mb/sec: 15.332720024532351
Average IO rate mb/sec: 16.875974655151367
IO rate std deviation: 3.72817574426085
Test exec time sec: 25.766

Write test (-write)

By using argument –write, we tell the program to test writing to the cluster. It is convenient to use this before the –read argument, so that some files are prepared for read test.

The written files are located in HDFS under folder /benchmarks, in folder TestDFSIO. If the write test is run and the TestDFSIO folder is already there, it will be first deleted.

Read test (-read)

By using argument –read, we tell the program to test read from the cluster. It is convenient to run test with argument –write first, so that some files are prepared for read test.

If the test is run with this argument before it is run with argument write, an error message like this show up:

16/06/20 11:21:32 INFO mapreduce.Job: Task Id : attempt_1463992963604_0028_m_000005_0, Status : FAILED Error: java.io.FileNotFoundException: File does not exist: /benchmarks/TestDFSIO/io_data/test_io_5

Number of files (-nrFiles)

The argument defines the amount of files used in test. If the test is writing, this argument defines the amount of output files. If the test is reading, this argument defines the amount of input files.

File size (-size)

The argument defines the size of file(s) used in testing. This argument takes a numerical value with optional B|KB|MB|GB|TB. MB is default.

Remove previous test data (-clean)

The argument deletes the output directory /benchmarks/TestDFSIO in HDFS.

Command:

sudo -u hdfs yarn jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -clean

Example of an output:

INFO fs.TestDFSIO: TestDFSIO.1.8
16/06/20 11:16:54 INFO fs.TestDFSIO: nrFiles = 1
16/06/20 11:16:54 INFO fs.TestDFSIO: nrBytes (MB) = 1.0
16/06/20 11:16:54 INFO fs.TestDFSIO: bufferSize = 1000000
16/06/20 11:16:54 INFO fs.TestDFSIO: baseDir = /benchmarks/TestDFSIO
16/06/20 11:16:55 INFO fs.TestDFSIO: Cleaning up test files

TestDFSIO tests

Test 1: write 10 files, size 10MB

Write 10 files, each with a size 10MB and put the results in /home/hdfs/benchmark/TestDFSIOwrite.

yarn jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -write -nrFiles 10 -size 10MB -resFile /home/hdfs/benchmark/TestDFSIOwrite

Result:

—– TestDFSIO —– : write
Date & time: Sun Jun 19 17:39:20 CEST 2016
Number of files: 10
Total MBytes processed: 100.0
Throughput mb/sec: 17.13796058269066
Average IO rate mb/sec: 17.385766983032227
IO rate std deviation: 2.1966324914130517
Test exec time sec: 30.607

Test 2: read 10 files, size 10MB

yarn jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -read -nrFiles 10 -size 10MB -resFile /home/hdfs/benchmark/TestDFSIO

Result:

—– TestDFSIO —– : read
Date & time: Mon Jun 20 10:49:24 CEST 2016
Number of files: 10
Total MBytes processed: 100.0
Throughput mb/sec: 68.87052341597796
Average IO rate mb/sec: 170.9973602294922
IO rate std deviation: 135.61526628958586
Test exec time sec: 39.793

Test 3: write 50 files, size 100MB

yarn jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -write -nrFiles 50 -size 100MB -resFile /home/hdfs/benchmark/TestDFSIO

Result:

—– TestDFSIO —– : write
Date & time: Mon Jun 20 13:22:00 CEST 2016
Number of files: 50
Total MBytes processed: 5000.0
Throughput mb/sec: 3.393228337799953
Average IO rate mb/sec: 4.491838455200195
IO rate std deviation: 3.903713550708894
Test exec time sec: 65.042

Running the following command on one file written in the benchmark test:

hdfs fsck /benchmarks/TestDFSIO/io_data/test_io_0

Returns (selection):

Average block replication: 3.0
Number of data-nodes: 4

Test 4: write 50 files, size 100MB, replication factor 2

Cluster’s default replication factor is 3. In this case, we run the test with argument –D and change the replication factor to 2.

yarn jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -D dfs.replication=2 -write -nrFiles 50 -size 100MB -resFile /home/hdfs/benchmark/TestDFSIO

Result:

—– TestDFSIO —– : write
Date & time: Mon Jun 20 13:39:17 CEST 2016
Number of files: 50
Total MBytes processed: 5000.0
Throughput mb/sec: 8.495729196932702
Average IO rate mb/sec: 11.038678169250488
IO rate std deviation: 8.691344876093968
Test exec time sec: 43.378

Running the following command on one file written in the benchmark test:

hdfs fsck /benchmarks/TestDFSIO/io_data/test_io_0

Returns (selection):

Average block replication: 2.0
Number of data-nodes: 4

Test 5: write 50 files, size 1GB, time

time yarn jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -write -nrFiles 50 -size 1GB -resFile /home/hdfs/benchmark/TestDFSIO

Result:
—– TestDFSIO —– : write
Date & time: Mon Jun 20 16:59:56 CEST 2016
Number of files: 50
Total MBytes processed: 51200.0
Throughput mb/sec: 3.3432171543606457
Average IO rate mb/sec: 3.805697441101074
IO rate std deviation: 1.8925752533465978
Test exec time sec: 421.994

Output (time):

real 7m11.604s
user 0m21.056s
sys 0m2.757s

Where the relevant metrics is real.
In some cases, running time is not needed, since the execution time is in the test report or output.

MRbench test benchmarks how responsive small jobs are in a cluster can be.

4 thoughts on “Hadoop Benchmark test – TestDFSIO”

Craig Tierney says:

31/01/2017 at 1:15 am

+Can you describe more about the architecture of the datanodes and if the performance you reported would be considered “good” for this HDFS architecture? Did the number of map tasks exceed the maximum amount supported on this cluster?

LikeLike

1. markobigdata says:
  
  01/02/2017 at 3:41 pm
  
  Hi Craig!
  These benchmarking results were used to compare them to another Hadoop cluster, so I cant give you any concrete answers. And this was done in my previous job.
  
  LikeLike
  
Anonymous says:

03/04/2018 at 6:21 pm

Very helpful, Thanks!

LikeLike

Maria says:

03/01/2023 at 4:30 am

Thank you for wriiting this

LikeLike

markobigdata

Big Data documentation in a blog

Hadoop Benchmark test – TestDFSIO

Preparing the environment

About TestDFSIO benchmark test

Arguments

Defining result file (-resFile)

Write test (-write)

Read test (-read)

Number of files (-nrFiles)

File size (-size)

Remove previous test data (-clean)

TestDFSIO tests

Test 1: write 10 files, size 10MB

Test 2: read 10 files, size 10MB

Test 3: write 50 files, size 100MB

Test 4: write 50 files, size 100MB, replication factor 2

Test 5: write 50 files, size 1GB, time

4 thoughts on “Hadoop Benchmark test – TestDFSIO”

Leave a comment Cancel reply

Preparing the environment

About TestDFSIO benchmark test

Arguments

Defining result file (-resFile)

Write test (-write)

Read test (-read)

Number of files (-nrFiles)

File size (-size)

Remove previous test data (-clean)

TestDFSIO tests

Test 1: write 10 files, size 10MB

Test 2: read 10 files, size 10MB

Test 3: write 50 files, size 100MB

Test 4: write 50 files, size 100MB, replication factor 2

Test 5: write 50 files, size 1GB, time

Share this:

4 thoughts on “Hadoop Benchmark test – TestDFSIO”

Leave a comment Cancel reply