Preparing the environment
Create a folder where the benchmark result files are saved:
sudo -u hdfs mkdir /home/hdfs/benchmark
Give access to everyone (if more users would like to run benchmark tests, otherwise skipp this and run the commands as hdfs user) :
sudo -u hdfs chmod 777 /home/hdfs/benchmark
About TestDFSIO benchmark test
Program TestDFSIO can be found in jar file /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar.
The TestDFSIO benchmark is used for measuring I/O (read/write) performance. It does this by using a MapReduce job to read and write files in parallel. Hence, functional MapReduce is needed for it.
The benchmark test uses one map task per file.
By invoking the benchmark with no arguments
yarn jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO
usage instructions are shown:
Missing arguments. Usage: TestDFSIO [genericOptions] -read [-random | -backward | -skip [-skipSize Size]] | -write | -append | -truncate | -clean [-compression codecClassName] [-nrFiles N] [-size Size[B|KB|MB|GB|TB]] [-resFile resultFileName] [-bufferSize Bytes] [-rootDir]
Arguments
Defining result file (-resFile)
By using argument -resFile, the file location and name are defined for the results of the tests.
Example:
-resFile /home/hdfs/benchmark/TestDFSIOwrite
If this argument is not given, the result file is written in current directory under the name TestDFSIO_results.log.
If the argument is pointing to an existing file, the result is appended to it. Here follows an example of a result file after 2 tests have been run:
—– TestDFSIO —– : write
Date & time: Sun Jun 19 17:39:20 CEST 2016
Number of files: 10
Total MBytes processed: 100.0
Throughput mb/sec: 17.13796058269066
Average IO rate mb/sec: 17.385766983032227
IO rate std deviation: 2.1966324914130517
Test exec time sec: 30.607—– TestDFSIO —– : write
Date & time: Sun Jun 19 17:47:23 CEST 2016
Number of files: 10
Total MBytes processed: 100.0
Throughput mb/sec: 15.332720024532351
Average IO rate mb/sec: 16.875974655151367
IO rate std deviation: 3.72817574426085
Test exec time sec: 25.766
Write test (-write)
By using argument –write, we tell the program to test writing to the cluster. It is convenient to use this before the –read argument, so that some files are prepared for read test.
The written files are located in HDFS under folder /benchmarks, in folder TestDFSIO. If the write test is run and the TestDFSIO folder is already there, it will be first deleted.
Read test (-read)
By using argument –read, we tell the program to test read from the cluster. It is convenient to run test with argument –write first, so that some files are prepared for read test.
If the test is run with this argument before it is run with argument write, an error message like this show up:
16/06/20 11:21:32 INFO mapreduce.Job: Task Id : attempt_1463992963604_0028_m_000005_0, Status : FAILED Error: java.io.FileNotFoundException: File does not exist: /benchmarks/TestDFSIO/io_data/test_io_5
Number of files (-nrFiles)
The argument defines the amount of files used in test. If the test is writing, this argument defines the amount of output files. If the test is reading, this argument defines the amount of input files.
File size (-size)
The argument defines the size of file(s) used in testing. This argument takes a numerical value with optional B|KB|MB|GB|TB. MB is default.
Remove previous test data (-clean)
The argument deletes the output directory /benchmarks/TestDFSIO in HDFS.
Command:
sudo -u hdfs yarn jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -clean
Example of an output:
INFO fs.TestDFSIO: TestDFSIO.1.8
16/06/20 11:16:54 INFO fs.TestDFSIO: nrFiles = 1
16/06/20 11:16:54 INFO fs.TestDFSIO: nrBytes (MB) = 1.0
16/06/20 11:16:54 INFO fs.TestDFSIO: bufferSize = 1000000
16/06/20 11:16:54 INFO fs.TestDFSIO: baseDir = /benchmarks/TestDFSIO
16/06/20 11:16:55 INFO fs.TestDFSIO: Cleaning up test files
TestDFSIO tests
Test 1: write 10 files, size 10MB
Write 10 files, each with a size 10MB and put the results in /home/hdfs/benchmark/TestDFSIOwrite.
yarn jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -write -nrFiles 10 -size 10MB -resFile /home/hdfs/benchmark/TestDFSIOwrite
Result:
—– TestDFSIO —– : write
Date & time: Sun Jun 19 17:39:20 CEST 2016
Number of files: 10
Total MBytes processed: 100.0
Throughput mb/sec: 17.13796058269066
Average IO rate mb/sec: 17.385766983032227
IO rate std deviation: 2.1966324914130517
Test exec time sec: 30.607
Test 2: read 10 files, size 10MB
yarn jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -read -nrFiles 10 -size 10MB -resFile /home/hdfs/benchmark/TestDFSIO
Result:
—– TestDFSIO —– : read
Date & time: Mon Jun 20 10:49:24 CEST 2016
Number of files: 10
Total MBytes processed: 100.0
Throughput mb/sec: 68.87052341597796
Average IO rate mb/sec: 170.9973602294922
IO rate std deviation: 135.61526628958586
Test exec time sec: 39.793
Test 3: write 50 files, size 100MB
yarn jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -write -nrFiles 50 -size 100MB -resFile /home/hdfs/benchmark/TestDFSIO
Result:
—– TestDFSIO —– : write
Date & time: Mon Jun 20 13:22:00 CEST 2016
Number of files: 50
Total MBytes processed: 5000.0
Throughput mb/sec: 3.393228337799953
Average IO rate mb/sec: 4.491838455200195
IO rate std deviation: 3.903713550708894
Test exec time sec: 65.042
Running the following command on one file written in the benchmark test:
hdfs fsck /benchmarks/TestDFSIO/io_data/test_io_0
Returns (selection):
Average block replication: 3.0
Number of data-nodes: 4
Test 4: write 50 files, size 100MB, replication factor 2
Cluster’s default replication factor is 3. In this case, we run the test with argument –D and change the replication factor to 2.
yarn jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -D dfs.replication=2 -write -nrFiles 50 -size 100MB -resFile /home/hdfs/benchmark/TestDFSIO
Result:
—– TestDFSIO —– : write
Date & time: Mon Jun 20 13:39:17 CEST 2016
Number of files: 50
Total MBytes processed: 5000.0
Throughput mb/sec: 8.495729196932702
Average IO rate mb/sec: 11.038678169250488
IO rate std deviation: 8.691344876093968
Test exec time sec: 43.378
Running the following command on one file written in the benchmark test:
hdfs fsck /benchmarks/TestDFSIO/io_data/test_io_0
Returns (selection):
Average block replication: 2.0
Number of data-nodes: 4
Test 5: write 50 files, size 1GB, time
time yarn jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -write -nrFiles 50 -size 1GB -resFile /home/hdfs/benchmark/TestDFSIO
Result:
—– TestDFSIO —– : write
Date & time: Mon Jun 20 16:59:56 CEST 2016
Number of files: 50
Total MBytes processed: 51200.0
Throughput mb/sec: 3.3432171543606457
Average IO rate mb/sec: 3.805697441101074
IO rate std deviation: 1.8925752533465978
Test exec time sec: 421.994
Output (time):
real 7m11.604s
user 0m21.056s
sys 0m2.757s
Where the relevant metrics is real.
In some cases, running time is not needed, since the execution time is in the test report or output.
MRbench test benchmarks how responsive small jobs are in a cluster can be.
+Can you describe more about the architecture of the datanodes and if the performance you reported would be considered “good” for this HDFS architecture? Did the number of map tasks exceed the maximum amount supported on this cluster?
LikeLike
Hi Craig!
These benchmarking results were used to compare them to another Hadoop cluster, so I cant give you any concrete answers. And this was done in my previous job.
LikeLike
Very helpful, Thanks!
LikeLike