Hadoop Benchmark test – MRbench

The program runs small jobs a number of times and checks whether small jobs are responsive. It is a complementary benchmark test to Terasoft.

What this benchmark test does is it creates a folder MRBench on HDFS in /benchmarks and generates an input file in the input folder. This input file holds one string per line. After the input file is created, it is split into the output folders with amount of files matching the value of parameter reduces.
This job can be run many times, depending on parameter numRuns value. The parameters are explained further in the post.

The jar file needed for MRbench, in Hortonworks distribution, can be found under /usr/hdp/{hdp-version}/hadoop-mapreduce.

The cluster used for running tests for this post has HDP version 2.3.4.0-3485. One of my other clusters, running HDP 2.4, has version 2.4.0.0-169, for example.

Running the following jar file

yarn jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar mrbench --help

returns all the arguments available

MRBenchmark.0.0.2

Usage: mrbench [-baseDir <base DFS path for output/input, default is /benchmarks/MRBench>]
[-jar <local path to job jar file containing Mapper and Reducer implementations, default is current jar file>]
[-numRuns <number of times to run the job, default is 1>]
[-maps <number of maps for each run, default is 2>]
[-reduces <number of reduces for each run, default is 1>]
[-inputLines <number of input lines to generate, default is 1>]
[-inputType <type of input to generate, one of ascending (default), descending, random>]
[-verbose]

reduces defines number of reduce jobs which is also seen in number of output files (input data is split into number of files equal to reduces value).

numRuns defines number of times the test is going to be run. Each job starts with the following info:

INFO mapred.MRBench: Running job 0

The counter starts at 0 and ends with numRuns – 1.

inputLines defines number of lines generated. These lines hold one number per line. If inputLines is 100, numbers from 000 to 099 are generated. In what order are they generated is defined by inputType.

inputType defines sort order of the numbers generated.

For example, if we define inputLines to be 1000 and inputType ascending (default), this will generate numbers from 0000 to 0999 in the input file, one value per line.
Value descending for inputType is going to sort them from 0999 to 0000 in the input file.
Value random is going to generate random numbers with no sorting. And trust me, they will be random.
Example:

6493553352002875669
-6076474133462114203
-4219424128611728137
3147428516996533652
8833283876026349807
-6231574853220588520
4464414501313572651
4107251190398906611
7209395140850842640
-8963297226854656877

The program can be run with no attributes as well, as Test 1 shows.

Test 1: No arguments

The default values of the arguments are the following:

Argument Default value
numRuns 1
maps 2
reduces 1
inputLines 1
inputType ascending
yarn jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar mrbench

Output (with my comments):

/*arguments maps and reduces*/
Job Counters
                Launched map tasks=2
                Launched reduce tasks=1

/*arguments inputLines, maps and reduces, and average time*/
DataLines       Maps    Reduces AvgTime (milliseconds)
1               2       1       28846

Average time to finish the job was almost 29 seconds.

Test 2: 10 runs, default values

yarn jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar mrbench -numRuns 10

Output (with my comments):

/*numRuns is 10, the job is run 10 times, each job starts with a counter. Example of job number to here. */
16/06/21 13:06:42 INFO mapred.MRBench: Running job 1
DataLines       Maps    Reduces AvgTime (milliseconds)
1               2       1       20986

Average time per job was almost 21 seconds.

Test with results

Here is an example of one benchmark test.

yarn jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar mrbench -numRuns 2 -maps 50 -reduces 10 -inputLines 100000000 -inputType random

Some results from running this test on one of my clusters.

Input parameters Output statistics
inputLines maps reduces numRuns inputType Input file size generated Avg time
1.000.000 10 5 2 asc/desc 7.4MB 26s
1.000.000 10 5 2 random 19.4MB 28s
10.000.000 10 5 2 asc/desc 85.8MB 36s
10.000.000 10 5 2 random 194.4MB 35s
100.000.000 10 5 2 asc/desc 256MB 97s
100.000.000 10 5 2 random 1.9GB 100s
100.000.000 50 10 2 random 1.9GB 78s

 

For measuring I/O (read/write) performance, check the post about TestDFSIO.

Hadoop Benchmark test – TestDFSIO

Preparing the environment

Create a folder where the benchmark result files are saved:

sudo -u hdfs mkdir /home/hdfs/benchmark

Give access to everyone (if more users would like to run benchmark tests, otherwise skipp this and run the commands as hdfs user) :

sudo -u hdfs chmod 777 /home/hdfs/benchmark

 

About TestDFSIO benchmark test

Program TestDFSIO can be found in jar file /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar.

The TestDFSIO benchmark is used for measuring I/O (read/write) performance. It does this by using a MapReduce job to read and write files in parallel. Hence, functional MapReduce is needed for it.

The benchmark test uses one map task per file.

By invoking the benchmark with no arguments

yarn jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO

usage instructions are shown:

Missing arguments.

Usage: TestDFSIO [genericOptions] -read [-random | -backward | -skip [-skipSize Size]] | -write | -append | -truncate | -clean [-compression codecClassName] [-nrFiles N] [-size Size[B|KB|MB|GB|TB]] [-resFile resultFileName] [-bufferSize Bytes] [-rootDir]

 

Arguments

Defining result file (-resFile)

By using argument -resFile, the file location and name are defined for the results of the tests.

Example:

-resFile /home/hdfs/benchmark/TestDFSIOwrite

If this argument is not given, the result file is written in current directory under the name TestDFSIO_results.log.

If the argument is pointing to an existing file, the result is appended to it. Here follows an example of a result file after 2 tests have been run:

—– TestDFSIO —– : write
Date & time: Sun Jun 19 17:39:20 CEST 2016
Number of files: 10
Total MBytes processed: 100.0
Throughput mb/sec: 17.13796058269066
Average IO rate mb/sec: 17.385766983032227
IO rate std deviation: 2.1966324914130517
Test exec time sec: 30.607

—– TestDFSIO —– : write
Date & time: Sun Jun 19 17:47:23 CEST 2016
Number of files: 10
Total MBytes processed: 100.0
Throughput mb/sec: 15.332720024532351
Average IO rate mb/sec: 16.875974655151367
IO rate std deviation: 3.72817574426085
Test exec time sec: 25.766

Write test (-write)

By using argument –write, we tell the program to test writing to the cluster. It is convenient to use this before the –read argument, so that some files are prepared for read test.

The written files are located in HDFS under folder /benchmarks, in folder TestDFSIO. If the write test is run and the TestDFSIO folder is already there, it will be first deleted.

Read test (-read)

By using argument –read, we tell the program to test read from the cluster. It is convenient to run test with argument –write first, so that some files are prepared for read test.

If the test is run with this argument before it is run with argument write, an error message like this show up:

16/06/20 11:21:32 INFO mapreduce.Job: Task Id : attempt_1463992963604_0028_m_000005_0, Status : FAILED Error: java.io.FileNotFoundException: File does not exist: /benchmarks/TestDFSIO/io_data/test_io_5

Number of files (-nrFiles)

The argument defines the amount of files used in test. If the test is writing, this argument defines the amount of output files. If the test is reading, this argument defines the amount of input files.

File size (-size)

The argument defines the size of file(s) used in testing. This argument takes a numerical value with optional B|KB|MB|GB|TB. MB is default.

Remove previous test data (-clean)

The argument deletes the output directory /benchmarks/TestDFSIO in HDFS.

Command:

sudo -u hdfs yarn jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -clean

Example of an output:

INFO fs.TestDFSIO: TestDFSIO.1.8
16/06/20 11:16:54 INFO fs.TestDFSIO: nrFiles = 1
16/06/20 11:16:54 INFO fs.TestDFSIO: nrBytes (MB) = 1.0
16/06/20 11:16:54 INFO fs.TestDFSIO: bufferSize = 1000000
16/06/20 11:16:54 INFO fs.TestDFSIO: baseDir = /benchmarks/TestDFSIO
16/06/20 11:16:55 INFO fs.TestDFSIO: Cleaning up test files

 

TestDFSIO tests

Test 1: write 10 files, size 10MB

Write 10 files, each with a size 10MB and put the results in /home/hdfs/benchmark/TestDFSIOwrite.

yarn jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -write -nrFiles 10 -size 10MB -resFile /home/hdfs/benchmark/TestDFSIOwrite

Result:

—– TestDFSIO —– : write
Date & time: Sun Jun 19 17:39:20 CEST 2016
Number of files: 10
Total MBytes processed: 100.0
Throughput mb/sec: 17.13796058269066
Average IO rate mb/sec: 17.385766983032227
IO rate std deviation: 2.1966324914130517
Test exec time sec: 30.607

Test 2: read 10 files, size 10MB

yarn jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -read -nrFiles 10 -size 10MB -resFile /home/hdfs/benchmark/TestDFSIO

Result:

—– TestDFSIO —– : read
Date & time: Mon Jun 20 10:49:24 CEST 2016
Number of files: 10
Total MBytes processed: 100.0
Throughput mb/sec: 68.87052341597796
Average IO rate mb/sec: 170.9973602294922
IO rate std deviation: 135.61526628958586
Test exec time sec: 39.793

Test 3: write 50 files, size 100MB

yarn jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -write -nrFiles 50 -size 100MB -resFile /home/hdfs/benchmark/TestDFSIO

Result:

—– TestDFSIO —– : write
Date & time: Mon Jun 20 13:22:00 CEST 2016
Number of files: 50
Total MBytes processed: 5000.0
Throughput mb/sec: 3.393228337799953
Average IO rate mb/sec: 4.491838455200195
IO rate std deviation: 3.903713550708894
Test exec time sec: 65.042

Running the following command on one file written in the benchmark test:

hdfs fsck /benchmarks/TestDFSIO/io_data/test_io_0

Returns (selection):

Average block replication: 3.0
Number of data-nodes: 4

Test 4: write 50 files, size 100MB, replication factor 2

Cluster’s default replication factor is 3. In this case, we run the test with argument –D and change the replication factor to 2.

yarn jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -D dfs.replication=2 -write -nrFiles 50 -size 100MB -resFile /home/hdfs/benchmark/TestDFSIO

Result:

—– TestDFSIO —– : write
Date & time: Mon Jun 20 13:39:17 CEST 2016
Number of files: 50
Total MBytes processed: 5000.0
Throughput mb/sec: 8.495729196932702
Average IO rate mb/sec: 11.038678169250488
IO rate std deviation: 8.691344876093968
Test exec time sec: 43.378

Running the following command on one file written in the benchmark test:

hdfs fsck /benchmarks/TestDFSIO/io_data/test_io_0

Returns (selection):

Average block replication:     2.0
Number of data-nodes:          4

Test 5: write 50 files, size 1GB, time

time yarn jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -write -nrFiles 50 -size 1GB -resFile /home/hdfs/benchmark/TestDFSIO

Result:
—– TestDFSIO —– : write
Date & time: Mon Jun 20 16:59:56 CEST 2016
Number of files: 50
Total MBytes processed: 51200.0
Throughput mb/sec: 3.3432171543606457
Average IO rate mb/sec: 3.805697441101074
IO rate std deviation: 1.8925752533465978
Test exec time sec: 421.994

Output (time):

real 7m11.604s
user 0m21.056s
sys 0m2.757s

Where the relevant metrics is real.
In some cases, running time is not needed, since the execution time is in the test report or output.

MRbench test benchmarks how responsive small jobs are in a cluster can be.