Benchmarks

Micro-Benchmarks
BigDataBench

DataMPI Micro-Benchmark Suite is used to evaluate the performance of MPI-D implementations over various infrastructures.

DataMPI Micro-Benchmark Suite 0.6.0 mainly consists of four benchmarks:

WordCount - counts the appearance times of all word in the files.
- Input format: any document (usually in text format)
- Output format: <word> <count>
- Dataset: Generated through RandomTextWriter in Hadoop.
- Command-line execution: $ ./dmb.sh wordcount <num-O-tasks> <num-A-tasks> <input-dir> <output-dir>
Sort - sorts the input files based on the keys.
- Input format: <key> <value>
- Output format: <key> <value>
- Dataset: Generated through RandomTextWriter in Hadoop.
- Command-line execution: $ ./dmb.sh sort <num-O-tasks> <num-A-tasks> <input-dir> <output-dir>
TeraSort - sorts 100-bytes <key, value> tuples. Each of them contains 10-bytes key and 90-bytes value.
- Input format: <key> <value>
- Output format: <key> <value>
- Dataset: Generated through TeraGen in Hadoop.
- Command-line execution: $ ./dmb.sh terasort <num-O-tasks> <num-A-tasks> <input-dir> <output-dir>
Grep - extracts matching strings from input files and counts the appearance times of the strings.
- Input format: <key> <value>
- Output format: <key> <value>
- Dataset: Generated through RandomTextWriter in Hadoop.
- Command-line execution: $ ./dmb.sh grep <num-O-tasks> <num-A-tasks> <input-dir> <output-dir>

The suite also contains several micro-benchmarks for MPI basic operations, like Bandwidth and Latency. More details please refer to DataMPI User Guide.

BigDataBench is a big data benchmark suite abstracted from Internet services. It includes six real-world data sets, and nineteen big data workloads, covering six application scenarios: micro benchmarks, Cloud “OLTP”, relational query, search engine, social networks and e-commerce. Through generating representative and various of big data workloads, BigDataBench features an abstracted set of Operations and Patterns for big data processing. For the same workloads, BigDataBench provides different implementations based on various programming models, including MapReduce, MPI, MPI-D (DataMPI), Spark, etc. This page provides the information of DataMPI-based implementation for BigDataBench.

The DataMPI 0.6.0 release provides its implementation of three commonly used micro-benchmarks in BigDataBench, which are listed as follows. Other benchmarks will be included in future releases gradually.

WordCount - counts the appearance times of all word in the files.
Sort - sorts the input files based on the keys.
Grep - extracts matching strings from input files and counts the appearance times of the strings.

DataMPI-BigDataBench uses the data generated by BigDataBench Text Generator, which can be downloaded from here.

Here, we give a demonstration to show how to run DataMPI-BigDataBench step by step. The major two steps include data preparation and running benchmarks (WordCount, Sort, Grep). Note that DataMPI-BigDataBench related scripts can be found in $DATAMPI_HOME/benchmarks/dmbdb.

# Prepare Input Data
$ cd ${BIGDATABENCH_HOME}/BigDataGeneratorSuite/Text_datagen
$ sh gen_text_data.sh lda_wiki1w 10 100 1000 gen_data
$ ${HADOOP_HOME}/bin/hadoop fs -copyFromLocal gen_data /

# Convert Text files to Sequence files 
$ cd ${BIGDATABENCH_HOME}/BigDataGeneratorSuite/Text_datagen/ToSeqFile
$ ./sort-transfer.sh gen_data gen_data_seq

# Run Benchmarks
$ cd ${DATAMPI_HOME}/benchmarks/dmbdb
$ ./dmbdb.sh wordcount 4 1 /gen_data out_wc
$ ./dmbdb.sh grep 4 1 /gen_data out_gp
$ ./dmbdb.sh sort 4 4 /gen_data_seq out_st