DataMPI 0.6.0 User Guide

DataMPI team

http://datampi.org

April, 2014

Contents

1 Overview
2 Features
3 Installation Instructions
 3.1 Pre-requisites
  3.1.1 Supported Platforms
  3.1.2 Required Software
  3.1.3 MPI Environment
 3.2 Download
 3.3 Configure Build Environment
 3.4 Compile and Install DataMPI
 3.5 Performing Tests
 3.6 Performing Examples
  3.6.1 Configure DataMPI
  3.6.2 Run Examples
4 Basic Usage Instructions
5 Advanced Configuration
 5.1 User Application Code
 5.2 mpi-d.properties
 5.3 mpi-d-env.sh
6 Benchmarks
 6.1 Micro-Benchmarks
 6.2 BigDataBench
  6.2.1 Demo

1 Overview

DataMPI is an efficient, flexible, and productive communication library, which provides a set of key-value pair based communication interfaces that extends MPI for Big Data. Through utilizing the efficient communication technologies in the High-Performance Computing area, DataMPI can speedup the emerging data intensive computing applications. DataMPI takes a step in bridging the two fields of HPC and Big Data.

We are working on a draft version of our proposed specification to extend MPI for Big Data. We label the specification as MPI-D, which is still under revision and will be published later. DataMPI is a Java binding implementation for MPI-D.

DataMPI can support multiple modes for various Big Data Computing applications, including Common, MapReduce, Streaming, and Iteration. The current version implements the functionalities and features of the Common mode, which aims to support the single program, multiple data (SPMD) applications. The remaining modes will be released in the future.

The current implementation of DataMPI is extending mpiJava. We also integrate some features from Hadoop under Apache License 2.0. The current evaluations of DataMPI use MVAPICH2 as the backend. DataMPI also supports other MPI implementations, such as MPICH2.

2 Features

Following features are included in DataMPI 0.6.0:

3 Installation Instructions

3.1 Pre-requisites

3.1.1 Supported Platforms

GNU/Linux is supported as the development and production platform for DataMPI. DataMPI has been well tested on Linux clusters. Any mainstream Linux distribution can be chosen. A non-privileged account is recommended for security reasons during installation. However, if DataMPI is installed into a globally accessible location, the root account is necessary.

3.1.2 Required Software

Compliers and build toolchains are necessary only for building DataMPI from source code.

3.1.3 MPI Environment

DataMPI depends on a native MPI environment. We use MVAPICH2 1.8.1 as an example. MVAPICH2 1.8.1 can be installed as follows in a typical Ethernet environment.

$ cd <mvapich2_source_directory> 
$ ./configure --prefix=/home/mpiuser/mvapich2 --with-device=ch3:nemesis \ 
   CFLAGS=-fPIC --disable-f77 --disable-fc 
$ make && make install 
 
$ echo export MVAPICH2_HOME=/home/mpiuser/mvapich2 >> ~/.bashrc 
$ echo export PATH=$MVAPICH2_HOME/bin:$PATH >> ~/.bashrc 
$ source ~/.bashrc

For detailed installing configuration of MVAPICH2, please refer to its website ( http://mvapich.cse.ohio-state.edu/support/ ).

DataMPI also supports other MPI implementations, such as MPICH2.

3.2 Download

The binary release of DataMPI can be downloaded from the official website ( http://datampi.org ). The source code can be obtained from DataMPI team by email ( mpid.contact at gmail.com ).

For the binary package, we suppose it is unzipped into /home/mpiuser/datampi-0.6.0-install, which is referred to as the “install” directory in the following text, the following steps of configuration, compilation and test can be skipped and the examples can be run directly (see Section 3.6 ).

For the source package, we suppose it is unzipped into /home/mpiuser/datampi-0.6.0-src, which is referred to as the “source” directory in the following text, the steps below must be executed before running.

3.3 Configure Build Environment

A “build” directory should be created for storing the intermediate files. Here
/home/mpiuser/datampi-0.6.0-build is taken as an example.

$ mkdir /home/mpiuser/datampi-0.6.0-build

An “install” directory should be created as the destination location of DataMPI installation. Here
/home/mpiuser/datampi-0.6.0-install is taken as an example.

$ mkdir /home/mpiuser/datampi-0.6.0-install

CMake tool is used to created Makefile scripts.

$ cd <build_directory> 
$ cmake -D CMAKE_INSTALL_PREFIX=<install_directory> \ 
   -D MPI_D_BUILD_DOCS=<ON_or_OFF> -D MPI_D_BUILD_TESTS=<ON_or_OFF> \ 
   -D MPI_D_BUILD_EXAMPLES=<ON_or_OFF> -D MPI_D_BUILD_BENCHMARKS=<ON_or_OFF> \ 
   <source_directory>

An example command looks like this.

$ cd /home/mpiuser/datampi-0.6.0-build 
$ cmake -D CMAKE_INSTALL_PREFIX=/home/mpiuser/datampi-0.6.0-install \ 
   -D MPI_D_BUILD_DOCS=ON -D MPI_D_BUILD_TESTS=ON -D MPI_D_BUILD_EXAMPLES=ON \ 
   -D MPI_D_BUILD_BENCHMARKS=ON /home/mpiuser/datampi-0.6.0-src

If all goes well, the result will be printed like this.

-- The C compiler identification is GNU 4.4.5 
-- Check for working C compiler: /usr/bin/gcc 
-- Check for working C compiler: /usr/bin/gcc -- works 
... 
-- Configuring done 
-- Generating done 
-- Build files have been written to: /home/mpiuser/datampi-0.6.0-build

3.4 Compile and Install DataMPI

DataMPI can be compiled and installed with the following command.

$ make install

If all goes well, the result will be printed like this.

Scanning dependencies of target mpi 
[  1%] Building Java object src/Java/CMakeFiles/mpi.dir/mpi/CartParms.class 
[  1%] Building Java object src/Java/CMakeFiles/mpi.dir/mpi/Cartcomm.class 
... 
Scanning dependencies of target html-manual 
[100%] Compiling html-manual 
Install the project... 
... 
-- Installing: /home/mpiuser/datampi-0.6.0-install/share/datampi/doc/ 
   html/md-docs-style.css 
-- Installing: /home/mpiuser/datampi-0.6.0-install/share/datampi/doc/ 
   datampi-0.6.0-quick-start.md 
-- Installing: /home/mpiuser/datampi-0.6.0-install/share/datampi/doc/ 
   mpiJava-spec.pdf

Now, DataMPI is installed in the “install” directory. The “source” and “build” directories are no longer being accessed when using DataMPI normally. However, the “build” directory should be kept for performing the tests and the “source” directory should be kept for studying the code of examples.

3.5 Performing Tests

DataMPI tests are used to verify the newly complied DataMPI working properly. By performing the tests, specific problems in certain environment can be generally located. DataMPI tests are located in the build directory. To run all the tests, the following commands can be used.

$ cd /home/mpiuser/datampi-0.6.0-build 
$ make test

If all goes well, the result will be printed like this.

Running tests... 
Test project /home/mpiuser/datampi-0.6.0-build 
  Start   1: mpid.tests.SortJavaSerializable 
1/106 Test   #1: mpid.tests.SortJavaSerializable ............... Passed 1.50 sec 
  Start   2: mpid.tests.ReverseSortWritableBytesComparator 
2/106 Test   #2: mpid.tests.ReverseSortWritableBytesComparator . Passed 1.41 sec 
... 
100% tests passed, 0 tests failed out of 106 
Total Test time (real) = 104.29 sec

3.6 Performing Examples

DataMPI examples are supplied to help users to be familiar with the basic command-line operations of DataMPI, and to guide users to write MPI-D programs as well.

3.6.1 Configure DataMPI

Before performing DataMPI examples and writing MPI-D programs, the DataMPI installation must be well configured. The configuration files of DataMPI are located in the conf subdirectory of the install directory. To perform the following simple example on a single node, a few files should be modified, the variables in the angle brackets should be replaced by the real values of users’ environment.

hostfile:

<your_host_name>

mpi-d-env.sh:

# Java home to use.  Required. 
# export JAVA_HOME=<your_java_home>

To perform multi-node tests or examples, the install directory should be available on all the nodes with the same path (either employing a network file system or copying the directory to all the nodes), and all the hostnames should be listed line by line in hostfile.

3.6.2 Run Examples

The examples are grouped and located in the share/datampi/examples subdirectory of the install directory. For example, the “common” group includes some examples for the Common mode of MPI-D. The mpidrun command in the bin subdirectory is the launcher of MPI-D programs, which can be used to run examples. A Sort example is given to show the usage of mpidrun.

$ cd /home/mpiuser/datampi-0.6.0-install 
$ bin/mpidrun -mode COMMON -O 2 -A 2 \ 
   -jar share/datampi/examples/common/common-example.jar mpid.examples.Sort

The meanings of the parameters will be explained in Section 4

If all goes well, the result will be printed like this.

The O task 1 of 2 is working... 
The O task 0 of 2 is working... 
The A task 0 of 2 is working... 
... 
In the task 0, key is [hello], value is [3] 
In the task 0, key is [sort], value is [4] 
In the task 0, key is [world], value is [0] 
... 
In the task 1, key is [templar], value is [1] 
In the task 1, key is [templar], value is [8] 
In the task 1, key is [valvula], value is [2] 
... 
13/05/09 21:41:55 INFO common.MPI_D_Runner4Common: Job job_201305092141_0.64 
   ended, used time 1.03 sec.

4 Basic Usage Instructions

The most important command of DataMPI is mpidrun, which is used for launching a DataMPI job. Its available parameters are listed as follows.

$ mpidrun -f <hostfile> -O <n> -A <m> -mode <mode> -jar <jarname> <classname> <params ...>

In the user application code, there are two parameters of reserved keys that can be used to define data types of key-value pairs.

private static void initConf(HashMap<String, String> conf) { 
               conf.put(MPI_D_Constants.ReservedKeys.KEY_CLASS, Text.class.getName()); 
               conf.put(MPI_D_Constants.ReservedKeys.VALUE_CLASS, Text.class.getName()); 
               ... 
}

The meanings of the parameters are listed as follows.

5 Advanced Configuration

5.1 User Application Code

There are some advanced parameters of reserved keys that can be used in users’ appllications. There are some advanced parameters of reserved keys that can be used in users’ appllications.

private static void initConf(HashMap<String, String> conf) { 
               conf.put(MPI_D_Constants.ReservedKeys.COMBINER_CLASS, WordCountCombiner.class.getName()); 
               ... 
}

The meanings of the parameters in the MPI_D_Constants.ReservedKeys class are explained as follows.

5.2 mpi-d.properties

Parameters in mpi-d.properties are used to set the intermediate data cache path and tune the buffer management, these parameters are explained as follows.

5.3 mpi-d-env.sh

The JVM memory size and hosfile file path are defined in this file.

6 Benchmarks

DataMPI provides a set of benchmarks to help users conduct performance evaluation.

6.1 Micro-Benchmarks

DataMPI Micro-Benchmark Suite is used to evaluate the performance of MPI-D implementations over various infrastructures.

DataMPI Micro-Benchmark Suite 0.6.0 mainly consists of four benchmarks:

$ ./dmb.sh wordcount <num-O-tasks> <num-A-tasks> <input-dir> <output-dir>

$ ./dmb.sh sort <num-O-tasks> <num-A-tasks> <input-dir> <output-dir>

$ ./dmb.sh terasort <num-O-tasks> <num-A-tasks> <input-dir> <output-dir>

$ ./dmb.sh grep <num-O-tasks> <num-A-tasks> <input-dir> <output-dir>

The suite also contains several primitive-level benchmarks like Bandwidth and Latency to evaluate Java-level MPI operation performance.

6.2 BigDataBench

BigDataBench ( http://prof.ict.ac.cn/BigDataBench ) is a big data benchmark suite abstracted from Internet services. It includes six real-world data sets, and nineteen big data workloads, covering six application scenarios: micro benchmarks, Cloud “OLTP”, relational query, search engine, social networks and e-commerce. Through generating representative and various big data workloads, BigDataBench features an abstracted set of Operations and Patterns for big data processing. For the same workloads, BigDataBench provides different implementations based on various programming models, including MapReduce, MPI, MPI-D (DataMPI), Spark, etc.

The DataMPI 0.6.0 release provides its implementation of three commonly used micro-benchmarks in BigDataBench, which are listed as follows. Other benchmarks will be included in future releases gradually.

DataMPI-BigDataBench uses data generated by BigDataBench Text Generator, which can be downloaded from this website ( http://prof.ict.ac.cn/BigDataBench/\#downloads ).

6.2.1 Demo

Here a demonstration is given to show how to run DataMPI-BigDataBench step by step. The major two steps include data preparation and running benchmarks (WordCount, Sort, Grep). Note that DataMPI-BigDataBench related scripts can be found in $DATAMPI_HOME/benchmarks/dmbdb.

# Prepare Input Data 
$ cd ${BIGDATABENCH_HOME}/BigDataGeneratorSuite/Text_datagen 
$ sh gen_text_data.sh lda_wiki1w 10 100 1000 gen_data 
$ ${HADOOP_HOME}/bin/hadoop fs -copyFromLocal gen_data / 
 
# Convert Text files to Sequence files 
$ cd ${BIGDATABENCH_HOME}/BigDataGeneratorSuite/Text_datagen/ToSeqFile 
$ ./sort-transfer.sh gen_data gen_data_seq 
 
# Run Benchmarks 
$ cd ${DATAMPI_HOME}/benchmarks/dmbdb 
$ ./dmbdb.sh wordcount 4 1 /gen_data out_wc 
$ ./dmbdb.sh grep 4 1 /gen_data out_gp 
$ ./dmbdb.sh sort 4 4 /gen_data_seq out_st