Overview

DataMPI is an efficient, flexible, and productive communication library, which provides a set of key-value pair based communication interfaces that extends MPI for Big Data. Through utilizing the efficient communication technologies in the High-Performance Computing area, DataMPI can speedup the emerging data intensive computing applications. DataMPI takes a step in bridging the two fields of HPC and Big Data.

We are working on a draft version of our proposed specification to extend MPI for Big Data. We label the specification as MPI-D, which is still under revision and will be published later. DataMPI is a Java binding implementation for MPI-D.

DataMPI can support multiple modes for various Big Data Computing applications, including Common, MapReduce, Streaming, and Iteration. The current version implements the functionalities and features of the Common mode, which aims to support the single program, multiple data (SPMD) applications. The remaining modes will be released in the future.

The current implementation of DataMPI is extending mpiJava. We also integrate some features from Hadoop under Apache License 2.0. The current evaluations of DataMPI use MVAPICH2 as the backend. DataMPI also supports other MPI implementations, such as MPICH2.

Please read the Quick Start to get started with DataMPI. For detailed instructions of usage, please refer to the User Guide.

We really appreciate your interest and feedback.