Message Passing Interface is an Application Programming Interface designed to help you build a parallel program. It has a C, Fortran 90, and Fortran 77 interface, and is the method used to parallelize most parallel programs in Amber.
The hardest part of parallel programming is devising parallelized algorithms (we tend to think sequentially, so designing a parallel algorithm can be rather counter-intuitive). The point of this page is not to help you become parallel programmers, but rather to help you understand (and provide a reference for) the MPI API specifically.
A book that I found particularly helpful when learning MPI programming was Peter Pacheco's text.
Table of Contents
|
Implementations
MPI is just a standard laid out. Programs that actually put this standard into practice are called implementations. There are a number of different implementations that you can choose from.
- mpich
- mvapich — an MPICH implementation optimized for Infiniband interconnect
- mpich2 (this project has reverted to the 'mpich' name)
- mvapich2 — an MPICH2 implementation optimized for Infiniband interconnect
- OpenMPI
- Intel MPI
- LAM/MPI
- Microsoft MPI
Programming model
MPI programs typically adopt the SPMD model (Single Program Multiple Data), although using some new MPI-2 features, the MPMD model (Multiple Program Multiple Data) can be used, in which different programs can send messages back and forth via MPI calls.
MPI programs are threaded. That is, multiple processes are spawned which can communicate with one another. However, unlike some parallel programming models, MPI uses static threading. This means that the number of threads spawned is set at the very beginning of the program and never changes. This is in contrast to some parallel programming models in which threads can be spawned as needed.
MPI programs can be parallelized across distributed systems. This means that not all threads or processes have to run on the same physical machine (more specifically, not all threads need access to the same shared memory). However, it's also not required that MPI programs be distributed, and the MPI implementation itself deals with how messages (memory) are transferred and is hidden from the user.
Most parallel programs in Amber use MPI for their implementation, and all of those use the SPMD model.
Using MPI in Your Program
The MPI functions and variables are available through header files (mpi.h for C programs and mpif.h for Fortran programs), so these must be included in any source code that uses MPI constants or functions/subroutines.
When linking your program (typically the phase after compiling all of the source code, the compiled object files are linked together by the linker — which is often the compiler itself), you must link the MPI libraries along with it. The location of the libraries, which libraries need to be linked into the final program, and the location of the header files often depend on the particular MPI implementation you're using as well as how it was configured and built. To make things easier for the user, MPI implementations typically provide compiler wrappers for C, Fortran 77, and Fortran 90 compilers called mpicc, mpif77, and mpif90, respectively.
To show what these wrappers actually do, you can use the command
mpicc -show
for most implementations (likewise for the Fortran compiler wrappers).
General MPI Function Characteristics
MPI exposes its API (makes its interface available to users) by providing a set of functions (in C) or subroutines (in Fortran) to give programmers access to its message passing capabilities as well as a set of constants to hide the implementation details.
MPI functions and constants all follow a particular convention. All functions/subroutines and constants begin with the prefix MPI_ (all capitalized where case sensitivity exists — so NOT Fortran).
Constants are all capitalized (like MPI_COMM_WORLD or MPI_DOUBLE_PRECISION), whereas functions have only the first letter after the MPI_ capitalized, like MPI_Init() and MPI_Send().
Differences Between C and Fortran
There is one principle difference between the C MPI functions and Fortran MPI subroutines. The C MPI functions return the error value (if the function succeeded or failed). However, Fortran subroutines have no return value, so this approach cannot be used. Instead, the Fortran subroutines take one extra integer value as the last argument for every subroutine that we will consider here as that error value.
The equivalent MPI_Send calls in C, then Fortran, are shown below as a comparison
C: ierror = MPI_Send(*sendbuf, 1, MPI_DOUBLE, 0, 1234, MPI_COMM_WORLD)
Fortran: call mpi_send(sendbuf, 1, MPI_DOUBLE_PRECISION, 0, 1234, mpi_comm_world, ierror)
Notice the case insensitivity in Fortran, and the different data type used. The data type in MPI matches the initialization data type (a C double matches a Fortran double precision).
Communicators and Communications
MPI allows parallel programming by allowing users to send and receive messages to and from different threads of the same program. All communications are done within a grouping of threads called a communicator. A communicator is a collection of threads in an MPI program that can send messages between each other.
An MPI program has at least one communicator, MPI_COMM_WORLD a defined constant that includes every thread that was launched at the very beginning of the program, and no upper limit to how many communicators can be defined.
Rank in Communicator
Every communicator is composed of a subset of the threads that were initially launched. In the case of MPI_COMM_WORLD, this subset consists of every thread. In order to distinguish between the threads in a communicator, each thread in the communicator is assigned a unique integer identity from 0 to one less than the total number of threads in that communicator.
You can get the rank of a thread in a given communicator by the function MPI_Comm_rank()
C: ierror = MPI_Comm_rank(<communicator>, &rank)
Fortran: call mpi_comm_rank(<communicator>, rank, ierror)
You should never make any assumption about what rank a particular thread is in a particular communicator — always use the value assigned by MPI_Comm_rank() to assign the rank (this includes other communicators you may create besides MPI_COMM_WORLD)
"Master" thread
Very often, you will need a single thread to handle certain tasks. For instance, reading input and writing output (I/O) should typically be done only by a single thread. For this reason, it's typical to assign a "master" thread that takes care of all of these tasks.
Out of convenience, this "master" thread is designated as the thread with rank 0 (zero) in the communicator. This is chosen because every communicator has at least one thread, so every communicator will always have a thread with rank 0.
This is the convention used in Amber codes (and you will frequently see "master" be assigned as a boolean type set equal to rank == 0)
Size of Communicator
It is often crucial to know not only how to uniquely identify threads within a communicator (the rank described above), but also how many threads are in a particular communicator so that the workload may be appropriately split up as evenly as possible.
Thus, each communicator has a given size which is simply the number of threads that belong to that communicator (i.e. that have a rank in that communicator).
To get the number of threads in a given communicator, use the MPI_Comm_size() function.
C: ierror = MPI_Comm_size(<communicator>, &size)
Fortran: call mpi_comm_size(<communicator>, size, ierror)
As a note, you can get the total number of threads in an MPI program by using this function with the communicator MPI_COMM_WORLD, as that communicator is defined to include every thread launched at the beginning.
Writing an MPI program
All MPI programs need an initialization call before any other MPI function or subroutine can be used. This is called MPI_Init.
Using C, this would be
ierror = MPI_Init(&argc, &argv)
Using Fortran, this would be
call MPI_Init(ierror)
This must be called on every thread. However, because an MPI function is necessary to retrieve an identifying value (the processor rank in the global communicator encompassing all processors), and MPI_Init needs to be called before that other MPI function can be called, this is actually a rule that is impossible to break (without breaking the first rule in writing an MPI program, that is).
Blocking Communications
A communication is called a blocking communication if all of the threads involved in this particular communication need to call that function and return from it before any of the threads may continue.
That is, if one of the threads finishes its previous work early and arrives at a blocking communication, it must sit and wait for each thread to arrive at that communication.
Point-to-Point
Send data: MPI_Send
C: ierror = MPI_Send(buffer, int count, MPI_Datatype, int destination, int tag, Communicator)
Fortran: call mpi_send(buffer, int count, MPI_Datatype, int destination, int tag, communicator, ierror)
buffer is data that needs to be sent. count tells MPI how many numbers are being sent, and MPI_Datatype tells MPI which kind of variable is being sent.
destination is the rank of the thread that this message should be sent to within the specified communicator. Finally, the tag is a unique identifier for this message so that MPI can uniquely identify this message.
Note that MPI_Send() must be paired with an MPI_Recv() function call on destination.
Receive Data: MPI_Recv
C: ierror = MPI_Recv(&buffer, int count, MPI_Datatype, int source, int tag, Communicator, mpi_status)
Fortran: call mpi_recv(buffer, int count, MPI_Datatype, int source, int tag, Communicator, mpi_status, ierror)
buffer is the memory location in which received data will be stored. count tells MPI how many numbers are being received, and MPI_Datatype tells MPI which kind of variable is being received.
source is the rank of the thread that this message is being sent from within the specified communicator. Finally, the tag is a unique identifier for this message so that MPI can uniquely identify it, and mpi_status is a data type used to store information about the status of a received message. See the section about MPI_Status at the bottom for descriptions of it (and how it differs between Fortran and C).
Send and Receive Data: MPI_Sendrecv
C: ierror = MPI_Sendrecv(&sendbuf, int send_count, MPI_Datatype, int destination, int sendtag, &recvbuf,
int recv_count, MPI_Datatype, int source, int recvtag, <communicator>, mpi_status)
Fortran: call mpi_sendrecv(sendbuf, int send_count, MPI_Datatype, int destination, int sendtag, recvbuf,
int recv_count, MPI_Datatype, int source, int recvtag, <communicator>, mpi_status)
The above command squashes a send and a receive call into the same one. See the description of both MPI_Send and MPI_Recv above to see what the variables themselves mean. The only restriction on a MPI_Sendrecv is that both communications take place on the same <communicator>.
A common use of this call are to trade a particular variable between two threads. However, the send target does not have to be the same as the receive target. Thus, if you have to shuffle around a particular variable or array between all of the threads, you can actually implement that via a single MPI_Sendrecv call in which the destination and source differ.
One-to-All and All-to-One
These types of communication involve a single thread sending a message to every other thread or every thread sending a single message to a single thread. These communications are more expensive than point-to-point communications. They can be naively implemented using point-to-point communications by just sending point-to-point messages from one thread to everybody else.
However, MPI implementations can take advantage of prior knowledge of a cluster's interconnect topology, and good implementations can reduce the communication overhead of these so-called collective communications by picking out the optimal path of data transfer. For this reason, I would always suggest using these collective communications where possible rather than trying to implement a version of it yourself.
All of the following functions must be called by every thread in the communicator that you're using in the MPI function calls. Failure to do so will result in your program stalling indefinitely. If you find yourself wishing that you could call any of the functions below with only a subset of your particular communicator, that is a prime indication that it's time to create a new communicator for that purpose.
Broadcast data to everyone: MPI_Bcast
If one thread has a particular variable that you want every other thread in a given communicator to have, use MPI_Bcast
C: ierror = MPI_Bcast(&message, int count, MPI_Datatype, int root, <communicator>)
Fortran: call mpi_bcast(message, int count, MPI_Datatype, int root, <communicator>, ierror)
This will take the data stored in message from the thread with rank root in the given communicator and store it in the variable message on every other thread in that communicator. Every thread in communicator must call this method, since the program will sit and wait for every thread to "show up" at the broadcast.
Gather data to a single thread: MPI_Gather
C: ierror = MPI_Gather(sendbuf, int send_count, MPI_Datatype, &recvbuf, int recv_count, MPI_Datatype, int root, <communicator>)
Fortran: MPI_Gather(sendbuf, int send_count, MPI_Datatype, recvbuf, int recv_count, MPI_Datatype, int root, <communicator>, ierror)
This function will take all of the data in sendbuf from each thread in communicator and store it in the array recbuf on the thread with rank root on communicator. Note, this includes the value in sendbuf on the thread with rank root as well!
Send (Scatter) data to all threads: MPI_Scatter
C: ierror = MPI_Scatter(*sendbuf, int send_count, MPI_Datatype, *recvbuf, int recv_count, MPI_Datatype, int root, Communicator)
F90: MPI_Scatter(sendbuf, int send_count, MPI_Datatype, recvbuf, int recv_count, MPI_Datatype, int root, Communicator, ierror)
This function will take all of the data in sendbuf from thread root in the specified Communicator, and sends send_count data elements to each of the processors. Therefore, the length of the sendbuf array should be the size of the Communicator times the value of send_count.
This command will send elements the first send_count elements of sendbuf to the first rank (including int root), it sends the second send_count elements of sendbuf to the second rank (including int root), etc.
Reduce (sum/difference/multiply/divide) data from all threads: MPI_Reduce
C: ierror = MPI_Reduce(*operand, *result, int count, MPI_Datatype, MPI_Operator, int root, Communicator)
F90: MPI_Reduce(operand, result, int count, MPI_Datatype, MPI_Operator, int root, Communicator, ierror)
This operator reduces all of the data from all processors to a single set of data on root using the operator MPI_Operator. Allowable operators are
- MPI_SUM
- MPI_PROD
- MPI_MAX
- MPI_MIN
- A couple others that have to do with bitwise and logical operators.
So what this means is that it takes all count elements of the operand array and combines them into a single array of length count by either: MPI_SUM) adding the first element of operand from every thread and putting it in the first element of result, adding the second element of operand from every thread and putting it in the second element of result, etc. Or MPI_PROD) the same process as above, but using multiplication instead of addition to reduce the data.
All-to-All
These are the most expensive types of communications, and should be used as infrequently as possible. Having said that, however, these are the types of communications that the MPI implementation can help out the most, so you definitely don't want to 'roll your own' here.
Most of these functions/routines are analogous to the One-to-All/All-to-One functions/routines, except that the results are sent to every process rather than just the one with rank equal to root.
MPI_Allgather
C: ierror = MPI_Gather(*sendbuf, int send_count, MPI_Datatype, *recvbuf, int recv_count, MPI_Datatype, Communicator)
F90: MPI_Gather(sendbuf, int send_count, MPI_Datatype, recvbuf, int recv_count, MPI_Datatype, Communicator, ierror)
This does the same as MPI_Gather, except recvbuf is filled on every process (hence, there is no root argument).
MPI_Allreduce
C: ierror = MPI_Allreduce(*operand, *result, int count, MPI_Datatype, MPI_Operator, Communicator)
F90: MPI_Allreduce(operand, result, int count, MPI_Datatype, MPI_Operator, Communicator, ierror)
This does the same as MPI_Reduce, except result is filled on every process (hence there is no root argument).
MPI_Alltoall
C: ierror = MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype sendtype, *recvbuf, int recvcount, MPI_Datatype recvtype, Communicator)
F90: MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype sendtype, *recvbuf, int recvcnt, MPI_Datatype recvtype, Communicator, ierror)
This is actually the closest analogy we have to an MPI_Allscatter. This is the most expensive of all operations, and should be used sparingly, if at all.
Special case — MPI_Barrier
C: ierror = MPI_Barrier(Communicator)
F90: MPI_Barrier(Communicator, ierror)
This is just a blocking call. All processes in Communicator will wait at this barrier until every process in Communicator calls it. Then, everybody will be released and the program will continue. Note that for all of the MPI functions listed above, those are blocking functions, and have an implied MPI_Barrier given that they all wait for each process in Communicator to call the function.
Nonblocking Communications
Miscellaneous
MPI_Status
MPI_Status is a set of integers that store information about some MPI actions. In C, MPI_Status is its own data type (a struct) containing several integers. As such, it should be declared like so:
MPI_Status rec_stat;
In Fortran there is no defined type. Instead, it must be declared as an integer array of size MPI_STATUS_SIZE. It should be declared as so:
integer, dimension(MPI_STATUS_SIZE) :: rec_stat
or some other equivalent way of declaring arrays in Fortran.