...one of the most highly
regarded and expertly designed C++ library projects in the
world.
— Herb Sutter and Andrei
Alexandrescu, C++
Coding Standards
Home | Libraries | People | FAQ | More |
Parallelization is a key feature for modern numerical libraries due to the vast availability of many cores nowadays, even on Laptops. odeint currently supports parallelization with OpenMP and MPI, as described in the following sections. However, it should be made clear from the beginning that the difficulty of efficiently distributing ODE integration on many cores/machines lies in the parallelization of the system function, which is still the user's responsibility. Simply using a parallel odeint backend without parallelizing the system function will bring you almost no performance gains.
odeint's OpenMP support is implemented as an external backend, which needs
to be manually included. Depending on the compiler some additional flags
may be needed, i.e. -fopenmp
for GCC.
#include <omp.h> #include <boost/numeric/odeint.hpp> #include <boost/numeric/odeint/external/openmp/openmp.hpp>
In the easiest parallelization approach with OpenMP we use a standard
vector
as the state type:
typedef std::vector< double > state_type;
We initialize the state with some random data:
size_t N = 131101; state_type x( N ); boost::random::uniform_real_distribution<double> distribution( 0.0 , 2.0*pi ); boost::random::mt19937 engine( 0 ); generate( x.begin() , x.end() , boost::bind( distribution , engine ) );
Now we have to configure the stepper to use the OpenMP backend. This is
done by explicitly providing the openmp_range_algebra
as a template parameter to the stepper. This algebra requires the state
type to be a model of Random Access Range and will be used from multiple
threads by the algebra.
typedef runge_kutta4< state_type , double , state_type , double , openmp_range_algebra > stepper_type;
Additional to providing the stepper with OpenMP parallelization we also need a parallelized system function to exploit the available cores. Here this is shown for a simple one-dimensional chain of phase oscillators with nearest neighbor coupling:
struct phase_chain { phase_chain( double gamma = 0.5 ) : m_gamma( gamma ) { } void operator()( const state_type &x , state_type &dxdt , double /* t */ ) const { const size_t N = x.size(); #pragma omp parallel for schedule(runtime) for(size_t i = 1 ; i < N - 1 ; ++i) { dxdt[i] = coupling_func( x[i+1] - x[i] ) + coupling_func( x[i-1] - x[i] ); } dxdt[0 ] = coupling_func( x[1 ] - x[0 ] ); dxdt[N-1] = coupling_func( x[N-2] - x[N-1] ); } double coupling_func( double x ) const { return sin( x ) - m_gamma * ( 1.0 - cos( x ) ); } double m_gamma; };
Note | |
---|---|
In the OpenMP backends the system function will always be called sequentially from the thread used to start the integration. |
Finally, we perform the integration by using one of the integrate functions from odeint. As you can see, the parallelization is completely hidden in the stepper and the system function. OpenMP will take care of distributing the work among the threads and join them automatically.
integrate_n_steps( stepper_type() , phase_chain( 1.2 ) , x , 0.0 , 0.01 , 100 );
After integrating, the data can be accessed immediately and be processed
further. Note, that you can specify the OpenMP scheduling by calling omp_set_schedule
in the beginning of
your program:
int chunk_size = N/omp_get_max_threads(); omp_set_schedule( omp_sched_static , chunk_size );
See openmp/phase_chain.cpp for the complete example.
For advanced cases odeint offers another approach to use OpenMP that allows
for a more exact control of the parallelization. For example, for odd-sized
data where OpenMP's thread boundaries don't match cache lines and hurt
performance it might be advisable to copy the data from the continuous
vector<T>
into separate, individually aligned, vectors. For this, odeint provides
the openmp_state<T>
type, essentially an alias for vector<vector<T>>
.
Here, the initialization is done with a vector<double>
, but then we use odeint's split
function to fill an openmp_state
. The splitting is done such
that the sizes of the individual regions differ at most by 1 to make the
computation as uniform as possible.
const size_t N = 131101; vector<double> x( N ); boost::random::uniform_real_distribution<double> distribution( 0.0 , 2.0*pi ); boost::random::mt19937 engine( 0 ); generate( x.begin() , x.end() , boost::bind( distribution , engine ) ); const size_t blocks = omp_get_max_threads(); state_type x_split( blocks ); split( x , x_split );
Of course, the system function has to be changed to deal with the openmp_state
. Note that each sub-region
of the state is computed in a single task, but at the borders read access
to the neighbouring regions is required.
struct phase_chain_omp_state { phase_chain_omp_state( double gamma = 0.5 ) : m_gamma( gamma ) { } void operator()( const state_type &x , state_type &dxdt , double /* t */ ) const { const size_t N = x.size(); #pragma omp parallel for schedule(runtime) for(size_t n = 0 ; n < N ; ++n) { const size_t M = x[n].size(); for(size_t m = 1 ; m < M-1 ; ++m) { dxdt[n][m] = coupling_func( x[n][m+1] - x[n][m] ) + coupling_func( x[n][m-1] - x[n][m] ); } dxdt[n][0] = coupling_func( x[n][1] - x[n][0] ); if( n > 0 ) { dxdt[n][0] += coupling_func( x[n-1].back() - x[n].front() ); } dxdt[n][M-1] = coupling_func( x[n][M-2] - x[n][M-1] ); if( n < N-1 ) { dxdt[n][M-1] += coupling_func( x[n+1].front() - x[n].back() ); } } } double coupling_func( double x ) const { return sin( x ) - m_gamma * ( 1.0 - cos( x ) ); } double m_gamma; };
Using the openmp_state<T>
state type automatically selects openmp_algebra
which executes odeint's
internal computations on parallel regions. Hence, no manual configuration
of the stepper is necessary. At the end of the integration, we use unsplit
to concatenate the sub-regions
back together into a single vector.
integrate_n_steps( runge_kutta4<state_type>() , phase_chain_omp_state( 1.2 ) , x_split , 0.0 , 0.01 , 100 ); unsplit( x_split , x );
Note | |
---|---|
You don't actually need to use |
See openmp/phase_chain_omp_state.cpp for the complete example.
To expand the parallel computation across multiple machines we can use MPI.
The system function implementation is similar to the OpenMP variant with
split data, the main difference being that while OpenMP uses a spawn/join
model where everything not explicitly paralleled is only executed in the
main thread, in MPI's model each node enters the main()
method independently, diverging based
on its rank and synchronizing through message-passing and explicit barriers.
odeint's MPI support is implemented as an external backend, too. Depending
on the MPI implementation the code might need to be compiled with i.e.
mpic++
.
#include <boost/numeric/odeint.hpp> #include <boost/numeric/odeint/external/mpi/mpi.hpp>
Instead of reading another thread's data, we asynchronously send and receive the relevant data from neighbouring nodes, performing some computation in the interim to hide the latency.
struct phase_chain_mpi_state { phase_chain_mpi_state( double gamma = 0.5 ) : m_gamma( gamma ) { } void operator()( const state_type &x , state_type &dxdt , double /* t */ ) const { const size_t M = x().size(); const bool have_left = x.world.rank() > 0, have_right = x.world.rank() < x.world.size()-1; double x_left, x_right; boost::mpi::request r_left, r_right; if( have_left ) { x.world.isend( x.world.rank()-1 , 0 , x().front() ); // send to x_right r_left = x.world.irecv( x.world.rank()-1 , 0 , x_left ); // receive from x().back() } if( have_right ) { x.world.isend( x.world.rank()+1 , 0 , x().back() ); // send to x_left r_right = x.world.irecv( x.world.rank()+1 , 0 , x_right ); // receive from x().front() } for(size_t m = 1 ; m < M-1 ; ++m) { dxdt()[m] = coupling_func( x()[m+1] - x()[m] ) + coupling_func( x()[m-1] - x()[m] ); } dxdt()[0] = coupling_func( x()[1] - x()[0] ); if( have_left ) { r_left.wait(); dxdt()[0] += coupling_func( x_left - x().front() ); } dxdt()[M-1] = coupling_func( x()[M-2] - x()[M-1] ); if( have_right ) { r_right.wait(); dxdt()[M-1] += coupling_func( x_right - x().back() ); } } double coupling_func( double x ) const { return sin( x ) - m_gamma * ( 1.0 - cos( x ) ); } double m_gamma; };
Analogous to openmp_state<T>
we use mpi_state< InnerState<T> >
,
which automatically selects mpi_nested_algebra
and the appropriate MPI-oblivious inner algebra (since our inner state
is a vector
, the inner
algebra will be range_algebra
as in the OpenMP example).
typedef mpi_state< vector<double> > state_type;
In the main program we construct a communicator
which tells us the size
of the cluster and the current node's rank
within that. We generate the input data on the master node only, avoiding
unnecessary work on the other nodes. Instead of simply copying chunks,
split
acts as a MPI collective
function here and sends/receives regions from master to each slave. The
input argument is ignored on the slaves, but the master node receives a
region in its output and will participate in the computation.
boost::mpi::environment env( argc , argv ); boost::mpi::communicator world; const size_t N = 131101; vector<double> x; if( world.rank() == 0 ) { x.resize( N ); boost::random::uniform_real_distribution<double> distribution( 0.0 , 2.0*pi ); boost::random::mt19937 engine( 0 ); generate( x.begin() , x.end() , boost::bind( distribution , engine ) ); } state_type x_split( world ); split( x , x_split );
Now that x_split
contains
(only) the local chunk for each node, we start the integration.
To print the result on the master node, we send the processed data back
using unsplit
.
integrate_n_steps( runge_kutta4<state_type>() , phase_chain_mpi_state( 1.2 ) , x_split , 0.0 , 0.01 , 100 ); unsplit( x_split , x );
Note | |
---|---|
|
See mpi/phase_chain.cpp for the complete example.
As used by mpi_nested_algebra
.
InnerState
The inner state type
State
The MPI-state type
state
Object of type State
world
Object of type boost::mpi::communicator
Name |
Expression |
Type |
Semantics |
---|---|---|---|
Construct a state with a communicator |
|
|
Constructs the State. |
Construct a state with the default communicator |
|
|
Constructs the State. |
Get the current node's inner state |
|
|
Returns a (const) reference. |
Get the communicator |
|
|
See Boost.MPI. |
mpi_state<InnerState>
As used by openmp_nested_algebra
,
essentially a Random Access Container with ValueType
= InnerState
.
InnerState
The inner state type
State
The split state type
state
Object of type State
Name |
Expression |
Type |
Semantics |
---|---|---|---|
Construct a state for |
|
|
Constructs underlying |
Get a chunk |
|
|
Accesses underlying |
Get the number of chunks |
|
|
Returns size of underlying |
openmp_state<ValueType>
with InnerState = vector<ValueType>
Container1
The continuous-data container type
x
Object of type Container1
Container2
The chunked-data container type
y
Object of type Container2
Name |
Expression |
Type |
Semantics |
---|---|---|---|
Copy chunks of input to output elements |
|
|
Calls |
Join chunks of input elements to output |
|
|
Calls |
Container1
= Boost.Range
and Container2 =
openmp_state
Container2 =
mpi_state
.
To implement splitters for containers incompatible with Boost.Range,
specialize the split_impl
and unsplit_impl
types:
template< class Container1, class Container2 , class Enabler = void > struct split_impl { static void split( const Container1 &from , Container2 &to ); }; template< class Container2, class Container1 , class Enabler = void > struct unsplit_impl { static void unsplit( const Container2 &from , Container1 &to ); };