Other (advanced) OpenMP parallel structures (for loop)

Other Parallel constructs in OpenMP
- So far, what you have seen is exactly what is achieved using parallel exeuction with threads
- The PARALLEL is the most basic construct in OpenMP (every thread executes the code within the PARALLEL construct.
- There are other more convenient PARALLEL constructs that help you write parallel programs with more ease
- Here is a list of all PARALLEL (work-sharing) constructs in OpenMP:
  - Parallel loop (DO or for)
  - Parallel section
  - General work-sharing (Fortran only)
- I will only discuss the PARALLEL LOOP construct because that is where the most common opportunity for speeding up a program.

Parallel For Loop in OpenMP

From what you have seen so far, you can conclude that for-loops is a great place to speed up a program through concurrent execution.
It is a very simple and mechanical process to divide up the work over a number of threads simply by scheduling a different thread to work on the for-body using a different index value.
The division of labor (splitting the work of a for-loop) of a for-loop can be done in OpenMP through a special Parallel LOOP construct.
A Parallel Loop construct MUST appear within a Parallel region of the program !

The syntax of a Parallel LOOP construct is:

#pragma omp parallel { .... #pragma omp for [parameters] for-statement // Parallel Loop .... }

The meaning of this Parallel LOOP construct is to distribute the iterations in the for-loop (or do-loop) among the threads.
Each iteration of the for-loop is executed exactly once by each thread.
The loop variable used in the Parallel LOOP construct is by default PRIVATE (other variables are still by default SHARED)

Example:

double f(double a) { return( 2.0 / sqrt(1 - a*a) ); } int main(int argc, char *argv[]) { int N; double sum; // Shared variable, updated ! double x, w; N = ...; // accuracy of the approximation w = 1.0/N; sum = 0.0; #pragma omp parallel { int i; double mypi, x; mypi = 0.0; #pragma omp for for (i = 0; i < N; i = i + 1) { x = w*(i + 0.5); // Save us the trouble of dividing mypi = mypi + w*f(x); // the work up... } #pragma omp critical { sum = sum + mypi; } } cout << sum; }

Notice we write the for-loop as if we were summing the integration rectangle in a sequential program
The C/C++ compiler will insert instructions that distribute the execution of the each iteration of for-loop to some thread - it is no longer your problem to "skip" index count to accomplish load distribution !
Example Program: (OpenMP compute Pi) --- click here
Compile with:
Run with:
Change OMP_NUM_THREADS and see the difference in performance

Another OpenMP program - I grabbed this off the Internet
- OpenMP program of matrix-vector multiplication: click here
Final Notes
- The stack size of each thread can be controlled by setting another environment variable:
- For more information on OpenMP, see: http://www.openmp.org