<- previous    index    next ->

openMP parallel computation

Reasonably easy way to speed up your program on a computer
with multiple cores. That is almost every computer built
today. Often you can have twice as many threads running
as you have physical cores.

OMP short of openMP is not technically a language.
Pragmas are used in a conventional language to cause the
compiler to generate calls to run time library routines.
Typically, OMP causes tasks to be generated and executed.
Often the underlying task are provided by pthreads.

Fortran and C family compilers usually support OMP.
Some other languages may also provide support.

Specifically with  gcc  there must be a   -fopenmp  option
in order for the compiler to process the pragmas.
Each pragma has the syntax   #pragma omp ...
With no braces { } the pragma applies to the next statement.

#pragma omp parallel
{
  // many copies will be executed
}

test1.c is a first OMP program to run to find out if OMP is
installed and how many tasks are available:

test1.c first test
test1_c.out output from one computer

   // test1.c  check if openMP available  compile gcc -fopenmp ...
   #include 
   #include 
   #include 
   #include 

   int main(int argc, char * argv[]) 
   {
     int myid;
     int def_num_threads, num_threads, num_proc;
     double ttime;

     num_proc = omp_get_num_procs();
     printf("test1.c check for openMP, num_proc=%d \n", num_proc);
     fflush(stdout);
     ttime = omp_get_wtime(); // wall time now, in seconds

     #pragma omp parallel // must be at least one "parallel", else no threads
     {
       #pragma omp master // only the master thread will run { to }
       {
         def_num_threads = omp_get_num_threads(); // must be in parallel
         omp_set_num_threads(6);
         num_threads = omp_get_num_threads();
         printf("def_num_threads=%d, try 6, omp_set_num_threads =%d \n",
	        def_num_threads, num_threads);
         fflush(stdout); // needed to get clean output
       }
       #pragma omp barrier // all threads must be doing nothing
       myid = omp_get_thread_num(); // master == 0
       printf("test1.c in pragma omp parallel myid=%d \n", myid);
       fflush(stdout);
    
     }
     // total wall time is difference
     printf("test1.c ends, %f seconds \n", omp_get_wtime()-ttime);
     fflush(stdout);
     return 0;
   } // end test1.c

On one of my computers the output is:

   test1.c check for openMP, num_proc=8 
   def_num_threads=8, try 6, omp_set_num_threads =8 
   test1.c in pragma omp parallel myid=0 
   test1.c in pragma omp parallel myid=1 
   test1.c in pragma omp parallel myid=4 
   test1.c in pragma omp parallel myid=7 
   test1.c in pragma omp parallel myid=6 
   test1.c in pragma omp parallel myid=5 
   test1.c in pragma omp parallel myid=2 
   test1.c in pragma omp parallel myid=3 
   test1.c ends, 0.039330 seconds 

As you can see, my system would not do the 
   omp_set_num_threads(6);  it used the default.

The  fflush(stdout); is used after every printf, else
output lines from various threads may get intertwined.

There is an on-line tutorial for openMP by an Intel developer.
This tutorial takes a few hours and has programming assignments.
https://www.youtube.com/watch?v=nE-xN4Bf8XI

One of the problems is to take a sequential program with a
loop and parallelize the program. The program does numerical
integration to compute the value of Pi. 3.14159...
The next sequence of files, you need to click to see them,
is a more extensive version of the tutorial problem.

First the non parallel program, I always check my code before parallelizing:
pi.c a non parallel version for test and timing
pi_c.out output of non parallel version
pi.make Makefile for single computer 

Source code with OMP pragmas, source, output, Makefile
pi_omp.c source code with openMP pragmas
pi_omp.out output, many more cases, parallel
pi_omp.make OMP Makefile for single computer 

Same source code run on a cluster, uses slurm and activated by mpi
pi_omp.slurm driver code only for cluster
pi_omp.makeslurm Makefile for cluster 

Some key points:
  #pragma omp parallel  // if declared above, could say  private(x)
  {
    double x;  // a different location for each thread


  #pragma omp for reduction(+:pi)  says each task will be adding to pi,
                                   avoid race condition on update,
  much faster than  #pragma omp atomic  yet does same thing.

Now, what if you have nested loops that are smaller than
number of available threads, tasks?

test2.c parallelize nested loops
test2_c.out output showing which thread is running

Here, "static" option, rather than "dynamic" option was used because
all threads take about the same time.
Notice that the use of threads, myid,  is not equal.

    <- previous    index    next ->

Other links

Go to top