Special problems that can occur when using parallel execution
 

  • Parallel execution can result in error when:

      • The operations performed by threads depend on the outcome of other threads

  • When the execution of a statement by thread 2 depends on the outcome of the execution of a statement by thread 1, then:

        • The threads must synchronize (= wait) with each other      

  • There are different kinds of dependencies and in this intro course, I will discuss these kinds of interactions

      1. Multiple threads updates the same variable
      2. Multiple threads must coorperate (= synchronize) to compute a sum
      3. Multiple threads must share some information (= variable) in sorting

Multiple threads update the same variable

Recall the Hello World CUDA program:

#include <stdio.h>   // C programming header file
#include <unistd.h>  // C programming header file

/* ------------------------------------
   Your first kernel (= GPU function)
   ------------------------------------ */
__global__ void hello( )
{
   printf("Hello World !\n");  // You see this msg 4 times...
}

int main()
{
   hello<<< 1, 4 >>>( ); // launch kernel
   cudaDeviceSynchronize(); 

   printf("I am the CPU: Hello World ! \n");
} 

 

Multiple threads update the same variable

For our demo, I will rename the kernel function to: addOne( )



__global__ void addOne( )
{
   printf("Hello World !\n"); 

}

int main()
{


   addOne<<< 1, 4 >>>( ); // launch kernel
   cudaDeviceSynchronize();

   printf("I am the CPU: Hello World ! \n");
} 

The addOne( ) function will add 1 to the same variable x...

Multiple threads update the same variable

We create a shared variable x between the CPU and GPU:

__managed__ int x;

__global__ void addOne( )
{
   printf("Hello World !\n"); 

}

int main()
{


   addOne<<< 1, 4 >>>( ); // launch kernel
   cudaDeviceSynchronize();

   printf("I am the CPU: Hello World ! \n");
}

 

Multiple threads update the same variable

Initialize the shared variable x to 0 (zero):

__managed__ int x;

__global__ void addOne( )
{
   printf("Hello World !\n"); 

}

int main()
{
   x = 0;

   addOne<<< 1, 4 >>>( ); // launch kernel
   cudaDeviceSynchronize();

   printf("I am the CPU: Hello World ! \n");
} 

 

Multiple threads update the same variable

The GPU function Make addOne( ) will add 1 to its x (and print a message on what it did):

__managed__ int x;

__global__ void addOne( )
{
   printf("Thread %d: Adding 1 to x\n", threadIdx.x);
   x = x + 1;
}

int main()
{
   x = 0;

   addOne<<< 1, 4 >>>( ); // launch kernel
   cudaDeviceSynchronize();

   printf("Main after addOne(): x = %d\n", x\n");
} 

DEMO: /home/cs355001/demo/CUDA/5-add-common-var/add-common1.cu

Multiple threads update the same variable

Output of the demo program:

cs355@ghost01 (841)> add-common1
Main before addOne(): x = 0
Thread 0: Adding 1 to x
Thread 1: Adding 1 to x
Thread 2: Adding 1 to x
Thread 3: Adding 1 to x
Main after  addOne(): x = 1 
 

In other words: the 4 threads has each performed x = x + 1 once

We expected that x is increased by 4 !!!

However, x is increased by 1 only !!!

Multiple threads update the same variable

Explanation of this inconsistent behavior:

  • One high level program statement is translated into many machine instructions

    Example: (I'm using a simplified assembler syntax for brevity)

       C program statement      Machine instructions
       =====================    ====================== 
       x = x + 1           -->  ldr R0, x (get x)
                                add R0, R0, #1
                                str R0,x  (update x)
      

  • The individual machine instructions are executed simultaneously by multiple ALUs (= CUDA cores) !!!

Multiple threads update the same variable

Explanation of this inconsistent behavior --- recall that:

Threads executes on CUDA cores and each core has its own registers !!!

Multiple threads update the same variable

Explanation of this inconsistent behavior

Let's execute the GPU program using 2 (simultaneously running) threads:

 2 threads executes:  x = x + 1

 Initially: x = 0

 Thread 1                        Thread 2
 ============                    ===========
 ldr R0,x                        ldr R0,x                      
 add R0,R0,#1                    add R0,R0,#1                 
 str R0,x                        str R0,x                     

 

 

Multiple threads update the same variable

Explanation of this inconsistent behavior

Let's execute the GPU program using 2 (simultaneously running) threads:

 2 threads executes:  x = x + 1

 Initially: x = 0

 Thread 1                        Thread 2
 ============                    ===========
 ldr R0,x     (Thr 1's R0 = 0)   ldr R0,x     (Thr 2's R0 = 0) 
 add R0,R0,#1                    add R0,R0,#1                 
 str R0,x                        str R0,x                     

 

Notice that both thread will set: R0 = 0 !!!

Multiple threads update the same variable

Explanation of this inconsistent behavior

Let's execute the GPU program using 2 (simultaneously running) threads:

 2 threads executes:  x = x + 1

 Initially: x = 0

 Thread 1                        Thread 2
 ============                    ===========
 ldr R0,x     (Thr 1's R0 = 0)   ldr R0,x     (Thr 2's R0 = 0) 
 add R0,R0,#1 (Thr 1's R0 = 1)   add R0,R0,#1 (Thr 2's R0 = 1)
 str R0,x     (Thr 1 set x= 1)   str R0,x     (Thr 2 set x= 1)

 

Notice that both thread will increase R0 to 1 !!!

Multiple threads update the same variable

Explanation of this inconsistent behavior

Let's execute the GPU program using 2 (simultaneously running) threads:

 2 threads executes:  x = x + 1

 Initially: x = 0

 Thread 1                        Thread 2
 ============                    ===========
 ldr R0,x     (Thr 1's R0 = 0)   ldr R0,x     (Thr 2's R0 = 0) 
 add R0,R0,#1 (Thr 1's R0 = 1)   add R0,R0,#1 (Thr 2's R0 = 1)
 str R0,x     (Thr 1 set x= 1)   str R0,x     (Thr 2 set x= 1)

 

Notice that both thread will update x to 1 !!!

Therefore: unsynchronized simultaneous updates to the same variable results in "missing" updates

Multiple threads update the same variable: atomic functions
 

  • Atomic function:

      • Atomic function = a function that is executed atomically (= in its entirety) - without interference from instruction execution from other threads.

  • CUDA manual page on atomic functions: click here

  • Example of an atomic functions:

       int    atomicAdd(int    *x, int    val);    
       float  atomicAdd(float  *x, float  val);
       double atomicAdd(double *x, double val);
      

    These atomic functions will atomically add the value val to the variable x

Multiple threads update the same variable

Consider the previous CUDA program where 4 threads simultaneously add 1 to the var x:

__managed__ int x;

__global__ void addOne( )
{
   printf("Thread %d: Adding 1 to x\n", threadIdx.x);
   x = x + 1;
}

int main()
{
   x = 0;

   addOne<<< 1, 4 >>>( ); // launch kernel
   cudaDeviceSynchronize();

   printf("Main after addOne(): x = %d\n", x\n");
} 

 

Multiple threads update the same variable

We change it to use atomicAdd(x, 1) to add 1 atomically:

__managed__ int x;

__global__ void addOne( )
{
   printf("Thread %d: Adding 1 to x\n", threadIdx.x);
   atomicAdd(&x, 1);
}

int main()
{
   x = 0;

   addOne<<< 1, 4 >>>( ); // launch kernel
   cudaDeviceSynchronize();

   printf("Main after addOne(): x = %d\n", x\n");
}

DEMO: /home/cs355001/demo/CUDA/5-add-common-var/add-common2.cu

Multiple threads update the same variable

Output of the demo program:

cs355@ghost01 (841)> add-common2
Main before addOne(): x = 0
Thread 0: Adding 1 to x
Thread 1: Adding 1 to x
Thread 2: Adding 1 to x
Thread 3: Adding 1 to x
Main after  addOne(): x = 4 
 

In other words: the 4 threads has each performed x = x + 1 once

We expected that x is increased by 4 !!!

It is now correct !!!