Special problems that can occur when using parallel execution

Parallel execution can result in error when:
When the execution of a statement by thread 2 depends on the outcome of the execution of a statement by thread 1, then:

There are different kinds of dependencies and in this intro course, I will discuss these kinds of interactions

Multiple threads updates the same variable
Multiple threads must coorperate (= synchronize) to compute a sum
Multiple threads must share some information (= variable) in sorting

Multiple threads update the same variable

Recall the Hello World CUDA program:

#include <stdio.h> // C programming header file #include <unistd.h> // C programming header file /* ------------------------------------ Your first kernel (= GPU function) ------------------------------------ */ __global__ void hello( ) { printf("Hello World !\n"); // You see this msg 4 times... } int main() { hello<<< 1, 4 >>>( ); // launch kernel cudaDeviceSynchronize(); printf("I am the CPU: Hello World ! \n"); }

Multiple threads update the same variable

For our demo, I will rename the kernel function to: addOne( )

__global__ void addOne( ) { printf("Hello World !\n"); } int main() { addOne<<< 1, 4 >>>( ); // launch kernel cudaDeviceSynchronize(); printf("I am the CPU: Hello World ! \n"); }

The addOne( ) function will add 1 to the same variable x...

Multiple threads update the same variable

We create a shared variable x between the CPU and GPU:

__managed__ int x; __global__ void addOne( ) { printf("Hello World !\n"); } int main() { addOne<<< 1, 4 >>>( ); // launch kernel cudaDeviceSynchronize(); printf("I am the CPU: Hello World ! \n"); }

Multiple threads update the same variable

Initialize the shared variable x to 0 (zero):

__managed__ int x; __global__ void addOne( ) { printf("Hello World !\n"); } int main() { x = 0; addOne<<< 1, 4 >>>( ); // launch kernel cudaDeviceSynchronize(); printf("I am the CPU: Hello World ! \n"); }

Multiple threads update the same variable

The GPU function Make addOne( ) will add 1 to its x (and print a message on what it did):

__managed__ int x; __global__ void addOne( ) { printf("Thread %d: Adding 1 to x\n", threadIdx.x); x = x + 1; } int main() { x = 0; addOne<<< 1, 4 >>>( ); // launch kernel cudaDeviceSynchronize(); printf("Main after addOne(): x = %d\n", x\n"); }

DEMO: /home/cs355001/demo/CUDA/5-add-common-var/add-common1.cu

Multiple threads update the same variable

Output of the demo program:

cs355@ghost01 (841)> add-common1 Main before addOne(): x = 0 Thread 0: Adding 1 to x Thread 1: Adding 1 to x Thread 2: Adding 1 to x Thread 3: Adding 1 to x Main after addOne(): x = 1

In other words: the 4 threads has each performed x = x + 1 once

We expected that x is increased by 4 !!!

However, x is increased by 1 only !!!

Multiple threads update the same variable

Explanation of this inconsistent behavior:

One high level program statement is translated into many machine instructions

Example: (I'm using a simplified assembler syntax for brevity)

C program statement Machine instructions ===================== ====================== x = x + 1 --> ldr R0, x (get x) add R0, R0, #1 str R0,x (update x)

The individual machine instructions are executed simultaneously by multiple ALUs (= CUDA cores) !!!

Multiple threads update the same variable

Explanation of this inconsistent behavior --- recall that:

Threads executes on CUDA cores and each core has its own registers !!!

Multiple threads update the same variable

Explanation of this inconsistent behavior

Let's execute the GPU program using 2 (simultaneously running) threads:

2 threads executes: x = x + 1 Initially: x = 0 Thread 1 Thread 2 ============ =========== ldr R0,x ldr R0,x add R0,R0,#1 add R0,R0,#1 str R0,x str R0,x

Multiple threads update the same variable

Explanation of this inconsistent behavior

Let's execute the GPU program using 2 (simultaneously running) threads:

2 threads executes: x = x + 1 Initially: x = 0 Thread 1 Thread 2 ============ =========== ldr R0,x (Thr 1's R0 = 0) ldr R0,x (Thr 2's R0 = 0) add R0,R0,#1 add R0,R0,#1 str R0,x str R0,x

Notice that both thread will set: R0 = 0 !!!

Multiple threads update the same variable

Explanation of this inconsistent behavior

Let's execute the GPU program using 2 (simultaneously running) threads:

2 threads executes: x = x + 1 Initially: x = 0 Thread 1 Thread 2 ============ =========== ldr R0,x (Thr 1's R0 = 0) ldr R0,x (Thr 2's R0 = 0) add R0,R0,#1 (Thr 1's R0 = 1) add R0,R0,#1 (Thr 2's R0 = 1) str R0,x (Thr 1 set x= 1) str R0,x (Thr 2 set x= 1)

Notice that both thread will increase R0 to 1 !!!

Multiple threads update the same variable

Explanation of this inconsistent behavior

Let's execute the GPU program using 2 (simultaneously running) threads:

2 threads executes: x = x + 1 Initially: x = 0 Thread 1 Thread 2 ============ =========== ldr R0,x (Thr 1's R0 = 0) ldr R0,x (Thr 2's R0 = 0) add R0,R0,#1 (Thr 1's R0 = 1) add R0,R0,#1 (Thr 2's R0 = 1) str R0,x (Thr 1 set x= 1) str R0,x (Thr 2 set x= 1)

Notice that both thread will update x to 1 !!!

Therefore: unsynchronized simultaneous updates to the same variable results in "missing" updates

Multiple threads update the same variable: atomic functions

Atomic function:

Atomic function = a function that is executed atomically (= in its entirety) - without interference from instruction execution from other threads.

CUDA manual page on atomic functions: click here

Example of an atomic functions:

int atomicAdd(int *x, int val); float atomicAdd(float *x, float val); double atomicAdd(double *x, double val);

These atomic functions will atomically add the value val to the variable x

Multiple threads update the same variable

Consider the previous CUDA program where 4 threads simultaneously add 1 to the var x:

__managed__ int x; __global__ void addOne( ) { printf("Thread %d: Adding 1 to x\n", threadIdx.x); x = x + 1; } int main() { x = 0; addOne<<< 1, 4 >>>( ); // launch kernel cudaDeviceSynchronize(); printf("Main after addOne(): x = %d\n", x\n"); }

Multiple threads update the same variable

We change it to use atomicAdd(x, 1) to add 1 atomically:

__managed__ int x; __global__ void addOne( ) { printf("Thread %d: Adding 1 to x\n", threadIdx.x); atomicAdd(&x, 1); } int main() { x = 0; addOne<<< 1, 4 >>>( ); // launch kernel cudaDeviceSynchronize(); printf("Main after addOne(): x = %d\n", x\n"); }

DEMO: /home/cs355001/demo/CUDA/5-add-common-var/add-common2.cu

Multiple threads update the same variable

Output of the demo program:

cs355@ghost01 (841)> add-common2 Main before addOne(): x = 0 Thread 0: Adding 1 to x Thread 1: Adding 1 to x Thread 2: Adding 1 to x Thread 3: Adding 1 to x Main after addOne(): x = 4

In other words: the 4 threads has each performed x = x + 1 once

We expected that x is increased by 4 !!!

It is now correct !!!