Synchronous vs. Asynchronous CUDA function calls

There are 2 kinds of kernel calls (launches):

Synchronous CUDA function call:

The CPU will launch (= starts execution) a kernel function to run on the GPU and then
The CPU's program execution will wait (= pause) until the kernel function completes before executing the next CPU program statement

Asynchronous CUDA function call:

The CPU will launch (= starts execution) a kernel function to run on the GPU and then
The CPU will continue right away with the execution the next CPU program statement

The launching of user-defined kernels is asynchronous

Important fact in CUDA programming:

The launching of a (user-defined) kernel function is always asynchronous !!!

Example:

int main() { /* ------------------------------------ Call the hello( ) kernel function ------------------------------------ */ hello<<< 1, 4 >>>( ); // Asynchronous !!! // Exec next statement without any waiting: printf("I am the CPU: Hello World ! \n"); }

Forcing the CPU execution to wait on an execution on the GPU

To force the CPU execution to wait on the termination of all activities on the GPU, use:

cudaDeviceSynchronize( ); When the CPU executes cudaDeviceSynchronize( ); the CPU execution will wait until all kernels running on the GPU have completed

I will re-write the hello2.cu program properly using cudaDeviceSynchronize() next

The correct way to write the hello world CUDA program

The Hello World program in CUDA:

#include <stdio.h> #include <unistd.h> __global__ void hello( ) { printf("Hello World !\n"); } int main() { hello<<< 1, 4 >>>( ); // Launch printf("I am the CPU: Hello World ! \n"); // Try moving the "printf" statement // after cudaDeviceSynchronize() cudaDeviceSynchronize(); }

DEMO: /home/cs355001/demo/CUDA/1-intro/hello-sync.cu