CUDA program: kernel

Kernel = the program (function) that is executed by the GPU

Example:

__global__ void hello( ) { printf("Hello World\n"); // CUDA C code // uses printf( ) in CUDA C library }

A kernel (= a GPU function/program) is executed by a grid (of threads)

Note:

Different threads will use different operands

Threads (terminology)

Thread = single execution unit that run CUDA code ("kernel") on the GPU

Each thread is executed by 1 CUDA core (= processor)
Multiple threads can be assigned to the same CUDA core
(A CUDA core will switch execution between different threads !)

Thread organization: thread block

Multiple threads are organized (= grouped) into a "thread block"

Threads in the same thread block are run on the same stream multiprocessor (SM)
Multiple thread blocks can be assigned to one multiprocessor

Organization:

A (thread) block has 3 dimensions:

x ≤ 1024 y ≤ 1024 and x * y * z ≤ 1024 (i.e.: max 1024 threads in a block) z ≤ 64

Thread organization: grid

Multiple thread blocks are organized (= grouped) into a "grid"
Organization:

A grid also has 3 dimensions:

x ≤ 2³¹-1 y ≤ 65535 z ≤ 65535

CUDA program execution: "launching" the kernel on a grid

Grid = all the threads that execute the same CUDA kernel function

A grid can have any number of threads
A grid will (therefore) consists of 1 or more thread blocks
(A thread block contain upto 1024 threads)

A grid is create by the host program when it "launches" (= calls) a kernel function

Kernel launching syntax:

KernelFunction <<<NBlocks, NThreads>>> (params); Run KernelFunction on GPU using a grid that consists of: NBlocks thread blocks with NThreads threads in each thread block