CUDA program: kernel
 

Kernel = the program (function) that is executed by the GPU

Example:

 
  __global__ void hello( )
  {
     printf("Hello World\n"); 
                // CUDA C code
                // uses printf( ) in CUDA C library
  }
 

A kernel (= a GPU function/program) is executed by a grid (of threads)

 

Note:

  • Different threads will use different operands

Threads (terminology)

Thread = single execution unit that run CUDA code ("kernel") on the GPU

Each thread is executed by 1 CUDA core (= processor)
Multiple threads can be assigned to the same CUDA core
(A CUDA core will switch execution between different threads !)

Thread organization: thread block

  • Multiple threads are organized (= grouped) into a "thread block"

      • Threads in the same thread block are run on the same stream multiprocessor (SM)

      • Multiple thread blocks can be assigned to one multiprocessor

  • Organization:

  • A (thread) block has 3 dimensions:

         x  ≤  1024
         y  ≤  1024  and x * y * z ≤ 1024 (i.e.: max 1024 threads in a block) 
         z  ≤  64
    

Thread organization: grid

  • Multiple thread blocks are organized (= grouped) into a "grid"

      • Threads in the same grid are run on the same GPU

  • Organization:

  • A grid also has 3 dimensions:

         x  ≤  231-1
         y  ≤  65535   
         z  ≤  65535
    

CUDA program execution: "launching" the kernel on a grid

  • Grid = all the threads that execute the same CUDA kernel function

      • A grid can have any number of threads

      • A grid will (therefore) consists of 1 or more thread blocks

        (A thread block contain upto 1024 threads)

  • A grid is create by the host program when it "launches" (= calls) a kernel function

  • Kernel launching syntax:

        KernelFunction <<<NBlocks, NThreads>>> (params);         
      
        Run KernelFunction on GPU using a grid that consists of:
      
                   NBlocks thread blocks with 
                   NThreads threads in each thread block