Review: thread, thread block and grid

  • Thread = an unit of execution on the GPU that is used to execute computer instructions in a kernel function

  • Block (or thread block) = a group of threads that can coorperate with each other

      • In order for different threads to coorperate, the threads must be able to communicate with each other

      • Therefore, threads in a block are executed on the same stream multiprocessor

        (Because thread communication can only happen using the shared memory inside stream processor)

    A block has max 1024 threads

  • Grid = a group of (thread) blocks that execute the same kernel function (i.e.: the same instructions - but using different data !)

Execution configuration
 

  • The problem space can be:

      1. one-dimensional (line),
      2. 2-dimensional (plane) or          
      3. 3-dimensional (space)

  • The shape of the CUDA grid that we create to solve a problem, should be able to adapt to the dimension of the problem space of the problem

  • When we launch a kernel function with:

       Kernel<<< gridShape, blockShape >>> ( params );    
      

    we can specify the "execution configuration" that specify the shape of the (thread) grid and of the (thread) blocks !

Defining the shape of a grid or a (thread) block
 

  • CUDA provides the dim3 data type to allow the programmer to define the shape of the execution configuration

    Syntax:

         dim3 myShape = dim3( xDim, yDim, zDim );   
      



  • I will discuss the general usage of dim3 in the next set of slides...

  • In this set of slides, I will discussed a simplified shape:

      • The 1-dimensional execution configuration        

The 1-dimensional execution configuration
 

  • An 1-dimensional execution configuration is defined using the following syntax:

     <<< NBlocks, NThreads >>>
    
     NBlocks  = # blocks in the grid
     NThreads = # threads in each (thread) block   
    

    Examples:

     <<< 1, 4 >>>: use 1 block, with 4 threads in each block     
    
     <<< 3, 4 >>>: use 3 blocks,with 4 threads in each block
    

How each thread in the grid can create a unique identifier for itself

  • CUDA provide the following built-in variables to allow each thread to identify its position within the grid:

       gridDim   = # blocks in the grid
       blockIdx  = index of a block within the grid
      
       blockDim  = # threads in the block
       threadIdx = index of a thread within its block   
      

  • Each of these (4) variables have 3 components:

       .x  = the value in the first dimension
       .y  = the value in the second dimension    
       .z  = the value in the third dimension
      

I will illustrate their use using the first dimension in the next slides

How each thread in the grid can create a unique identifier for itself - DEMO

The following hello program makes each thread print of its own set of identifying variables:

#include <stdio.h>
#include <unistd.h>

__global__ void hello( )
{
  printf("gridDim.x=%d, blockIdx.x=#%d, 
          blockDim.x=%d, threadIdx.x=#%d\n", 
	  gridDim.x, blockIdx.x, blockDim.x, threadIdx.x);  
}

int main()
{
   hello<<< 2, 4 >>>( );

   printf("I am the CPU: Hello World ! \n");
   cudaDeviceSynchronize();
}

DEMO: /home/cs355001/demo/CUDA/1-intro/hello-thrIndex.cu

How each thread in the grid can create a unique identifier for itself

  • We can use the built-in variables to compute a unique identifier for each thread as follows:

       uniqueID = blockIdx.x × blockDim.x + threadIdx.x
      
                 
      
       gridDim.x   = # blocks in the grid (in the first dimension)
       blockIdx.x  = block index of the (current) thread in grid
      
       blockDim.x  = # threads in (each of) the (thread) block
                     (in the first dimension)
       threadIdx.x = thread index of the (current) thread in block
      

How each thread in the grid can create a unique identifier for itself - DEMO

The following hello program makes each thread print its unique ID:

#include <stdio.h>
#include <unistd.h>

__global__ void hello( )
{
  printf("gridDim.x=%d, blockIdx.x=#%d, 
          blockDim.x=%d, threadIdx.x=#%d -> ID=%d of %d\n", 
	  gridDim.x, blockIdx.x, blockDim.x, threadIdx.x,
          blockIdx.x*blockDim.x+threadIdx.x,
	  gridDim.x*blockDim.x);  
}

int main()
{
   hello<<< 2, 4 >>>( );

   printf("I am the CPU: Hello World ! \n");
   cudaDeviceSynchronize();
} 

DEMO: /home/cs355001/demo/CUDA/1-intro/hello-thrIndex2.cu