The vector addition algorithm for the CPU

Vector addition is adding corresponding array elements in a "vector" (stored as an array):

I will first show you an algorithm for a CPU

Then I will show you an algorithm for a GPU

A CPU algorithm for vector addition

int main(int argc, char *argv[]) { int i, N; // N = vector length float *x, *y, *z; // Base addresses of arrays /* ========================================== Allocate arrays to store vector x and y ========================================== */ x = malloc( N*sizeof(float) ); // Allocate array of N floats y = malloc( N*sizeof(float) ); // Allocate array of N floats z = malloc( N*sizeof(float) ); // Allocate array of N floats /* Initialize array x and y omitted */ for (i = 0; i < N; i++) z[i] = x[i] + y[i]; // CPU's array addition }

DEMO: /home/cs355001/demo/CUDA/3-add-vector/cpu-add-vector.c

Execution of the CPU algorithm of vector addition

Initial state:

Execution of the CPU algorithm of vector addition

After 1 iteration:

Execution of the CPU algorithm of vector addition

After 2 iterations:

And so on...

GPU's execution of the vector addition algorithm

Recall that a grid has B blocks with T (T ≤ 1024) threads per block:

#include <stdio.h> #include <unistd.h> __global__ void hello( ) { printf("gridDim.x=%d, blockIdx.x=#%d, blockDim.x=%d, threadIdx.x=#%d -> ID=%d\n", gridDim.x, blockIdx.x, blockDim.x, threadIdx.x, blockIdx.x*blockDim.x+threadIdx.x); } int main() { hello<<< B, T >>>( ); printf("I am the CPU: Hello World ! \n"); cudaDeviceSynchronize(); }

Each thread can compute its unique ID using blockIdx.x*blockDim.x+threadIdx.x.

How to set the parameters B (# thread blocks) and T (# threads/block) ?

Suppose we must add 2 vectors (= arrays) of size N:

Suppose we add arrays of size N: x[0] x[1] x[2] ... ... ... ... ... ... ... ... ... x[N-1] + + + + + + + y[0] y[1] y[2] ... ... ... ... ... ... ... ... ... y[N-1]

How to set the parameters B (# thread blocks) and T (# threads/block) ?

We create N threads and use thread i to compute x[i] + y[i]:

Suppose we add arrays of size N: x[0] x[1] x[2] ... ... x[i]... ... ... ... ... ... x[N-1] + + + + + + + + y[0] y[1] y[2] ... ... y[i]... ... ... ... ... ... y[N-1] ^ | thread i <---------------------------------------------------------> N threads Problem: what if N > 1024 ??

Important restriction: 1 block can have at most 1024 threads

How to set the parameters B (# thread blocks) and T (# threads/block) ?

To create an N threads, we divide the N threads up into blocks of T threads with: T ≤ 1024:

Suppose we add arrays of size N: x[0] x[1] x[2] ... ... x[i]... ... ... ... ... ... x[N-1] + + + + + + + + y[0] y[1] y[2] ... ... y[i]... ... ... ... ... ... y[N-1] <----------------> <--------------> ....... <--------------> T threads T threads T threads (T < 1024) (1 block) (1 block) ... (1 block) <---------------------------------------------------------> N threads

We can choose the value of T, as long as: T ≤ 1024 (T can affect performance !)

How to set the parameters B (# thread blocks) and T (# threads/block) ?

The # block in the grid is equal to ⌈N/T⌉

Suppose we add arrays of size N: x[0] x[1] x[2] ... ... x[i]... ... ... ... ... ... x[N-1] + + + + + + + + y[0] y[1] y[2] ... ... y[i]... ... ... ... ... ... y[N-1] <----------------> <--------------> ....... <--------------> T threads T threads T threads (T < 1024) (1 block) (1 block) ... (1 block) | | +-------------------------------------------------------+ # blocks B = ⌈ N/T ⌉

How to set the parameters B (# thread blocks) and T (# threads/block) ?

Notice that we may create > N threads:

Suppose we add arrays of size N: x[0] x[1] x[2] ... ... x[i]... ... ... ... ... ... x[N-1] + + + + + + + + y[0] y[1] y[2] ... ... y[i]... ... ... ... ... ... y[N-1] <----------------> <--------------> ....... <--------------> T threads T threads T threads (T < 1024) (1 block) (1 block) ... (1 block) | | +-------------------------------------------------------+ # blocks B = ⌈ N/T ⌉ Example: N = 3500 T = 1000 --> B = ⌈ 3500/1000 ⌉ = 4 thread blocks We will create 4 × 1000 = 4000 threads !!

Warning: threads with ID ≥ N must not execute the vector addition code !!!