Review: matrix multiplication (Linear Algebra)

Review: the matrix multiplication operation

+- -+ +- -+ |A₁₁ A₁₂ A₁₃| |B₁₁ B₁₂ B₁₃| A = |A₂₁ A₂₂ A₂₃| B = |B₂₁ B₂₂ B₂₃| |A₃₁ A₃₂ A₃₃| |B₃₁ B₃₂ B₃₃| +- -+ +- -+ Then: +- -+ |C₁₁ C₁₂ C₁₃| C = A*B = |C₂₁ C₂₂ C₂₃| |C₃₁ C₃₂ C₃₃| +- -+ where: C_ij = ∑_{_k=0..N-1} A_ikB_kj (for i = 0, 1, .., N-1 and j = 0, 1, ..., N-1)

The matrix multiplication algorithm for the CPU using 2-dim array

The matrix multiplication with two (2) N×N matrices (C = A × B):

int N = ...; // Some pre-defined value float A[N][N]; // Matrix 1 float B[N][N]; // Matrix 2 float C[N][N]; // Output matrix //CPU matrix mult alg for static 2-dimensional arrays for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) { /* --------------------------------- Compute the matrix element C_ij = ∑_{_k=0..N-1} A_ikB_kj --------------------------------- */ C[i][j] = 0.0; for (int k = 0; k < N; k++) C[i][j] = C[i][j] + A[i][k]*B[k][j]; // "inner" vector product }

Problems with 2 dimensional arrays

Limitations on using 2-dimensional arrays in C:

2 dimensional arrays in C must be defined with constant sizes

Better solution:

Allocate the 2-dimensional array dynamically with malloc:

float *A; A = malloc( N*N*size(float) ); // N×N matrix

Problem: we cannot use expression A[i][j] (2 dim index) to access the allocated elements
We must use A[i] (1 dim index) !!!

How use use a dynamically allocated array as a 2-dim array

The memory space allocated by the malloc( ) function:

There is no difference between:

A = malloc( 4×4×sizeof(float) ); and A = malloc( 16×sizeof(float) );

In other words:

The memory space allocated by malloc( ) has a one-dimensional structure !!
(The memory cells are consecutive in memory, so they are line up on a "straight line")

How use use a dynamically allocated array as a 2-dim array

Mapping a 2-dimensional matrix to a 1-dimensional array (allocated by malloc( )):

If we want to store a matrix using a dynamically allocated array, then:
"Mapping" the matrix elements onto an one-dimensional array:

We will discuss the mapping function next

How use use a dynamically allocated array as a 2-dim array

Mapping a 2-dimensional matrix to a 1-dimensional array (allocated by malloc( )):

We will map the matrix elements row-wise onto the 1-dim array next

How use use a dynamically allocated array as a 2-dim array

We map the 1st row of the matrix to the first N elements in 1-dim array:

Relationship: w (1-dim index) = 0×N + j

How use use a dynamically allocated array as a 2-dim array

We map the 2nd row of the matrix to the second N elements in 1-dim array:

Relationship: w (1-dim index) = 1×N + j

How use use a dynamically allocated array as a 2-dim array

We map the 3rd row of the matrix to the third N elements in 1-dim array:

Relationship: w (1-dim index) = 2×N + j --- and so on..

How use use a dynamically allocated array as a 2-dim array

The mapping function from a 2-dimensional index (i,j) to a 1-dimensional index w is:

How to linearize a 2 dim index (i,j) to a 1-dim index w: w (1-dim index) = i×N + j

How use use a dynamically allocated array as a 2-dim array

The reverse mapping function from a 1-dimensional index w to a 2-dimensional index (i,j):

w (1-dim index) = i×N + j --- Question: what is i = f(w) and j = g(w) ??

How use use a dynamically allocated array as a 2-dim array

The reverse mapping function from a 1-dimensional index w to a 2-dimensional index (i,j) is:

Relationship: i = w/N and j = w%N

The matrix multiplication algorithm for the CPU using allocated arrays

Recall: The matrix multiplication with 2 dimensional arrays

int N = ...; // Some pre-defined value float A[N][N]; // Matrix 1 float B[N][N]; // Matrix 2 float C[N][N]; // Output matrix //CPU matrix mult alg for static 2-dimensional arrays for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) { /* --------------------------------- Compute the matrix element C_ij = ∑_{_k=0..N-1} A_ikB_kj --------------------------------- */ C[i][j] = 0.0; for (int k = 0; k < N; k++) C[i][j] = C[i][j] + A[i][k]*B[k][j]; // 1 vector product }

The matrix multiplication algorithm for the CPU using allocated arrays

We change it to use allocated arrays:

int N = ...; // Some pre-defined value float *A = malloc(N*N*sizeof(float)); // Matrix 1 float *B = malloc(N*N*sizeof(float)); // Matrix 2 float *C = malloc(N*N*sizeof(float)); // Output matrix //CPU matrix mult alg for static 2-dimensional arrays for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) { /* --------------------------------- Compute the matrix element C_ij = ∑_{_k=0..N-1} A_ikB_kj --------------------------------- */ C[i][j] = 0.0; for (int k = 0; k < N; k++) C[i][j] = C[i][j] + A[i][k]*B[k][j]; // 1 vector product }

The matrix multiplication algorithm for the CPU using allocated arrays

We map the 2-dim array index [i][j] to a 1-dim index [i*N+j]:

int N = ...; // Some pre-defined value float *A = malloc(N*N*sizeof(float)); // Matrix 1 float *B = malloc(N*N*sizeof(float)); // Matrix 2 float *C = malloc(N*N*sizeof(float)); // Output matrix //CPU matrix mult alg for static 2-dimensional arrays for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) { /* --------------------------------- Compute the matrix element C_ij = ∑_{_k=0..N-1} A_ikB_kj --------------------------------- */ C[0*N+0] = 0.0; for (int k = 0; k < N; k++) C[i*N+j] = C[i*N+j] + A[i*N+k]*B[k*N+j]; //1 vec product }

DEMO: /home/cs355001/demo/CUDA/4-mult-matrix/cpu-mult-matrix.c

GPU's execution of the matrix multiplication algorithm

Recall that we can create B blocks with T (T ≤ 1024) threads/block:

#include <stdio.h> #include <unistd.h> __global__ void hello( ) { printf("gridDim.x=%d, blockIdx.x=#%d, blockDim.x=%d, threadIdx.x=#%d -> ID=%d\n", gridDim.x, blockIdx.x, blockDim.x, threadIdx.x, blockIdx.x*blockDim.x+threadIdx.x); } int main() { hello<<< B, T >>>( ); printf("I am the CPU: Hello World ! \n"); cudaDeviceSynchronize(); }

Each thread can compute its unique ID using blockIdx.x*blockDim.x+threadIdx.x.
Thread k will compute C[i][j] where i = k/N and j = k%N

The CUDA matrix multiplication algorithm - CPU part

The CPU code: (1) allocate arrays and (2) then launches the threads (as <<< B,T >>>):

int main(int argc, char *argv[]) { int N = Matrix size }

N will be specified by the user input

The CUDA matrix multiplication algorithm - CPU part

The variable T is the number of threads in a block:

int main(int argc, char *argv[]) { int N = Matrix size int T = # threads per thread block (user choice) }

We can pick a value for T later (with the condition that T ≤ 1024)

The CUDA matrix multiplication algorithm - CPU part

We will create N² threads to perform the matrix multiplication:

int main(int argc, char *argv[]) { int N = Matrix size int T = # threads per thread block (user choice) N*N }

Thread (i,j) will compute C[i][j]

The CUDA matrix multiplication algorithm - CPU part

The number of blocks that we must create is:

int main(int argc, char *argv[]) { int N = Matrix size int T = # threads per thread block (user choice) int B = ceil((float) N*N / T ); // # thread blocks needed }

We will create the shared array variables next...

The CUDA matrix multiplication algorithm - CPU part

We define the reference variables to help us allocate the shared arrays:

int main(int argc, char *argv[]) { int N = Matrix size int T = # threads per thread block (user choice) int B = ceil((float) N*N / T ); // # thread blocks needed float *a, *b, *c; }

(This is CUDA syntax... i.e, how CUDA provide the dynamic shared array to us)

The CUDA matrix multiplication algorithm - CPU part

Allocate the 3 shared matrices (as 1-dim arrays):

int main(int argc, char *argv[]) { int N = Matrix size int T = # threads per thread block (user choice) int B = ceil((float) N*N / T ); // # thread blocks needed float *a, *b, *c; /* ======================================= Allocate 3 shared matrices (as arrays) ======================================= */ cudaMallocManaged(&a, N*N*sizeof(float)); cudaMallocManaged(&b, N*N*sizeof(float)); cudaMallocManaged(&c, N*N*sizeof(float)); // initialize x and y arrays (code omitted) }

(This is CUDA syntax... i.e, how CUDA provide the dynamic shared array to us)

The CUDA matrix multiplication algorithm - CPU part

Launch (at least) N² threads as <<< B,T >>> to perform the matrix multiplication:

int main(int argc, char *argv[]) { int N = Matrix size int T = # threads per thread block (user choice) int B = ceil((float) N*N / T ); // # thread blocks needed float *a, *b, *c; /* ======================================= Allocate 3 shared matrices (as arrays) ======================================= */ cudaMallocManaged(&a, N*N*sizeof(float)); cudaMallocManaged(&b, N*N*sizeof(float)); cudaMallocManaged(&c, N*N*sizeof(float)); // initialize x and y arrays (code omitted) matrixMult<<< B, T >>>(a, b, c, N); // Launch kernel }

We will write the kernel code for matrixMult( ) next...

The CUDA matrix multiplication algorithm - GPU part

First, compute a unique ID for each thread within the <<< B, T >>> configuration:

__global__ void matrixMult(float *A, float *B, float *C, int n) { int w = blockIdx.x*blockDim.x + threadIdx.x; // w < n² /* --------------------------------- Compute the matrix element C_ij = ∑_{_k=0..N-1} A_ikB_kj --------------------------------- */ }

We first find the matrix element C[i][j] (what's i and j ??) that thread w must compute

The CUDA matrix multiplication algorithm - GPU part

The thread w on the GPU must compute the element C[w/N][w%N] in the output matrix:

__global__ void matrixMult(float *A, float *B, float *C, int n) { int w = blockIdx.x*blockDim.x + threadIdx.x; int i = w/n; // Row index int j = w%n; // Column index /* --------------------------------- Compute the matrix element C_ij = ∑_{_k=0..N-1} A_ikB_kj --------------------------------- */ }

(We have discussed this mapping in this slide set before)

The CUDA matrix multiplication algorithm - GPU part

The algorithm to compute C[i][j] is:

__global__ void matrixMult(float *A, float *B, float *C, int n) { int w = blockIdx.x*blockDim.x + threadIdx.x; int i = w/n; // Row index int j = w%n; // Column index /* --------------------------------- Compute the matrix element C_ij = ∑_{_k=0..N-1} A_ikB_kj --------------------------------- */ C[i*n+j] = 0.0; // We could use: C[w] = 0.0 for ( int k = 0; k < n; k++ ) C[i*n+j] += A[i*n+k] * B[k*n+j]; // C[i*n+j] = C[w] }

Notice: we created more than N² threads with <<< B, T >>> !!!

The CUDA matrix multiplication algorithm - GPU part

To prevent the computation of non-existing matrix elements by the extra threads, we use:

__global__ void matrixMult(float *A, float *B, float *C, int n) { int w = blockIdx.x*blockDim.x + threadIdx.x; if ( w < n*n ) { int i = w/n; // Row index int j = w%n; // Column index /* --------------------------------- Compute the matrix element C_ij = ∑_{_k=0..N-1} A_ikB_kj --------------------------------- */ C[i*n+j] = 0.0; // We could use: C[w] = 0.0 for ( int k = 0; k < n; k++ ) C[i*n+j] += A[i*n+k] * B[k*n+j]; // C[i*n+j] = C[w] } }

DEMO: /home/cs355001/demo/CUDA/4-mult-matrix/mult-matrix2.cu