Review: the matrix multiplication operation
+- -+ +- -+ |A11 A12 A13| |B11 B12 B13| A = |A21 A22 A23| B = |B21 B22 B23| |A31 A32 A33| |B31 B32 B33| +- -+ +- -+ Then: +- -+ |C11 C12 C13| C = A*B = |C21 C22 C23| |C31 C32 C33| +- -+ where: Cij = ∑k=0..N-1 AikBkj(for i = 0, 1, .., N-1 and j = 0, 1, ..., N-1) |
The matrix multiplication with two (2) N×N matrices (C = A × B):
int N = ...; // Some pre-defined value float A[N][N]; // Matrix 1 float B[N][N]; // Matrix 2 float C[N][N]; // Output matrix //CPU matrix mult alg for static 2-dimensional arrays for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) { /* --------------------------------- Compute the matrix element Cij = ∑k=0..N-1 AikBkj --------------------------------- */ C[i][j] = 0.0; for (int k = 0; k < N; k++) C[i][j] = C[i][j] + A[i][k]*B[k][j]; // "inner" vector product } |
Limitations on using 2-dimensional arrays in C:
|
Better solution:
|
Problem: we
cannot
use expression
A[i][j]
(2 dim index) to
access the
allocated elements
We must use
A[i]
(1 dim index) !!!
The memory space allocated by the malloc( ) function:
|
The memory space
allocated by
malloc( ) has
a one-dimensional structure !!
(The memory cells are
consecutive in
memory, so they are
line up on
a "straight line")
Mapping a 2-dimensional matrix to a 1-dimensional array (allocated by malloc( )):
|
We will discuss the mapping function next
Mapping a 2-dimensional matrix to a 1-dimensional array (allocated by malloc( )):
We will map the matrix elements row-wise onto the 1-dim array next
We map the 1st row of the matrix to the first N elements in 1-dim array:
Relationship: w (1-dim index) = 0×N + j
We map the 2nd row of the matrix to the second N elements in 1-dim array:
Relationship: w (1-dim index) = 1×N + j
We map the 3rd row of the matrix to the third N elements in 1-dim array:
Relationship: w (1-dim index) = 2×N + j --- and so on..
The mapping function from a 2-dimensional index (i,j) to a 1-dimensional index w is:
How to linearize a 2 dim index (i,j) to a 1-dim index w: w (1-dim index) = i×N + j
The reverse mapping function from a 1-dimensional index w to a 2-dimensional index (i,j):
w (1-dim index) = i×N + j --- Question: what is i = f(w) and j = g(w) ??
The reverse mapping function from a 1-dimensional index w to a 2-dimensional index (i,j) is:
Relationship: i = w/N and j = w%N
Recall: The matrix multiplication with 2 dimensional arrays
int N = ...; // Some pre-defined value float A[N][N]; // Matrix 1 float B[N][N]; // Matrix 2 float C[N][N]; // Output matrix //CPU matrix mult alg for static 2-dimensional arrays for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) { /* --------------------------------- Compute the matrix element Cij = ∑k=0..N-1 AikBkj --------------------------------- */ C[i][j] = 0.0; for (int k = 0; k < N; k++) C[i][j] = C[i][j] + A[i][k]*B[k][j]; // 1 vector product } |
We change it to use allocated arrays:
int N = ...; // Some pre-defined value float *A = malloc(N*N*sizeof(float)); // Matrix 1 float *B = malloc(N*N*sizeof(float)); // Matrix 2 float *C = malloc(N*N*sizeof(float)); // Output matrix //CPU matrix mult alg for static 2-dimensional arrays for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) { /* --------------------------------- Compute the matrix element Cij = ∑k=0..N-1 AikBkj --------------------------------- */ C[i][j] = 0.0; for (int k = 0; k < N; k++) C[i][j] = C[i][j] + A[i][k]*B[k][j]; // 1 vector product } |
We map the 2-dim array index [i][j] to a 1-dim index [i*N+j]:
int N = ...; // Some pre-defined value float *A = malloc(N*N*sizeof(float)); // Matrix 1 float *B = malloc(N*N*sizeof(float)); // Matrix 2 float *C = malloc(N*N*sizeof(float)); // Output matrix //CPU matrix mult alg for static 2-dimensional arrays for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) { /* --------------------------------- Compute the matrix element Cij = ∑k=0..N-1 AikBkj --------------------------------- */ C[0*N+0] = 0.0; for (int k = 0; k < N; k++) C[i*N+j] = C[i*N+j] + A[i*N+k]*B[k*N+j]; //1 vec product } |
DEMO: /home/cs355001/demo/CUDA/4-mult-matrix/cpu-mult-matrix.c
Recall that we can create B blocks with T (T ≤ 1024) threads/block:
#include <stdio.h> #include <unistd.h> __global__ void hello( ) { printf("gridDim.x=%d, blockIdx.x=#%d, blockDim.x=%d, threadIdx.x=#%d -> ID=%d\n", gridDim.x, blockIdx.x, blockDim.x, threadIdx.x, blockIdx.x*blockDim.x+threadIdx.x); } int main() { hello<<< B, T >>>( ); printf("I am the CPU: Hello World ! \n"); cudaDeviceSynchronize(); } |
Each thread can
compute its
unique ID using
blockIdx.x*blockDim.x+threadIdx.x.
Thread k will
compute
C[i][j]
where i =
k/N and
j =
k%N
The CPU code: (1) allocate arrays and (2) then launches the threads (as <<< B,T >>>):
int main(int argc, char *argv[]) { int N = Matrix size } |
N will be specified by the user input
The variable T is the number of threads in a block:
int main(int argc, char *argv[])
{
int N = Matrix size
int T = # threads per thread block (user choice)
}
|
We can pick a value for T later (with the condition that T ≤ 1024)
We will create N2 threads to perform the matrix multiplication:
int main(int argc, char *argv[])
{
int N = Matrix size
int T = # threads per thread block (user choice)
N*N
}
|
Thread (i,j) will compute C[i][j]
The number of blocks that we must create is:
int main(int argc, char *argv[])
{
int N = Matrix size
int T = # threads per thread block (user choice)
int B = ceil((float) N*N / T ); // # thread blocks needed
}
|
We will create the shared array variables next...
We define the reference variables to help us allocate the shared arrays:
int main(int argc, char *argv[])
{
int N = Matrix size
int T = # threads per thread block (user choice)
int B = ceil((float) N*N / T ); // # thread blocks needed
float *a, *b, *c;
}
|
(This is CUDA syntax... i.e, how CUDA provide the dynamic shared array to us)
Allocate the 3 shared matrices (as 1-dim arrays):
int main(int argc, char *argv[]) { int N = Matrix size int T = # threads per thread block (user choice) int B = ceil((float) N*N / T ); // # thread blocks needed float *a, *b, *c; /* ======================================= Allocate 3 shared matrices (as arrays) ======================================= */ cudaMallocManaged(&a, N*N*sizeof(float)); cudaMallocManaged(&b, N*N*sizeof(float)); cudaMallocManaged(&c, N*N*sizeof(float)); // initialize x and y arrays (code omitted) } |
(This is CUDA syntax... i.e, how CUDA provide the dynamic shared array to us)
Launch (at least) N2 threads as <<< B,T >>> to perform the matrix multiplication:
int main(int argc, char *argv[]) { int N = Matrix size int T = # threads per thread block (user choice) int B = ceil((float) N*N / T ); // # thread blocks needed float *a, *b, *c; /* ======================================= Allocate 3 shared matrices (as arrays) ======================================= */ cudaMallocManaged(&a, N*N*sizeof(float)); cudaMallocManaged(&b, N*N*sizeof(float)); cudaMallocManaged(&c, N*N*sizeof(float)); // initialize x and y arrays (code omitted) matrixMult<<< B, T >>>(a, b, c, N); // Launch kernel } |
We will write the kernel code for matrixMult( ) next...
First, compute a unique ID for each thread within the <<< B, T >>> configuration:
__global__ void matrixMult(float *A, float *B, float *C, int n) { int w = blockIdx.x*blockDim.x + threadIdx.x; // w < n2 /* --------------------------------- Compute the matrix element Cij = ∑k=0..N-1 AikBkj --------------------------------- */ } |
We first find the matrix element C[i][j] (what's i and j ??) that thread w must compute
The thread w on the GPU must compute the element C[w/N][w%N] in the output matrix:
__global__ void matrixMult(float *A, float *B, float *C, int n) { int w = blockIdx.x*blockDim.x + threadIdx.x; int i = w/n; // Row index int j = w%n; // Column index /* --------------------------------- Compute the matrix element Cij = ∑k=0..N-1 AikBkj --------------------------------- */ } |
(We have discussed this mapping in this slide set before)
The algorithm to compute C[i][j] is:
__global__ void matrixMult(float *A, float *B, float *C, int n) { int w = blockIdx.x*blockDim.x + threadIdx.x; int i = w/n; // Row index int j = w%n; // Column index /* --------------------------------- Compute the matrix element Cij = ∑k=0..N-1 AikBkj --------------------------------- */ C[i*n+j] = 0.0; // We could use: C[w] = 0.0 for ( int k = 0; k < n; k++ ) C[i*n+j] += A[i*n+k] * B[k*n+j]; // C[i*n+j] = C[w] } |
Notice: we created more than N2 threads with <<< B, T >>> !!!
To prevent the computation of non-existing matrix elements by the extra threads, we use:
__global__ void matrixMult(float *A, float *B, float *C, int n) { int w = blockIdx.x*blockDim.x + threadIdx.x; if ( w < n*n ) { int i = w/n; // Row index int j = w%n; // Column index /* --------------------------------- Compute the matrix element Cij = ∑k=0..N-1 AikBkj --------------------------------- */ C[i*n+j] = 0.0; // We could use: C[w] = 0.0 for ( int k = 0; k < n; k++ ) C[i*n+j] += A[i*n+k] * B[k*n+j]; // C[i*n+j] = C[w] } } |
DEMO: /home/cs355001/demo/CUDA/4-mult-matrix/mult-matrix2.cu