Review: matrix multiplication (Linear Algebra)

Review: the matrix multiplication operation

       +-              -+          +-              -+
       |A11    A12    A13|          |B11    B12    B13|
   A = |A21    A22    A23|      B = |B21    B22    B23|
       |A31    A32    A33|          |B31    B32    B33|
       +-              -+          +-              -+

 Then:

              +-              -+
              |C11    C12    C13|
   C =  A*B = |C21    C22    C23|
              |C31    C32    C33|
              +-              -+

 where:

    Cij = k=0..N-1 AikBkj      
 
        (for i = 0, 1, .., N-1 and j = 0, 1, ..., N-1)
  

The matrix multiplication algorithm for the CPU using 2-dim array

The matrix multiplication with two (2) N×N matrices (C = A × B):

   int   N = ...;       // Some pre-defined value
   float A[N][N];       // Matrix 1
   float B[N][N];	// Matrix 2
   float C[N][N];	// Output matrix


   //CPU matrix mult alg for static 2-dimensional arrays
   for (int i = 0; i < N; i++)
      for (int j = 0; j < N; j++)
      {
         /* ---------------------------------
            Compute the matrix element
                    Cij = k=0..N-1 AikBkj     
            --------------------------------- */
         C[i][j] = 0.0;

         for (int k = 0; k < N; k++)            
            C[i][j] = C[i][j] + A[i][k]*B[k][j];  // "inner" vector product
      }
  

Problems with 2 dimensional arrays

Limitations on using 2-dimensional arrays in C:

  • 2 dimensional arrays in C must be defined with constant sizes

      • 2 dimensional arrays are therefore not flexible       

Better solution:

  • Allocate the 2-dimensional array dynamically with malloc:

        float *A;
      
        A = malloc( N*N*size(float) ); // N×N matrix  
      

Problem: we cannot use expression A[i][j] (2 dim index) to access the allocated elements
We must use A[i] (1 dim index) !!!

How use use a dynamically allocated array as a 2-dim array

The memory space allocated by the malloc( ) function:

  • There is no difference between:

         A = malloc( 4×4×sizeof(float) );    
      
      and
      
         A = malloc( 16×sizeof(float) );
      

  • In other words:

      • malloc( ) will only dynamically allocate one-dimensional arrays

The memory space allocated by malloc( ) has a one-dimensional structure !!
(The memory cells are consecutive in memory, so they are line up on a "straight line")

How use use a dynamically allocated array as a 2-dim array

Mapping a 2-dimensional matrix to a 1-dimensional array (allocated by malloc( )):

  • If we want to store a matrix using a dynamically allocated array, then:

      • We must map the 2-dimensional matrix elements onto an one-dimensional array !!!

  • "Mapping" the matrix elements onto an one-dimensional array:

      • "Map the elements" = find a relationship between the indices (i,j) of the matrix element A[i][j] and the indices w of an one-dimensional array

 

We will discuss the mapping function next

How use use a dynamically allocated array as a 2-dim array

Mapping a 2-dimensional matrix to a 1-dimensional array (allocated by malloc( )):

 
 

We will map the matrix elements row-wise onto the 1-dim array next

How use use a dynamically allocated array as a 2-dim array

We map the 1st row of the matrix to the first N elements in 1-dim array:

 
 

Relationship:   w (1-dim index) = 0×N + j

How use use a dynamically allocated array as a 2-dim array

We map the 2nd row of the matrix to the second N elements in 1-dim array:

 
 

Relationship:   w (1-dim index) = 1×N + j

How use use a dynamically allocated array as a 2-dim array

We map the 3rd row of the matrix to the third N elements in 1-dim array:

 
 

Relationship:   w (1-dim index) = 2×N + j   ---   and so on..

How use use a dynamically allocated array as a 2-dim array

The mapping function from a 2-dimensional index (i,j) to a 1-dimensional index w is:

 
 

How to linearize a 2 dim index (i,j) to a 1-dim index w:    w (1-dim index) = i×N + j

How use use a dynamically allocated array as a 2-dim array

The reverse mapping function from a 1-dimensional index w to a 2-dimensional index (i,j):

 
 

w (1-dim index) = i×N + j --- Question:   what is i = f(w) and j = g(w) ??

How use use a dynamically allocated array as a 2-dim array

The reverse mapping function from a 1-dimensional index w to a 2-dimensional index (i,j) is:

 
 

Relationship:   i = w/N   and   j = w%N

The matrix multiplication algorithm for the CPU using allocated arrays

Recall:   The matrix multiplication with 2 dimensional arrays

   int   N = ...;       // Some pre-defined value
   float A[N][N];       // Matrix 1
   float B[N][N];	// Matrix 2
   float C[N][N];	// Output matrix


   //CPU matrix mult alg for static 2-dimensional arrays
   for (int i = 0; i < N; i++)
      for (int j = 0; j < N; j++)
      {
         /* ---------------------------------
            Compute the matrix element
                    Cij = k=0..N-1 AikBkj     
            --------------------------------- */
         C[i][j] = 0.0;

         for (int k = 0; k < N; k++)            
            C[i][j] = C[i][j] + A[i][k]*B[k][j];  // 1 vector product
      }
  

The matrix multiplication algorithm for the CPU using allocated arrays

We change it to use allocated arrays:

   int   N = ...;       // Some pre-defined value
   float *A = malloc(N*N*sizeof(float)); // Matrix 1
   float *B = malloc(N*N*sizeof(float)); // Matrix 2
   float *C = malloc(N*N*sizeof(float)); // Output matrix


   //CPU matrix mult alg for static 2-dimensional arrays
   for (int i = 0; i < N; i++)
      for (int j = 0; j < N; j++)
      {
         /* ---------------------------------
            Compute the matrix element
                    Cij = k=0..N-1 AikBkj     
            --------------------------------- */
         C[i][j] = 0.0;

         for (int k = 0; k < N; k++)            
            C[i][j] = C[i][j] + A[i][k]*B[k][j];  // 1 vector product
      }

The matrix multiplication algorithm for the CPU using allocated arrays

We map the 2-dim array index [i][j] to a 1-dim index [i*N+j]:

   int   N = ...;       // Some pre-defined value
   float *A = malloc(N*N*sizeof(float)); // Matrix 1
   float *B = malloc(N*N*sizeof(float)); // Matrix 2
   float *C = malloc(N*N*sizeof(float)); // Output matrix


   //CPU matrix mult alg for static 2-dimensional arrays
   for (int i = 0; i < N; i++)
      for (int j = 0; j < N; j++)
      {
         /* ---------------------------------
            Compute the matrix element
                    Cij = k=0..N-1 AikBkj     
            --------------------------------- */
         C[0*N+0] = 0.0;

         for (int k = 0; k < N; k++)            
            C[i*N+j] = C[i*N+j] + A[i*N+k]*B[k*N+j];  //1 vec product
      } 

DEMO: /home/cs355001/demo/CUDA/4-mult-matrix/cpu-mult-matrix.c

GPU's execution of the matrix multiplication algorithm

Recall that we can create B blocks with T (T ≤ 1024) threads/block:

#include <stdio.h>
#include <unistd.h>

__global__ void hello( )
{
  printf("gridDim.x=%d, blockIdx.x=#%d, 
          blockDim.x=%d, threadIdx.x=#%d -> ID=%d\n", 
	  gridDim.x, blockIdx.x, blockDim.x, threadIdx.x,
          blockIdx.x*blockDim.x+threadIdx.x);  
}

int main()
{
   hello<<< B, T >>>( );

   printf("I am the CPU: Hello World ! \n");
   cudaDeviceSynchronize();
} 

Each thread can compute its unique ID using blockIdx.x*blockDim.x+threadIdx.x.
Thread k will compute C[i][j] where i = k/N and j = k%N

The CUDA matrix multiplication algorithm - CPU part

The CPU code:   (1) allocate arrays and (2) then launches the threads (as <<< B,T >>>):

int main(int argc, char *argv[])
{
  int N = Matrix size

  

 

  
    
    
 
 
  

  

  
} 

N will be specified by the user input

The CUDA matrix multiplication algorithm - CPU part

The variable T is the number of threads in a block:

int main(int argc, char *argv[])
{
  int N = Matrix size
  int T = # threads per thread block (user choice)
 

 


  
    
  
 
 

  

  
} 

We can pick a value for T later (with the condition that T ≤ 1024)

The CUDA matrix multiplication algorithm - CPU part

We will create N2 threads to perform the matrix multiplication:

int main(int argc, char *argv[])
{
  int N = Matrix size
  int T = # threads per thread block (user choice)
                       N*N 

 

  
    
    
 
  
  

  

  
} 

Thread (i,j) will compute C[i][j]

The CUDA matrix multiplication algorithm - CPU part

The number of blocks that we must create is:

int main(int argc, char *argv[])
{
  int N = Matrix size
  int T = # threads per thread block (user choice)
  int B = ceil((float) N*N / T ); // # thread blocks needed 




    
    
 
 


 

 
} 

We will create the shared array variables next...

The CUDA matrix multiplication algorithm - CPU part

We define the reference variables to help us allocate the shared arrays:

int main(int argc, char *argv[])
{
  int N = Matrix size
  int T = # threads per thread block (user choice)
  int B = ceil((float) N*N / T ); // # thread blocks needed 

  float *a, *b, *c;


    
    
 
  
  

 

 
} 

(This is CUDA syntax... i.e, how CUDA provide the dynamic shared array to us)

The CUDA matrix multiplication algorithm - CPU part

Allocate the 3 shared matrices (as 1-dim arrays):

int main(int argc, char *argv[])
{
  int N = Matrix size
  int T = # threads per thread block (user choice)
  int B = ceil((float) N*N / T ); // # thread blocks needed 

  float *a, *b, *c;

  /* =======================================
     Allocate 3 shared matrices (as arrays)
     ======================================= */
  cudaMallocManaged(&a, N*N*sizeof(float));
  cudaMallocManaged(&b, N*N*sizeof(float));
  cudaMallocManaged(&c, N*N*sizeof(float));

  // initialize x and y arrays (code omitted)


} 

(This is CUDA syntax... i.e, how CUDA provide the dynamic shared array to us)

The CUDA matrix multiplication algorithm - CPU part

Launch (at least) N2 threads as <<< B,T >>> to perform the matrix multiplication:

int main(int argc, char *argv[])
{
  int N = Matrix size
  int T = # threads per thread block (user choice)
  int B = ceil((float) N*N / T ); // # thread blocks needed 

  float *a, *b, *c;

  /* =======================================
     Allocate 3 shared matrices (as arrays)
     ======================================= */
  cudaMallocManaged(&a, N*N*sizeof(float));
  cudaMallocManaged(&b, N*N*sizeof(float));
  cudaMallocManaged(&c, N*N*sizeof(float));

  // initialize x and y arrays (code omitted)

  matrixMult<<< B, T >>>(a, b, c, N);   // Launch kernel
} 

We will write the kernel code for matrixMult( ) next...

The CUDA matrix multiplication algorithm - GPU part

First, compute a unique ID for each thread within the <<< B, T >>> configuration:

__global__ void matrixMult(float *A, float *B, float *C, int n)
{
   int w = blockIdx.x*blockDim.x + threadIdx.x; // w < n2


  
      
     

      /* ---------------------------------
         Compute the matrix element
            Cij = k=0..N-1 AikBkj     
         --------------------------------- */


    
     
  
} 

We first find the matrix element C[i][j] (what's i and j ??) that thread w must compute

The CUDA matrix multiplication algorithm - GPU part

The thread w on the GPU must compute the element C[w/N][w%N] in the output matrix:

__global__ void matrixMult(float *A, float *B, float *C, int n)
{
   int w = blockIdx.x*blockDim.x + threadIdx.x;



      int i = w/n;       // Row index
      int j = w%n;       // Column index

      /* ---------------------------------
         Compute the matrix element
            Cij = k=0..N-1 AikBkj     
         --------------------------------- */




   
} 

(We have discussed this mapping in this slide set before)

The CUDA matrix multiplication algorithm - GPU part

The algorithm to compute C[i][j] is:

__global__ void matrixMult(float *A, float *B, float *C, int n)
{
   int w = blockIdx.x*blockDim.x + threadIdx.x;

 
  
      int i = w/n;       // Row index
      int j = w%n;       // Column index

      /* ---------------------------------
         Compute the matrix element
            Cij = k=0..N-1 AikBkj     
         --------------------------------- */
      C[i*n+j] = 0.0;  // We could use: C[w] = 0.0

      for ( int k = 0; k < n; k++ )              
         C[i*n+j] += A[i*n+k] * B[k*n+j]; // C[i*n+j] = C[w]
  
} 

Notice:   we created more than N2 threads with <<< B, T >>> !!!

The CUDA matrix multiplication algorithm - GPU part

To prevent the computation of non-existing matrix elements by the extra threads, we use:

__global__ void matrixMult(float *A, float *B, float *C, int n)
{
   int w = blockIdx.x*blockDim.x + threadIdx.x;

   if ( w < n*n )
   {
      int i = w/n;       // Row index
      int j = w%n;       // Column index

      /* ---------------------------------
         Compute the matrix element
            Cij = k=0..N-1 AikBkj     
         --------------------------------- */
      C[i*n+j] = 0.0;  // We could use: C[w] = 0.0

      for ( int k = 0; k < n; k++ )              
         C[i*n+j] += A[i*n+k] * B[k*n+j]; // C[i*n+j] = C[w]
   }
} 

DEMO: /home/cs355001/demo/CUDA/4-mult-matrix/mult-matrix2.cu