Execution configuration and performance

Achieving highest performance in parallel processing

Factors that affect program performance in parallel processing:

Parallel programs should use many threads (as many as possible) to achieve good execution performance by performing multiple computations simultaneously

However:

Many simultaneous executions will require a lot of data transfer operations
The data transfer operation can become a bottle neck that can reduce the program performance !!!

Performance experiment

To show you empirically how the number of parallel threads will affect program performance, I made the Matrix Multiplication program accept an additional parameter K:

The main( ) program launches the CUDA threads as follows:

int K = atoi( argv[2] ); // Read K in from the command line /* ============================================== Find # blocks needed to launch N*N threads where each block contains K threads ============================================== */ int NBlks = ceil((float) N*N / K ); // ================================================================== // Run kernel on the GPU using NBlks block, K thread/per block matrixMult<<>>(N, C, A, B); // Launch // Wait for GPU to finish before accessing on host cudaDeviceSynchronize(); // Wait // ==================================================================

Experimental results:

Program: /home/cs355001/demo/CUDA/4-mult-matrix2.cu
mult-matrix2 1000 1 // *** Too few threads K = 1 N*N = 1000000/K = 1 = 1000000.000000 ---> use 1000000 blocks Elasped time = 1909454 micro secs mult-matrix2 1000 4 K = 4 N*N = 1000000/K = 4 = 250000.000000 ---> use 250000 blocks Elasped time = 531817 micro secs mult-matrix2 1000 8 K = 8 N*N = 1000000/K = 8 = 125000.000000 ---> use 125000 blocks Elasped time = 311036 micro secs mult-matrix2 1000 32 K = 32 N*N = 1000000/K = 32 = 31250.000000 ---> use 31250 blocks Elasped time = 74149 micro secs mult-matrix2 1000 128 K = 128 N*N = 1000000/K = 128 = 7812.500000 ---> use 7813 blocks Elasped time = 42554 micro secs <-- best performance !! mult-matrix2 1000 256 K = 256 N*N = 1000000/K = 256 = 3906.250000 ---> use 3907 blocks Elasped time = 42829 micro secs mult-matrix2 1000 512 K = 512 N*N = 1000000/K = 512 = 1953.125000 ---> use 1954 blocks Elasped time = 43213 micro secs mult-matrix2 1000 1024 K = 1024 N*N = 1000000/K = 1024 = 976.562500 ---> use 977 blocks Elasped time = 43435 micro secs

CUDA function to help programmer get reasonably good performance

From https://devblogs.nvidia.com/cuda-pro-tip-occupancy-api-simplifies-launch-configuration/:

The cudaOccupancyMaxPotentialBlockSize( ) makes it possible to compute a reasonably efficient execution configuration for a kernel without having to directly query the kernel's attributes or the device properties, regardless of what device is present or any compilation details.

From CUDA manual pages:

__host__ cudaError_t cudaOccupancyMaxPotentialBlockSize ( int* minGridSize, int* blockSize, T func, size_t dynamicSMemSize = 0, int blockSizeLimit = 0 ) Returns grid and block size that achieves maximum potential occupancy for a device function. Parameters minGridSize - Returned minimum grid size needed to achieve the best potential occupancy blockSize - Returned block size func - Device function symbol dynamicSMemSize - Per-block dynamic shared memory usage intended, in bytes blockSizeLimit - The maximum block size func is designed to work with. 0 means no limit. Function return values: cudaSuccess, cudaErrorInvalidDevice, cudaErrorInvalidDeviceFunction, cudaErrorInvalidValue, cudaErrorUnknown, Function description: Returns in *minGridSize and *blocksize a suggested grid / block size pair that achieves the best potential occupancy (i.e. the maximum number of active warps with the smallest number of blocks).

How to use cudaOccupancyMaxPotentialBlockSize( ) to find the GridSize and BlockSize:

int minGridSize; // The minimum grid size needed to achieve // the maximum occupancy for a full device launch int BlkSize; // Block size to used for max occupancy cudaOccupancyMaxPotentialBlockSize( &minGridSize, &BlkSize, matrixMult, 0, 0); printf("Computed: minGridSize = %d, BlkSize = %d\n", minGridSize, BlkSize); int GridSize = ceil((float) N*N / BlkSize ); // Round up to integral grid size printf("N*N = %d/BlkSize = %lf ---> GridSize = %d\n", N*N, (float) N*N/BlkSize, GridSize); // Run kernel on the GPU using NBlks block, K thread/per block matrixMult<<<GridSize, BlkSize>>>(N, C, A, B); // Wait for GPU to finish before accessing on host cudaDeviceSynchronize();

Example Program: (Demo above code)

Program: /home/cs355001/demo/CUDA/4-mult-matrix/mult-matrix-auto.cu Output: cs355@ghost01 (2095)> mult-matrix-auto 1000 Computed: minGridSize = 10, BlkSize = 1024 N*N = 1000000/BlkSize = 976.562500 ---> GridSize = 977 Elasped time = 42972 micro secs