The NVidoa GPU architecture

This set of slides go into more details on the GPU architecture made by NVidia:

The NVidoa GPU architecture

The NVidia GPU consists of N "stream" multiprocessors that can access a shared device memory:

The device memory can be accessed by all multiprocessors (i.e.: a shared memory)

The NVidoa GPU architecture

Each multiprocessor has M processors (a.k.a. CUDA cores ~= ALUs):

Note: a processor or "CUDA core" = floating point unit (comparable to an ALU) ( click here and click here)
Each processor or CUDA core (= ALU) uses its own registers

Inside the nVidia "Ampere" Multiprocessor

The NVidia Ampere streaming multiprocessor ( click here)

(It was succeeded by the Hopper (2022) and recently: Blackwell (2024))

Specification of the Ampere multiprocessor

Specifications of the Ampere MP:

Execution of threads on a Multi-processor

Threads are divided in groups of 32 threads (called a warp)

(Upto 4) Warps are run on a multi-processor at the same time

The NVidoa GPU architecture

All cores in the same multiprocessor can also access a (faster) shared memory:

This shared memory enables threads (= programs) running on CUDA cores to communicate with one another

Postscript: CPU cores vs GPU cores

Intel and AMD manufactures multi-core processors:
A core in an Intel/AMD processor is a "complete" CPU
A CPU can perform:
Instruction fetch
Operand fetch
ALU operations on operands
Storing result to memory
In contrast:
A core in a GPU (a.k.a. a CUDA core) can only perform:

ALU operations on operands