The NVidoa GPU architecture
 

This set of slides go into more details on the GPU architecture made by NVidia:

 

The NVidoa GPU architecture

The NVidia GPU consists of N "stream" multiprocessors that can access a shared device memory:

The device memory can be accessed by all multiprocessors (i.e.: a shared memory)

The NVidoa GPU architecture

Each multiprocessor has M processors (a.k.a. CUDA cores ~= ALUs):

Note: a processor or "CUDA core" = floating point unit (comparable to an ALU) ( click here and click here)
Each processor or CUDA core (= ALU) uses its own registers

Inside the nVidia "Ampere" Multiprocessor

  • The NVidia Ampere streaming multiprocessor ( click here)

    (It was succeeded by the Hopper (2022) and recently: Blackwell (2024))

Specification of the Ampere multiprocessor

  • Specifications of the Ampere MP:

Execution of threads on a Multi-processor

  • Threads are divided in groups of 32 threads (called a warp)

    (Upto 4) Warps are run on a multi-processor at the same time

The NVidoa GPU architecture

All cores in the same multiprocessor can also access a (faster) shared memory:

This shared memory enables threads (= programs) running on CUDA cores to communicate with one another

Postscript: CPU cores vs GPU cores

  • Intel and AMD manufactures multi-core processors:

    • A core in an Intel/AMD processor is a "complete" CPU

  • A CPU can perform:

    • Instruction fetch
    • Operand fetch
    • ALU operations on operands
    • Storing result to memory


  • In contrast:

    • A core in a GPU (a.k.a. a CUDA core) can only perform:

      • ALU operations on operands