NVIDIA Fermi Architecture

Overview and characteristics

The Fermi architecture is the penultimate NVIDIA architecture, the very last being the Kepler one.
Fermi Graphic Processing Units (GPUs) feature 3.0 billion transistors and a schematic is sketched in Fig. 1.

Streaming Multiprocessor (SM): composed by 32 CUDA cores (see Streaming Multiprocessor and CUDA core sections).
GigaThread globlal scheduler: distributes thread blocks to SM thread schedulers and manages the context switches between threads during execution (see Warp Scheduling section).
Host interface: connects the GPU to the CPU via a PCI-Express v2 bus (peak transfer rate of 8GB/s).
DRAM: supported up to 6GB of GDDR5 DRAM memory thanks to the 64-bit addressing capability (see Memory Architecture section).
Clock frequency: 1.5GHz (not released by NVIDIA, but estimated by Insight 64).
Peak performance: 1.5 TFlops.
Global memory clock: 2GHz.
DRAM bandwidth: 192GB/s.

Convention in figures:
orange - scheduling and dispatch;
green - execution;
light blue -registers and caches.

Fig. 01 - NVIDIA Fermi Architecture

Fig. 01 - NVIDIA Fermi Architecture


Streaming Multiprocessor (SM)

Each SM (see Fig. 2) features 32 single-precision CUDA cores, 16 load/store units, four Special Function Units (SFUs), a 64KB block of high speed on-chip memory (see L1+Shared Memory subsection) and an interface to the L2 cache (see L2 Cache subsection).

Fig. 2. The Fermi SM.

Fig. 2. The Fermi SM.

Load/Store Units

Allow source and destination addresses to be calculated for 16 threads per clock. Load and store the data from/to cache or DRAM.
Data can be converted from one format to another (for example, from integer to floating point or vice-versa) as it passes between DRAM and the core registers at the full rate.
These formatting and converting features are examples of optimizations unique to GPUs.
They are not worthwhile to be implemented in general-purpose CPUs, but for GPUs they will be used sufficiently often to justify their inclusion.

Special Functions Units (SFUs)

Execute transcendental instructions such as sin, cosine, reciprocal, and square root. The device intrinsics (e.g., __log2f(), __sinf(), __cosf()) expose in hardware the instructions implemented by the SFU.
The hardware implementation is based on quadratic interpolation in ROM tables using fixed-point arithmetic, as described in [5].

If -ffast-math is passed to nvcc, it will automatically use the intrinsic versions of the transcendentals, they have to be called explicitly.

Four of these operations can be issued per cycle in each SM. The SFU pipeline is decoupled from the dispatch unit, allowing the dispatch unit to issue to other execution units while the SFU is occupied.
It should be noticed that CUDA intrinsics rarely map to a single SFU instruction, but usually map to sequences of multiple SFU and non-SFU instructions.
Different GPU have different throughputs for the various operations involved, so if one needs to know the throughput of a particular intrinsic on a particular GPU it would be best to simply measure it.
The performance of single-precision intrinsics can also vary with compilation mode, in particular -ftz={true|false}.

CUDA core

Fig. 3. The Fermi core.

Fig. 3. The Fermi core.

A CUDA core handles integer and floating point operations (see Fig. 3).

Integer Arithmetic Logic Unit (ALU)

Supports full 32-bit precision for all usual mathematical and logical instructions, including multiplication, consistent with standard programming language requirements. It is also optimized to efficiently support 64-bit and extended precision operations.

Floating Point Unit (FPU)

Implements the new IEEE 754-2008 floating-point standard, providing the fused multiply-add (FMA) instruction (see Fused Multiply-Add subsection) for both single and double precision arithmetics. Up to 16 double precision fused multiply-add operations can be performed per SM, per clock.

Fused Multiply-Add

Fig. 4. Multiply-Add (MAD) instruction.

Fig. 4. Multiply-Add (MAD) instruction.

Fused Multiply-Add (FMA) perform Multiply-Add (MAD) operations (i.e., A*B+C) with a single final rounding step, with no loss of precision in the addition, for both 32-bit single-precision and 64-bit double-precision floating point numbers.
FMA improves the accuracy upon MAD (see Fig. 4) by retaining full precision in the intermediate stage (see Fig. 5).
In Fermi, this intermediate result carries a full 106-bit mantissa; in fact, 161 bits of precision are maintained during the add operation to handle worst-case denormalized numbers before the final double-precision result is computed.

Fig. 5. Fused Multiply-Add (FMA) instruction.

Fig. 5. Fused Multiply-Add (FMA) instruction.

Prior GPUs accelerated these calculations with the MAD instruction (multiplication with truncation, followed by an addition with round-to-nearest even) that allowed both operations to be performed in a single clock.
A FMA counts as two operations when estimating performance, resulting in a peak performance rate of 1024 operations per clock (32 cores x 16 SM x 2 operations).

Rounding and subnormal numbers

Subnormal numbers are small numbers that lie between zero and the smallest normalized number of a given floating point number system.
In the Fermi architecture, single precision floating point instructions support subnormal numbers by default in hardware, allowing values to gradually underflow to zero with no performance penalty, as well as all four IEEE 754-2008 rounding modes (nearest, zero, positive infinity, and negative infinity).
This is a relevant point since prior generation GPUs flushed subnormal operands and results to zero, incurring in losses of accuracy and CPUs typically perform subnormal calculations in exception-handling software, taking thousands of cycles.

Warp scheduling

The Fermi architecture uses a two-level, distributed thread scheduler.
The GigaThread engine schedules thread blocks to various SMs, while at the SM level, each warp scheduler distributes warps of 32 threads to its execution units.
Each SM can issue instructions consuming any two of the four green execution columns shown in the schematic Fig. 6. For example, the SM can mix 16 operations from the 16 first column cores with 16 operations from the 16 second column cores, or 16 operations from the load/store units with four from SFUs, or any other combinations the program specifies.

Fig. 6. Relevant to warp scheduling.

Fig. 6. Relevant to warp scheduling.

Note that 64-bit floating point operations consume both the first two execution columns. This implies that an SM can issue up to 32 single-precision (32-bit) floating point operations or 16 double-precision (64-bit) floating point operations at a time.

Dual Warp Scheduler

Threads are scheduled in groups of 32 threads called warps. Each SM features two warp schedulers (Fig. 7) and two instruction dispatch units, allowing two warps to be issued and executed concurrently.
The dual warp scheduler selects two warps, and issues one instruction from each warp to a group of 16 cores, 16 load/store units, or 4 SFUs.
Most instructions can be dual issued; two integer instructions, two floating instructions, or a mix of integer, floating point, load, store, and SFU instructions can be issued concurrently.
Double precision instructions do not support dual dispatch with any other operation.
In each cycle, a total of 32 instructions can be dispatched from one or two warps to these blocks.
It takes two cycles for the 32 instructions in each warp to execute on the cores or load/store units.
A warp of 32 special-function instructions is issued in a single cycle but takes eight cycles to complete on the four SFUs (see Fig. 8).

Context Switching

Fermi supports concurrent kernel execution, where different kernels of the same application context can execute on the GPU at the same time.
Concurrent kernel execution allows programs that execute a number of small kernels to utilize the whole GPU.
For example, a program may invoke a fluids solver and a rigid body solver which, if executed sequentially, would use only half of the available thread processors.
On the Fermi architecture, different kernels of the same CUDA context can execute concurrently, allowing maximum utilization of GPU resources.

Fig. 7. Illustrating the dual warp scheduler.

Fig. 7. Illustrating the dual warp scheduler.

Kernels from different application contexts can still run sequentially with great efficiency thanks to the improved context switching performance.
Switching from one application to another takes just 25 microseconds.
This time is short enough that a Fermi GPU can still maintain high utilization even when running multiple applications, like a mix of compute code and graphics code.

Fig. 8. Issuing of the instructions to the execution blocks.

Fig. 8. Issuing of the instructions to the execution blocks.

Efficient multitasking is important for consumers (e.g., for video games using physics-based effects) and professional users (who often need to run computationally intensive simulations and simultaneously visualize the results). As mentioned, this switching is managed by the GigaThread hardware thread scheduler.


L1 cache per SM and unified L2 cache that services all operations (load, store and texture).
Register files, shared memories, L1 caches, L2 cache, and DRAM memory are Error Correcting Code (see subsection on the Error Correcting Code) protected.


Each SM has 32KB of registers. Each thread has access to its own registers and not those of other threads. The maximum number of registers that can be used by a CUDA kernel is 63. The number of available registers degrades gracefully from 63 to 21 as the workload (and hence resource requirements) increases by number of threads. Registers have a very high bandwidth: about 8,000 GB/s.


Fig. 9. The Fermi memory hierarchy.

Fig. 9. The Fermi memory hierarchy.

L1+Shared Memory

On-chip memory that can be used either to cache data for individual threads (register spilling/L1 cache) and/or to share data among several threads (shared memory).
This 64 KB memory can be configured as either 48 KB of shared memory with 16 KB of L1 cache, or 16 KB of shared memory with 48 KB of L1 cache.
Prior generation GPUs spilled registers directly to DRAM, increasing access latency. Also, shared memory enables threads within the same thread block to cooperate, facilitates extensive reuse of on-chip data, and greatly reduces off-chip traffic.
Shared memory is accessible by the threads in the same thread block. It provides low-latency access (10-20 cycles) and very high bandwidth (1,600 GB/s) to moderate amounts of data (such as intermediate results in a series of calculations, one row or column of data for matrix operations, a line of video, etc.).
Because the access latency to this memory is also completely predictable, algorithms can be written to interleave loads, calculations, and stores with maximum efficiency.

Local memory

Local memory is meant as a memory location used to hold “spilled” registers.
Register spilling occurs when a thread block requires more register storage than is available on an SM.
Pre-Fermi GPUs spilled registers to global memory, causing a dramatic drop in performance.
Compute ≥ 2.0 devices spill registers to the L1 cache, which minimizes the performance impact of register spills.
Register spilling increases the importance of the L1 cache. Pressure from register spilling and the stack (which consume L1 storage) can increase the cache miss rate by data eviction.
Local memory is used only for some automatic variables (which are declared in the device code without any of the __device__, __shared__, or __constant__ qualifiers).
Generally, an automatic variable resides in a register except for the following:

  • Arrays that the compiler cannot determine are indexed with constant quantities;
  • Large structures or arrays that would consume too much register space;

Any variable the compiler decides to spill to local memory when a kernel uses more registers than are available on the SM.
The nvcc compiler reports total local memory usage per kernel (lmem) when compiling with the –ptxas-options=-v option.

L2 Cache

768 KB unified L2 cache, shared among the 16 SMs, that services all load and store from/to global memory, including copies to/from CPU host, and also texture requests.
As an example, if one copies 512KB from CPU to GPU, those data will reside both in the global memory and in L2; a kernel needing those data immediately after the CPU->GPU copy will find them in L2.
The L2 cache subsystem also implements atomic operations, used for managing access to data that must be shared across thread blocks or even kernels.

Global memory

Accessible by all threads as well as host (CPU). High latency (400-800 cycles).

Error Correcting Code

The Fermi architecture supports the Error Correcting Code (ECC) based protection of data in memory.
ECC was requested by GPU computing users to enhance data integrity in high performance computing environments (as medical imaging and large-scale cluster computing).
Naturally occurring radiation can cause a bit stored in memory to be altered, resulting in a soft error.
ECC technology detects and corrects single-bit soft errors before they affect the system. Because the probability of such radiation induced errors increase linearly with the number of installed systems, ECC is an essential requirement in large cluster installations.
Fermi supports Single-Error Correct Double-Error Detect (SECDED) ECC codes that correct any single bit error in hardware as the data is accessed.
In addition, SECDED ECC ensures that all double bit errors and many multi-bit errors are also be detected and reported so that the program can be re-run rather than being allowed to continue executing with bad data.
ECC is enabled by default.If ECC is not needed, it can be disabled for improved performance using the nvidia-smi utility (or via Control Panel on Microsoft Windows systems) [7].
Note that toggling ECC on or off requires a reboot to take effect.

[1] NVIDIA’s Next Generation CUDA Compute Architecture: Fermi.

[2] N. Brookwood, NVIDIA Solves the GPU Computing Puzzle.

[3] P.N. Glaskowsky, NVIDIA’s Fermi: The First Complete GPU Computing Architecture.

[4] N. Whitehead, A. Fit-Florea, Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs, 2011.

[5] S.F. Oberman, M. Siu, "A high-performance area-efficient multifunction interpolator," Proc. of the 17th IEEE Symposium on Computer Arithmetic, Cap Cod, MA, USA, Jul. 27-29, 2005, pp. 272–279.

[6] R. Farber, "CUDA Application Design and Development," Morgan Kaufmann, 2011.

[7] NVIDIA Application Note "Tuning CUDA applica

Note: All the images are owned by NVIDIA Corporation and distributed under Creative Commons Attribution Share Alike 3.0 License

Leave a Reply


Current day month ye@r *