Module 3 · Lesson 1

RAM vs. VRAM: Where Models Actually Live

Understanding why memory bandwidth — not clock speed — is the true bottleneck of local inference.

Why did Meta's LLaMA release in February 2023 suddenly make consumer GPU memory specs a front-page hardware topic?

When Meta AI released the weights for LLaMA 1 on February 24, 2023, the research community gained access to models ranging from 7 billion to 65 billion parameters. Within days, hobbyists discovered that the 7B model could be quantized to run on a single consumer GPU with 6 GB of VRAM — the kind found in millions of gaming laptops. Subreddits and Discord servers erupted with benchmark threads comparing RTX 3060 Ti against RTX 4090 performance. For the first time, the phrase "does your GPU have enough VRAM?" entered mainstream tech conversation.

The answer depended almost entirely on one thing: how much video RAM was soldered to the graphics card, and how fast data could move across it.

System RAM vs. VRAM: A Functional Distinction

When a language model runs inference, every layer of the network must be loaded into addressable memory before any arithmetic happens. System RAM (DDR5, LPDDR5, etc.) connects to the CPU. VRAM — video random-access memory, typically GDDR6 or HBM — connects directly to the GPU's compute cores. The distinction matters enormously because modern LLM inference is bottlenecked not by arithmetic speed but by memory bandwidth: how many gigabytes per second can be fed to the compute units.

A consumer CPU can address 64–128 GB of DDR5 RAM with bandwidth around 50–90 GB/s. A mid-range GPU (RTX 4070) holds only 12 GB of VRAM but moves data at roughly 504 GB/s — ten times faster. A high-end GPU like the H100 SXM reaches 3.35 TB/s across 80 GB of HBM3. This is why a model that barely fits in VRAM runs dramatically faster than the same model spilling into system RAM.

Key Principle

Inference throughput (tokens per second) is primarily limited by how fast the GPU can read model weights from VRAM, not by raw FLOP count. A model that fits entirely in VRAM will typically generate 5–15× more tokens per second than the same model offloaded to system RAM, on comparable hardware.

How Model Size Translates to Memory Requirement

Every parameter in a neural network is stored as a floating-point number. The precision format determines byte size per parameter, and thus the raw memory footprint:

Precision	Bytes / Param	7B Model	13B Model	70B Model
FP32 (full)	4 bytes	~28 GB	~52 GB	~280 GB
BF16 / FP16	2 bytes	~14 GB	~26 GB	~140 GB
Q8 (8-bit int)	1 byte	~7 GB	~13 GB	~70 GB
Q4 (4-bit)	0.5 bytes	~4 GB	~7 GB	~35 GB

These are weight-only figures. Add roughly 10–20% overhead for KV cache (which stores attention context and grows with sequence length), runtime buffers, and quantization metadata. A Q4 7B model that appears to need 4 GB often requires 5.5–6 GB in practice under Ollama or llama.cpp.

In January 2024, the open-source benchmark site Hugging Face Spaces ran systematic tests confirming that Mistral 7B Q4_K_M required approximately 5.1 GB VRAM at 2048-token context, rising to 6.4 GB at 8192 tokens due to KV cache growth. This was one of the first widely-cited empirical measurements helping consumers gauge fit.

The KV Cache: Memory That Grows With Context

The key-value cache stores intermediate attention computations for each token already processed in the conversation. It grows linearly with context length multiplied by the number of attention layers and heads. For a Llama 2 13B model with 40 layers and 40 heads at FP16, each token in context consumes roughly 0.8 MB of VRAM. A 4096-token conversation therefore adds ~3.3 GB on top of model weights.

This is why "context length" is listed as a VRAM constraint in every serious local inference guide — not merely as a capability limit, but as a physical memory budget. Tools like llama.cpp expose a --ctx-size flag explicitly to let users trade context depth against memory fit.

Real Case: Apple M2 Unified Memory (2023)

Apple's M2 Ultra chip, released in June 2023, offered up to 192 GB of unified memory shared between CPU and GPU at 800 GB/s bandwidth. Researchers at Hugging Face demonstrated running a full FP16 70B model on a single Mac Studio — something that otherwise required two A100 80 GB GPUs. The trade-off: bandwidth at 800 GB/s, while A100s achieve 2 TB/s. Inference was roughly 2–3× slower, but achievable on consumer hardware for the first time without quantization.

VRAMVideo RAM — high-bandwidth memory on the GPU die used to store model weights, activations, and KV cache during inference. Typically GDDR6 (consumer) or HBM2/HBM3 (data center).

Memory BandwidthThe rate at which data can be transferred between memory and compute units, measured in GB/s. The primary throughput limiter for autoregressive LLM generation.

KV CacheKey-value attention cache — stored intermediate attention computations that grow with sequence length, consuming VRAM proportional to context depth.

QuantizationReducing parameter precision (e.g., FP16 → Q4) to shrink memory footprint, at a small cost to output quality that varies by method and bit depth.

Module 3 · Lesson 1

Quiz: RAM & VRAM Fundamentals

Four questions — select the best answer for each.

1. A 7B parameter model stored in Q4 (4-bit) quantization uses approximately how many bytes per parameter?

Correct. Q4 stores two 4-bit values per byte, so each parameter costs 0.5 bytes — giving roughly 3.5–4 GB for weights alone on a 7B model before overhead.

Not quite. Q4 (4-bit integer) packs two parameters per byte, yielding 0.5 bytes per parameter. FP32 = 4 bytes, FP16 = 2 bytes, Q8 = 1 byte.

2. Why does increasing context length consume more VRAM during inference?

Correct. The key-value cache stores attention state for every processed token. It grows proportionally with context length × layers × heads × hidden dimension, measured per-token in VRAM.

Incorrect. Weights stay fixed regardless of context. The KV cache — which stores intermediate attention computations — is what grows with each additional token in the prompt or conversation.

3. A consumer GPU with 12 GB VRAM at 504 GB/s bandwidth is compared to CPU inference using 64 GB DDR5 RAM at 75 GB/s. Which statement best describes inference performance?

Correct. Memory bandwidth — not capacity or FLOP count — determines autoregressive generation speed. At 504 GB/s vs. 75 GB/s, the GPU can feed its compute units roughly 6–7× faster, producing proportionally more tokens per second.

Incorrect. LLM inference is memory-bandwidth-bound. The GPU's 504 GB/s vastly outpaces DDR5's ~75 GB/s. More total RAM doesn't help if data can't be delivered to compute units fast enough.

4. What was historically significant about Apple's M2 Ultra chip (2023) for running large language models locally?

Correct. M2 Ultra's up to 192 GB unified memory (shared CPU/GPU) at 800 GB/s enabled full-precision 70B inference that otherwise required dual A100 80 GB data center GPUs — a genuine milestone for accessible local inference.

Incorrect. The M2 Ultra's significance was its unified memory — up to 192 GB accessible by both CPU and GPU at high bandwidth — enabling unquantized 70B model inference on consumer hardware for the first time.

Module 3 · Lab 1

Memory Sizing Calculator

Conversation lab — ask the AI assistant to help you size memory requirements for specific models and use cases.

Lab Objective

Practice estimating VRAM requirements for different model sizes and quantization formats. Ask the assistant to walk you through calculations for your own hardware scenario.

Try asking: "I have an RTX 3070 with 8 GB VRAM. What's the largest model I can run, and at what quantization level?" — or describe your actual machine and target model.

Memory Sizing Assistant

L1 Lab

Hello! I'm here to help you calculate VRAM and RAM requirements for running language models locally. Tell me your GPU model (or say "CPU only"), how much memory it has, and which model you're hoping to run — and I'll work through the sizing math with you.

Module 3 · Lesson 2

Choosing a GPU: Consumer Tiers Explained

From the RTX 3060 to the RTX 4090 — what each tier actually buys you in tokens per second.

When llama.cpp published its first GPU benchmark tables in mid-2023, which consumer cards delivered the best tokens-per-second per dollar?

In June 2023, contributor ggerganov merged CUDA support into llama.cpp, opening GPU-accelerated inference to anyone with an NVIDIA card. Within weeks, the project's GitHub wiki filled with community benchmark tables. A pattern emerged quickly: the RTX 3090 with 24 GB VRAM dominated mid-range local inference, not because of its CUDA core count, but because it was the cheapest way to get 24 GB of GDDR6X bandwidth on the second-hand market. Meanwhile the RTX 4090, with 24 GB at 1 TB/s, became the new ceiling — able to run a Q4 70B model in under 6 GB of overspill with fast decode.

The community discovered that for models that fit entirely in VRAM, the generation speed scaled almost perfectly with memory bandwidth — not CUDA core count, not tensor core generation, just raw GB/s to the compute units.

Consumer GPU Tiers: A Practical Hierarchy

The following tiers reflect the state of consumer NVIDIA GPUs as of 2024. AMD Radeon RX 7900 XTX (24 GB VRAM) provides a competitive alternative, though ROCm software support for inference tools lags CUDA in maturity. Apple Silicon is addressed separately in Lesson 3.

Tier	GPU	VRAM	Bandwidth	Best Fit Model
Entry	RTX 3060	12 GB GDDR6	360 GB/s	Q4 7B–13B
Entry	RTX 4060 Ti	16 GB GDDR6	288 GB/s	Q4 13B; Q8 7B
Mid	RTX 3090	24 GB GDDR6X	936 GB/s	Q4 34B; FP16 13B
Mid	RTX 4090	24 GB GDDR6X	1,008 GB/s	Q4 34B; FP16 13B; fast 70B offload
Pro	A100 80 GB	80 GB HBM2e	2,000 GB/s	FP16 70B; Q8 70B
Pro	H100 SXM	80 GB HBM3	3,350 GB/s	FP16 70B at high throughput

Why the RTX 4060 Ti 16 GB is Slower Than the RTX 3060 12 GB for Inference

NVIDIA cut memory bandwidth on the RTX 4060 Ti to reduce cost. Its 288 GB/s is actually slower than the RTX 3060's 360 GB/s — meaning inference on models that fit in either card runs faster on the older 3060, despite the 4060 Ti being a newer, more expensive product. The 16 GB version is still valuable for fitting larger quantized models, but token generation rate per dollar favors the 3060 for smaller models. This counter-intuitive result was documented extensively in the r/LocalLLaMA community benchmarks of July 2023.

Multi-GPU and Offloading Strategies

When a model exceeds single-GPU VRAM, two strategies apply: tensor parallelism (splitting layers across GPUs via NVLink) and CPU offloading (storing excess layers in system RAM). llama.cpp exposes the --n-gpu-layers flag to control how many transformer layers are kept in VRAM — layers not on the GPU are computed on CPU at much lower speed.

In October 2023, the Ollama project released documentation showing that a 70B Q4 model on a machine with one RTX 4090 (24 GB) and 64 GB DDR5 RAM ran at approximately 8 tokens/second with full GPU layers versus 1.2 tokens/second when offloaded entirely to CPU. This ~6.7× penalty quantified the real cost of VRAM overflow for practical users.

Two RTX 3090s linked via NVLink provide 48 GB of unified VRAM at full bandwidth — enabling FP16 34B or Q8 70B inference at high speed, at a cost substantially below enterprise alternatives. This configuration became popular among researchers who published local inference work throughout 2023–2024.

AMD ROCm: The Alternative Path

AMD's RX 7900 XTX offers 24 GB GDDR6 at 960 GB/s — competitive with the RTX 3090 — at generally lower price points. However, ROCm (Radeon Open Compute) software support in inference tools has historically been one to two major versions behind CUDA support. As of early 2024, llama.cpp ROCm builds required manual compilation and were not distributed in standard Ollama binaries on Windows. Linux users reported functional but occasionally unstable ROCm inference. The gap is narrowing as AMD invests in ROCm tooling.

Tokens Per Second: What to Expect

Community benchmarks (primarily from r/LocalLLaMA and the llama.cpp GitHub issues tracker) established approximate token generation rates for common hardware/model combinations as of late 2023. These figures assume Q4_K_M quantization, default context, and llama.cpp or Ollama as the inference engine:

GPU	7B Q4	13B Q4	34B Q4	70B Q4
RTX 3060 12 GB	~85 t/s	~45 t/s	offload	offload
RTX 4070 12 GB	~110 t/s	~58 t/s	offload	offload
RTX 3090 24 GB	~120 t/s	~70 t/s	~32 t/s	offload
RTX 4090 24 GB	~145 t/s	~80 t/s	~38 t/s	~8 t/s (offload)
2× RTX 3090 NVLink	~115 t/s	~70 t/s	~35 t/s	~22 t/s

Tensor ParallelismSplitting model layers across multiple GPUs so each holds a portion of the weights in VRAM simultaneously, enabling models larger than any single GPU's memory.

CPU OffloadingStoring transformer layers that don't fit in VRAM in system RAM and computing them on CPU. Functional but typically 5–10× slower than full-VRAM inference.

NVLinkNVIDIA's high-speed GPU interconnect allowing multiple GPUs to share VRAM address space. Consumer RTX cards (3090, 4090) support 2-way NVLink at 112 GB/s bidirectional.

Module 3 · Lesson 2

Quiz: GPU Tiers for Local Inference

Four questions — select the best answer for each.

1. Why does the RTX 4060 Ti 16 GB sometimes generate tokens more slowly than the RTX 3060 12 GB for models that fit in either card's VRAM?

Correct. NVIDIA reduced memory bandwidth on the 4060 Ti to cut costs. At 288 GB/s vs. the 3060's 360 GB/s, the older card feeds its compute units faster for memory-bandwidth-bound workloads like LLM token generation.

Incorrect. LLM inference is memory-bandwidth-bound. The RTX 4060 Ti's 288 GB/s bandwidth is actually lower than the RTX 3060's 360 GB/s — meaning the older card generates tokens faster when both models fit in VRAM.

2. According to Ollama's 2023 documentation, approximately how much faster is full-GPU inference versus full CPU offloading for a 70B Q4 model on an RTX 4090 system?

Correct. Ollama documentation showed ~8 tokens/s with GPU layers versus ~1.2 tokens/s fully CPU-offloaded — a ~6.7× difference, reflecting the bandwidth gap between GDDR6X and DDR5.

Incorrect. Ollama's data showed approximately 8 t/s with GPU inference versus 1.2 t/s CPU-offloaded — roughly a 6.7× difference. This quantifies the real cost of VRAM overflow on a model too large for any single consumer GPU.

3. What primary advantage does a two-RTX 3090 NVLink configuration offer over a single RTX 4090 for local LLM inference?

Correct. Two RTX 3090s via NVLink pool their VRAM into 48 GB of addressable GPU memory, enabling FP16 34B or Q8 70B inference that a single 24 GB card cannot accommodate without CPU offloading.

Incorrect. The key benefit is combined VRAM: 48 GB vs. 24 GB. This allows models that don't fit in a single 24 GB card to run entirely in GPU memory at full bandwidth — the primary bottleneck for inference speed.

4. As of early 2024, which statement most accurately describes AMD ROCm support for local LLM inference tools compared to NVIDIA CUDA?

Correct. As of early 2024, ROCm builds of llama.cpp required manual compilation and were not in standard Ollama Windows binaries. Linux ROCm support was functional but occasionally unstable, one to two versions behind CUDA parity.

Incorrect. ROCm support has historically lagged CUDA in inference tools. Standard Ollama distributions did not include ROCm Windows binaries as of early 2024, and Linux users needed manual compilation for reliable use.

Module 3 · Lab 2

GPU Selection Advisor

Conversation lab — get personalized GPU recommendations based on your budget, use case, and target models.

Lab Objective

Practice evaluating GPU options for local inference. Ask the assistant to compare specific cards, estimate tokens-per-second for your target model, or evaluate whether upgrading from your current hardware is worthwhile.

Try asking: "I'm deciding between an RTX 3090 for $400 used and an RTX 4070 Ti 12 GB for $600 new. I want to run 13B models. Which should I buy?" — or describe your own decision.

GPU Selection Advisor

L2 Lab

I can help you choose the right GPU for local LLM inference. Tell me your budget, which models you want to run, and your current hardware — I'll compare the relevant options including VRAM capacity, bandwidth, and expected tokens per second.

Module 3 · Lesson 3

CPU Inference & Apple Silicon

When dedicated GPU VRAM isn't the bottleneck — how unified memory architectures changed the local inference landscape.

How did Apple's M1/M2 chip architecture accidentally become one of the best consumer platforms for running language models in 2023?

On March 10, 2023, developer Georgi Gerganov merged Metal GPU support into llama.cpp — just weeks after the initial CPU-only release. Apple Silicon Macs, dismissed by many as gaming-inferior, suddenly had a structural advantage: their unified memory meant there was no bandwidth penalty for the GPU accessing model weights. A MacBook Pro M2 with 24 GB of RAM could run a 13B model at 20–25 tokens per second — faster than many gaming desktops doing CPU offload inference at the time.

Within weeks, threads appeared comparing M2 Pro MacBook Pro inference performance to RTX 3060 gaming PCs. For models that fit in unified memory, the MacBook often won — or tied — at a fraction of the power draw and without active cooling noise.

Apple Silicon Unified Memory Architecture

Traditional laptops have discrete RAM (CPU) and VRAM (GPU) connected by a PCIe bus with limited bandwidth. Apple Silicon eliminates this boundary: all memory is unified — a single pool accessible by CPU cores, GPU cores, and the Neural Engine at full chip bandwidth. This means llama.cpp's Metal backend can run GPU inference on model weights stored in what appears to be ordinary RAM, without copying data between separate memory pools.

Bandwidth figures for the M-series chips:

Chip	Max Unified RAM	Memory Bandwidth	Approx 7B Q4 Speed
M1	16 GB	68 GB/s	~16 t/s
M1 Pro	32 GB	200 GB/s	~35 t/s
M2	24 GB	100 GB/s	~22 t/s
M2 Pro	32 GB	200 GB/s	~38 t/s
M2 Max	96 GB	400 GB/s	~55 t/s
M2 Ultra	192 GB	800 GB/s	~70 t/s (FP16 70B feasible)
M3 Max	128 GB	400 GB/s	~60 t/s

The Power Efficiency Advantage

An M2 Pro MacBook Pro running a 13B Q4 model draws approximately 20–30 watts at the wall. An equivalent RTX 3090 gaming PC running the same model draws 350–450 watts. For users running inference continuously (coding assistants, document processing), the Apple Silicon power advantage compounds significantly over time and enables battery-powered inference that no discrete GPU solution can match.

CPU-Only Inference: AVX2, AVX-512, and x86 Performance

For users without a discrete GPU, llama.cpp can run entirely on CPU using SIMD extensions. The key extensions that accelerate quantized inference:

Extension

AVX2

Available on Intel Haswell (2013+) and AMD Zen (2017+). Processes 256-bit vectors. Minimum recommended for practical CPU inference. Most consumer PCs from the past decade have AVX2.

Extension

AVX-512

Intel Skylake-X and later server/HEDT chips; AMD Zen 4 (Ryzen 7000+). 512-bit vectors roughly double the throughput versus AVX2 for quantized matmul. Significant speedup on supported hardware.

Architecture

AMD Zen 4

Ryzen 7000 series. First mainstream consumer CPUs with AVX-512. Benchmarks from 2023 show Zen 4 achieving ~12–15 t/s on 7B Q4 — competitive with entry discrete GPU setups for small models.

Architecture

Intel Core Ultra

Meteor Lake (2024) includes a dedicated NPU for AI workloads. Intel's OpenVINO runtime can use the NPU for specific model formats, though llama.cpp support is partial as of 2024.

CPU inference speed scales roughly with memory bandwidth (DDR5 > DDR4) and the number of channels. AMD's Threadripper platform with 8-channel DDR5 reaches ~175 GB/s system bandwidth — approaching lower-tier GPU bandwidth and enabling surprisingly capable CPU-only inference for 7B models.

In November 2023, Andrej Karpathy benchmarked llama2.c (a minimal C implementation) on an Apple M2 Air and various x86 machines, publishing results that showed M2 Air at 89 t/s for a small Llama 2 7B model — faster than a Threadripper at full multi-thread. The comparison highlighted unified memory bandwidth as the decisive factor over raw compute.

Real Case: Ollama on Apple Silicon — December 2023

When Ollama 0.1.0 released in December 2023 with native Apple Silicon support and automatic Metal acceleration, it became the most-downloaded AI tool in the Mac App Store within its launch week. The combination of llama.cpp Metal backend, automatic model management, and a clean API made Apple Silicon Macs the most accessible platform for local inference — no driver configuration, no CUDA, no ROCm. A user with a MacBook Pro M2 Pro could run ollama run llama2:13b and get responsive output in under two minutes from cold start.

Unified MemoryA single physical memory pool accessible at full bandwidth by both CPU and GPU compute cores, eliminating the PCIe transfer penalty seen in discrete GPU systems.

MetalApple's GPU programming API (analogous to CUDA for NVIDIA), used by llama.cpp and Ollama to accelerate inference on Apple Silicon's integrated GPU.

AVX-512Advanced Vector Extensions 512-bit — CPU instruction set extension enabling 512-bit wide SIMD operations, approximately doubling throughput over AVX2 for quantized LLM inference on compatible x86 CPUs.

Module 3 · Lesson 3

Quiz: CPU Inference & Apple Silicon

Four questions — select the best answer for each.

1. What structural advantage does Apple Silicon's unified memory provide specifically for LLM inference compared to discrete GPU systems?

Correct. Unified memory eliminates the PCIe copy bottleneck — the GPU reads model weights at full chip memory bandwidth without first transferring data across a slower interconnect, as discrete GPU systems must do.

Incorrect. The structural advantage is unified memory bandwidth: GPU compute cores access model weights in the same high-bandwidth pool as the CPU, eliminating PCIe transfer overhead that limits discrete GPU systems.

2. Andrej Karpathy's 2023 llama2.c benchmarks showed an M2 Air outperforming many x86 CPUs at small-model inference. What hardware factor best explains this result?

Correct. M2 Air's 100 GB/s unified bandwidth — utilized by both CPU and GPU — greatly exceeds typical x86 laptop DDR5 bandwidth (~50–70 GB/s), making memory-bandwidth-bound inference proportionally faster.

Incorrect. The key factor is memory bandwidth. M2 Air provides 100 GB/s accessible to its GPU cores via unified memory, while most x86 laptops offer 50–70 GB/s DDR5 accessible only to CPU. For bandwidth-bound LLM inference, the Apple chip wins on throughput per watt.

3. Which x86 CPU family was the first mainstream consumer platform to support AVX-512, providing a meaningful inference speedup over earlier Ryzen generations?

Correct. AMD Zen 4 (Ryzen 7000, launched September 2022) was the first mainstream consumer AMD CPU with AVX-512. Intel's consumer 12th/13th Gen Alder/Raptor Lake notably disabled AVX-512 in their hybrid core design.

Incorrect. AMD Zen 4 (Ryzen 7000 series, 2022) was the first mainstream AMD consumer platform with AVX-512. Zen 3 lacked it. Intel's Alder Lake (12th Gen) actually disabled AVX-512 due to hybrid core architecture issues.

4. Approximately how much power does an M2 Pro MacBook running 13B Q4 inference draw compared to an RTX 3090 gaming PC running the same task?

Correct. An M2 Pro MacBook draws approximately 20–30W at the wall during inference. An RTX 3090 gaming PC draws 350–450W. This ~15× power efficiency gap makes Apple Silicon compelling for continuous or battery-powered inference workloads.

Incorrect. An M2 Pro draws only 20–30 watts during LLM inference while an RTX 3090 desktop draws 350–450 watts — approximately 15× more. This power efficiency advantage is one of Apple Silicon's most significant practical benefits for local AI workloads.

Module 3 · Lab 3

Platform Comparison Advisor

Conversation lab — compare Apple Silicon, x86 CPU, and discrete GPU platforms for your specific inference use case.

Lab Objective

Explore the trade-offs between different inference platforms based on your workload, portability needs, and budget. Ask the assistant to compare options or help you evaluate your existing hardware.

Try asking: "I use my MacBook Air M2 with 16 GB RAM for coding. Can I run a 13B model, and will it affect battery life significantly?" — or describe your actual machine and workload.

Platform Comparison Advisor

L3 Lab

I can help you compare inference platforms — Apple Silicon, x86 CPU-only, and NVIDIA GPU setups. Tell me about your hardware, how you plan to use local models (coding assistant, document processing, batch tasks), and any portability or power constraints. I'll give you an honest comparison.

Module 3 · Lesson 4

Storage, Cooling & System Integration

The hardware requirements that don't show up in benchmark tables — and why they determine long-term inference reliability.

Why does NVMe SSD speed directly affect how quickly you can switch between local models, and how much does it actually matter in practice?

In January 2024, the LocalLLaMA subreddit produced a systematic thread comparing model load times on NVMe Gen4, NVMe Gen3, SATA SSD, and spinning HDD. The results were striking: a 7B Q4 model (approximately 4 GB file) loaded in 3.2 seconds from a Samsung 990 Pro (Gen4 NVMe), 5.8 seconds from a Gen3 NVMe, 14 seconds from a SATA SSD, and over 90 seconds from a mechanical hard drive. For users switching between a coding assistant and a general-purpose model dozens of times per day, this difference between Gen4 NVMe and HDD exceeded two minutes per model swap — easily an hour of lost productivity across a working week.

The same thread noted that the 70B Q4 model at ~40 GB took 28 seconds from Gen4 NVMe — and over 12 minutes from spinning disk. That benchmark effectively ended the discussion of whether storage tier mattered for local AI workloads.

Storage Requirements for Model Libraries

Local inference requires storing model weights on disk, which are loaded into VRAM/RAM at startup. Model file sizes scale directly with parameter count and quantization:

Model	Q4_K_M	Q8_0	FP16
7B	~4.1 GB	~7.7 GB	~14 GB
13B	~7.4 GB	~14 GB	~26 GB
34B	~19 GB	~38 GB	~68 GB
70B	~40 GB	~78 GB	~140 GB

A modest local model library — say five 7B models, two 13B models, and one 34B model — occupies roughly 80–100 GB of storage. Serious researchers maintaining multiple quantizations and experimental checkpoints routinely accumulate 500 GB to 2 TB of model files. The minimum recommended configuration for practical local inference is a 1 TB NVMe SSD dedicated to or shared with the model library.

Ollama stores downloaded models in ~/.ollama/models on Linux/macOS and C:\Users\[user]\.ollama\models on Windows. A dedicated secondary NVMe drive mounted at this path is a common configuration for power users to separate model storage from OS/application storage without performance penalty.

Storage Tier Comparison: Model Load Speed

Gen4 NVMe (7 GB/s): 7B model loads in ~3 seconds. Recommended baseline.
Gen3 NVMe (3.5 GB/s): 7B model loads in ~6 seconds. Acceptable for casual use.
SATA SSD (550 MB/s): 7B model loads in ~14 seconds. Functional but noticeable delay.
HDD (150 MB/s): 7B model loads in ~90 seconds. Effectively unusable for interactive work.

Thermal Management: Sustained Load Behavior

Consumer GPUs are designed to run at maximum performance for gaming workloads (typically 15–30 minute sessions). Local LLM inference, by contrast, can sustain near-100% GPU memory bandwidth utilization for hours at a time — a thermal profile most consumer cooling solutions were not designed for.

NVIDIA's GPU Boost algorithm dynamically reduces clock speed when junction temperature exceeds a threshold (typically 83°C for RTX 30/40 series). An RTX 3090 running a 34B model for four hours in a poorly-ventilated case will thermal-throttle significantly, reducing effective memory bandwidth and token throughput by 15–30%. This phenomenon was documented by the EleutherAI team during extended inference runs in 2023, prompting guidance recommending aftermarket cooling and chassis airflow evaluation before committing to sustained inference workloads.

Cooling Recommendation

Desktop GPU: Case Airflow

Ensure positive front-to-back airflow. RTX 3090/4090 triple-fan coolers require 120–180mm of clearance below the GPU for intake. Target GPU junction temp: below 80°C under sustained inference load.

Cooling Recommendation

Desktop GPU: Aftermarket

For 24/7 inference servers, replacing the RTX 3090's stock cooler with an Arctic Accelero Xtreme IV (a popular community recommendation as of 2023) reduces junction temperature by 10–15°C under sustained load.

Cooling Recommendation

Apple Silicon

M-series MacBooks throttle more aggressively than desktops under sustained load. Mac Studio and Mac Pro with M2/M3 chips use larger heatsinks and maintain performance significantly better than MacBook Pro for multi-hour inference sessions.

Cooling Recommendation

CPU-Only Systems

CPU inference at full load generates 65–170W of heat depending on TDP. Standard 240mm AIO or large tower coolers handle sustained inference well. DDR5 heat spreaders rarely need active cooling even at sustained memory bandwidth.

PCIe Bandwidth and System Bus Considerations

When a GPU is installed in a PCIe x16 slot, the system bus becomes relevant for initial model loading and for CPU↔GPU data transfers during hybrid inference. PCIe 4.0 x16 provides ~32 GB/s bidirectional; PCIe 3.0 x16 provides ~16 GB/s. For loading a 40 GB 70B Q4 model from RAM to VRAM, PCIe 4.0 completes the transfer in roughly 1.25 seconds versus 2.5 seconds on PCIe 3.0 — a one-time cost per model load, not an ongoing inference bottleneck.

The PCIe bus is not the bottleneck during token generation: once weights are in VRAM, inference uses the GPU's internal memory bus (GDDR6X at 936–1008 GB/s), which is 30–60× faster than PCIe. PCIe speed therefore matters primarily for model switching speed, not generation throughput.

Real Case: Meta's Internal LLaMA 2 Inference Clusters (2023)

Meta's infrastructure team documented in a 2023 technical blog post that their LLaMA 2 inference clusters used custom liquid-cooled GPU nodes to sustain maximum memory bandwidth continuously. Consumer GPU thermal throttling under sustained load — reducing effective bandwidth by 20–30% — was cited as a key reason consumer hardware cannot replicate cluster-level throughput even when raw specifications appear comparable. This distinction is relevant to anyone building a home inference server expecting data-center-equivalent performance.

Minimum and Recommended System Configurations

Component	Minimum	Recommended	Power User
GPU VRAM	8 GB (RTX 3070)	16–24 GB (RTX 4070 Ti / 3090)	48 GB (2× RTX 3090) or 80 GB A100
System RAM	16 GB DDR4	32 GB DDR5	64–128 GB DDR5
Storage	500 GB SATA SSD	1 TB Gen3 NVMe	2 TB Gen4 NVMe
CPU	Any AVX2-capable (2014+)	Ryzen 5600X / i7-12700	Ryzen 7950X (AVX-512) or Threadripper
Power Supply	550W 80+ Bronze	750W 80+ Gold	1000W+ 80+ Platinum (dual GPU)
Cooling	Stock GPU cooler, decent case	Good airflow case, 240mm AIO CPU	High-airflow case, aftermarket GPU cooler

NVMe Gen4PCIe 4.0-based M.2 SSD with sequential read speeds up to 7 GB/s — the recommended storage standard for model libraries due to fast load times when switching between models.

Thermal ThrottlingAutomatic reduction of GPU clock speed to prevent overheating under sustained load. Reduces effective memory bandwidth and token throughput; particularly impactful in multi-hour inference sessions.

GPU BoostNVIDIA's dynamic clock frequency algorithm that adjusts GPU speed based on temperature, power consumption, and utilization — and reduces it when sustained inference pushes temperature toward junction limits.

Module 3 · Lesson 4

Quiz: Storage, Cooling & System Integration

Four questions — select the best answer for each.

1. According to the r/LocalLLaMA benchmark thread from January 2024, approximately how long does a 7B Q4 model take to load from a Gen4 NVMe SSD versus a mechanical hard drive?

Correct. The January 2024 community benchmark showed ~3.2 seconds from Gen4 NVMe vs. over 90 seconds from spinning HDD for a ~4 GB model file — a 28× difference that makes HDD effectively unusable for interactive model switching.

Incorrect. The benchmark showed approximately 3.2 seconds from Gen4 NVMe versus over 90 seconds from mechanical HDD — roughly a 28× difference driven by sequential read speeds (7 GB/s vs. ~150 MB/s).

2. Why does PCIe bandwidth (3.0 vs. 4.0) affect model loading speed but NOT token generation throughput once the model is in VRAM?

Correct. PCIe transfers weights from system storage/RAM to VRAM once at load time. During inference, the GPU reads weights from its own VRAM at 936–1008 GB/s (GDDR6X), which completely bypasses PCIe — so PCIe speed has no effect on token generation rate.

Incorrect. During token generation, the GPU reads model weights from VRAM over its internal GDDR6X bus at 936+ GB/s. PCIe (32 GB/s for Gen4) is only used during the initial model load from system RAM/storage — a one-time cost, not an ongoing inference bottleneck.

3. Under sustained inference load (4+ hours), what hardware phenomenon primarily degrades token generation performance in consumer GPU systems?

Correct. NVIDIA's GPU Boost algorithm throttles clock speed when junction temperature approaches the 83°C threshold. Under sustained inference in poorly-ventilated cases, this reduces effective memory bandwidth and token throughput by 15–30% — as documented by EleutherAI's 2023 inference infrastructure guidance.

Incorrect. Thermal throttling is the primary culprit: as GPU junction temperature approaches ~83°C, NVIDIA's GPU Boost algorithm reduces clock speed to protect the chip. Under sustained 4+ hour inference, poorly-cooled consumer GPUs lose 15–30% of their effective memory bandwidth through this mechanism.

4. What is the recommended minimum storage configuration for a practical local inference setup that includes several 7B and 13B models?

Correct. A modest collection of five 7B models (~20 GB), two 13B models (~15 GB), and one 34B model (~19 GB) already occupies ~55 GB. With multiple quantizations and updates, 1 TB NVMe provides both adequate capacity and fast load speeds for practical use.

Incorrect. A realistic local model library quickly grows to 80–200 GB across multiple models and quantizations. The recommended minimum is 1 TB NVMe SSD — both for capacity and for the fast load times (vs. SATA or HDD) needed for interactive model switching.

Module 3 · Lab 4

System Build Advisor

Conversation lab — get a complete hardware recommendation for your local inference use case and budget.

Lab Objective

Practice integrating all hardware considerations — VRAM, system RAM, storage, cooling, and power — into a coherent system recommendation. Ask the assistant to help you spec out a build or evaluate your current system's weaknesses.

Try asking: "I want to build a dedicated local inference machine for $1,500. I want to run 34B models for coding assistance 6–8 hours a day. What should I buy?" — or describe your own situation.

System Build Advisor

L4 Lab

I can help you design a complete local inference system — covering GPU, RAM, storage, cooling, and power supply. Tell me your budget, target models, daily usage duration, and any constraints (e.g., noise level, form factor, existing components to reuse). I'll give you a specific, reasoned recommendation.

Module 3

Module Test: Hardware Requirements

15 questions across all four lessons — score 80% or higher to pass this module.

1. What is the primary bottleneck that determines tokens-per-second throughput during LLM inference on a GPU?

Correct. LLM autoregressive generation is memory-bandwidth-bound: throughput scales with how fast the GPU can read weights from VRAM, not with arithmetic compute capacity.

Incorrect. Memory bandwidth is the binding constraint. LLM generation reads model weights once per token — the rate of that read determines speed, not arithmetic core count or clock frequency.

2. A 13B parameter model stored in BF16 precision requires approximately how much VRAM for weights alone?

Correct. BF16 uses 2 bytes per parameter: 13B × 2 bytes = 26 GB. Add KV cache and buffers for total runtime VRAM usage.

Incorrect. BF16 = 2 bytes per parameter. 13 billion × 2 bytes = 26 GB for weights alone, before KV cache overhead.

3. The KV cache in LLM inference stores what, and how does it affect VRAM requirements?

Correct. The KV cache stores key-value attention computations for every previously processed token. It grows proportionally with context length × layers × attention heads, consuming significant VRAM for long conversations.

Incorrect. The KV cache stores intermediate attention computations (keys and values) for each processed token, enabling attention to previous context without recomputation. It grows linearly with context length and can add several GB of VRAM for long conversations.

4. The RTX 3060 12 GB outperforms the RTX 4060 Ti 16 GB in token generation speed for 7B models that fit in either card's VRAM. Why?

Correct. NVIDIA cut bandwidth on the 4060 Ti to reduce manufacturing cost. At 288 GB/s vs. 360 GB/s, the older RTX 3060 feeds its compute units faster for memory-bound workloads like LLM token generation.

Incorrect. The RTX 3060 has 360 GB/s memory bandwidth vs. the 4060 Ti's 288 GB/s. Since LLM generation is bandwidth-bound, the older card generates tokens faster for models that fit in both cards' VRAM — a counter-intuitive result well-documented in 2023 community benchmarks.

5. What is the performance penalty for running a 70B Q4 model fully CPU-offloaded (no GPU layers) versus fully in GPU VRAM on an RTX 4090 system, per Ollama's 2023 documentation?

Correct. Ollama documented approximately 8 t/s with GPU layers versus 1.2 t/s fully CPU-offloaded — a ~6.7× penalty reflecting the bandwidth gap between GDDR6X (~1 TB/s) and DDR5 (~75 GB/s).

Incorrect. Ollama's benchmark showed ~8 t/s GPU versus ~1.2 t/s CPU for a 70B Q4 model — approximately 6.7× slower on CPU, directly reflecting the memory bandwidth difference between GDDR6X and DDR5.

6. Two RTX 3090s connected via NVLink provide what primary advantage over a single RTX 4090 for local LLM inference?

Correct. NVLink pools VRAM: 2× 24 GB = 48 GB unified GPU memory. This enables models like FP16 34B or Q8 70B that don't fit in any single 24 GB consumer GPU to run entirely in fast VRAM.

Incorrect. The key benefit of 2× RTX 3090 NVLink is 48 GB of unified VRAM — large enough for models that would otherwise overflow to slower CPU offload on any single 24 GB GPU.

7. What enabled Apple Silicon M-series Macs to become competitive inference platforms when llama.cpp added Metal GPU support in March 2023?

Correct. Unified memory eliminated the PCIe copy bottleneck: Apple Silicon's GPU accesses all system RAM at full chip bandwidth (100–800 GB/s depending on chip tier), making large total memory addressable at GPU speeds.

Incorrect. The enabling factor was unified memory architecture: Apple Silicon's GPU accesses the full system RAM pool at chip-level bandwidth without PCIe transfers, allowing GPU inference on all available memory at competitive bandwidth.

8. The M2 Ultra chip offers up to 192 GB of unified memory at 800 GB/s. Compared to an A100 80 GB SXM (2 TB/s), how does M2 Ultra inference performance on a FP16 70B model compare?

Correct. M2 Ultra's 800 GB/s vs. A100's 2 TB/s makes it approximately 2–3× slower for bandwidth-bound inference. However, M2 Ultra was the first consumer device that could run FP16 70B inference at all — previously requiring two A100 80 GB GPUs.

Incorrect. M2 Ultra runs at ~800 GB/s vs. the A100's ~2 TB/s, making it 2–3× slower for bandwidth-bound FP16 70B inference. Its significance was enabling consumer-accessible FP16 70B at all, not matching data-center throughput.

9. Which CPU instruction set extension provides approximately double the throughput over AVX2 for quantized LLM inference on compatible x86 hardware?

Correct. AVX-512 processes 512-bit vectors versus AVX2's 256-bit, roughly doubling throughput for quantized matrix multiply operations in CPU inference. AMD Zen 4 (Ryzen 7000) was the first mainstream consumer CPU to include it.

Incorrect. AVX-512 doubles vector width to 512 bits from AVX2's 256 bits, approximately doubling throughput for quantized GEMM operations. AMD Zen 4 (2022) brought it to mainstream consumer CPUs for the first time.

10. How does AMD ROCm support for local LLM inference tools compare to NVIDIA CUDA as of early 2024?

Correct. As of early 2024, ROCm builds required manual compilation for tools like llama.cpp, were not included in standard Ollama Windows distributions, and were one to two versions behind CUDA feature parity on Linux.

Incorrect. ROCm support in inference tools lagged CUDA significantly as of early 2024 — requiring manual compilation on Linux, with no standard Windows binaries in Ollama. The gap is narrowing but remained material for most users.

11. Approximately how much storage does a modest local model library occupy (five 7B, two 13B, one 34B model — all Q4_K_M)?

Correct. Five 7B Q4 models (~4 GB each = 20 GB) + two 13B Q4 (~7.4 GB each = 14.8 GB) + one 34B Q4 (~19 GB) = ~54 GB for weights, rising to ~65–75 GB with metadata and multiple quantization variants.

Incorrect. Five 7B at ~4 GB each (20 GB) + two 13B at ~7.4 GB each (~15 GB) + one 34B at ~19 GB = approximately 54 GB for core weights, reaching ~65–75 GB with associated files — making 1 TB NVMe the practical minimum.

12. During multi-hour sustained inference on a consumer GPU in a poorly-ventilated case, what happens to effective token generation speed due to thermal throttling?

Correct. NVIDIA's GPU Boost throttles clocks when junction temperature approaches ~83°C. Under sustained inference load in poor thermal conditions, this reduces effective memory bandwidth and token throughput by 15–30% — as documented by EleutherAI's infrastructure team.

Incorrect. GPU Boost reduces clock speeds as temperature rises toward the ~83°C junction threshold. Under sustained inference in poorly-cooled cases, this causes 15–30% throughput degradation — not emergency shutdown, but significant performance loss for continuous workloads.

13. Why does PCIe generation (3.0 vs. 4.0) affect model loading time but have no impact on ongoing token generation speed?

Correct. PCIe moves weights from system storage to VRAM once at load time. Inference then reads from VRAM's GDDR6X bus at 936+ GB/s — entirely internal to the GPU, bypassing PCIe completely during token generation.

Incorrect. The data path changes after loading: PCIe (32 GB/s for Gen4) transfers weights to VRAM at startup, but inference reads from VRAM's internal GDDR6X bus at 936–1008 GB/s — 30–60× faster and completely independent of PCIe bandwidth.

14. For a power user building a dedicated inference server that will run 34B models for 8+ hours daily, what cooling recommendation is most appropriate for the GPU?

Correct. For sustained 8+ hour inference, aftermarket cooling (e.g., Arctic Accelero Xtreme IV on RTX 3090) and a high-airflow chassis reduce junction temperatures by 10–15°C, keeping GPU Boost from throttling and maintaining consistent throughput.

Incorrect. For sustained 8+ hour inference, stock cooling and average cases lead to thermal throttling that reduces throughput 15–30%. High-airflow case with aftermarket GPU cooler targeting below 80°C junction temperature is needed to maintain full performance through extended sessions.

15. Which of the following hardware configurations would best support running Q8 70B model inference entirely within GPU memory (no CPU offloading)?

Correct. A Q8 70B model requires ~78 GB of VRAM for weights plus KV cache overhead. Only 48 GB NVLink (borderline) or an A100 80 GB provides sufficient GPU memory. 24 GB cards and 32 GB unified memory cannot accommodate it without CPU offloading.

Incorrect. Q8 70B requires approximately 78 GB of VRAM for weights alone. Only 2× RTX 3090 NVLink (48 GB — tight) or an A100 80 GB provides enough GPU memory. 24 GB consumer GPUs require CPU offloading for this model size.