When Meta AI released the weights for LLaMA 1 on February 24, 2023, the research community gained access to models ranging from 7 billion to 65 billion parameters. Within days, hobbyists discovered that the 7B model could be quantized to run on a single consumer GPU with 6 GB of VRAM — the kind found in millions of gaming laptops. Subreddits and Discord servers erupted with benchmark threads comparing RTX 3060 Ti against RTX 4090 performance. For the first time, the phrase "does your GPU have enough VRAM?" entered mainstream tech conversation.
The answer depended almost entirely on one thing: how much video RAM was soldered to the graphics card, and how fast data could move across it.
When a language model runs inference, every layer of the network must be loaded into addressable memory before any arithmetic happens. System RAM (DDR5, LPDDR5, etc.) connects to the CPU. VRAM — video random-access memory, typically GDDR6 or HBM — connects directly to the GPU's compute cores. The distinction matters enormously because modern LLM inference is bottlenecked not by arithmetic speed but by memory bandwidth: how many gigabytes per second can be fed to the compute units.
A consumer CPU can address 64–128 GB of DDR5 RAM with bandwidth around 50–90 GB/s. A mid-range GPU (RTX 4070) holds only 12 GB of VRAM but moves data at roughly 504 GB/s — ten times faster. A high-end GPU like the H100 SXM reaches 3.35 TB/s across 80 GB of HBM3. This is why a model that barely fits in VRAM runs dramatically faster than the same model spilling into system RAM.
Inference throughput (tokens per second) is primarily limited by how fast the GPU can read model weights from VRAM, not by raw FLOP count. A model that fits entirely in VRAM will typically generate 5–15× more tokens per second than the same model offloaded to system RAM, on comparable hardware.
Every parameter in a neural network is stored as a floating-point number. The precision format determines byte size per parameter, and thus the raw memory footprint:
| Precision | Bytes / Param | 7B Model | 13B Model | 70B Model |
|---|---|---|---|---|
| FP32 (full) | 4 bytes | ~28 GB | ~52 GB | ~280 GB |
| BF16 / FP16 | 2 bytes | ~14 GB | ~26 GB | ~140 GB |
| Q8 (8-bit int) | 1 byte | ~7 GB | ~13 GB | ~70 GB |
| Q4 (4-bit) | 0.5 bytes | ~4 GB | ~7 GB | ~35 GB |
These are weight-only figures. Add roughly 10–20% overhead for KV cache (which stores attention context and grows with sequence length), runtime buffers, and quantization metadata. A Q4 7B model that appears to need 4 GB often requires 5.5–6 GB in practice under Ollama or llama.cpp.
In January 2024, the open-source benchmark site Hugging Face Spaces ran systematic tests confirming that Mistral 7B Q4_K_M required approximately 5.1 GB VRAM at 2048-token context, rising to 6.4 GB at 8192 tokens due to KV cache growth. This was one of the first widely-cited empirical measurements helping consumers gauge fit.
The key-value cache stores intermediate attention computations for each token already processed in the conversation. It grows linearly with context length multiplied by the number of attention layers and heads. For a Llama 2 13B model with 40 layers and 40 heads at FP16, each token in context consumes roughly 0.8 MB of VRAM. A 4096-token conversation therefore adds ~3.3 GB on top of model weights.
This is why "context length" is listed as a VRAM constraint in every serious local inference guide — not merely as a capability limit, but as a physical memory budget. Tools like llama.cpp expose a --ctx-size flag explicitly to let users trade context depth against memory fit.
Apple's M2 Ultra chip, released in June 2023, offered up to 192 GB of unified memory shared between CPU and GPU at 800 GB/s bandwidth. Researchers at Hugging Face demonstrated running a full FP16 70B model on a single Mac Studio — something that otherwise required two A100 80 GB GPUs. The trade-off: bandwidth at 800 GB/s, while A100s achieve 2 TB/s. Inference was roughly 2–3× slower, but achievable on consumer hardware for the first time without quantization.
Practice estimating VRAM requirements for different model sizes and quantization formats. Ask the assistant to walk you through calculations for your own hardware scenario.
In June 2023, contributor ggerganov merged CUDA support into llama.cpp, opening GPU-accelerated inference to anyone with an NVIDIA card. Within weeks, the project's GitHub wiki filled with community benchmark tables. A pattern emerged quickly: the RTX 3090 with 24 GB VRAM dominated mid-range local inference, not because of its CUDA core count, but because it was the cheapest way to get 24 GB of GDDR6X bandwidth on the second-hand market. Meanwhile the RTX 4090, with 24 GB at 1 TB/s, became the new ceiling — able to run a Q4 70B model in under 6 GB of overspill with fast decode.
The community discovered that for models that fit entirely in VRAM, the generation speed scaled almost perfectly with memory bandwidth — not CUDA core count, not tensor core generation, just raw GB/s to the compute units.
The following tiers reflect the state of consumer NVIDIA GPUs as of 2024. AMD Radeon RX 7900 XTX (24 GB VRAM) provides a competitive alternative, though ROCm software support for inference tools lags CUDA in maturity. Apple Silicon is addressed separately in Lesson 3.
| Tier | GPU | VRAM | Bandwidth | Best Fit Model |
|---|---|---|---|---|
| Entry | RTX 3060 | 12 GB GDDR6 | 360 GB/s | Q4 7B–13B |
| Entry | RTX 4060 Ti | 16 GB GDDR6 | 288 GB/s | Q4 13B; Q8 7B |
| Mid | RTX 3090 | 24 GB GDDR6X | 936 GB/s | Q4 34B; FP16 13B |
| Mid | RTX 4090 | 24 GB GDDR6X | 1,008 GB/s | Q4 34B; FP16 13B; fast 70B offload |
| Pro | A100 80 GB | 80 GB HBM2e | 2,000 GB/s | FP16 70B; Q8 70B |
| Pro | H100 SXM | 80 GB HBM3 | 3,350 GB/s | FP16 70B at high throughput |
NVIDIA cut memory bandwidth on the RTX 4060 Ti to reduce cost. Its 288 GB/s is actually slower than the RTX 3060's 360 GB/s — meaning inference on models that fit in either card runs faster on the older 3060, despite the 4060 Ti being a newer, more expensive product. The 16 GB version is still valuable for fitting larger quantized models, but token generation rate per dollar favors the 3060 for smaller models. This counter-intuitive result was documented extensively in the r/LocalLLaMA community benchmarks of July 2023.
When a model exceeds single-GPU VRAM, two strategies apply: tensor parallelism (splitting layers across GPUs via NVLink) and CPU offloading (storing excess layers in system RAM). llama.cpp exposes the --n-gpu-layers flag to control how many transformer layers are kept in VRAM — layers not on the GPU are computed on CPU at much lower speed.
In October 2023, the Ollama project released documentation showing that a 70B Q4 model on a machine with one RTX 4090 (24 GB) and 64 GB DDR5 RAM ran at approximately 8 tokens/second with full GPU layers versus 1.2 tokens/second when offloaded entirely to CPU. This ~6.7× penalty quantified the real cost of VRAM overflow for practical users.
Two RTX 3090s linked via NVLink provide 48 GB of unified VRAM at full bandwidth — enabling FP16 34B or Q8 70B inference at high speed, at a cost substantially below enterprise alternatives. This configuration became popular among researchers who published local inference work throughout 2023–2024.
AMD's RX 7900 XTX offers 24 GB GDDR6 at 960 GB/s — competitive with the RTX 3090 — at generally lower price points. However, ROCm (Radeon Open Compute) software support in inference tools has historically been one to two major versions behind CUDA support. As of early 2024, llama.cpp ROCm builds required manual compilation and were not distributed in standard Ollama binaries on Windows. Linux users reported functional but occasionally unstable ROCm inference. The gap is narrowing as AMD invests in ROCm tooling.
Community benchmarks (primarily from r/LocalLLaMA and the llama.cpp GitHub issues tracker) established approximate token generation rates for common hardware/model combinations as of late 2023. These figures assume Q4_K_M quantization, default context, and llama.cpp or Ollama as the inference engine:
| GPU | 7B Q4 | 13B Q4 | 34B Q4 | 70B Q4 |
|---|---|---|---|---|
| RTX 3060 12 GB | ~85 t/s | ~45 t/s | offload | offload |
| RTX 4070 12 GB | ~110 t/s | ~58 t/s | offload | offload |
| RTX 3090 24 GB | ~120 t/s | ~70 t/s | ~32 t/s | offload |
| RTX 4090 24 GB | ~145 t/s | ~80 t/s | ~38 t/s | ~8 t/s (offload) |
| 2× RTX 3090 NVLink | ~115 t/s | ~70 t/s | ~35 t/s | ~22 t/s |
Practice evaluating GPU options for local inference. Ask the assistant to compare specific cards, estimate tokens-per-second for your target model, or evaluate whether upgrading from your current hardware is worthwhile.
On March 10, 2023, developer Georgi Gerganov merged Metal GPU support into llama.cpp — just weeks after the initial CPU-only release. Apple Silicon Macs, dismissed by many as gaming-inferior, suddenly had a structural advantage: their unified memory meant there was no bandwidth penalty for the GPU accessing model weights. A MacBook Pro M2 with 24 GB of RAM could run a 13B model at 20–25 tokens per second — faster than many gaming desktops doing CPU offload inference at the time.
Within weeks, threads appeared comparing M2 Pro MacBook Pro inference performance to RTX 3060 gaming PCs. For models that fit in unified memory, the MacBook often won — or tied — at a fraction of the power draw and without active cooling noise.
Traditional laptops have discrete RAM (CPU) and VRAM (GPU) connected by a PCIe bus with limited bandwidth. Apple Silicon eliminates this boundary: all memory is unified — a single pool accessible by CPU cores, GPU cores, and the Neural Engine at full chip bandwidth. This means llama.cpp's Metal backend can run GPU inference on model weights stored in what appears to be ordinary RAM, without copying data between separate memory pools.
Bandwidth figures for the M-series chips:
| Chip | Max Unified RAM | Memory Bandwidth | Approx 7B Q4 Speed |
|---|---|---|---|
| M1 | 16 GB | 68 GB/s | ~16 t/s |
| M1 Pro | 32 GB | 200 GB/s | ~35 t/s |
| M2 | 24 GB | 100 GB/s | ~22 t/s |
| M2 Pro | 32 GB | 200 GB/s | ~38 t/s |
| M2 Max | 96 GB | 400 GB/s | ~55 t/s |
| M2 Ultra | 192 GB | 800 GB/s | ~70 t/s (FP16 70B feasible) |
| M3 Max | 128 GB | 400 GB/s | ~60 t/s |
An M2 Pro MacBook Pro running a 13B Q4 model draws approximately 20–30 watts at the wall. An equivalent RTX 3090 gaming PC running the same model draws 350–450 watts. For users running inference continuously (coding assistants, document processing), the Apple Silicon power advantage compounds significantly over time and enables battery-powered inference that no discrete GPU solution can match.
For users without a discrete GPU, llama.cpp can run entirely on CPU using SIMD extensions. The key extensions that accelerate quantized inference:
CPU inference speed scales roughly with memory bandwidth (DDR5 > DDR4) and the number of channels. AMD's Threadripper platform with 8-channel DDR5 reaches ~175 GB/s system bandwidth — approaching lower-tier GPU bandwidth and enabling surprisingly capable CPU-only inference for 7B models.
In November 2023, Andrej Karpathy benchmarked llama2.c (a minimal C implementation) on an Apple M2 Air and various x86 machines, publishing results that showed M2 Air at 89 t/s for a small Llama 2 7B model — faster than a Threadripper at full multi-thread. The comparison highlighted unified memory bandwidth as the decisive factor over raw compute.
When Ollama 0.1.0 released in December 2023 with native Apple Silicon support and automatic Metal acceleration, it became the most-downloaded AI tool in the Mac App Store within its launch week. The combination of llama.cpp Metal backend, automatic model management, and a clean API made Apple Silicon Macs the most accessible platform for local inference — no driver configuration, no CUDA, no ROCm. A user with a MacBook Pro M2 Pro could run ollama run llama2:13b and get responsive output in under two minutes from cold start.
Explore the trade-offs between different inference platforms based on your workload, portability needs, and budget. Ask the assistant to compare options or help you evaluate your existing hardware.
In January 2024, the LocalLLaMA subreddit produced a systematic thread comparing model load times on NVMe Gen4, NVMe Gen3, SATA SSD, and spinning HDD. The results were striking: a 7B Q4 model (approximately 4 GB file) loaded in 3.2 seconds from a Samsung 990 Pro (Gen4 NVMe), 5.8 seconds from a Gen3 NVMe, 14 seconds from a SATA SSD, and over 90 seconds from a mechanical hard drive. For users switching between a coding assistant and a general-purpose model dozens of times per day, this difference between Gen4 NVMe and HDD exceeded two minutes per model swap — easily an hour of lost productivity across a working week.
The same thread noted that the 70B Q4 model at ~40 GB took 28 seconds from Gen4 NVMe — and over 12 minutes from spinning disk. That benchmark effectively ended the discussion of whether storage tier mattered for local AI workloads.
Local inference requires storing model weights on disk, which are loaded into VRAM/RAM at startup. Model file sizes scale directly with parameter count and quantization:
| Model | Q4_K_M | Q8_0 | FP16 |
|---|---|---|---|
| 7B | ~4.1 GB | ~7.7 GB | ~14 GB |
| 13B | ~7.4 GB | ~14 GB | ~26 GB |
| 34B | ~19 GB | ~38 GB | ~68 GB |
| 70B | ~40 GB | ~78 GB | ~140 GB |
A modest local model library — say five 7B models, two 13B models, and one 34B model — occupies roughly 80–100 GB of storage. Serious researchers maintaining multiple quantizations and experimental checkpoints routinely accumulate 500 GB to 2 TB of model files. The minimum recommended configuration for practical local inference is a 1 TB NVMe SSD dedicated to or shared with the model library.
Ollama stores downloaded models in ~/.ollama/models on Linux/macOS and C:\Users\[user]\.ollama\models on Windows. A dedicated secondary NVMe drive mounted at this path is a common configuration for power users to separate model storage from OS/application storage without performance penalty.
Gen4 NVMe (7 GB/s): 7B model loads in ~3 seconds. Recommended baseline.
Gen3 NVMe (3.5 GB/s): 7B model loads in ~6 seconds. Acceptable for casual use.
SATA SSD (550 MB/s): 7B model loads in ~14 seconds. Functional but noticeable delay.
HDD (150 MB/s): 7B model loads in ~90 seconds. Effectively unusable for interactive work.
Consumer GPUs are designed to run at maximum performance for gaming workloads (typically 15–30 minute sessions). Local LLM inference, by contrast, can sustain near-100% GPU memory bandwidth utilization for hours at a time — a thermal profile most consumer cooling solutions were not designed for.
NVIDIA's GPU Boost algorithm dynamically reduces clock speed when junction temperature exceeds a threshold (typically 83°C for RTX 30/40 series). An RTX 3090 running a 34B model for four hours in a poorly-ventilated case will thermal-throttle significantly, reducing effective memory bandwidth and token throughput by 15–30%. This phenomenon was documented by the EleutherAI team during extended inference runs in 2023, prompting guidance recommending aftermarket cooling and chassis airflow evaluation before committing to sustained inference workloads.
When a GPU is installed in a PCIe x16 slot, the system bus becomes relevant for initial model loading and for CPU↔GPU data transfers during hybrid inference. PCIe 4.0 x16 provides ~32 GB/s bidirectional; PCIe 3.0 x16 provides ~16 GB/s. For loading a 40 GB 70B Q4 model from RAM to VRAM, PCIe 4.0 completes the transfer in roughly 1.25 seconds versus 2.5 seconds on PCIe 3.0 — a one-time cost per model load, not an ongoing inference bottleneck.
The PCIe bus is not the bottleneck during token generation: once weights are in VRAM, inference uses the GPU's internal memory bus (GDDR6X at 936–1008 GB/s), which is 30–60× faster than PCIe. PCIe speed therefore matters primarily for model switching speed, not generation throughput.
Meta's infrastructure team documented in a 2023 technical blog post that their LLaMA 2 inference clusters used custom liquid-cooled GPU nodes to sustain maximum memory bandwidth continuously. Consumer GPU thermal throttling under sustained load — reducing effective bandwidth by 20–30% — was cited as a key reason consumer hardware cannot replicate cluster-level throughput even when raw specifications appear comparable. This distinction is relevant to anyone building a home inference server expecting data-center-equivalent performance.
| Component | Minimum | Recommended | Power User |
|---|---|---|---|
| GPU VRAM | 8 GB (RTX 3070) | 16–24 GB (RTX 4070 Ti / 3090) | 48 GB (2× RTX 3090) or 80 GB A100 |
| System RAM | 16 GB DDR4 | 32 GB DDR5 | 64–128 GB DDR5 |
| Storage | 500 GB SATA SSD | 1 TB Gen3 NVMe | 2 TB Gen4 NVMe |
| CPU | Any AVX2-capable (2014+) | Ryzen 5600X / i7-12700 | Ryzen 7950X (AVX-512) or Threadripper |
| Power Supply | 550W 80+ Bronze | 750W 80+ Gold | 1000W+ 80+ Platinum (dual GPU) |
| Cooling | Stock GPU cooler, decent case | Good airflow case, 240mm AIO CPU | High-airflow case, aftermarket GPU cooler |
Practice integrating all hardware considerations — VRAM, system RAM, storage, cooling, and power — into a coherent system recommendation. Ask the assistant to help you spec out a build or evaluate your current system's weaknesses.