In September 2023, the llama.cpp project merged its first widely-used 4-bit quantization kernels. Within weeks, users were running Meta's LLaMA 2 70B model — a model that required roughly 140 GB in full precision — on a single consumer Mac with 96 GB of unified memory. The achievable quality loss on standard benchmarks was often under two percentage points.
Every weight in a neural network is a number. In the standard training format — float32 (FP32) — each weight occupies 32 bits: a sign bit, 8 exponent bits, and 23 mantissa bits. This gives roughly seven decimal digits of precision and a dynamic range spanning ±3.4 × 10³⁸.
For inference most developers already accept float16 (FP16) or bfloat16 (BF16), each using 16 bits. A 7 billion parameter model in FP16 weighs approximately 14 GB. That still exceeds the VRAM of most consumer GPUs.
| Format | Bits / weight | 7B model size | Typical use |
|---|---|---|---|
| FP32 | 32 | ~28 GB | Training |
| FP16 / BF16 | 16 | ~14 GB | Inference on large GPUs |
| Q8_0 | 8 | ~7 GB | High-quality local use |
| Q4_K_M | ~4.5 | ~4.1 GB | Most popular local choice |
| Q2_K | ~2.6 | ~2.7 GB | Extreme compression |
Quantization compresses weights by mapping a range of floating-point values onto a much smaller set of integer codes. In 4-bit quantization, you have only 16 distinct values (0–15) to represent what was originally a continuous distribution. The key insight is that you don't map the entire FP32 range — you map a local window of values per small block of weights (typically 32 or 64 consecutive weights), keeping a small set of scale and zero-point metadata in higher precision to reconstruct approximate values at inference time.
This per-block approach — called block quantization — is what makes modern schemes dramatically better than naive global quantization. Each block of weights gets its own scale factor tuned to that block's distribution, so outlier weights in one layer don't corrupt the representation of all the others.
Research since 2022 (most influentially, the GPTQ and AWQ papers) demonstrated that large language model weights cluster in relatively narrow ranges per layer. Because the weight distribution is smooth, approximating each block's values with 16 integers loses surprisingly little information — the model's outputs depend on the relative ordering of weights far more than their exact floating-point values.
When Mistral 7B was released in October 2023, Ollama's Q4_K_M version fit comfortably in 6 GB of VRAM. Users benchmarking on MMLU found less than 1.5% accuracy drop versus the FP16 baseline. The model ran at 30–50 tokens per second on an RTX 3060 — hardware available for under $300.
Architecture, attention mechanism, vocabulary, and context window all remain identical. The model's structure is unchanged. Quantization only affects the numerical precision of the stored weight tensors. The forward pass computation — matrix multiplications and activations — uses the dequantized (approximate) values, often in FP16 for the actual arithmetic to maintain throughput.
This means a quantized model and its FP16 parent respond to identical prompts in structurally identical ways. The outputs differ only to the degree the approximated weights change the distributions over the vocabulary at each token position.
Explore the core mechanics of quantization through conversation. Ask about bits, blocks, scale factors, and the trade-offs between precision and model size.
In August 2023, llama.cpp developer klosax introduced the K-quant formats — Q2_K through Q6_K — alongside the new GGUF container format that replaced the earlier GGML binary. GGUF added structured metadata: model architecture, tokenizer vocabulary, rope scaling, and quantization parameters all embedded in the file header. Hugging Face began hosting GGUF files from TheBloke's repository; by late 2023 it had become the dominant distribution format for local inference, with tens of millions of downloads.
A typical filename looks like: mistral-7b-instruct-v0.2.Q4_K_M.gguf
The quantization tag follows a consistent pattern: Q[bits]_[type]_[size]. The bits are the nominal bit-depth. The type is either a legacy letter (0, 1) or K for K-quant. The size suffix — S (small), M (medium), L (large) — indicates how much of the model is quantized at the higher versus lower precision within that tier.
| Format | Avg bits/w | 7B size | Quality vs FP16 | Best for |
|---|---|---|---|---|
| Q8_0 | 8.5 | ~7.7 GB | Near-identical | Maximum fidelity, plenty of RAM |
| Q6_K | 6.6 | ~6.1 GB | Excellent | Quality-first on 8 GB VRAM |
| Q5_K_M | 5.7 | ~5.2 GB | Very good | 8 GB VRAM with headroom |
| Q4_K_M | 4.8 | ~4.4 GB | Good (recommended) | 6–8 GB VRAM, most users |
| Q4_K_S | 4.4 | ~4.1 GB | Slightly less than _M | Tight VRAM budgets |
| Q3_K_M | 3.9 | ~3.5 GB | Noticeable degradation | 4 GB VRAM only |
| Q2_K | 2.6 | ~2.7 GB | Significant loss | Extreme constraint |
The K-quant system applies mixed quantization within a single file. Not every tensor in the model is quantized identically. The K suffix indicates that attention weights — specifically the Q, K, and V projection matrices — and the output projection are quantized to a higher bit depth within the same nominal tier, while feed-forward layers use the lower bit depth. This is why Q4_K_M outperforms a naive Q4_0 despite both using roughly 4 bits on average.
The S/M/L size suffixes encode the exact assignment: in Q4_K_M, some tensors are kept at Q6_K internally while the majority sit at Q4_K. In Q4_K_S, the higher-precision tensors are reduced, squeezing out extra size at a slight quality cost.
Q4_K_M emerged as the de facto community recommendation through late 2023 and 2024 because it hits the sweet spot: for models up to 13B parameters, it fits in consumer GPU VRAM while benchmark degradation (measured on MMLU, ARC, and HellaSwag) stays under 2% compared to FP16 in most documented cases. Ollama's default pulls almost always fetch Q4_K_M when available.
Unlike its predecessor GGML, GGUF embeds all metadata needed to run the model in the file's header — no separate config.json required. The header stores: magic bytes identifying the format version, key-value metadata pairs (architecture name, context length, rope parameters, tokenizer type, vocabulary), and a tensor index with name, shape, and quantization type for every tensor.
This self-contained design is why ollama pull can configure a model's runtime parameters automatically. The runtime reads the header, allocates appropriate buffers, and selects SIMD or GPU kernels matched to the quantization type of each tensor before a single weight is loaded into compute memory.
You will encounter Q4_0 and Q4_1 in older repositories. These are the pre-K-quant formats using global (not per-block K-quant) scale factors. Q4_1 adds a non-zero minimum value per block. Both are smaller than Q4_K_M but measurably lower quality. Prefer K-quant variants whenever available — they are almost always the better choice for the same nominal bit depth.
Practice decoding GGUF filenames and choosing appropriate quantization formats for given hardware scenarios. Bring real-world constraints and let the assistant help you reason through the trade-offs.
In November 2023, the llm-benchmark repository on GitHub published a systematic comparison of Llama 2 13B across eight quantization levels on MMLU (57-subject academic QA) and ARC-Challenge (science questions). The Q4_K_M version scored 55.4% on MMLU versus the FP16 baseline of 56.8% — a 1.4 percentage point gap. Q2_K dropped to 50.1%, a 6.7 point gap. The lesson was clear: the damage is not uniform across bit depths.
Perplexity (PPL) measures how well a language model predicts a held-out text corpus — lower is better. It is computed as the exponentiated average negative log-likelihood per token. In quantization comparisons, researchers typically measure PPL on WikiText-2 or C4 datasets and report the ratio to the FP16 baseline.
A perplexity increase of 0.1–0.3 points (typical for Q4_K_M on Llama-family models) is generally imperceptible in conversation. An increase of 2+ points (typical for Q2_K) produces noticeably less coherent text — models start repeating, losing thread, or making factual errors more frequently.
The degradation from quantization is not random — it concentrates in predictable areas. Research and community testing have identified the following patterns across multiple model families:
MMLU (Massive Multitask Language Understanding) covers 57 subject areas from elementary math to professional law. It is the most commonly cited benchmark in quantization comparisons because it stresses recall of precise factual information — which is where quantization error accumulates. However, MMLU does not test generation quality, coherence, or instruction-following.
A model that drops 1.5 MMLU points from FP16 to Q4_K_M may produce outputs that are indistinguishable to a human reader for most use cases. The benchmark measures systematic probability shifts, not qualitative experience. This disconnect is why community testers consistently report that Q4_K_M "feels" close to the full-precision model in practice.
The quantization format matters far less than the base model quality. A Q4_K_M of Llama 3 70B consistently outperforms a Q8_0 of Llama 2 7B on almost every benchmark and subjective evaluation — the larger model's stronger representations survive quantization better than a weaker small model at full precision. When choosing between "better format on smaller model" versus "coarser format on larger model," the larger model almost always wins within your hardware constraints.
There is a practical cliff around Q2_K and below. At 2.6 bits per weight, the 16 representable values per block are so coarse that attention weight matrices lose their ability to differentiate between token relationships that depend on subtle magnitude differences. The model begins to produce repetitive loops, miss instruction constraints, and confabulate with higher frequency. For any serious use case, Q3_K_M is typically the absolute floor — and only when hardware genuinely cannot accommodate Q4_K_M.
Explore how to evaluate and reason about quantization quality loss for your specific use case. The assistant can help you think through which benchmarks matter, what perplexity deltas mean practically, and how to design simple tests for your own models.
In March 2023, researchers at ETH Zürich published GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers, demonstrating that 3- and 4-bit quantization of GPT-3-scale models was achievable with under 1% accuracy loss by using second-order weight update corrections. Six months later, Ji Lin et al. at MIT published AWQ: Activation-aware Weight Quantization, which instead of correcting errors after the fact, identified the 1% of weights that matter most — determined by activation magnitudes — and protected them during quantization. Both papers enabled GPU-accelerated inference at 4-bit depth that GGUF/llama.cpp, running primarily on CPU-accessible memory, could not yet match in raw throughput.
GPTQ's key innovation is using a small calibration dataset — typically 128 samples from the model's training distribution — and computing the Hessian (second derivative) of the loss with respect to each layer's weights. This Hessian information tells the algorithm which quantization errors will compound most severely and applies compensating adjustments to neighboring weights to absorb the error.
The result: GPTQ at 4-bit consistently achieves lower perplexity than equivalent K-quant formats at the same bit depth, particularly for models above 30B parameters where the Hessian computation becomes tractable relative to the number of weights being protected. The trade-off is that GPTQ quantization requires a GPU with sufficient VRAM to run the calibration pass — typically the same VRAM needed to load the FP16 model.
For models above 30B parameters being served with GPU inference, GPTQ typically shows a measurable quality advantage over K-quants at the same bit depth. For 7B–13B models on consumer hardware, the difference is often within measurement noise on standard benchmarks, and the simpler GGUF workflow usually wins on practicality.
Activation-aware Weight Quantization (AWQ) takes a different approach than GPTQ. Instead of computing error corrections across all weights, AWQ first runs the calibration data through the model and measures activation magnitudes at each layer. It identifies the roughly 1% of weight channels that correspond to large activations — those that, when quantized, cause the most damage to output quality.
AWQ then applies a per-channel scaling to those sensitive weights before quantization, effectively shifting precision budget toward the channels where it matters most. The scaling parameters are stored efficiently and inverted at inference time. AWQ is faster to quantize than GPTQ, requires less calibration data, and produces comparable quality on most evaluated models.
Both GPTQ and AWQ are GPU-native: inference is performed entirely on VRAM using optimized CUDA kernels (ExLlamaV2 for GPTQ, VLLM and AutoAWQ for AWQ). They do not support CPU offloading or Apple Silicon in the way GGUF does.
EXL2 (the format used by ExLlamaV2) takes the mixed-precision concept further than K-quants. Rather than applying predefined precision assignments to tensor types, EXL2 quantizes each individual row of each weight matrix at the bit depth that minimizes reconstruction error for that row specifically — then packs the mixed-precision rows into a compact format. The user specifies a target average bits-per-weight and the algorithm allocates bits optimally across all rows.
EXL2 benchmarks typically show it as the highest-quality 4-bit format available for pure-GPU inference, outperforming both GPTQ and AWQ at the same average bit depth. The format is supported by the ExLlamaV2 inference engine and is available for many popular models on Hugging Face.
| Format | Primary advantage | Limitation | Inference engine |
|---|---|---|---|
| GGUF K-quant | CPU+GPU, Apple Silicon, widest compatibility | Lower throughput than pure-GPU formats | llama.cpp, Ollama |
| GPTQ | High quality via Hessian calibration; mature ecosystem | Requires CUDA GPU; no CPU offload | ExLlamaV2, VLLM |
| AWQ | Fast quantization, good quality, efficient kernels | CUDA only; less flexible than EXL2 | VLLM, AutoAWQ, TGI |
| EXL2 | Best quality at 4-bit for GPU inference; row-level mixed precision | ExLlamaV2 only; no CPU fallback | ExLlamaV2 |
Use GGUF (K-quant) if you are running on a Mac, a CPU, a machine where the model exceeds your VRAM (requiring CPU offloading), or you simply want the simplest workflow via Ollama.
Use GPTQ or AWQ if you have a dedicated NVIDIA GPU with sufficient VRAM to hold the model entirely in VRAM, you need high throughput for serving, and you want a mature ecosystem with tools like VLLM or Text Generation Inference.
Use EXL2 if you are running ExLlamaV2 specifically, want the highest quality 4-bit output for GPU inference, and your use case justifies the narrower tooling ecosystem.
For most individuals running models locally for personal productivity, GGUF Q4_K_M remains the correct default. The GPU-native formats become relevant when throughput matters (serving multiple users) or when you are optimizing quality at the edge of what 4-bit can achieve for a specific application.
A separate quantization ecosystem — bitsandbytes by Tim Dettmers — provides 8-bit and 4-bit quantization integrated directly into the Hugging Face Transformers training and inference pipeline. Its primary use case is enabling full-precision fine-tuning of large models on limited VRAM by quantizing model layers that are not being actively trained. QLoRA (2023) combined bitsandbytes 4-bit with LoRA adapters, allowing 65B model fine-tuning on a single A100 80GB GPU. For local inference, bitsandbytes is generally slower than GGUF or GPTQ but valuable in fine-tuning workflows.
Work through real hardware and use-case scenarios to decide between GGUF, GPTQ, AWQ, and EXL2. The assistant can help you reason through throughput requirements, VRAM constraints, tooling compatibility, and quality trade-offs.