Module 4 · Lesson 1

What Quantization Actually Is

Shrinking the numbers that hold intelligence — without losing the intelligence.

Why can a 4-bit model feel nearly as sharp as a 16-bit original?

In September 2023, the llama.cpp project merged its first widely-used 4-bit quantization kernels. Within weeks, users were running Meta's LLaMA 2 70B model — a model that required roughly 140 GB in full precision — on a single consumer Mac with 96 GB of unified memory. The achievable quality loss on standard benchmarks was often under two percentage points.

Floating-Point Weights: The Starting Point

Every weight in a neural network is a number. In the standard training format — float32 (FP32) — each weight occupies 32 bits: a sign bit, 8 exponent bits, and 23 mantissa bits. This gives roughly seven decimal digits of precision and a dynamic range spanning ±3.4 × 10³⁸.

For inference most developers already accept float16 (FP16) or bfloat16 (BF16), each using 16 bits. A 7 billion parameter model in FP16 weighs approximately 14 GB. That still exceeds the VRAM of most consumer GPUs.

Format	Bits / weight	7B model size	Typical use
FP32	32	~28 GB	Training
FP16 / BF16	16	~14 GB	Inference on large GPUs
Q8_0	8	~7 GB	High-quality local use
Q4_K_M	~4.5	~4.1 GB	Most popular local choice
Q2_K	~2.6	~2.7 GB	Extreme compression

The Core Idea: Mapping Ranges to Integers

Quantization compresses weights by mapping a range of floating-point values onto a much smaller set of integer codes. In 4-bit quantization, you have only 16 distinct values (0–15) to represent what was originally a continuous distribution. The key insight is that you don't map the entire FP32 range — you map a local window of values per small block of weights (typically 32 or 64 consecutive weights), keeping a small set of scale and zero-point metadata in higher precision to reconstruct approximate values at inference time.

This per-block approach — called block quantization — is what makes modern schemes dramatically better than naive global quantization. Each block of weights gets its own scale factor tuned to that block's distribution, so outlier weights in one layer don't corrupt the representation of all the others.

Why It Works

Research since 2022 (most influentially, the GPTQ and AWQ papers) demonstrated that large language model weights cluster in relatively narrow ranges per layer. Because the weight distribution is smooth, approximating each block's values with 16 integers loses surprisingly little information — the model's outputs depend on the relative ordering of weights far more than their exact floating-point values.

Key Terms

Scale factorA per-block multiplier stored in higher precision that converts an integer code back to an approximate float. It represents the step size between adjacent quantized values.

Zero pointAn offset that allows the integer range to be shifted so asymmetric weight distributions are centered correctly. Not all schemes use it — symmetric quantization omits it.

Quantization errorThe difference between the original weight and its reconstructed approximation. Lower bits = larger possible error per weight, but the aggregated effect on model output is often small.

Post-training quantization (PTQ)Quantizing a model after training is complete, without further gradient updates. This is what Ollama and llama.cpp do — no retraining required.

Real Impact

When Mistral 7B was released in October 2023, Ollama's Q4_K_M version fit comfortably in 6 GB of VRAM. Users benchmarking on MMLU found less than 1.5% accuracy drop versus the FP16 baseline. The model ran at 30–50 tokens per second on an RTX 3060 — hardware available for under $300.

What Quantization Does Not Change

Architecture, attention mechanism, vocabulary, and context window all remain identical. The model's structure is unchanged. Quantization only affects the numerical precision of the stored weight tensors. The forward pass computation — matrix multiplications and activations — uses the dequantized (approximate) values, often in FP16 for the actual arithmetic to maintain throughput.

This means a quantized model and its FP16 parent respond to identical prompts in structurally identical ways. The outputs differ only to the degree the approximated weights change the distributions over the vocabulary at each token position.

Lesson 1 Quiz

What quantization actually is — check your understanding

A 7B parameter model stored in FP16 occupies approximately how much memory?

Correct. 7 billion parameters × 2 bytes per FP16 value = 14 GB.

Not quite. FP16 uses 2 bytes per parameter. 7B × 2 = 14 GB. FP32 would give 28 GB; INT8 gives ~7 GB.

What is the purpose of the "scale factor" in block quantization?

Correct. The scale factor is stored in higher precision per block and acts as the step size that maps integer codes back to approximate float values at inference.

Not quite. Scale factors are an inference-time reconstruction mechanism, not a training hyperparameter or architectural setting.

Why does per-block (rather than global) quantization produce better results?

Correct. Local scale factors mean each small group of weights is mapped optimally, so a few large-magnitude weights don't compress everything else poorly.

Incorrect. Per-block quantization uses the same or fewer bits but allocates them more intelligently by fitting scale factors to local distributions.

What does post-training quantization (PTQ) mean in the context of llama.cpp / Ollama?

Correct. PTQ compresses a finished model without any retraining — which is why you can download and use quantized models without GPU clusters.

Not correct. PTQ requires no new training. It's a compression step applied to the existing weights, not a re-training procedure.

Lab 1 — Quantization Fundamentals

Chat with your AI lab assistant · Complete 3 exchanges to finish

Your Task

Explore the core mechanics of quantization through conversation. Ask about bits, blocks, scale factors, and the trade-offs between precision and model size.

Suggested starter: "If I have a weight of 0.7324 and I'm using 4-bit quantization with a block of 64 weights, walk me through what actually happens to that number."

Quantization Lab Assistant

Welcome to Lab 1. I'm here to help you understand the mechanics of quantization — how weights get compressed, what scale factors do, and why 4-bit models often retain most of their quality. What would you like to explore?

Module 4 · Lesson 2

GGUF Formats and the K-Quant System

Decoding what Q4_K_M, Q5_K_S, and their siblings actually mean.

How do you read a GGUF filename and know what you're getting?

In August 2023, llama.cpp developer klosax introduced the K-quant formats — Q2_K through Q6_K — alongside the new GGUF container format that replaced the earlier GGML binary. GGUF added structured metadata: model architecture, tokenizer vocabulary, rope scaling, and quantization parameters all embedded in the file header. Hugging Face began hosting GGUF files from TheBloke's repository; by late 2023 it had become the dominant distribution format for local inference, with tens of millions of downloads.

Reading a GGUF Filename

A typical filename looks like: mistral-7b-instruct-v0.2.Q4_K_M.gguf

The quantization tag follows a consistent pattern: Q[bits]_[type]_[size]. The bits are the nominal bit-depth. The type is either a legacy letter (0, 1) or K for K-quant. The size suffix — S (small), M (medium), L (large) — indicates how much of the model is quantized at the higher versus lower precision within that tier.

Format	Avg bits/w	7B size	Quality vs FP16	Best for
Q8_0	8.5	~7.7 GB	Near-identical	Maximum fidelity, plenty of RAM
Q6_K	6.6	~6.1 GB	Excellent	Quality-first on 8 GB VRAM
Q5_K_M	5.7	~5.2 GB	Very good	8 GB VRAM with headroom
Q4_K_M	4.8	~4.4 GB	Good (recommended)	6–8 GB VRAM, most users
Q4_K_S	4.4	~4.1 GB	Slightly less than _M	Tight VRAM budgets
Q3_K_M	3.9	~3.5 GB	Noticeable degradation	4 GB VRAM only
Q2_K	2.6	~2.7 GB	Significant loss	Extreme constraint

What K-Quant Actually Does Differently

The K-quant system applies mixed quantization within a single file. Not every tensor in the model is quantized identically. The K suffix indicates that attention weights — specifically the Q, K, and V projection matrices — and the output projection are quantized to a higher bit depth within the same nominal tier, while feed-forward layers use the lower bit depth. This is why Q4_K_M outperforms a naive Q4_0 despite both using roughly 4 bits on average.

The S/M/L size suffixes encode the exact assignment: in Q4_K_M, some tensors are kept at Q6_K internally while the majority sit at Q4_K. In Q4_K_S, the higher-precision tensors are reduced, squeezing out extra size at a slight quality cost.

The Community Standard

Q4_K_M emerged as the de facto community recommendation through late 2023 and 2024 because it hits the sweet spot: for models up to 13B parameters, it fits in consumer GPU VRAM while benchmark degradation (measured on MMLU, ARC, and HellaSwag) stays under 2% compared to FP16 in most documented cases. Ollama's default pulls almost always fetch Q4_K_M when available.

GGUF Container Structure

Unlike its predecessor GGML, GGUF embeds all metadata needed to run the model in the file's header — no separate config.json required. The header stores: magic bytes identifying the format version, key-value metadata pairs (architecture name, context length, rope parameters, tokenizer type, vocabulary), and a tensor index with name, shape, and quantization type for every tensor.

This self-contained design is why ollama pull can configure a model's runtime parameters automatically. The runtime reads the header, allocates appropriate buffers, and selects SIMD or GPU kernels matched to the quantization type of each tensor before a single weight is loaded into compute memory.

Legacy Formats Still Exist

You will encounter Q4_0 and Q4_1 in older repositories. These are the pre-K-quant formats using global (not per-block K-quant) scale factors. Q4_1 adds a non-zero minimum value per block. Both are smaller than Q4_K_M but measurably lower quality. Prefer K-quant variants whenever available — they are almost always the better choice for the same nominal bit depth.

Key Terms

GGUFGPT-Generated Unified Format — the self-contained binary file format used by llama.cpp and Ollama since August 2023, replacing GGML.

K-quantA family of quantization schemes in llama.cpp that apply mixed precision internally — critical tensors kept higher-precision, others lower — for better quality at the same average bit depth.

Q4_K_MThe most widely recommended local model format: ~4.8 bits/weight average, mixing Q6_K for sensitive tensors and Q4_K for the rest, with medium (M) precision assignments.

Lesson 2 Quiz

GGUF formats and K-quant naming — check your understanding

In the filename "llama-3-8b.Q5_K_M.gguf", what does the "K" indicate?

Correct. K-quant formats apply mixed quantization — some tensors are stored at higher precision than others within the same file.

Incorrect. The K in K-quant stands for the K-quant scheme introduced in llama.cpp, which uses mixed precision internally across tensor types.

What is the main advantage of Q4_K_M over Q4_0 at roughly the same 4-bit depth?

Correct. The K-quant system identifies high-impact tensors (Q/K/V projections) and keeps them at Q6_K internally even in a "4-bit" file, preserving quality.

Not quite. The quality advantage comes from mixed-precision tensor assignment — not speed, context length, or parameter count.

What key advantage does GGUF have over the older GGML format?

Correct. GGUF's self-contained header means no separate config files are needed — the runtime can configure itself entirely from the file.

Incorrect. GGUF's defining feature is its rich self-contained metadata header, not hardware restrictions, bit depth limits, or guaranteed size reduction.

For a user with exactly 6 GB of VRAM running a 7B model, which format is the most practical primary choice?

Correct. Q8_0 and Q6_K exceed 6 GB leaving no room for the KV cache. Q4_K_M at ~4.4 GB leaves ~1.5 GB for runtime buffers, making it the practical choice.

Incorrect. You need to account for KV cache and runtime buffers beyond the raw model size. Q8_0 at 7.7 GB won't fit; Q6_K at 6.1 GB leaves almost no headroom.

Lab 2 — Reading GGUF Filenames

Chat with your AI lab assistant · Complete 3 exchanges to finish

Your Task

Practice decoding GGUF filenames and choosing appropriate quantization formats for given hardware scenarios. Bring real-world constraints and let the assistant help you reason through the trade-offs.

Suggested starter: "I have 12 GB of VRAM and I want to run a 13B parameter model. Walk me through which GGUF format I should pick and why."

GGUF Format Advisor

Welcome to Lab 2. I can help you decode GGUF filenames, understand what Q4_K_M versus Q5_K_S actually means for your hardware, and reason through format selection for specific memory constraints. What hardware or model scenario are you working with?

Module 4 · Lesson 3

Quality Loss: Measuring What You Actually Lose

Benchmarks, perplexity, and the gap between numbers and real-world output.

When a benchmark says you lost 2%, what does that mean for actual conversations?

In November 2023, the llm-benchmark repository on GitHub published a systematic comparison of Llama 2 13B across eight quantization levels on MMLU (57-subject academic QA) and ARC-Challenge (science questions). The Q4_K_M version scored 55.4% on MMLU versus the FP16 baseline of 56.8% — a 1.4 percentage point gap. Q2_K dropped to 50.1%, a 6.7 point gap. The lesson was clear: the damage is not uniform across bit depths.

Perplexity: The Primary Technical Metric

Perplexity (PPL) measures how well a language model predicts a held-out text corpus — lower is better. It is computed as the exponentiated average negative log-likelihood per token. In quantization comparisons, researchers typically measure PPL on WikiText-2 or C4 datasets and report the ratio to the FP16 baseline.

A perplexity increase of 0.1–0.3 points (typical for Q4_K_M on Llama-family models) is generally imperceptible in conversation. An increase of 2+ points (typical for Q2_K) produces noticeably less coherent text — models start repeating, losing thread, or making factual errors more frequently.

Q8_0 PPL delta+0.01–0.05

Q6_K PPL delta+0.05–0.15

Q4_K_M PPL delta+0.1–0.4

Q3_K_M PPL delta+0.5–1.5

Q2_K PPL delta+2.0–5.0

Where Quality Loss Actually Shows Up

The degradation from quantization is not random — it concentrates in predictable areas. Research and community testing have identified the following patterns across multiple model families:

Areas More Affected

Precise numerical reasoning (arithmetic, counting)
Rare-word and low-frequency token prediction
Long-range coherence in extended generation
Instruction-following on highly specific prompts
Code generation correctness at edge cases

Areas Largely Unaffected

General conversation and summarization
Common factual recall (major facts, dates)
Translation between high-resource language pairs
Creative writing and stylistic tasks
Classification of broad categories

The MMLU Benchmark in Context

MMLU (Massive Multitask Language Understanding) covers 57 subject areas from elementary math to professional law. It is the most commonly cited benchmark in quantization comparisons because it stresses recall of precise factual information — which is where quantization error accumulates. However, MMLU does not test generation quality, coherence, or instruction-following.

A model that drops 1.5 MMLU points from FP16 to Q4_K_M may produce outputs that are indistinguishable to a human reader for most use cases. The benchmark measures systematic probability shifts, not qualitative experience. This disconnect is why community testers consistently report that Q4_K_M "feels" close to the full-precision model in practice.

The Bigger Lever

The quantization format matters far less than the base model quality. A Q4_K_M of Llama 3 70B consistently outperforms a Q8_0 of Llama 2 7B on almost every benchmark and subjective evaluation — the larger model's stronger representations survive quantization better than a weaker small model at full precision. When choosing between "better format on smaller model" versus "coarser format on larger model," the larger model almost always wins within your hardware constraints.

When Q2 Crosses Into Unusable

There is a practical cliff around Q2_K and below. At 2.6 bits per weight, the 16 representable values per block are so coarse that attention weight matrices lose their ability to differentiate between token relationships that depend on subtle magnitude differences. The model begins to produce repetitive loops, miss instruction constraints, and confabulate with higher frequency. For any serious use case, Q3_K_M is typically the absolute floor — and only when hardware genuinely cannot accommodate Q4_K_M.

Key Terms

Perplexity (PPL)A measure of how well a language model predicts text — lower is better. The primary technical metric for quantization quality evaluation. Computed on standard corpora like WikiText-2.

MMLUMassive Multitask Language Understanding — a 57-subject benchmark commonly used to compare quantized model quality. Measures precise recall; doesn't capture generation fluency.

Perplexity deltaThe increase in perplexity of a quantized model versus its FP16 baseline. Under 0.5 is generally imperceptible; above 2.0 is often noticeable in generation quality.

Lesson 3 Quiz

Quantization quality loss — check your understanding

What does a perplexity increase of +0.2 points (Q4_K_M vs FP16) typically indicate about real-world output?

Correct. A +0.2 PPL delta is a statistical signal of compression, but falls well within the range where human readers cannot reliably distinguish quantized from full-precision output.

Incorrect. Small perplexity deltas (under 0.5) consistently test as imperceptible in human evaluations. Visible degradation typically begins around +2.0 or above.

Which task is MOST likely to show measurable quality loss from quantization?

Correct. Numerical reasoning is highly sensitive to weight precision — quantization error accumulates in the magnitude differences that distinguish correct from incorrect computation paths.

Not quite. General tasks like summarization, creative writing, and high-resource translation are largely unaffected. Quantization error concentrates in precision-dependent tasks like arithmetic.

If you must choose between a Q8_0 of a 7B model and a Q4_K_M of a 70B model, and both fit in your available memory, which is generally preferable?

Correct. Scale dominates format. A 70B model has far more representational capacity. Even at Q4_K_M its stronger base representations outperform a 7B model at full FP16 precision on nearly all tasks.

Incorrect. The research consensus is clear: within your memory budget, use the largest model that fits. A 70B at Q4_K_M beats a 7B at Q8_0 on almost every benchmark.

Why is MMLU not a complete picture of quantization quality?

Correct. MMLU stresses factual precision — exactly where quantization hurts most. But most real-world use involves generation tasks where quantized models often feel indistinguishable from their FP16 parents.

Incorrect. MMLU is broad (57 subjects) and reproducible. Its limitation is that it tests multiple-choice recall, not the open-ended generation quality that matters most in practice.

Lab 3 — Evaluating Quality Loss

Chat with your AI lab assistant · Complete 3 exchanges to finish

Your Task

Explore how to evaluate and reason about quantization quality loss for your specific use case. The assistant can help you think through which benchmarks matter, what perplexity deltas mean practically, and how to design simple tests for your own models.

Suggested starter: "I'm building a code assistant using a locally quantized model. What should I actually test to know if my quantization level is good enough for that use case?"

Quality Evaluation Assistant

Welcome to Lab 3. I can help you think through how to evaluate quantization quality for your specific use case — which benchmarks matter, what perplexity deltas mean in practice, and how to design real-world tests. What are you building or evaluating?

Module 4 · Lesson 4

Advanced Schemes: GPTQ, AWQ, and EXL2

Beyond K-quants — GPU-native compression that pushes quality further.

When does it make sense to go beyond GGUF, and what do you gain?

In March 2023, researchers at ETH Zürich published GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers, demonstrating that 3- and 4-bit quantization of GPT-3-scale models was achievable with under 1% accuracy loss by using second-order weight update corrections. Six months later, Ji Lin et al. at MIT published AWQ: Activation-aware Weight Quantization, which instead of correcting errors after the fact, identified the 1% of weights that matter most — determined by activation magnitudes — and protected them during quantization. Both papers enabled GPU-accelerated inference at 4-bit depth that GGUF/llama.cpp, running primarily on CPU-accessible memory, could not yet match in raw throughput.

GPTQ: Second-Order Calibration

GPTQ's key innovation is using a small calibration dataset — typically 128 samples from the model's training distribution — and computing the Hessian (second derivative) of the loss with respect to each layer's weights. This Hessian information tells the algorithm which quantization errors will compound most severely and applies compensating adjustments to neighboring weights to absorb the error.

The result: GPTQ at 4-bit consistently achieves lower perplexity than equivalent K-quant formats at the same bit depth, particularly for models above 30B parameters where the Hessian computation becomes tractable relative to the number of weights being protected. The trade-off is that GPTQ quantization requires a GPU with sufficient VRAM to run the calibration pass — typically the same VRAM needed to load the FP16 model.

When GPTQ Is Worth It

For models above 30B parameters being served with GPU inference, GPTQ typically shows a measurable quality advantage over K-quants at the same bit depth. For 7B–13B models on consumer hardware, the difference is often within measurement noise on standard benchmarks, and the simpler GGUF workflow usually wins on practicality.

AWQ: Protecting the 1% That Matters

Activation-aware Weight Quantization (AWQ) takes a different approach than GPTQ. Instead of computing error corrections across all weights, AWQ first runs the calibration data through the model and measures activation magnitudes at each layer. It identifies the roughly 1% of weight channels that correspond to large activations — those that, when quantized, cause the most damage to output quality.

AWQ then applies a per-channel scaling to those sensitive weights before quantization, effectively shifting precision budget toward the channels where it matters most. The scaling parameters are stored efficiently and inverted at inference time. AWQ is faster to quantize than GPTQ, requires less calibration data, and produces comparable quality on most evaluated models.

Both GPTQ and AWQ are GPU-native: inference is performed entirely on VRAM using optimized CUDA kernels (ExLlamaV2 for GPTQ, VLLM and AutoAWQ for AWQ). They do not support CPU offloading or Apple Silicon in the way GGUF does.

EXL2: Flexible Mixed-Precision

EXL2 (the format used by ExLlamaV2) takes the mixed-precision concept further than K-quants. Rather than applying predefined precision assignments to tensor types, EXL2 quantizes each individual row of each weight matrix at the bit depth that minimizes reconstruction error for that row specifically — then packs the mixed-precision rows into a compact format. The user specifies a target average bits-per-weight and the algorithm allocates bits optimally across all rows.

EXL2 benchmarks typically show it as the highest-quality 4-bit format available for pure-GPU inference, outperforming both GPTQ and AWQ at the same average bit depth. The format is supported by the ExLlamaV2 inference engine and is available for many popular models on Hugging Face.

Format	Primary advantage	Limitation	Inference engine
GGUF K-quant	CPU+GPU, Apple Silicon, widest compatibility	Lower throughput than pure-GPU formats	llama.cpp, Ollama
GPTQ	High quality via Hessian calibration; mature ecosystem	Requires CUDA GPU; no CPU offload	ExLlamaV2, VLLM
AWQ	Fast quantization, good quality, efficient kernels	CUDA only; less flexible than EXL2	VLLM, AutoAWQ, TGI
EXL2	Best quality at 4-bit for GPU inference; row-level mixed precision	ExLlamaV2 only; no CPU fallback	ExLlamaV2

Choosing Your Format: A Decision Tree

Use GGUF (K-quant) if you are running on a Mac, a CPU, a machine where the model exceeds your VRAM (requiring CPU offloading), or you simply want the simplest workflow via Ollama.

Use GPTQ or AWQ if you have a dedicated NVIDIA GPU with sufficient VRAM to hold the model entirely in VRAM, you need high throughput for serving, and you want a mature ecosystem with tools like VLLM or Text Generation Inference.

Use EXL2 if you are running ExLlamaV2 specifically, want the highest quality 4-bit output for GPU inference, and your use case justifies the narrower tooling ecosystem.

For most individuals running models locally for personal productivity, GGUF Q4_K_M remains the correct default. The GPU-native formats become relevant when throughput matters (serving multiple users) or when you are optimizing quality at the edge of what 4-bit can achieve for a specific application.

bitsandbytes: The Training Integration Path

A separate quantization ecosystem — bitsandbytes by Tim Dettmers — provides 8-bit and 4-bit quantization integrated directly into the Hugging Face Transformers training and inference pipeline. Its primary use case is enabling full-precision fine-tuning of large models on limited VRAM by quantizing model layers that are not being actively trained. QLoRA (2023) combined bitsandbytes 4-bit with LoRA adapters, allowing 65B model fine-tuning on a single A100 80GB GPU. For local inference, bitsandbytes is generally slower than GGUF or GPTQ but valuable in fine-tuning workflows.

Key Terms

GPTQA post-training quantization method using Hessian-based weight corrections from a calibration dataset. Produces high-quality 4-bit models for pure-GPU inference. Published ETH Zürich, 2023.

AWQActivation-aware Weight Quantization. Identifies and protects the ~1% of weight channels with high activation magnitudes before quantization. Fast to apply; competitive quality with GPTQ. MIT, 2023.

EXL2A mixed-precision quantization format for ExLlamaV2 that assigns bit depths at per-row granularity to minimize reconstruction error under a target average bits-per-weight constraint.

bitsandbytesA Hugging Face-integrated quantization library enabling 4-bit and 8-bit inference and fine-tuning within the Transformers pipeline. Key component of QLoRA fine-tuning workflows.

Lesson 4 Quiz

GPTQ, AWQ, EXL2, and advanced quantization — check your understanding

What distinguishes GPTQ from simpler PTQ methods like K-quants?

Correct. GPTQ uses second-order (Hessian) information from a calibration dataset to distribute quantization error intelligently across weights, reducing the net impact on model quality.

Incorrect. GPTQ is distinguished by its use of Hessian-based error correction during quantization — not by bit depth, retraining, or model size restrictions.

AWQ identifies which weights to protect by examining what signal?

Correct. AWQ runs calibration data through the model and identifies channels with large activation magnitudes — these are the ~1% of weights whose quantization causes the most damage, so they receive protective scaling.

Incorrect. AWQ uses activation magnitudes from a forward pass on calibration data to identify sensitive weights — not training gradients, perplexity, or matrix rank.

What is the primary limitation of GPTQ and AWQ compared to GGUF for local use?

Correct. GPTQ and AWQ inference uses CUDA kernels exclusively — they cannot fall back to CPU or run on Apple Silicon, unlike GGUF which works across hardware backends.

Incorrect. GPTQ and AWQ often achieve higher quality than K-quants at the same bit depth, and are widely available on Hugging Face. Their limitation is requiring CUDA-capable NVIDIA GPUs.

For most individuals running models locally on a Mac or in Ollama, which format remains the recommended default?

Correct. GGUF Q4_K_M works on CPU, GPU, and Apple Silicon, integrates directly with Ollama, and delivers excellent quality for the vast majority of local use cases without requiring CUDA infrastructure.

Incorrect. EXL2, GPTQ, and bitsandbytes all require NVIDIA CUDA infrastructure. GGUF Q4_K_M is the right default for Mac, CPU-based, and Ollama workflows.

Lab 4 — Choosing Your Quantization Format

Chat with your AI lab assistant · Complete 3 exchanges to finish

Your Task

Work through real hardware and use-case scenarios to decide between GGUF, GPTQ, AWQ, and EXL2. The assistant can help you reason through throughput requirements, VRAM constraints, tooling compatibility, and quality trade-offs.

Suggested starter: "I have an RTX 4090 with 24 GB VRAM and I want to serve a quantized 34B model to 3–5 concurrent users via an API. Walk me through which format I should use and why."

Format Selection Advisor

Welcome to Lab 4. I can help you choose between GGUF K-quants, GPTQ, AWQ, and EXL2 based on your specific hardware, use case, and quality requirements. Tell me about your setup — what GPU do you have, what model size are you targeting, and what are you building?

Module 4 Test — Quantization Explained

15 questions · 80% required to pass · All lessons covered

1. How many distinct values can a 4-bit quantization scheme represent per weight position?

Correct. 2⁴ = 16 distinct integer values (0–15).

Incorrect. 4-bit gives 2⁴ = 16 possible integer codes.

2. A 7B parameter model in FP32 occupies approximately how much memory?

Correct. 7B × 4 bytes (FP32) = 28 GB.

Incorrect. FP32 = 4 bytes per parameter. 7B × 4 = 28 GB.

3. What does the "zero point" in quantization accomplish?

Correct. The zero point is an offset that centers the integer representation on the actual distribution of weights in each block.

Incorrect. The zero point is an offset parameter that handles asymmetric weight distributions, not a threshold or initialization parameter.

4. In block quantization, how large is a typical quantization block?

Correct. Typical block sizes are 32 or 64 weights, each with its own scale factor to adapt to local distribution.

Incorrect. Block quantization uses small local groups — typically 32 or 64 weights — not entire layers or the whole model.

5. What year did GGUF replace GGML as the standard llama.cpp container format?

Correct. GGUF was introduced in August 2023 alongside the K-quant formats.

Incorrect. GGUF was introduced in August 2023, replacing the older GGML binary format.

6. In a Q4_K_M file, some tensors are stored internally at which higher precision?

Correct. In Q4_K_M, the most sensitive tensors (attention projections) are kept at Q6_K internally, while the majority use Q4_K.

Incorrect. Within a Q4_K_M file, the higher-precision internal format for sensitive tensors is Q6_K — not FP16 or Q8_0.

7. Which benchmark is most commonly used to compare quantized model quality and why?

Correct. MMLU's 57-subject factual recall format directly targets the area where quantization error compounds, making it the standard comparison benchmark.

Incorrect. MMLU is the standard because its precision-dependent recall tasks expose quantization error more clearly than generation or conversation benchmarks.

8. At what perplexity delta does quantization-induced quality loss typically become noticeable to human readers?

Correct. Community and research evaluations consistently find that perplexity deltas below ~1.0 are imperceptible in practice. The +2.0 threshold is where coherence degradation becomes reliably noticeable.

Incorrect. Small deltas (+0.1 to +0.5 typical for Q4_K_M) are statistically real but qualitatively imperceptible. Human-noticeable degradation typically begins around +2.0.

9. Which of the following tasks is LEAST affected by quantization from FP16 to Q4_K_M?

Correct. Summarization draws on broad semantic patterns rather than precise numerical or rare-word distinctions — exactly the kind of task that survives quantization well.

Incorrect. Counting, arithmetic, and rare-vocabulary prediction are the areas most sensitive to quantization error. Broad summarization tasks are largely unaffected.

10. What does GPTQ use to determine which quantization errors to compensate for?

Correct. GPTQ computes the Hessian to identify which weight perturbations cause the most loss increase, then applies compensating adjustments to neighboring weights.

Incorrect. Activation magnitudes are AWQ's signal. GPTQ uses the Hessian — the second-order gradient information — to guide its error compensation.

11. What fraction of weight channels does AWQ identify as "sensitive" and apply protective scaling to?

Correct. AWQ's key finding (and efficiency) is that only ~1% of weight channels — those with large activation magnitudes — are responsible for most quantization-induced quality loss.

Incorrect. AWQ's efficiency comes from identifying that only ~1% of channels need protection. Protecting more would be wasteful; these 1% produce the large activations that matter.

12. EXL2's primary advantage over GPTQ at the same average bit depth is:

Correct. EXL2 quantizes each row of each weight matrix at whatever bit depth minimizes that row's reconstruction error, achieving better quality than fixed-type assignments at the same average bits-per-weight.

Incorrect. EXL2 requires CUDA like GPTQ. Its advantage is per-row precision allocation — not hardware support or calibration data requirements.

13. Which quantization format is most appropriate for fine-tuning large models with QLoRA?

Correct. QLoRA specifically combines bitsandbytes 4-bit quantization with LoRA adapters — allowing fine-tuning of large models within the Hugging Face Transformers training pipeline.

Incorrect. GGUF, GPTQ, and EXL2 are inference-focused formats. QLoRA uses bitsandbytes for training-time quantization, keeping non-active layers at 4-bit while training LoRA adapters.

14. The Q2_K format begins to produce qualitatively problematic output primarily because:

Correct. With so few representable values, the weight matrices that implement attention lose granularity needed to differentiate token relationships, causing repetitive or incoherent generation.

Incorrect. Q2_K does work in Ollama and uses per-block scaling. The fundamental problem is that 2.6 bits provides too few distinct values for attention weight matrices to maintain quality.

15. A user on an M2 MacBook Pro with 32 GB unified memory wants to run a 13B model at the highest feasible quality. Which format and approximate variant is most appropriate?

Correct. GPTQ and EXL2 require CUDA; bitsandbytes is a training tool. With 32 GB unified memory, a 13B Q6_K (~8.5 GB) or Q8_0 (~14 GB) fits comfortably in Ollama, giving near-FP16 quality on Apple Silicon.

Incorrect. GPTQ, EXL2, and bitsandbytes all require NVIDIA CUDA and won't run on Apple Silicon. With 32 GB unified memory, GGUF Q6_K or Q8_0 gives the best quality via Ollama on this hardware.