Running Models Locally

1. Open WebUI (formerly Ollama WebUI) is primarily designed to serve what function?

Correct. Open WebUI is the Layer 3 application interface — a Docker-deployed web app that provides a polished user experience on top of a running Ollama installation.

Incorrect. Open WebUI is a web-based chat interface — Layer 3 — that runs over Ollama and adds conversation history, model switching, RAG, and multi-user support for teams.

2. llama-server's continuous batching feature processes multiple inference slots together. What is the primary performance benefit?

Correct. Continuous batching means each forward pass serves multiple slots simultaneously, spreading the fixed matrix multiplication cost across N requests instead of 1.

Incorrect. The key benefit is that each GPU forward pass serves all active slots together, amortising the fixed per-step compute cost.

3. Which of the four system prompt pillars is most directly responsible for preventing confident hallucination on unknown facts?

Correct. The Uncertainty pillar — e.g., "If you are unsure, say so explicitly" — directly addresses hallucination by giving the model an alternative to filling knowledge gaps with confident fabrication.

Incorrect. The Uncertainty pillar handles unknown information. Without explicit uncertainty instructions, local models default to their training priors — producing confident, plausible-sounding but wrong answers.

4. What is the primary advantage of Chroma for local RAG compared to Qdrant?

Correct. Chroma is an embedded database — it runs inside your Python process. Qdrant requires a separate Docker container or server process.

Chroma's key advantage is being embedded: it runs in-process with no Docker container or server management, unlike Qdrant which requires a separate service.

5. In Llama 3's chat template, what token marks the beginning of a role header?

Incorrect. <|start_header_id|> is Llama 3's role header token. Each model family uses distinct special tokens — mixing them produces format mismatch failures.

6. What is the primary bottleneck that determines tokens-per-second throughput during LLM inference on a GPU?

Correct. LLM autoregressive generation is memory-bandwidth-bound: throughput scales with how fast the GPU can read weights from VRAM, not with arithmetic compute capacity.

Incorrect. Memory bandwidth is the binding constraint. LLM generation reads model weights once per token — the rate of that read determines speed, not arithmetic core count or clock frequency.

7. A local model produces literal asterisks and pound signs in its terminal output. What is the most direct fix?

Correct. Local models default to markdown because their training data included it. An explicit system prompt instruction — "plain text only, no markdown" — is the direct fix. Post-processing is a workaround, not a solution.

Incorrect. The direct fix is an explicit system prompt instruction: "Respond in plain text only. Do not use markdown, headers, or bullet points." Temperature does not control format token generation in this way.

8. Grammar-constrained generation in llama.cpp uses what kind of grammar specification?

Correct. BNF-style grammars constrain the token sampling distribution at each step, making it structurally impossible to generate invalid output.

Incorrect. llama.cpp uses BNF-style context-free grammars that constrain token sampling at generation time, not post-processing filters.

9. For Mistral 7B in raw completion mode, where should system instructions be placed?

Correct. Mistral has no dedicated system role token — its tokenizer_config.json and Mistral AI's model cards both specify that system instructions go inside the first [INST] block, before the user's query.

Incorrect. Mistral's template has no <system> tag. System instructions must be placed inside the first [INST] block. Using non-existent template tokens is one of the most common causes of Mistral prompt failures.

10. The four elements a well-structured system prompt should specify are:

Correct. Role (who the model is), task (what it does), format (how to return results), and constraints (what to avoid) are the four core elements of an application system prompt.

The four elements are: role, task, format, and constraints. These cover who the model is, what it should do, how to format output, and what to avoid.

11. What does the "zero point" in quantization accomplish?

Correct. The zero point is an offset that centers the integer representation on the actual distribution of weights in each block.

Incorrect. The zero point is an offset parameter that handles asymmetric weight distributions, not a threshold or initialization parameter.

12. LM Studio's GPU Offload slider at maximum means:

Correct. Maximum offload = all layers in VRAM = maximum GPU utilisation and tokens-per-second.

Incorrect. Maximum GPU offload means all transformer layers reside in VRAM, minimising memory transfer overhead.

13. The llama-bench binary measures two primary metrics abbreviated "pp" and "tg". What do they stand for?

Correct. pp = prompt processing (prefill phase, reading the input prompt) measured in tokens/sec; tg = token generation (autoregressive decode phase) also in tokens/sec.

Incorrect. pp = prompt processing (prefill) and tg = token generation (decode). These are the two phases of LLM inference and have very different performance characteristics.

14. The KV cache in LLM inference stores what, and how does it affect VRAM requirements?

Correct. The KV cache stores key-value attention computations for every previously processed token. It grows proportionally with context length × layers × attention heads, consuming significant VRAM for long conversations.

Incorrect. The KV cache stores intermediate attention computations (keys and values) for each processed token, enabling attention to previous context without recomputation. It grows linearly with context length and can add several GB of VRAM for long conversations.

15. EleutherAI's GPT-NeoX 20B (2022) was significant because it was the largest publicly available autoregressive language model with openly published weights. What organization was EleutherAI?

Correct. EleutherAI is a volunteer research collective — not corporate, not government-funded — that has published landmark open models including GPT-J, GPT-NeoX, and the Pile training dataset.

EleutherAI is a volunteer research collective that formed organically online, committed to open AI research as a counterweight to closed corporate development.

16. Which command checks whether the Ollama service is reachable?

Correct. Curling localhost:11434 returns "Ollama is running" when the service is live. Ollama has no status, ping, or health subcommand.

Ollama doesn't have a status, ping, or health command. Checking the HTTP endpoint with curl is the standard verification method.

17. What is the recommended default quantisation tier for a 7B model when VRAM is not the limiting factor?

Correct. Q4_K_M is the established default recommendation — near-F16 quality with a file size suitable for consumer hardware.

Incorrect. Q4_K_M is the recommended starting point: perplexity degradation is minimal and file size fits most consumer VRAM budgets.

18. Why does few-shot prompting produce more reliable structured output than prose format instructions for smaller local models?

Correct. Transformers excel at continuing patterns they observe in context. A worked example is immediate, concrete pattern context — far more reliable than an instruction that must activate a learned association with variable strength.

Incorrect. Few-shot works through pattern continuation — the core transformer mechanism. The model has just "seen" JSON being produced and continues that pattern. This is more direct and reliable than instruction-following, which activates fuzzier learned associations.

19. Which benchmark is most commonly used to compare quantized model quality and why?

Correct. MMLU's 57-subject factual recall format directly targets the area where quantization error compounds, making it the standard comparison benchmark.

Incorrect. MMLU is the standard because its precision-dependent recall tasks expose quantization error more clearly than generation or conversation benchmarks.

20. What is the purpose of the PARAMETER instruction in a Modelfile?

Correct. PARAMETER sets inference-time settings like temperature (randomness), num_ctx (context window), top_p (sampling), and repeat_penalty.

PARAMETER in a Modelfile sets inference settings — temperature, context window, sampling parameters. It has nothing to do with Python packages, model size declarations, or environment variables.

Final Exam