Module 6 · Lesson 1

LM Studio — A GUI for Local LLMs

From download to first inference in under ten minutes — no terminal required.

What makes LM Studio the on-ramp that brought local AI to non-engineers?

When LM Studio 0.2 shipped in early 2024, its download counter crossed one million installs within weeks. Forum posts on Reddit's r/LocalLLaMA read like conversion stories: users who had never touched a terminal were chatting with Mistral 7B on a gaming laptop the same afternoon they heard about the tool. The GUI had solved a real barrier.

What LM Studio Is

LM Studio is a desktop application (macOS, Windows, Linux) that bundles a model browser, a downloader, a chat interface, and a local OpenAI-compatible REST server into a single installer. Under the hood it shells out to llama.cpp binaries — but the user never sees that layer unless they want to.

The project is developed by a small team led by Ahmet Öztürk and remains free for personal use. Its commercial licensing terms were updated in late 2024 to require a paid plan for business deployments generating revenue, a move that mirrored similar decisions by Ollama and other tooling projects as the space matured.

The Four Panels

LM Studio's interface organises work into four primary areas:

DiscoverBrowsable catalogue of GGUF models sourced from Hugging Face. Search by name, filter by parameter count or quantisation. One-click download with progress bar.

ChatConversation interface. Load a model, set a system prompt, adjust context length and temperature via sliders. Inference runs locally; nothing leaves the machine.

Local ServerStarts an HTTP server on localhost:1234. Exposes /v1/chat/completions and /v1/completions endpoints that are drop-in replacements for the OpenAI API. Any OpenAI SDK client works with a one-line base-URL change.

My ModelsLibrary of downloaded GGUF files. Shows file size, quantisation tier, and last-used date. Models can be deleted or re-downloaded here.

Hardware Detection and Backend Selection

On launch, LM Studio probes the host GPU and selects an appropriate llama.cpp backend: Metal on Apple Silicon, CUDA on NVIDIA cards, Vulkan on AMD/Intel. The number of layers offloaded to the GPU is surfaced as a simple slider labelled GPU Offload — moving it from 0 to max shifts weight matrices from RAM to VRAM, accelerating inference at the cost of VRAM budget.

In March 2024 LM Studio added MLX backend support for Apple Silicon, providing a second high-performance path alongside Metal-accelerated llama.cpp. Users on M-series Macs can choose whichever backend yields higher tokens-per-second for a given model.

Real Benchmark — M3 Max, 2024

Independent benchmarks published on the r/LocalLLaMA wiki in mid-2024 showed Llama 3 8B Q4_K_M achieving roughly 55–65 tokens/sec in LM Studio on an M3 Max MacBook Pro with 36 GB unified memory — fast enough for comfortable real-time conversation. The same model on a mid-range RTX 4070 laptop hit approximately 70–80 tokens/sec with full GPU offload.

The Local Server in Practice

The OpenAI-compatible server is LM Studio's most consequential feature for developers. Starting the server and pointing the Python openai library at http://localhost:1234/v1 requires only a single line change:

client = openai.OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="lm-studio"  # any string; LM Studio ignores it
)

This compatibility allowed developers to test prompt logic against local models at zero marginal cost before paying for cloud API calls — a workflow that spread rapidly through the indie developer community throughout 2024.

Key Takeaway

LM Studio is not a novel inference engine — it is a polished GUI skin over llama.cpp with a bundled model marketplace. Its value proposition is accessibility: it removed the terminal entirely from the path to running a local LLM, unlocking a user base that would never have attempted a manual llama.cpp build.

Lesson 1 Quiz

LM Studio — A GUI for Local LLMs

What inference engine does LM Studio use under the hood?

Correct. LM Studio is fundamentally a GUI front-end over llama.cpp. It ships the binaries inside the installer and manages them automatically.

Not quite. LM Studio wraps llama.cpp binaries. The abstraction is intentional — users never need to interact with llama.cpp directly.

On what port does LM Studio's local server listen by default?

Correct. LM Studio defaults to localhost:1234. (Port 11434 is Ollama's default — a common confusion.)

Not quite. LM Studio uses port 1234. Port 11434 belongs to Ollama; 8080 is a common generic web server default.

What is the practical purpose of LM Studio's GPU Offload slider?

Correct. Each layer pushed to VRAM reduces RAM usage and increases inference speed, up to the limit of available VRAM.

Incorrect. The GPU Offload slider controls how many model layers reside in VRAM versus system RAM, directly affecting tokens-per-second throughput.

Which additional backend for Apple Silicon did LM Studio add in March 2024?

Correct. Apple's MLX framework, developed at Apple Research, provides a high-performance array computation library optimised for unified memory on M-series chips.

Not quite. The answer is MLX — Apple's own machine learning array framework released in late 2023 and integrated into LM Studio as an optional backend in March 2024.

Lab 1 — Setting Up LM Studio

Guided practice with your AI lab assistant · 3 exchanges to complete

Your Task

You are setting up LM Studio on a Windows machine with an RTX 3060 (12 GB VRAM) and 32 GB system RAM. You want to run Mistral 7B Instruct for a local chatbot project. Walk through the key decisions with the assistant.

Start by asking: which quantisation tier should I download for Mistral 7B on my RTX 3060, and how do I set the GPU offload correctly?

LM Studio Setup Advisor

Lab 1

Ready to help you configure LM Studio. Ask me about quantisation selection, GPU offload settings, or anything else about getting Mistral 7B running on your RTX 3060.

Module 6 · Lesson 2

llama.cpp — The Engine Underneath

How a C++ project became the foundation of consumer-grade local AI inference.

Why did Georgi Gerganov's side project become the most forked inference library in AI history?

On 10 March 2023, Georgi Gerganov — the Bulgarian developer who had already built the whisper.cpp speech-recognition port — pushed the first commit of llama.cpp to GitHub. His stated goal was to run Meta's LLaMA model on a MacBook without a GPU. Within 48 hours the repository had thousands of stars. Within a week, contributors had added Windows support, ARM optimisations, and the first quantisation routines. The project's pace never slowed.

Architecture Philosophy

llama.cpp is written in C and C++ with zero mandatory external dependencies beyond the C standard library. This choice was deliberate: it makes the binary portable, compilable on any platform with a C compiler, and free of the Python packaging complexity that plagued early PyTorch-based local inference attempts.

The core computation is done by ggml — Gerganov's own tensor library, also written in pure C. ggml implements the attention mechanism, matrix multiplications, and activation functions needed for transformer inference, with hand-tuned SIMD paths for x86 AVX, ARM NEON, and Apple AMX.

GGUF: The Model Format

llama.cpp originally used a format called GGML, then transitioned in August 2023 to GGUF (GGML Unified Format). GGUF is a binary container format that stores:

MetadataModel architecture, tokenizer vocabulary, hyperparameters, and author-supplied fields — all in a self-describing header. A GGUF file is fully self-contained; no separate config.json is needed.

Tensor DataWeight tensors stored in the chosen quantisation format. Multiple tensors can be mixed-precision — attention matrices may use Q5_K_M while feed-forward layers use Q4_K_M.

Alignment PaddingTensors are page-aligned so memory-mapped loading (mmap) works without copying, enabling near-instant model loading even for 70B parameter files.

Ecosystem Impact

By mid-2024 Hugging Face hosted over 120,000 GGUF model files, making it the dominant distribution format for quantised open-weight models. LM Studio, Ollama, GPT4All, and Jan all consume GGUF natively.

Building llama.cpp from Source

While most users consume llama.cpp through wrappers, building from source unlocks the latest features and platform-specific optimisations. The canonical build for CUDA-enabled systems:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j $(nproc)

The -DGGML_CUDA=ON flag compiles CUDA kernels for NVIDIA GPUs. Equivalent flags exist for GGML_METAL (Apple), GGML_VULKAN (cross-platform GPU), and GGML_OPENCL.

Key Binaries

A llama.cpp build produces several executables in the build/bin/ directory:

llama-cliInteractive command-line inference. Replaced the older main binary in mid-2024. Supports single-turn and multi-turn conversation with full parameter control.

llama-serverHTTP server exposing OpenAI-compatible endpoints. Replaces the older server binary. Production-grade with continuous batching and multi-slot support added in 2024.

llama-quantizeConverts full-precision GGUF (F16/F32) to any quantisation tier. Essential for creating custom quants from freshly converted model weights.

llama-benchBenchmarks prompt processing (pp) and token generation (tg) speed across batch sizes. Used to compare backends and quantisation tiers.

Why It Matters

llama.cpp's decision to depend on nothing and compile anywhere made local LLM inference achievable on hardware that never appeared in AI research papers — Raspberry Pis, old Core i5 laptops, smartphones via Android NDK builds. Every GUI tool in this module is ultimately a wrapper around this single C++ project.

Lesson 2 Quiz

llama.cpp — The Engine Underneath

Who created llama.cpp and in what month/year was its first commit?

Correct. Georgi Gerganov pushed the first commit on 10 March 2023, days after Meta's LLaMA weights leaked. He had previously built whisper.cpp using the same ggml tensor library.

Incorrect. llama.cpp was created by Georgi Gerganov with the first commit on 10 March 2023.

What does GGUF stand for and what replaced it before it?

Correct. GGUF (GGML Unified Format) replaced the original GGML format in August 2023. The transition added a self-describing header, eliminating the need for separate architecture config files.

Not quite. GGUF stands for GGML Unified Format and it replaced the original GGML file format in August 2023.

Which llama.cpp binary would you use to measure tokens-per-second performance across different quantisation levels?

Correct. llama-bench runs systematic benchmarks of prompt processing and token generation speed, making it ideal for comparing quant tiers or backends.

Incorrect. llama-bench is the dedicated benchmarking binary. llama-quantize converts models; llama-cli is for interactive use; llama-server runs an HTTP API.

What CMake flag enables NVIDIA CUDA acceleration when building llama.cpp?

Correct. The flag is -DGGML_CUDA=ON, reflecting that CUDA kernels are part of the ggml tensor library layer. (Prior to mid-2024 the flag was -DLLAMA_CUBLAS=1.)

Not quite. The correct flag is -DGGML_CUDA=ON. It compiles optimised CUDA kernels for the ggml tensor operations that underpin llama.cpp inference.

Lab 2 — llama.cpp CLI Inference

Guided practice with your AI lab assistant · 3 exchanges to complete

Your Task

You have just built llama.cpp with CUDA support and downloaded a Llama 3 8B Q4_K_M GGUF file. You want to run inference from the command line with specific parameters: 4096 context, 512 max tokens, temperature 0.7, and all layers on the GPU.

Ask the assistant to help you construct the correct llama-cli command with those parameters, and explain what each flag does.

llama.cpp CLI Advisor

Lab 2

I can help you build the right llama-cli command. Tell me your model path, desired parameters, and any special requirements — I'll walk through each flag and what it controls.

Module 6 · Lesson 3

Quantisation Tiers in Practice

Navigating the Q4 through Q8 spectrum — where quality breaks and speed gains are real.

At what quantisation level does perplexity degradation become noticeable in real tasks?

The r/LocalLLaMA community ran hundreds of informal perplexity and evals comparisons throughout 2023 and 2024. The consensus that emerged — validated by more rigorous benchmarks from TheBloke and later by the llama.cpp team's own tooling — was that Q4_K_M represented the practical sweet spot: it fit 7B models comfortably in 6 GB VRAM while preserving enough precision for the model's reasoning to remain coherent across most tasks.

The K-Quant System

llama.cpp uses a family of quantisation methods generically called k-quants, introduced by community contributor ikawrakow in mid-2023. K-quants apply different bit-widths to different parts of the weight matrix — specifically, using higher precision for the scales that control quantisation buckets while using lower precision for the individual weight values.

The naming convention encodes two properties: the bit-width (Q4, Q5, Q6, Q8) and the variant (_K_S = small, _K_M = medium, _K_L = large). Higher letter means more bits allocated to scales, meaning better quality at slightly larger file size.

Tier-by-Tier Comparison

Tier	Bits/weight	7B VRAM	Quality vs F16	Best Use Case
Q2_K	~2.6	~3.0 GB	Noticeably degraded	Extreme RAM constraints only
Q3_K_M	~3.3	~3.9 GB	Usable but lossy	Very tight VRAM budgets
Q4_K_M	~4.8	~5.2 GB	Very close to F16	General purpose — recommended default
Q5_K_M	~5.7	~6.1 GB	Near-identical to F16	When VRAM allows; coding/reasoning tasks
Q6_K	~6.6	~7.0 GB	Effectively lossless	Maximum quality with quantisation savings
Q8_0	8.0	~8.2 GB	Perceptually identical to F16	Reference quality; 8 GB VRAM cards

Perplexity Data — llama.cpp Wiki

The official llama.cpp wiki's perplexity table (measured on wikitext-2) shows Q4_K_M adding approximately 0.1–0.3 perplexity points over F16 for 7B Llama 2 — a degradation smaller than the variance between different random seeds of the same prompt. Q3_K_M adds 0.5–1.2 points, which corresponds to detectable but not catastrophic quality loss on structured tasks.

Mixed Quantisation and imatrix

In 2024 llama.cpp added importance matrix (imatrix) quantisation. The idea: run a calibration dataset through the model at full precision, measure which weights have the highest activation variance (are most "important"), and protect those with higher bit allocation during quantisation.

imatrix quants are labeled with -imat suffixes on Hugging Face. Benchmarks show imatrix Q4_K_M matching or exceeding standard Q5_K_M on many tasks — effectively a free quality upgrade that many model uploaders now apply by default.

Creating a Custom Quant

Starting from a freshly converted F16 GGUF, quantising to Q4_K_M takes one command:

./build/bin/llama-quantize \
    models/my-model-f16.gguf \
    models/my-model-q4_k_m.gguf \
    Q4_K_M

With imatrix, the process requires a calibration step first:

# Step 1: generate importance matrix
./build/bin/llama-imatrix \
    -m models/my-model-f16.gguf \
    -f calibration-data.txt \
    -o imatrix.dat

# Step 2: quantise using the imatrix
./build/bin/llama-quantize \
    --imatrix imatrix.dat \
    models/my-model-f16.gguf \
    models/my-model-q4_k_m-imat.gguf \
    Q4_K_M

Decision Rule

For most users: pick Q4_K_M as a default, upgrade to Q5_K_M if VRAM allows, and only go below Q4 if the model literally will not fit in RAM at any higher tier. imatrix variants are preferable to non-imatrix at the same tier when available.

Lesson 3 Quiz

Quantisation Tiers in Practice

What does the "_M" in Q4_K_M indicate?

Correct. The S/M/L suffix in k-quants refers to how much precision is allocated to the quantisation scales. M (medium) is a widely recommended balance of quality and file size.

Not quite. The "_M" stands for Medium — it controls how many bits are used for the scales in the k-quant scheme, with more bits giving better reconstruction quality.

What is the primary benefit of imatrix quantisation over standard k-quant quantisation at the same tier?

Correct. imatrix quantisation measures activation variance on a calibration corpus and allocates higher precision to the most sensitive weight dimensions, recovering quality without increasing file size.

Incorrect. The benefit is quality, not size reduction or speed. imatrix uses a calibration pass to identify which weights matter most and protects them during quantisation.

For a 7B model on a GPU with exactly 6 GB VRAM, which quantisation tier is the practical maximum that will fit?

Correct. Q4_K_M at ~5.2 GB for a 7B model fits within a 6 GB VRAM budget. Q5_K_M at ~6.1 GB would exceed the hard 6 GB limit when accounting for KV cache and runtime overhead.

Not quite. Q4_K_M (~5.2 GB) is the highest tier that safely fits in 6 GB. Q5_K_M at ~6.1 GB would exceed the limit once KV cache and runtime overhead are added.

Which llama.cpp binary is used to convert a full-precision GGUF to a quantised GGUF?

Correct. llama-quantize takes an input GGUF, output path, and quantisation type as positional arguments and produces the quantised file.

Incorrect. The correct binary is llama-quantize. It takes the source F16/F32 GGUF, the destination path, and the quantisation type (e.g., Q4_K_M) as arguments.

Lab 3 — Choosing and Creating Quantisations

Guided practice with your AI lab assistant · 3 exchanges to complete

Your Task

You have access to a freshly converted Mistral 7B F16 GGUF (14 GB) and want to create quantised versions for two targets: a 6 GB VRAM GPU workstation, and a 16 GB RAM-only laptop. You also want to understand whether an imatrix quant is worth the extra effort.

Ask which quantisation tier to create for each target, what commands to run, and whether imatrix quantisation is worthwhile for these use cases.

Quantisation Planning Advisor

Lab 3

I can help you plan your quantisation strategy. Tell me about your target hardware and use case and I'll recommend the right tier plus the exact commands to produce each file.

Module 6 · Lesson 4

llama-server and the OpenAI-Compatible API

Running llama.cpp as a production-ready local API server.

How close is llama-server's API to OpenAI's — and where do the gaps matter?

Throughout 2024, llama-server was quietly upgraded from a demo tool to a serious inference server. The addition of continuous batching, multi-slot support, and speculative decoding made it competitive with purpose-built serving systems for moderate-throughput workloads. Several companies processing hundreds of requests per hour switched from managed API calls to self-hosted llama-server deployments, reporting cost reductions of 80–90% against commercial API pricing.

Starting llama-server

The minimal command to start a server serving a local model:

./build/bin/llama-server \
    -m models/llama3-8b-q4_k_m.gguf \
    -c 8192 \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 99

Key flags: -c sets the context window size (affects KV cache VRAM usage), -ngl sets GPU layers (99 offloads all layers to GPU), and --host 0.0.0.0 makes the server accessible on the local network rather than only localhost.

API Endpoints and Compatibility

llama-server exposes both its native API and OpenAI-compatible endpoints:

POST /v1/chat/completionsOpenAI Chat Completions-compatible. Accepts the same JSON schema including messages array, temperature, max_tokens, stream, and stop sequences. Works with any OpenAI SDK client.

POST /v1/completionsOpenAI Legacy Completions-compatible. Takes a prompt string rather than a messages array. Used by older tooling expecting the text-davinci-003 interface.

POST /completionNative llama.cpp endpoint. Exposes additional parameters like mirostat sampling, repeat penalties, and grammar-constrained generation not available in the OpenAI-compatible endpoints.

GET /healthReturns server status. Used by load balancers and orchestration tools to confirm the server is loaded and ready.

POST /tokenizeReturns the token IDs for an input string. Useful for accurately measuring prompt length before sending a request.

Multi-Slot Concurrent Requests

By default llama-server handles one request at a time. Adding the --parallel flag enables multiple simultaneous inference slots, each maintaining its own KV cache state:

./build/bin/llama-server \
    -m models/llama3-8b-q4_k_m.gguf \
    -c 16384 \
    --parallel 4 \
    -ngl 99

With --parallel 4, the context window is divided across 4 slots (4096 tokens each at the context setting above). VRAM usage for the KV cache scales linearly with parallel count. Continuous batching means all active slots are processed together in each forward pass, amortising the fixed per-step compute cost.

Real Deployment — 2024

The Ollama project's architecture (which wraps llama.cpp) uses a similar multi-slot approach and reported in mid-2024 that a single RTX 4090 running llama-server with 4 parallel slots could handle approximately 20–30 concurrent users at conversational response speeds for a 7B model — sufficient for small team deployments.

Grammar-Constrained Generation

One of llama-server's most powerful non-OpenAI features is grammar-constrained generation: providing a BNF-style grammar to force the model output to conform to a specific structure. This enables reliable JSON extraction without post-processing:

# Force valid JSON output via the native /completion endpoint
{
  "prompt": "Extract the name and age from: John Smith is 34 years old.",
  "grammar": "root ::= object\nobject ::= \"{\" ws members ws \"}\"\nmembers ::= pair (\",\" ws pair)*\npair ::= string \":\" ws value\nvalue ::= string | number\nstring ::= \"\\\"\" [^\\\""]* \"\\\"\"\nnumber ::= [0-9]+\nws ::= [ \\t\\n]*"
}

In practice, llama.cpp ships a grammars/ directory with pre-built grammars for JSON, SQL, and other common formats that can be loaded directly rather than written by hand.

LM Studio vs Direct llama-server

LM Studio's built-in server is llama-server with a fixed configuration UI. Launching llama-server directly gives access to every flag — continuous batching depth, slot count, grammar files, speculative decoding, and custom sampling parameters — making it the right tool whenever LM Studio's GUI surface is insufficient.

Lesson 4 Quiz

llama-server and the OpenAI-Compatible API

What llama-server flag enables multiple simultaneous inference slots for concurrent users?

Correct. --parallel N creates N inference slots each with its own KV cache segment, enabling concurrent request handling with continuous batching across all active slots.

Not quite. The correct flag is --parallel. It divides the context window across N simultaneous slots and enables continuous batching across all of them.

Which llama-server endpoint exposes grammar-constrained generation that the OpenAI-compatible endpoints do not?

Correct. The native POST /completion endpoint exposes llama.cpp-specific features like grammar-constrained generation, mirostat sampling, and repeat penalties that the OpenAI-compatible endpoints do not surface.

Incorrect. Grammar constraints are available only on the native POST /completion endpoint, not the OpenAI-compatible /v1/ endpoints.

If you start llama-server with -c 16384 and --parallel 4, how many context tokens does each slot receive?

Correct. The total context window is divided evenly across parallel slots — 16384 / 4 = 4096 tokens per slot. Each slot maintains independent KV cache state within its allocation.

Not quite. The total context is split across slots — 16384 ÷ 4 = 4096 per slot. This is why VRAM usage scales with both context size and parallel count.

What does the POST /tokenize endpoint on llama-server return?

Correct. /tokenize returns the token IDs corresponding to an input string, letting you precisely measure prompt token count before submitting a full inference request.

Incorrect. POST /tokenize converts an input string into its token IDs — the reverse of detokenisation. This is used to check prompt lengths against context window limits before sending requests.

Lab 4 — Deploying llama-server

Guided practice with your AI lab assistant · 3 exchanges to complete

Your Task

You are deploying llama-server on a Linux machine with an RTX 4090 (24 GB VRAM) to serve a team of 8 people running a Llama 3 70B Q4_K_M model. You need to configure parallel slots, appropriate context size, and understand how to use the grammar endpoint to reliably extract JSON from model outputs.

Ask how to configure llama-server for 8 concurrent users with a 70B model on an RTX 4090, then ask about using grammar-constrained generation for JSON extraction.

llama-server Deployment Advisor

Lab 4

Ready to help you configure a production llama-server deployment. Tell me your hardware, model, team size, and use case requirements — I'll work through the optimal server configuration with you.

Module 6 Test

LM Studio and llama.cpp · 15 questions · 80% to pass

1. LM Studio primarily adds value over raw llama.cpp by providing which of the following?

Correct.

Incorrect. LM Studio wraps llama.cpp with a polished GUI and marketplace — it does not replace or improve the inference engine.

2. Which programming language is llama.cpp written in?

Correct. C/C++ with zero mandatory dependencies was a deliberate portability decision.

Incorrect. llama.cpp is C/C++ — this portability choice was fundamental to its rapid adoption across platforms.

3. The ggml tensor library inside llama.cpp provides SIMD paths for which processor instruction sets?

Correct. ggml has hand-tuned SIMD paths for all three major CPU instruction families.

Incorrect. ggml includes optimised paths for x86 AVX, ARM NEON, and Apple AMX.

4. GGUF files use memory-mapped loading. What property of the format makes mmap efficient?

Correct. Page-aligned tensors let the OS memory-map them directly, enabling near-instant model loading regardless of file size.

Incorrect. GGUF uses page alignment to enable zero-copy memory mapping via mmap.

5. LM Studio's GPU Offload slider at maximum means:

Correct. Maximum offload = all layers in VRAM = maximum GPU utilisation and tokens-per-second.

Incorrect. Maximum GPU offload means all transformer layers reside in VRAM, minimising memory transfer overhead.

6. Which k-quant variant uses the most bits for quantisation scales, yielding the best quality at a given bit-width?

Correct. _K_L allocates the most precision to scales, yielding the highest quality within the k-quant family at a given bit-width.

Incorrect. Within the S/M/L spectrum, _K_L uses the most bits for scales and produces the best quality, at the cost of slightly larger file size.

7. What is the imatrix calibration step in llama.cpp quantisation used for?

Correct. The calibration pass measures which weights have high activation variance on representative text, then protects them with higher precision during quantisation.

Incorrect. imatrix calibration identifies high-variance (high-importance) weights so they can be allocated more precision during quantisation.

8. What is the recommended default quantisation tier for a 7B model when VRAM is not the limiting factor?

Correct. Q4_K_M is the established default recommendation — near-F16 quality with a file size suitable for consumer hardware.

Incorrect. Q4_K_M is the recommended starting point: perplexity degradation is minimal and file size fits most consumer VRAM budgets.

9. Which MLX backend was added to LM Studio in March 2024 as an alternative to Metal-accelerated llama.cpp on Apple Silicon?

Correct. Apple's MLX array framework, developed at Apple Research and open-sourced in late 2023, was integrated as an optional inference backend in LM Studio for M-series Macs.

Incorrect. The addition was the Apple MLX framework — a numpy-like array library optimised for unified memory, released by Apple Research.

10. llama-server's continuous batching feature processes multiple inference slots together. What is the primary performance benefit?

Correct. Continuous batching means each forward pass serves multiple slots simultaneously, spreading the fixed matrix multiplication cost across N requests instead of 1.

Incorrect. The key benefit is that each GPU forward pass serves all active slots together, amortising the fixed per-step compute cost.

11. To connect the Python openai library to a local LM Studio server, what base_url should you use?

Correct. LM Studio defaults to port 1234 and exposes OpenAI-compatible routes under /v1.

Incorrect. LM Studio's default is localhost:1234/v1. Port 11434 is Ollama.

12. What does the llama-server --parallel 4 flag do to the KV cache?

Correct. Each parallel slot has its own KV cache segment, so VRAM for KV cache scales with N parallel slots × context size × precision.

Incorrect. Each slot maintains an independent KV cache. VRAM for KV cache scales linearly with the number of parallel slots.

13. Grammar-constrained generation in llama.cpp uses what kind of grammar specification?

Correct. BNF-style grammars constrain the token sampling distribution at each step, making it structurally impossible to generate invalid output.

Incorrect. llama.cpp uses BNF-style context-free grammars that constrain token sampling at generation time, not post-processing filters.

14. The llama-bench binary measures two primary metrics abbreviated "pp" and "tg". What do they stand for?

Correct. pp = prompt processing (prefill phase, reading the input prompt) measured in tokens/sec; tg = token generation (autoregressive decode phase) also in tokens/sec.

Incorrect. pp = prompt processing (prefill) and tg = token generation (decode). These are the two phases of LLM inference and have very different performance characteristics.

15. Approximately how many GGUF model files were hosted on Hugging Face by mid-2024?

Correct. By mid-2024 Hugging Face hosted over 120,000 GGUF files, making it by far the largest repository of quantised local models.

Incorrect. By mid-2024 the count exceeded 120,000 GGUF files on Hugging Face, reflecting the format's dominance in the consumer local AI ecosystem.