Module 8 · Lesson 1

The Local API Layer

Ollama, LM Studio, and llama.cpp all expose HTTP endpoints. Here's how every serious local app begins.

What does it actually take to turn a running model into something a real application can call?

In late 2023, the Mozilla Foundation launched llamafile—a single-executable packaging of llama.cpp that anyone could download and run, including a self-hosted HTTP server on port 8080. Within weeks, open-source developers were pointing standard OpenAI client libraries at localhost:8080 and building fully offline chatbots. No cloud account. No rate limit. The same pattern had already appeared in LM Studio's local server feature and Ollama's REST API—each one deliberately mimicking the OpenAI chat completions format so existing code would just work.

Why an HTTP API?

A language model is a function: tokens in, tokens out. Wrapping that function in an HTTP server is the simplest way to make it available to any programming language, any framework, and any tool that can make a web request. This is not a new idea—it is how every cloud AI provider works—but running the server on localhost changes the security and latency equation completely.

The three dominant local runtimes each expose slightly different APIs, but all support the OpenAI chat completions format as their primary interface. This means a Python script that calls http://localhost:11434/api/chat (Ollama) can be pointed at http://localhost:1234/v1/chat/completions (LM Studio) by changing one URL string.

The Three Runtimes and Their Endpoints

Runtime	Default Port	Primary Endpoint	OpenAI-Compatible?
Ollama	11434	/api/chat	Yes (also /v1/chat/completions)
LM Studio	1234	/v1/chat/completions	Yes (exact format)
llama.cpp server	8080	/v1/chat/completions	Yes

Ollama also exposes /api/generate for single-turn completion and /api/tags to list installed models programmatically—useful for building model-picker UIs.

The Minimal Python Call

The requests library is all you need to start. The pattern below works against any of the three runtimes by changing the URL:

import requests, json

url = "http://localhost:11434/api/chat"
payload = {
    "model": "llama3",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user",   "content": "Summarise the water cycle in two sentences."}
    ],
    "stream": False
}
resp = requests.post(url, json=payload)
print(resp.json()["message"]["content"])
    

For LM Studio or llama.cpp server, swap the URL and change the response key to choices[0].message.content—the exact OpenAI response shape.

The OpenAI Python Client Shortcut

Because LM Studio and Ollama's /v1/ routes are OpenAI-compatible, you can use openai-python directly with a custom base_url. This unlocks function calling, structured outputs, and streaming with no extra code:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:1234/v1",  # LM Studio
    api_key="not-needed"              # required by client, ignored locally
)

completion = client.chat.completions.create(
    model="local-model",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(completion.choices[0].message.content)
    

Streaming Responses

For interactive applications, streaming is essential. Instead of waiting seconds for a full response, tokens arrive incrementally and the UI updates in real time. Set "stream": true in the payload and read the response as server-sent events. Ollama streams newline-delimited JSON; the OpenAI-compatible endpoints stream SSE data: lines.

Key Insight

The local API layer is not a toy interface—it is the same HTTP/JSON contract that cloud providers use. Mastering it locally means your code is already portable to hosted endpoints when you need to scale.

OpenAI-compatibleAn API that accepts and returns the same JSON schema as OpenAI's chat completions endpoint, allowing existing OpenAI client code to work unchanged by pointing at a different base URL.

SSE (Server-Sent Events)A web standard for pushing data from server to client over a single HTTP connection, used by streaming LLM APIs to deliver tokens incrementally.

base_urlA configuration parameter in OpenAI client libraries that redirects all API calls to a custom host, enabling local model use without changing application logic.

Lesson 1 Quiz

The Local API Layer — four questions

1. Ollama's default HTTP port is:

Correct. Ollama listens on port 11434 by default. LM Studio uses 1234 and llama.cpp server uses 8080.

Not quite. Ollama's default is port 11434. LM Studio uses 1234; llama.cpp server defaults to 8080.

2. What is the key benefit of an OpenAI-compatible API endpoint on a local runtime?

Correct. The OpenAI-compatible format means you only need to change the base_url—all other application code stays the same.

The primary benefit is code reuse: your existing OpenAI client code works against local runtimes by pointing at a different URL.

3. To use the openai-python library against LM Studio, you set the api_key parameter to:

Correct. The openai-python client requires api_key to be set, but local runtimes ignore its value entirely — any string works.

Local runtimes don't authenticate. You still need to supply api_key because the client library requires a non-None value, but any string works.

4. Streaming responses from a local model are useful because:

Correct. Streaming lets your application display tokens as they are generated rather than waiting for the full response, which is critical for good UX.

Streaming is about latency perception. Tokens are shown to the user as they arrive, making the model feel faster and more interactive.

Lab 1: Designing Local API Integrations

Chat with an AI assistant about connecting applications to local model endpoints

Your Task

You are building a small Python script that calls a local Ollama instance. Use the AI assistant below to work through the design: how to structure the request, handle the response, and decide between streaming and non-streaming modes for your use case.

Suggested opening: "I'm building a Python script that summarises log files using a local Ollama model. How should I structure the API call, and when should I use streaming vs non-streaming?"

AI Lab Assistant

Local API Design

Welcome to Lab 1. I'm here to help you design integrations with local model APIs — Ollama, LM Studio, or llama.cpp server. Describe your application and we'll work through the API structure, request format, and response handling together. What are you building?

Module 8 · Lesson 2

Prompt Engineering for Application Contexts

A chatbot prompt and a document-processing prompt are completely different problems. Here is how to design each.

How do you move beyond "just ask the model" to engineering prompts that reliably produce machine-readable output?

When Hugging Face released text-generation-inference (TGI) in 2022, production teams quickly discovered that the same model produced dramatically different output quality depending on how they framed the system prompt. Teams at companies including Mistral AI and Stability AI published internal findings showing that structured output prompts — asking the model to return JSON with explicit field names — reduced post-processing errors by over 60% compared to freeform text prompts. The lesson propagated through the open-source community and is now standard practice in any local model application stack.

System Prompts as Application Configuration

The system message in the chat format is not a polite suggestion—it is your application's primary control surface. Everything about how the model behaves in your app flows from this single string. Treat it with the same care you would treat configuration code.

A well-structured system prompt specifies: role (who the model is), task (what it should do), format (how to return results), and constraints (what to avoid). Omitting any of these leaves the model to make its own choices—which are often wrong for your context.

The Four Prompt Architectures

Architecture 1

Conversational

Multi-turn dialogue. System prompt defines persona and scope. History grows with each turn. Use for chatbots, assistants, customer support.

Architecture 2

Single-Shot Processing

One input, one structured output. System prompt specifies exact JSON/CSV format. Use for document parsing, classification, extraction.

Architecture 3

Chain-of-Thought

Multi-step reasoning with intermediate outputs. Each step's output becomes next step's input. Use for complex analysis, code review, multi-stage decisions.

Architecture 4

Tool-Augmented

Model calls functions/tools defined in the API request. System prompt explains available tools. Use for search, calculations, database queries.

Forcing Structured Output

When your application needs to parse the model's response programmatically, freeform text is a liability. Two approaches work reliably with local models:

# Approach 1: Prompt-based JSON enforcement
system = """You are a data extraction assistant.
Always respond with valid JSON only. No prose, no markdown fences.
Schema: {"entities": [{"name": str, "type": str, "confidence": float}]}"""

# Approach 2: Ollama structured output (format parameter)
payload = {
    "model": "llama3",
    "format": "json",   # forces JSON mode
    "messages": [...]
}
    

Ollama's format: "json" parameter forces the model to produce valid JSON at the grammar-sampling level—it cannot produce malformed JSON because the sampler enforces the grammar. This is more reliable than prompt-only approaches, especially for smaller models.

Context Window Management

Local models have fixed context windows—typically 4K to 128K tokens. For conversational applications, you must manage history length explicitly. The standard strategy is a sliding window: keep the system prompt, the most recent N exchanges, and optionally a summary of earlier conversation.

For document processing, chunk large inputs into segments that fit the context window with room for the response. A rule of thumb: use at most 70% of the context window for input, leaving 30% for the model's response.

Production Pattern

Always validate model output before using it. Even with JSON mode, wrap response parsing in try/except. Log failures with the input that caused them — these become your fine-tuning dataset.

Grammar samplingA token-selection technique that constrains model output to a formal grammar (such as JSON schema), making syntactically invalid responses impossible rather than just unlikely.

Sliding windowA conversation history management strategy that keeps only the most recent N messages in the context, discarding older ones to stay within the model's context limit.

Chain-of-thoughtA prompting technique that instructs the model to reason step by step before producing a final answer, improving accuracy on complex tasks at the cost of more tokens.

Lesson 2 Quiz

Prompt Engineering for Application Contexts — four questions

1. What does Ollama's "format": "json" parameter do that prompt-only instructions cannot guarantee?

Correct. Grammar sampling constrains which tokens can be generated, making syntactically invalid JSON structurally impossible rather than just unlikely.

Grammar sampling works at the token level — it physically prevents the model from generating tokens that would break JSON syntax, which prompt instructions alone cannot do.

2. A "sliding window" strategy in conversation history management:

Correct. A sliding window is the simplest way to keep conversation history within the context limit — you keep the system prompt and the most recent N exchanges.

A sliding window keeps the N most recent messages and drops older ones. It doesn't expand the context or compress messages.

3. Which prompt architecture is most appropriate for classifying customer support tickets into predefined categories?

Correct. Single-shot processing — one input in, one structured output — is ideal for classification tasks like ticket routing.

Classification is a single-shot task: one ticket in, one category label out. Conversational architecture would add unnecessary complexity.

4. The recommended maximum portion of a context window to use for input when processing documents is:

Correct. Leaving ~30% of the context window for the model's response prevents truncation and gives the model room to reason.

If you fill the context window entirely with input, the model has no space to generate a meaningful response. ~70% input / 30% response headroom is a standard guideline.

Lab 2: Prompt Architecture Design

Work through structured output and context management strategies with an AI assistant

Your Task

You need to build a document classification pipeline using a local model. Design the system prompt and output format with the AI assistant. Focus on: how to enforce JSON output, what fields to include, and how to handle documents longer than your context window.

Suggested opening: "I need to classify legal contracts into five categories using a local Llama 3 model. Help me design a system prompt that returns structured JSON with a category, confidence score, and brief rationale."

AI Lab Assistant

Prompt Architecture

Welcome to Lab 2. I'll help you design prompt architectures for application contexts — structured outputs, context window management, and system prompt design. What document processing or classification task are you working on?

Module 8 · Lesson 3

Building RAG Pipelines Locally

Retrieval-Augmented Generation transforms a generic model into a domain expert — without fine-tuning a single weight.

How do you connect a local language model to your own data without sending anything to the cloud?

In 2023, LlamaIndex (then GPT Index) released its local pipeline toolkit, and within months the open-source community had documented dozens of production deployments where legal, medical, and financial teams ran RAG entirely on local hardware. A team at Deutsche Telekom demonstrated an internal knowledge base assistant running on on-premise hardware with no cloud dependency, using Chroma as the vector store and Mistral 7B as the generation model. The system answered questions about internal IT policy documents with measurable accuracy superior to keyword search, processed zero external API calls, and satisfied the company's data residency requirements entirely.

What RAG Is and Why It Works Locally

RAG stands for Retrieval-Augmented Generation. The idea is simple: instead of relying on knowledge baked into model weights during training, you retrieve relevant documents at query time and include them in the prompt. The model reads the documents and answers based on them.

This works exceptionally well with local models because the bottleneck shifts from model capability to retrieval quality. A Mistral 7B model that is given the correct passage can answer questions about it as well as a much larger model that must rely on memorized training data.

The RAG Pipeline: Five Stages

1Ingest: Load your documents (PDF, Markdown, HTML, CSV). Split into chunks of 300–800 tokens with overlap of ~10% to preserve context at boundaries.
2Embed: Pass each chunk through an embedding model to produce a vector representation. Locally, nomic-embed-text via Ollama or sentence-transformers are standard choices.
3Store: Insert vectors and source text into a vector database. For local use: Chroma (embedded, no server needed), Qdrant (Docker), or FAISS (in-memory, no persistence).
4Retrieve: At query time, embed the user's question and search the vector database for the N most similar chunks. Typically N=3–5.
5Generate: Assemble the retrieved chunks and the query into a prompt. Send to the local LLM. Return the response.

Minimal Python Implementation

from chromadb import Client
import ollama, textwrap

# --- INGEST ---
client = Client()
col = client.get_or_create_collection("docs")
chunks = textwrap.wrap(open("policy.txt").read(), 600)
embeddings = [ollama.embeddings("nomic-embed-text", c).embedding for c in chunks]
col.add(documents=chunks, embeddings=embeddings,
        ids=[f"c{i}" for i in range(len(chunks))])

# --- QUERY ---
query = "What is the password rotation policy?"
q_emb = ollama.embeddings("nomic-embed-text", query).embedding
results = col.query(query_embeddings=[q_emb], n_results=3)
context = "\n\n".join(results["documents"][0])

# --- GENERATE ---
resp = ollama.chat("llama3", messages=[
  {"role": "system", "content": f"Answer using only this context:\n{context}"},
  {"role": "user",   "content": query}
])
print(resp["message"]["content"])
    

Choosing a Local Embedding Model

Model	Dimensions	Context	Best For
nomic-embed-text	768	8192 tokens	General text, long documents
mxbai-embed-large	1024	512 tokens	High-accuracy short passages
all-MiniLM-L6-v2	384	256 tokens	Fast, low memory, good baseline
bge-m3	1024	8192 tokens	Multilingual documents

RAG Quality Levers

The most common RAG failure modes are poor chunk boundaries and insufficient retrieved context. Tuning these three parameters covers most quality gaps:

Chunk size: Smaller chunks (200–400 tokens) increase precision but can lose context. Larger chunks (600–1000 tokens) provide more context but dilute relevance scores.

Overlap: 10–15% overlap between adjacent chunks prevents important sentences from being split across chunk boundaries.

N results: Retrieving more chunks gives the model more material but increases prompt length. Start at N=3, increase if answers are incomplete.

Privacy Advantage

Local RAG means your documents never leave your network. This is not just a preference — for legal, medical, and financial data, it is often a compliance requirement that cloud-based RAG cannot satisfy.

RAGRetrieval-Augmented Generation — a technique that combines a retrieval system (vector search over a document corpus) with a generation model, grounding responses in retrieved source material.

EmbeddingA fixed-size vector representation of text, produced by an embedding model, that captures semantic meaning so that similar texts have similar vectors.

ChromaAn open-source, embedded vector database that runs in-process with your Python application — no separate server required — making it ideal for local RAG prototyping.

Lesson 3 Quiz

Building RAG Pipelines Locally — four questions

1. In a RAG pipeline, the "retrieve" step involves:

Correct. Retrieval means embedding the query and performing a similarity search in the vector database to find the most relevant chunks.

Retrieval involves embedding the query and searching the vector store for semantically similar chunks — it doesn't modify the model or send all documents at once.

2. Chroma is preferred for local RAG prototyping because:

Correct. Chroma is an embedded database that runs inside your Python process — you just import and use it, with no Docker container or server management required.

Chroma's key advantage for local use is that it's embedded — it runs in the same process as your Python code with no separate server to manage.

3. What is the purpose of chunk overlap in document splitting?

Correct. Overlap ensures that sentences or paragraphs split across two chunks appear in both, so neither chunk loses critical contextual information.

Overlap is about context preservation — if a key sentence falls at a chunk boundary, overlap ensures it appears in both adjacent chunks so it's never lost entirely.

4. Which local embedding model is specifically recommended for multilingual documents?

Correct. BGE-M3 is a multilingual embedding model with strong performance across many languages, unlike the other options which are primarily English-focused.

BGE-M3 is designed for multilingual use. The other models listed are optimised for English text.

Lab 3: RAG Pipeline Design

Design a local RAG system for a specific domain with an AI assistant

Your Task

You need to build a RAG system that lets staff query a company's internal HR policy documents without sending data to the cloud. Use the AI assistant to design the pipeline: choose your embedding model, chunk strategy, vector store, and generation prompt.

Suggested opening: "I'm building a local RAG system for 200+ internal HR policy PDFs. The team needs to query them in English and Spanish. Help me choose an embedding model, set my chunk size, and design the retrieval prompt."

AI Lab Assistant

RAG Pipeline Design

Welcome to Lab 3. I'll help you design a local RAG pipeline — from document ingestion and embedding model selection to chunking strategy and generation prompt. Tell me about the documents you're working with and what questions users need to answer.

Module 8 · Lesson 4

Production Patterns: Error Handling, Logging, and Deployment

A working prototype is not a production application. Here is what separates the two.

What does it take to run a local model application reliably for weeks, not just hours?

In 2024, the team behind Open WebUI — the open-source Ollama front-end with over 40,000 GitHub stars — documented a class of failure modes that appeared only under sustained production load. Models would occasionally return truncated responses when VRAM headroom was exhausted mid-generation, the Ollama daemon would silently queue requests when all GPU slots were occupied, and client applications had no way to distinguish a slow response from a hanging one without explicit timeouts. The project responded by publishing a timeout and retry pattern that became a community standard: 30-second connection timeout, 120-second read timeout, three retries with exponential backoff, and a hard failure to a fallback message rather than a blank UI.

The Four Production Failure Modes

Failure 1

Model Not Loaded

Ollama/LM Studio is running but the requested model isn't loaded. Returns 404. Detect and surface a clear error — don't let it surface as a generic timeout.

Failure 2

OOM / VRAM Exhausted

Model loaded but a request exceeds available memory mid-generation. Response truncates or errors. Detect via short response + error code. Retry with reduced max_tokens.

Failure 3

Timeout / Stalled

Model is generating but too slowly for the use case. Implement read timeouts. Distinguish slow-but-progressing (stream timeout) from completely stalled (silence timeout).

Failure 4

Malformed Output

Model returns text that fails JSON parsing or schema validation. Always validate. Log the raw output and input. Retry once with stricter prompt. Fall back to default value.

The Retry Pattern

import requests, time, logging

def call_model(payload, max_retries=3, backoff=2.0):
    for attempt in range(max_retries):
        try:
            resp = requests.post(
                "http://localhost:11434/api/chat",
                json=payload,
                timeout=(10, 120)  # connect, read
            )
            resp.raise_for_status()
            data = resp.json()
            content = data["message"]["content"]
            if not content.strip():
                raise ValueError("Empty response")
            return content
        except Exception as e:
            logging.warning(f"Attempt {attempt+1} failed: {e}")
            if attempt < max_retries - 1:
                time.sleep(backoff ** attempt)
    return None  # caller handles None as graceful fallback
    

Structured Logging

Every production LLM application should log at minimum: the model name and version, the full prompt (or a hash of it for long prompts), the response time, the response length, and whether the response passed validation. This data becomes your evaluation and debugging corpus.

import json, time, hashlib, logging

def logged_call(model, messages, **kwargs):
    t0 = time.time()
    result = call_model({"model": model, "messages": messages, **kwargs})
    elapsed = time.time() - t0
    log_entry = {
        "model": model,
        "prompt_hash": hashlib.md5(json.dumps(messages).encode()).hexdigest()[:8],
        "elapsed_s": round(elapsed, 2),
        "response_len": len(result) if result else 0,
        "success": result is not None
    }
    logging.info(json.dumps(log_entry))
    return result
    

Deployment Patterns

For single-user local applications, running Ollama or LM Studio as a background process is sufficient. For team or server deployments, three patterns are common:

1Docker Compose stack: Ollama in one container, your application in another, connected via an internal network. Add a Nginx reverse proxy with rate limiting if exposing to a LAN.
2FastAPI wrapper: Thin Python web service that adds authentication, rate limiting, logging, and input validation before forwarding to the local model. This is the pattern used by most self-hosted AI tools.
3Systemd service: On Linux servers, register Ollama as a systemd service so it starts on boot, restarts on crash, and logs to journald. Combine with a watchdog script that checks model availability every 60 seconds.

Health Check Endpoint

Ollama exposes GET http://localhost:11434/ which returns a 200 with the body "Ollama is running". Poll this in your deployment health checks and monitoring dashboards. For model availability, call /api/tags and verify your required model name appears in the response.

Architecture Principle

Design for the model being unavailable. Every local model application should have a graceful degradation path — whether that's a cached response, a simplified fallback, or a clear "AI unavailable" message — so that a crashed runtime doesn't crash your whole application.

Exponential backoffA retry strategy that increases the wait time between attempts exponentially (e.g. 1s, 2s, 4s, 8s), reducing load on a struggling service while still retrying transient failures.

Graceful degradationA design pattern where an application continues to function at reduced capability when a dependency (such as the local model) is unavailable, rather than failing completely.

Systemd serviceA Linux init system unit that manages the lifecycle (start, stop, restart, logging) of a background process, ensuring it runs automatically and recovers from crashes.

Lesson 4 Quiz

Production Patterns — four questions

1. The Open WebUI community's recommended timeout values for local model calls are:

Correct. The community standard documented by Open WebUI is ~30s connection timeout and 120s read timeout, allowing for slow generation without treating it as failure.

Local models can be slow. The Open WebUI pattern uses a short connection timeout (~10-30s) to detect a down server, and a long read timeout (120s) to allow for slow generation.

2. When a local model returns a valid HTTP 200 response but the content is an empty string, your application should:

Correct. An empty response indicates a generation failure even if the HTTP status is 200. Treat it as an error, log it, and retry.

HTTP 200 with empty content is a generation failure. Your code should check for empty content explicitly and retry rather than accepting it.

3. A FastAPI wrapper around a local model is used to add:

Correct. A FastAPI wrapper is a thin web service layer that adds the production concerns (auth, rate limiting, logging, validation) that the raw model runtime doesn't provide.

A FastAPI wrapper is an application layer — it adds auth, rate limiting, logging, and validation before forwarding to the local model runtime.

4. To check if a required model is available in a running Ollama instance, you should call:

Correct. /api/tags returns a JSON list of all installed models. Parse it and verify your required model name appears before sending inference requests.

Ollama's /api/tags endpoint returns the list of all installed models. Check this to verify your model is available, not just that Ollama is running.

Lab 4: Production Hardening

Design error handling and deployment strategy for a local model application

Your Task

You have a working local model integration but need to make it production-ready. Work with the AI assistant to design: timeout and retry logic, structured logging, graceful degradation, and a deployment strategy for a small team environment.

Suggested opening: "I have a FastAPI app that calls a local Ollama instance for document summarisation. It works in testing but I need to make it reliable for 10 users. Help me add proper error handling, timeouts, and logging — and decide whether to use Docker Compose or systemd."

AI Lab Assistant

Production Hardening

Welcome to Lab 4. I'll help you harden a local model application for production — covering error handling, retry logic, timeouts, logging, and deployment patterns. Tell me about your application and its current state, and we'll build out the production layer together.

Module 8 Test

Building Applications on Local Models — 15 questions · pass at 80%

1. Which HTTP endpoint does Ollama expose for OpenAI-compatible chat completions?

Correct. Ollama supports /v1/chat/completions as its OpenAI-compatible endpoint, in addition to its native /api/chat.

Ollama exposes /v1/chat/completions for OpenAI compatibility, alongside its native /api/chat endpoint.

2. LM Studio's local server runs on which default port?

Correct. LM Studio defaults to port 1234. Ollama uses 11434; llama.cpp server uses 8080.

LM Studio's local server defaults to port 1234. Ollama uses 11434; llama.cpp uses 8080.

3. When using the openai-python client against a local model, which parameter redirects calls to localhost?

Correct. Setting base_url to your local runtime's address redirects all API calls there without changing any other application code.

base_url is the parameter that redirects the openai-python client to a custom host such as localhost:1234.

4. Streaming responses from a local model use which web standard for incremental delivery?

Correct. SSE (Server-Sent Events) delivers tokens over a single HTTP connection as data: lines, which is the standard format used by OpenAI-compatible streaming endpoints.

Streaming LLM APIs use SSE — Server-Sent Events — to push tokens over a single HTTP connection as they are generated.

5. The four elements a well-structured system prompt should specify are:

Correct. Role (who the model is), task (what it does), format (how to return results), and constraints (what to avoid) are the four core elements of an application system prompt.

The four elements are: role, task, format, and constraints. These cover who the model is, what it should do, how to format output, and what to avoid.

6. Ollama's "format": "json" parameter enforces JSON output by:

Correct. Grammar sampling operates at the token selection level — the sampler only allows tokens that are valid continuations of JSON, making syntactically invalid output impossible.

Grammar sampling constrains which tokens can be selected at each position, making invalid JSON structurally impossible rather than just unlikely.

7. For a 32K token context window model processing a document, the recommended maximum input size is:

Correct. ~70% of the context window for input leaves ~30% (roughly 10K tokens) for the model's response, preventing truncation.

The ~70% rule: use ~22K tokens of a 32K context for input, reserving ~10K for the model's response to prevent truncation.

8. In the five-stage RAG pipeline, what happens in the "embed" stage?

Correct. The embed stage runs each text chunk through an embedding model to produce a vector, which captures its semantic meaning for similarity search.

Embedding converts text chunks into vectors using an embedding model. These vectors are then stored in the vector database for similarity search.

9. The nomic-embed-text model supports a context length of:

Correct. nomic-embed-text supports 8192 token context, making it well-suited for embedding long document chunks without truncation.

nomic-embed-text has an 8192 token context window, one of the longest available in commonly used local embedding models.

10. What is the primary advantage of Chroma for local RAG compared to Qdrant?

Correct. Chroma is an embedded database — it runs inside your Python process. Qdrant requires a separate Docker container or server process.

Chroma's key advantage is being embedded: it runs in-process with no Docker container or server management, unlike Qdrant which requires a separate service.

11. RAG is particularly valuable for local model applications because:

Correct. RAG shifts the quality bottleneck from model capability to retrieval quality. A smaller local model with good retrieved context can outperform a larger model relying on memorized knowledge.

RAG shifts quality from model size to retrieval quality. A 7B model that reads the right passage can answer questions about it as well as a 70B model recalling from training data.

12. Exponential backoff in retry logic means the wait time between attempts:

Correct. Exponential backoff multiplies the wait time by a factor (usually 2) with each retry — e.g. 1s, 2s, 4s, 8s — reducing pressure on a struggling service.

Exponential backoff doubles (or multiplies) the wait time each retry: 1s, 2s, 4s, 8s. This reduces load on a struggling service while still retrying transient failures.

13. The Ollama endpoint to verify the daemon is running is:

Correct. A GET to http://localhost:11434/ returns a 200 with "Ollama is running" — this is the health check endpoint for the daemon itself.

Ollama's health check is GET / — it returns "Ollama is running" with HTTP 200 when the daemon is active.

14. A FastAPI wrapper around a local model primarily adds which capabilities?

Correct. FastAPI wraps the raw model API to add the production concerns that local runtimes don't provide: auth, rate limiting, structured logging, and input validation.

A FastAPI wrapper is an application-layer concern: it adds auth, rate limiting, logging, and validation — not model capabilities.

15. "Graceful degradation" in a local model application means:

Correct. Graceful degradation means your application has a fallback path — cached responses, simplified output, or a clear error message — so a crashed model runtime doesn't take down the whole application.

Graceful degradation means the application has a fallback when the model is unavailable — a cached response, a simplified mode, or a clear user-facing error — rather than crashing entirely.