Module 7 · Lesson 1

Why Local Models Need Different Prompts

The same sentence that works perfectly on GPT-4 may produce garbage on a local 7B model — and that gap is structural, not accidental.

What actually changes when a model runs on your machine instead of a cloud endpoint?

In early 2024, researchers at Hugging Face published benchmark results showing that switching a prompt from ChatML format to raw text caused Mistral 7B's accuracy on instruction-following tasks to drop by over 30 percentage points — without changing a single word of the actual instruction. The format was the variable. The model's training data had burned in a specific template, and deviating from it broke the response distribution entirely.

The Architecture Behind the Gap

Cloud models like GPT-4 and Claude expose a messages API — you pass JSON with role labels and the server's inference engine handles all token formatting, special tokens, and system prompt injection invisibly. You never see the raw text that actually reaches the model weights.

Local models run through tools like Ollama, llama.cpp, or LM Studio, which do let you use a similar messages API — but under the hood they apply a chat template defined by the model's tokenizer config. If you bypass that API and talk to the model in raw-completion mode, you must apply the template yourself or the model produces incoherent output.

Even when using the structured API, local models differ from cloud endpoints in three critical ways: they have no hidden system prompt injected by the provider, they have no output filtering layer, and their context windows are hard limits enforced by your RAM — there is no graceful degradation.

Why This Matters

Cloud providers inject thousands of tokens of hidden system context before your first message. A local model starts with an empty context. Behavior you assumed was model capability may actually have been provider scaffolding.

Chat Templates: The Invisible Layer

Every instruction-tuned local model was fine-tuned with a specific text template wrapping each turn. The model learned to recognize these markers as signal boundaries. When you use Ollama's /api/chat endpoint with role-labeled messages, Ollama applies the correct template automatically. But understanding what that template looks like explains why your prompts behave differently.

Model Family	System Token	User Token	Assistant Token
Llama 3	<\|start_header_id\|>system<\|end_header_id\|>	<\|start_header_id\|>user<\|end_header_id\|>	<\|start_header_id\|>assistant<\|end_header_id\|>
Mistral / Mixtral	No dedicated system role	[INST]	[/INST]
ChatML (Phi-3, Qwen)	<\|im_start\|>system	<\|im_start\|>user	<\|im_start\|>assistant
Gemma 2	Injected into first user turn	<start_of_turn>user	<start_of_turn>model

What "No Provider Guardrails" Means for Prompts

With cloud APIs, you write a short system prompt and the provider's hidden instructions handle persona stability, refusal behavior, and safety filtering. Running locally, your system prompt is the only context shaping the model's behavior. This means local prompts must be more explicit about constraints, output format, tone, and what to do when the model is uncertain.

A system prompt that reads "You are a helpful assistant" works on GPT-4 because thousands of hidden tokens fill in the gaps. On a bare Llama 3 8B instance, that same prompt leaves the model with almost no behavioral anchoring — responses become inconsistent across similar queries.

Key Principle

Local prompting is explicit-by-default. Everything you want the model to do must be stated. Everything you want it to avoid must also be stated. The model has no ambient instructions filling the gaps.

Context Window as a Hard Resource

Cloud APIs handle context overflow gracefully — older messages get silently truncated or summarized server-side. Local inference has no such luxury. When your prompt plus conversation history exceeds the model's context window (commonly 4K–128K tokens depending on the model and your VRAM), inference either fails, silently wraps around producing incoherent output, or the framework throws an error.

This makes token budgeting a first-class concern for local prompt design. You must account for: the system prompt, conversation history, the current user message, and the expected output — all within the hard limit set by the model and your hardware.

Context windowThe total token limit a model can process at once, covering input and output combined. For local models, this is a hard RAM-enforced ceiling with no server-side mitigation.

Chat templateThe specific token markers a model was fine-tuned to expect around each conversational role. Deviating from it degrades response quality significantly.

Raw completion modeSending plain text to a model's base completion endpoint without role formatting, requiring manual template application.

Module 7 · Lesson 1 Quiz

Why Local Models Need Different Prompts

Three questions — select the best answer for each.

1. A researcher switches Mistral 7B from ChatML format to raw text input without changing the instruction content. According to documented benchmarks, what is the most likely outcome?

Correct. Hugging Face benchmark data showed a 30+ percentage-point drop when Mistral 7B was switched from its expected ChatML/instruction format to raw text — the model's behavior is deeply conditioned on format markers.

Incorrect. Format markers are structural signals burned in during fine-tuning. Removing them degrades instruction-following significantly — Hugging Face documented a 30+ point accuracy drop in this exact scenario.

2. What does a cloud provider's "hidden system prompt" do that a bare local model deployment does not have?

Correct. Cloud providers inject extensive hidden context that shapes behavior, refusal patterns, and persona stability. A local deployment starts with an empty context, making your explicit system prompt the sole behavioral anchor.

Incorrect. The hidden system prompt is textual context — it injects instructions and persona anchoring that make the model behave consistently. Local models start with no such scaffolding.

3. When a local model's context window is exceeded during inference, what typically happens — unlike cloud API behavior?

Correct. Local inference enforces a hard RAM-limited context window. When exceeded, you get failure or incoherent wrap-around — none of the graceful silent truncation that cloud APIs handle server-side.

Incorrect. Local models have no server-side mitigation layer. Context overflow results in hard failure or incoherent output, making token budgeting a mandatory part of local prompt design.

Module 7 · Lab 1

Template Explorer

Identify format differences and predict model behavior — minimum 3 exchanges to complete.

Your Task

You're working with a Mistral 7B model via Ollama's raw completion endpoint (bypassing the chat API). You need to decide how to format your prompts to match the model's training template.

Ask the lab assistant about chat templates, how to detect which template a model expects, what happens when you use the wrong format, and how to write a system prompt for a model with no dedicated system role (like Mistral).

Suggested opener: "I'm using Ollama's raw completion endpoint with Mistral 7B. How do I know which template to apply, and what does a correctly formatted single-turn prompt look like?"

Prompt Format Lab

L1 · Template Explorer

Welcome to the Template Explorer lab. I'm here to help you understand chat template formats for local models — specifically how to identify the right template for a given model and what happens when format is mismatched. What would you like to explore?

Module 7 · Lesson 2

Writing Effective System Prompts Locally

A local system prompt must do the work that an entire provider infrastructure does in the cloud. It needs to be complete, explicit, and carefully budgeted.

How do you write a system prompt that actually controls model behavior when there is no provider safety net?

When Ollama released Llama 3.1 support in July 2024, developers on the Ollama Discord immediately noticed that the default "helpful assistant" system prompt produced wildly inconsistent behavior — sometimes the model would answer in bullet points, sometimes in paragraphs, sometimes switching languages mid-response. The fix documented in the Ollama GitHub issues thread was a detailed system prompt specifying output language, format, response length, and persona — reducing inconsistency dramatically.

The Four Pillars of a Local System Prompt

Testing across community deployments of Llama 3, Mistral, and Phi-3 models has converged on four categories of instruction that, when present, produce consistent behavior. Omitting any one typically manifests as a specific failure mode.

Pillar	What to Specify	Failure Mode If Omitted
Identity	Role, expertise domain, persona name if needed	Model adopts inconsistent personas across turns
Format	Markdown on/off, list vs. prose, response length range	Formatting oscillates unpredictably per query
Boundary	What the model will/won't do, how to handle out-of-scope queries	Model either over-refuses or over-answers
Uncertainty	What to say when unsure — "say you don't know" vs. "say based on training data"	Confident hallucination on unknown facts

The Mistral System-Role Problem

Mistral 7B's instruction template — [INST] ... [/INST] — has no dedicated system role. The model was fine-tuned with system instructions prepended inside the first user turn. This is documented in Mistral AI's own model cards and confirmed by the tokenizer configuration on Hugging Face.

The correct pattern is to place your system instruction at the start of the first [INST] block, separated by a double newline or clearly delimited from the user's actual query. Using Ollama's chat API, this happens automatically — but if you're crafting raw completions or building API wrappers, you must implement this manually.

❌ Incorrect — Mistral Raw

<system> You are a helpful assistant. </system> [INST] What is GGUF? [/INST]

✓ Correct — Mistral Raw

[INST] You are a technical assistant. Answer concisely in plain text. If unsure, say so. What is GGUF? [/INST]

Token Budgeting the System Prompt

On a 4K context window model (still common with quantized 7B models on 8GB VRAM), a 500-token system prompt consumes 12% of your entire budget before the user types a word. With a 10-turn conversation averaging 150 tokens per turn, you exhaust the window entirely. This is not theoretical — it is a documented failure pattern in llama.cpp GitHub issues.

The practical target: keep system prompts under 200 tokens for 4K context models, under 500 tokens for 8K models. Use concrete, terse language. Every adverb in your system prompt is a token that could have been context.

# Counting system prompt tokens with tiktoken (approximate for most models)
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
system_prompt = """You are a concise technical assistant..."""
tokens = len(enc.encode(system_prompt))
print(f"System prompt: {tokens} tokens")

Practical Rule

Write the system prompt you want. Then cut it by 30%. Then cut it again. Local models respond well to terse, direct instruction — verbose system prompts do not improve behavior proportionally to their token cost.

Formatting Instructions That Actually Work

Local models often enable markdown formatting by default because their training data included markdown. If you're sending output to a terminal, a database field, or an API consumer that doesn't render markdown, you get literal asterisks and pound signs in your output. The fix is explicit: "Respond in plain text only. Do not use markdown, headers, or bullet points."

Conversely, if you want structured output, be specific: "Format your response as a numbered list. Each item: one sentence only." Vague instructions like "be structured" produce inconsistent results across models and even across runs of the same model.

System prompt pillarOne of four categories — identity, format, boundary, uncertainty — that together produce consistent local model behavior.

Token budgetThe deliberate allocation of context window tokens across system prompt, history, input, and output to avoid overflow.

Module 7 · Lesson 2 Quiz

Writing Effective System Prompts Locally

Three questions — select the best answer for each.

1. Which of the four system prompt pillars addresses how a model should respond when it does not know an answer?

Correct. The Uncertainty pillar specifies what the model should say when unsure — without it, models default to confident hallucination on unknown facts.

Incorrect. The Uncertainty pillar is specifically about handling unknown information. Omitting it leads to confident hallucination — one of the most common local deployment failure modes.

2. Mistral 7B's instruction template has no dedicated system role token. What is the documented correct approach for including system instructions?

Correct. Mistral AI's own model cards and Hugging Face tokenizer configs confirm that system instructions go inside the first [INST] block, before the user's query, when using raw completion mode.

Incorrect. Mistral has no dedicated system token. The documented approach — confirmed in Mistral AI's model cards — is to place system instructions at the start of the first [INST] block, clearly separated from the user query.

3. You are deploying a quantized 7B model on a machine with 8GB VRAM, giving you a ~4K token context window. What is the practical recommended maximum size for your system prompt?

Correct. Under 200 tokens preserves sufficient budget for conversation history and output on a 4K window. A 500-token system prompt alone consumes 12.5% of the entire budget before any user input.

Incorrect. On a 4K context model, under 200 tokens is the practical guideline. A 2,000-token system prompt leaves only 2,000 tokens for conversation history and output — quickly causing context overflow in real use.

Module 7 · Lab 2

System Prompt Builder

Draft and refine a complete local system prompt covering all four pillars — minimum 3 exchanges.

Your Task

You need to write a system prompt for a local Llama 3 8B deployment serving as a customer support assistant for a software product. The model will run on 8GB VRAM (4K context window).

Work with the lab assistant to draft a system prompt that covers all four pillars (identity, format, boundary, uncertainty) while staying under the 200-token budget. Iterate based on feedback.

Suggested opener: "Help me draft a system prompt for a Llama 3 8B customer support bot. It needs to cover identity, format, boundary, and uncertainty — and stay under 200 tokens. Here's my first attempt: 'You are a helpful customer support agent. Answer questions about our software.'"

System Prompt Builder

L2 · Four Pillars

Welcome to the System Prompt Builder. I'll help you craft a complete, token-efficient system prompt for local Llama 3 deployment. Share your draft or describe your use case and we'll work through all four pillars together.

Module 7 · Lesson 3

Few-Shot and Structured Output Prompting

Local models lack the instruction-following precision of frontier models. Few-shot examples and explicit output schemas are the primary tools for closing that gap.

When a model doesn't reliably follow format instructions, what technique actually forces consistent structure?

In the llama.cpp GitHub repository, issue #5323 (opened October 2023) documented a widely reproducible problem: Llama 2 13B consistently ignored JSON format instructions given in prose — "respond only with valid JSON" — but produced correct JSON output when the system prompt included a single worked example of the expected JSON structure. The thread became one of the most-referenced guides for structured output prompting in local deployments.

Why Few-Shot Works Better Than Instructions Alone

An instruction like "respond in JSON" activates the model's learned association between that phrase and JSON-formatted outputs — but the association is fuzzy. The model may produce JSON with inconsistent key names, extra commentary before or after the JSON block, or occasional plain-text fallbacks when the query seems conversational.

A few-shot example provides a direct input-output demonstration. The model's next-token prediction machinery treats the example as immediate context: it has just "seen" JSON being produced for a similar query and continues that pattern. This is not instruction-following; it is pattern continuation — and pattern continuation is what transformers do most reliably.

Constructing Effective Few-Shot Examples

Few-shot examples for local models work best when they are: representative of the actual query type, minimal (the shortest example that demonstrates the pattern), and consistent in format with each other. Inconsistent few-shot examples teach the model that multiple formats are acceptable — defeating the purpose.

❌ Inconsistent Examples

Q: Name a fruit. A: {"fruit": "apple"} Q: Name a vegetable. A: Here's the JSON: {"vegetable":"carrot"} Q: Name a grain. A:

✓ Consistent Examples

Q: Name a fruit. A: {"item": "apple", "category": "fruit"} Q: Name a vegetable. A: {"item": "carrot", "category": "vegetable"} Q: Name a grain. A:

Grammar-Constrained Output (llama.cpp / Ollama)

When few-shot prompting alone is insufficient — particularly for complex JSON schemas — both llama.cpp and Ollama support grammar-constrained generation. You provide a GBNF (GGML BNF) grammar file that constrains every token the model can emit. The model physically cannot produce output that violates the grammar.

Ollama exposes this via the format parameter. Passing "format": "json" enables a built-in JSON mode. For custom schemas, llama.cpp's --grammar-file flag accepts a GBNF file.

# Ollama JSON mode — forces valid JSON output
curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Extract the name and age from: John Smith is 34 years old.",
  "format": "json",
  "stream": false
}'

# Expected output (grammar-constrained):
{"name": "John Smith", "age": 34}

Chain-of-Thought for Smaller Models

Smaller local models (7B–13B parameters) benefit significantly from chain-of-thought prompting on reasoning tasks — but the technique must be applied differently than with frontier models. Appending "Think step by step" at the end of a prompt can cause smaller models to ramble; they do not have the same ability to self-terminate reasoning chains.

The more reliable approach is a structured reasoning template: explicitly ask the model to produce a reasoning section followed by a final answer section, with clear delimiters. This gives the model a token pattern to match rather than an open-ended instruction to interpret.

❌ Open-Ended CoT

Is 127 a prime number? Think step by step.

✓ Structured CoT

Is 127 a prime number? REASONING: [work through divisibility] ANSWER: [yes or no with one sentence]

Token Cost Warning

Few-shot examples consume context tokens permanently — they sit in your context window for the entire session. On a 4K window, three examples averaging 80 tokens each costs 240 tokens before any conversation begins. Budget accordingly.

Few-shot promptingProviding input-output demonstration examples within the prompt context, exploiting pattern continuation rather than instruction-following.

Grammar-constrained generationUsing a formal grammar (GBNF in llama.cpp) to restrict token sampling to only valid output shapes — making format violations physically impossible.

Structured CoTChain-of-thought prompting with explicit section delimiters (REASONING: / ANSWER:) that give smaller models a pattern to complete rather than an open instruction to interpret.

Module 7 · Lesson 3 Quiz

Few-Shot and Structured Output Prompting

Three questions — select the best answer for each.

1. According to the llama.cpp issue #5323 case, why did few-shot examples succeed where prose format instructions failed for JSON output?

Correct. The key insight from the llama.cpp thread: prose format instructions activate a fuzzy learned association, while examples leverage pattern continuation — the fundamental mechanism transformers excel at.

Incorrect. The mechanism is pattern continuation. Transformers predict the next token based on what came before — a worked example in context is far more direct than an instruction the model must interpret, which may activate only weak associations.

2. What is the primary risk of using inconsistent few-shot examples (e.g., one example adds commentary before the JSON, another does not)?

Correct. Inconsistent examples teach the model that variation in format is acceptable. If your examples themselves vary in how they present JSON, the model learns that prose commentary is sometimes acceptable — defeating the purpose.

Incorrect. The problem with inconsistent examples is that they teach inconsistency. The model learns that multiple format patterns are valid outputs for similar queries, which is exactly what you're trying to prevent.

3. In Ollama, what parameter enables grammar-constrained JSON generation, preventing the model from producing any non-JSON output?

Correct. Ollama's "format": "json" parameter enables built-in grammar-constrained JSON mode. For custom schemas, llama.cpp's --grammar-file flag accepts GBNF grammar files.

Incorrect. The correct Ollama API parameter is "format": "json". This enables the built-in JSON grammar constraint that makes it physically impossible for the model to produce non-JSON output.

Module 7 · Lab 3

Few-Shot Format Designer

Build consistent few-shot examples for a structured output task — minimum 3 exchanges.

Your Task

You need to extract structured data from customer support tickets using a local Llama 3 8B model. The output must be valid JSON with fields: {"issue_type": "...", "priority": "low|medium|high", "product": "..."}.

Work with the lab assistant to construct 2-3 consistent few-shot examples that will reliably produce this format. Consider token cost, example consistency, and whether to combine few-shot with grammar-constrained mode.

Suggested opener: "I need few-shot examples for extracting JSON from support tickets. The schema is issue_type, priority, and product. Help me write 2 consistent examples and tell me if I should also use Ollama's format:json together with the examples."

Few-Shot Format Designer

L3 · Structured Output

Welcome to the Few-Shot Format Designer. I'll help you construct consistent, token-efficient few-shot examples for structured JSON extraction from support tickets. What schema are you working with, and do you have a draft example to start?

Module 7 · Lesson 4

Iterative Prompt Debugging for Local Models

Cloud APIs hide most failure modes behind retry logic and silent correction. Local models expose them raw. Systematic debugging is a core skill.

When a local model consistently produces wrong output, how do you isolate whether the problem is the template, the system prompt, the few-shot examples, or the query itself?

In December 2023, the LM Studio team published a debugging guide after their Discord saw a surge of reports about Mistral 7B "ignoring" system prompts. Investigation traced the failures to three distinct causes in roughly equal proportions: wrong template applied (users treating Mistral like ChatML), system prompt token overflow (prompts exceeding 500 tokens on 4K window models), and instruction conflict (system prompt saying "be concise" while few-shot examples showed verbose responses). The guide established a three-step isolation protocol that became standard practice in local LLM communities.

The Isolation Protocol

Effective prompt debugging for local models follows a strict isolation principle: change one variable at a time and establish a baseline with the minimal possible prompt before adding complexity. This sounds obvious but is consistently violated by developers who add system prompts, few-shot examples, and format instructions simultaneously, then cannot attribute failures.

Step	Action	If This Works	If This Fails
1. Baseline	Send query with no system prompt, minimal phrasing	Model understands query — prompt is the issue	Template or model loading problem
2. Template	Add only the chat template wrappers, no system prompt	Template is correct	Wrong template — check model's tokenizer_config.json
3. System prompt	Add system prompt with identity only (no format/boundary/uncertainty)	Identity works — format instructions may conflict	System prompt syntax error or placement issue
4. Full prompt	Add format instructions, then few-shot examples one at a time	Isolates which addition causes regression	Identifies conflicting or inconsistent element

Reading Model Outputs as Diagnostic Signals

Specific failure patterns in local model outputs reliably indicate specific prompt problems. Learning to read these signals accelerates debugging significantly:

Symptom	Most Likely Cause	Fix
Model echoes the system prompt back in its response	Wrong template — system instruction placed in wrong position	Check template placement; use Ollama chat API instead of raw completion
Response cuts off mid-sentence	Context window exceeded or max_tokens too low	Shorten system prompt; reduce few-shot examples; increase max_tokens
Random language switching	No language specified in system prompt; training data bleed	Add explicit "Respond in English only" to system prompt
Ignores format instructions after 3-4 turns	System prompt pushed out of effective attention range	Re-inject key format instruction in every user turn; shorten conversation history
Confident wrong answers	No uncertainty instruction; model filling gaps with training priors	Add explicit "If you are unsure, say so" to system prompt

Temperature and Sampling as Debugging Tools

Temperature is often treated as a creativity dial — in debugging it is a signal amplifier. Running the same prompt at temperature 0 (greedy decoding) removes sampling randomness and makes output deterministic. If a format problem persists at temperature 0, it is a structural prompt issue. If it only appears at higher temperatures, it is a sampling issue that can be addressed with lower temperature or top-p/top-k adjustments.

Ollama exposes temperature via the options parameter: "options": {"temperature": 0}. Always debug at temperature 0 first.

# Debugging run at temperature 0 (deterministic)
curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "system": "You are a concise assistant. Respond in plain text only.",
  "prompt": "What is the capital of France?",
  "options": {"temperature": 0},
  "stream": false
}'

The Instruction Conflict Problem

Instruction conflicts are the most subtle failure mode and the one most often missed. A system prompt that says "be concise" while few-shot examples each contain three-paragraph responses teaches the model that long responses are acceptable. A format instruction saying "no markdown" while the model name in the persona section is surrounded by asterisks creates ambiguity.

The fix is an explicit audit: for every constraint in your system prompt, ask whether any other element of your prompt context contradicts or undermines it. Few-shot examples are the most common source of conflicts because they are written after the system prompt and developers do not always re-read both together.

Debugging Workflow Summary

Start minimal. Add one element at a time. Debug at temperature 0. Read failure symptoms as diagnostic signals. Audit for instruction conflicts between every prompt component. Document what you change — local prompt debugging without version tracking produces confusion, not insight.

Isolation protocolA stepwise debugging method: baseline → template → system prompt → full prompt, changing one variable per step to attribute failures precisely.

Temperature 0Greedy decoding with no sampling randomness, making output deterministic and isolating structural prompt failures from sampling variance.

Instruction conflictA contradiction between two prompt components — typically between the system prompt's stated constraints and the behavior demonstrated in few-shot examples.

Module 7 · Lesson 4 Quiz

Iterative Prompt Debugging for Local Models

Three questions — select the best answer for each.

1. A local Llama 3 model starts correctly following format instructions in turn 1 but ignores them by turn 5. According to the diagnostic table, what is the most likely cause?

Correct. Format instructions degrading after several turns is a classic symptom of system prompt context dilution — the system prompt is being pushed far from the current query in the attention window. Fix: re-inject key format instructions in each user turn.

Incorrect. Model weights are never modified during inference. The documented cause of this pattern is system prompt attention dilution — as conversation history grows, the system prompt becomes distant in the context window and less influential.

2. You run the same prompt at temperature 0 and the format problem persists. What does this tell you?

Correct. Temperature 0 produces deterministic greedy decoding. If the format problem persists at temperature 0, it is definitively a structural prompt issue — not sampling variance that top-p or temperature adjustments could fix.

Incorrect. Temperature 0 removes all sampling randomness. A problem that persists at temperature 0 cannot be attributed to sampling variance — it is structural, meaning the prompt itself is the cause and must be fixed directly.

3. What is an instruction conflict, and what component most commonly introduces it?

Correct. Instruction conflicts arise when prompt components contradict each other — a "be concise" system prompt paired with verbose few-shot examples teaches the model that verbosity is acceptable. Few-shot examples are the most common source because they're often written after the system prompt without cross-checking.

Incorrect. An instruction conflict is a contradiction within your own prompt — specifically when a system prompt constraint (e.g., "no markdown") is undermined by behavior shown in your few-shot examples. Few-shot examples are the most common source of this subtle failure mode.

Module 7 · Lab 4

Prompt Debug Clinic

Diagnose and fix a broken local model prompt using the isolation protocol — minimum 3 exchanges.

Your Task

You have a Mistral 7B deployment that is failing in the following ways: (1) it echoes parts of the system prompt in its response, (2) responses after turn 4 ignore the "plain text only" format instruction, (3) one of your few-shot examples shows a markdown-formatted response despite the system prompt banning markdown.

Use the isolation protocol to diagnose each failure and propose fixes. Work through each symptom with the lab assistant systematically.

Suggested opener: "My Mistral 7B deployment has three symptoms: it echoes the system prompt, ignores format rules after turn 4, and one few-shot example contradicts my system prompt. Let's start with step 1 of the isolation protocol — what baseline test should I run first?"

Prompt Debug Clinic

L4 · Isolation Protocol

Welcome to the Prompt Debug Clinic. I'll walk you through the isolation protocol for diagnosing local model prompt failures. Share your symptoms and we'll work through each one systematically — starting with the baseline test and building up from there.

Module 7 · Module Test

Prompt Formatting for Local Models

15 questions — score 80% or above to pass. Select the best answer for each.

1. What did Hugging Face benchmark data reveal about switching Mistral 7B from its expected instruction format to raw text input?

Correct. The documented Hugging Face benchmarks showed a 30+ point accuracy drop — demonstrating that chat template format is a structural performance variable, not cosmetic.

Incorrect. Hugging Face benchmarks documented a 30+ percentage-point accuracy drop when Mistral 7B's expected instruction format was replaced with raw text. Format is a structural variable.

2. When using Ollama's /api/chat endpoint with role-labeled messages, what does Ollama do automatically that you must do manually in raw completion mode?

Correct. Ollama's chat endpoint automatically applies the model's chat template from its tokenizer config. In raw completion mode you must apply the template yourself — or responses degrade significantly.

Incorrect. Ollama's chat API automatically applies the correct chat template. In raw completion mode, template application is your responsibility — the tokenizer_config.json in the model's Hugging Face repo defines which template to use.

3. In Llama 3's chat template, what token marks the beginning of a role header?

Incorrect. <|start_header_id|> is Llama 3's role header token. Each model family uses distinct special tokens — mixing them produces format mismatch failures.

4. A local model produces literal asterisks and pound signs in its terminal output. What is the most direct fix?

Correct. Local models default to markdown because their training data included it. An explicit system prompt instruction — "plain text only, no markdown" — is the direct fix. Post-processing is a workaround, not a solution.

Incorrect. The direct fix is an explicit system prompt instruction: "Respond in plain text only. Do not use markdown, headers, or bullet points." Temperature does not control format token generation in this way.

5. What is the practical recommended maximum system prompt length for a quantized 7B model running on 8GB VRAM with a ~4K context window?

Correct. Under 200 tokens preserves adequate budget for conversation history and output on a 4K context window. A 500-token system prompt uses 12.5% of the entire context before any user input.

Incorrect. Under 200 tokens is the practical guideline for 4K context models. Every system prompt token is permanently occupied context that cannot be used for conversation history or output.

6. Which of the four system prompt pillars is most directly responsible for preventing confident hallucination on unknown facts?

Correct. The Uncertainty pillar — e.g., "If you are unsure, say so explicitly" — directly addresses hallucination by giving the model an alternative to filling knowledge gaps with confident fabrication.

Incorrect. The Uncertainty pillar handles unknown information. Without explicit uncertainty instructions, local models default to their training priors — producing confident, plausible-sounding but wrong answers.

7. For Mistral 7B in raw completion mode, where should system instructions be placed?

Correct. Mistral has no dedicated system role token — its tokenizer_config.json and Mistral AI's model cards both specify that system instructions go inside the first [INST] block, before the user's query.

Incorrect. Mistral's template has no <system> tag. System instructions must be placed inside the first [INST] block. Using non-existent template tokens is one of the most common causes of Mistral prompt failures.

8. Why does few-shot prompting produce more reliable structured output than prose format instructions for smaller local models?

Correct. Transformers excel at continuing patterns they observe in context. A worked example is immediate, concrete pattern context — far more reliable than an instruction that must activate a learned association with variable strength.

Incorrect. Few-shot works through pattern continuation — the core transformer mechanism. The model has just "seen" JSON being produced and continues that pattern. This is more direct and reliable than instruction-following, which activates fuzzier learned associations.

9. What is the primary risk of providing few-shot examples with inconsistent formatting between them?

Correct. Inconsistent few-shot examples teach inconsistency. If your examples show varying formats, the model learns that variation is acceptable — the opposite of what structured output prompting is trying to achieve.

Incorrect. The problem is that the model learns from all examples equally. If they show different formats, it learns that multiple formats are valid — exactly the pattern you're trying to prevent with structured output prompting.

10. What Ollama API parameter enables built-in grammar-constrained JSON generation?

Correct. "format": "json" in Ollama's API enables the built-in grammar-constrained JSON mode, making it physically impossible for the model to produce non-JSON output.

Incorrect. The correct Ollama API parameter is "format": "json". This is documented in Ollama's official API reference and activates grammar-constrained generation using a built-in JSON grammar.

11. In the isolation protocol's Step 1 (Baseline), you send a query with no system prompt and minimal phrasing. The model produces complete nonsense. What does this most likely indicate?

Correct. If the baseline test (minimal prompt, no system context) fails, the problem is not in your prompt — it's in template application, model loading, or the inference setup itself.

Incorrect. A failing baseline test means the problem exists below the prompt layer — it's a template application or model loading issue. There's no point debugging system prompts if the model can't handle a bare query.

12. A model correctly follows format instructions in turn 1 but ignores them by turn 5. According to documented failure patterns, what is the most likely cause and fix?

Correct. This is documented context dilution — the system prompt becomes distant in the attention window as conversation history grows. Fix: re-inject critical format instructions in every user turn or use a shorter conversation history.

Incorrect. Model weights never change during inference. This is context dilution — a documented failure mode where system prompt instructions lose influence as conversation history grows. The fix is re-injection of key instructions per turn.

13. You run a problematic prompt at temperature 0 and the format error disappears. What does this tell you about the root cause?

Correct. If a problem disappears at temperature 0, it is caused by sampling randomness — meaning lower temperature or adjusted top-p/top-k can fix it without prompt changes. This is the diagnostic value of the temperature 0 test.

Incorrect. If the problem disappears at temperature 0, it's a sampling variance issue — not a structural prompt problem. You can address it by reducing temperature or adjusting sampling parameters in production, without prompt redesign.

14. What is an instruction conflict in local prompt design, and which component most commonly introduces it?

Correct. Instruction conflicts arise when system prompt rules contradict few-shot example behavior — e.g., "no markdown" + a few-shot example showing markdown. The examples are the usual culprit because they're often written after the system prompt without cross-checking both together.

Incorrect. Instruction conflicts are internal prompt contradictions — a system prompt rule (e.g., "be concise") undermined by few-shot examples that demonstrate verbose behavior. The model learns from both and treats the conflict as permission to use either format.

15. The LM Studio debugging guide identified three roughly equal causes of Mistral 7B "ignoring system prompts." Which of the following correctly lists all three?

Correct. The December 2023 LM Studio guide traced the surge of "ignored system prompt" reports to exactly these three causes in roughly equal proportions — making each one a common, documented failure mode worth knowing.

Incorrect. The LM Studio guide identified: (1) wrong template applied — users treating Mistral like ChatML, (2) system prompt token overflow on 4K context windows, and (3) instruction conflict between system prompt rules and few-shot example behavior.