In early 2024, researchers at Hugging Face published benchmark results showing that switching a prompt from ChatML format to raw text caused Mistral 7B's accuracy on instruction-following tasks to drop by over 30 percentage points β without changing a single word of the actual instruction. The format was the variable. The model's training data had burned in a specific template, and deviating from it broke the response distribution entirely.
Cloud models like GPT-4 and Claude expose a messages API β you pass JSON with role labels and the server's inference engine handles all token formatting, special tokens, and system prompt injection invisibly. You never see the raw text that actually reaches the model weights.
Local models run through tools like Ollama, llama.cpp, or LM Studio, which do let you use a similar messages API β but under the hood they apply a chat template defined by the model's tokenizer config. If you bypass that API and talk to the model in raw-completion mode, you must apply the template yourself or the model produces incoherent output.
Even when using the structured API, local models differ from cloud endpoints in three critical ways: they have no hidden system prompt injected by the provider, they have no output filtering layer, and their context windows are hard limits enforced by your RAM β there is no graceful degradation.
Cloud providers inject thousands of tokens of hidden system context before your first message. A local model starts with an empty context. Behavior you assumed was model capability may actually have been provider scaffolding.
Every instruction-tuned local model was fine-tuned with a specific text template wrapping each turn. The model learned to recognize these markers as signal boundaries. When you use Ollama's /api/chat endpoint with role-labeled messages, Ollama applies the correct template automatically. But understanding what that template looks like explains why your prompts behave differently.
| Model Family | System Token | User Token | Assistant Token |
|---|---|---|---|
| Llama 3 | <|start_header_id|>system<|end_header_id|> | <|start_header_id|>user<|end_header_id|> | <|start_header_id|>assistant<|end_header_id|> |
| Mistral / Mixtral | No dedicated system role | [INST] | [/INST] |
| ChatML (Phi-3, Qwen) | <|im_start|>system | <|im_start|>user | <|im_start|>assistant |
| Gemma 2 | Injected into first user turn | <start_of_turn>user | <start_of_turn>model |
With cloud APIs, you write a short system prompt and the provider's hidden instructions handle persona stability, refusal behavior, and safety filtering. Running locally, your system prompt is the only context shaping the model's behavior. This means local prompts must be more explicit about constraints, output format, tone, and what to do when the model is uncertain.
A system prompt that reads "You are a helpful assistant" works on GPT-4 because thousands of hidden tokens fill in the gaps. On a bare Llama 3 8B instance, that same prompt leaves the model with almost no behavioral anchoring β responses become inconsistent across similar queries.
Local prompting is explicit-by-default. Everything you want the model to do must be stated. Everything you want it to avoid must also be stated. The model has no ambient instructions filling the gaps.
Cloud APIs handle context overflow gracefully β older messages get silently truncated or summarized server-side. Local inference has no such luxury. When your prompt plus conversation history exceeds the model's context window (commonly 4Kβ128K tokens depending on the model and your VRAM), inference either fails, silently wraps around producing incoherent output, or the framework throws an error.
This makes token budgeting a first-class concern for local prompt design. You must account for: the system prompt, conversation history, the current user message, and the expected output β all within the hard limit set by the model and your hardware.
You're working with a Mistral 7B model via Ollama's raw completion endpoint (bypassing the chat API). You need to decide how to format your prompts to match the model's training template.
Ask the lab assistant about chat templates, how to detect which template a model expects, what happens when you use the wrong format, and how to write a system prompt for a model with no dedicated system role (like Mistral).
When Ollama released Llama 3.1 support in July 2024, developers on the Ollama Discord immediately noticed that the default "helpful assistant" system prompt produced wildly inconsistent behavior β sometimes the model would answer in bullet points, sometimes in paragraphs, sometimes switching languages mid-response. The fix documented in the Ollama GitHub issues thread was a detailed system prompt specifying output language, format, response length, and persona β reducing inconsistency dramatically.
Testing across community deployments of Llama 3, Mistral, and Phi-3 models has converged on four categories of instruction that, when present, produce consistent behavior. Omitting any one typically manifests as a specific failure mode.
| Pillar | What to Specify | Failure Mode If Omitted |
|---|---|---|
| Identity | Role, expertise domain, persona name if needed | Model adopts inconsistent personas across turns |
| Format | Markdown on/off, list vs. prose, response length range | Formatting oscillates unpredictably per query |
| Boundary | What the model will/won't do, how to handle out-of-scope queries | Model either over-refuses or over-answers |
| Uncertainty | What to say when unsure β "say you don't know" vs. "say based on training data" | Confident hallucination on unknown facts |
Mistral 7B's instruction template β [INST] ... [/INST] β has no dedicated system role. The model was fine-tuned with system instructions prepended inside the first user turn. This is documented in Mistral AI's own model cards and confirmed by the tokenizer configuration on Hugging Face.
The correct pattern is to place your system instruction at the start of the first [INST] block, separated by a double newline or clearly delimited from the user's actual query. Using Ollama's chat API, this happens automatically β but if you're crafting raw completions or building API wrappers, you must implement this manually.
On a 4K context window model (still common with quantized 7B models on 8GB VRAM), a 500-token system prompt consumes 12% of your entire budget before the user types a word. With a 10-turn conversation averaging 150 tokens per turn, you exhaust the window entirely. This is not theoretical β it is a documented failure pattern in llama.cpp GitHub issues.
The practical target: keep system prompts under 200 tokens for 4K context models, under 500 tokens for 8K models. Use concrete, terse language. Every adverb in your system prompt is a token that could have been context.
Write the system prompt you want. Then cut it by 30%. Then cut it again. Local models respond well to terse, direct instruction β verbose system prompts do not improve behavior proportionally to their token cost.
Local models often enable markdown formatting by default because their training data included markdown. If you're sending output to a terminal, a database field, or an API consumer that doesn't render markdown, you get literal asterisks and pound signs in your output. The fix is explicit: "Respond in plain text only. Do not use markdown, headers, or bullet points."
Conversely, if you want structured output, be specific: "Format your response as a numbered list. Each item: one sentence only." Vague instructions like "be structured" produce inconsistent results across models and even across runs of the same model.
You need to write a system prompt for a local Llama 3 8B deployment serving as a customer support assistant for a software product. The model will run on 8GB VRAM (4K context window).
Work with the lab assistant to draft a system prompt that covers all four pillars (identity, format, boundary, uncertainty) while staying under the 200-token budget. Iterate based on feedback.
In the llama.cpp GitHub repository, issue #5323 (opened October 2023) documented a widely reproducible problem: Llama 2 13B consistently ignored JSON format instructions given in prose β "respond only with valid JSON" β but produced correct JSON output when the system prompt included a single worked example of the expected JSON structure. The thread became one of the most-referenced guides for structured output prompting in local deployments.
An instruction like "respond in JSON" activates the model's learned association between that phrase and JSON-formatted outputs β but the association is fuzzy. The model may produce JSON with inconsistent key names, extra commentary before or after the JSON block, or occasional plain-text fallbacks when the query seems conversational.
A few-shot example provides a direct input-output demonstration. The model's next-token prediction machinery treats the example as immediate context: it has just "seen" JSON being produced for a similar query and continues that pattern. This is not instruction-following; it is pattern continuation β and pattern continuation is what transformers do most reliably.
Few-shot examples for local models work best when they are: representative of the actual query type, minimal (the shortest example that demonstrates the pattern), and consistent in format with each other. Inconsistent few-shot examples teach the model that multiple formats are acceptable β defeating the purpose.
When few-shot prompting alone is insufficient β particularly for complex JSON schemas β both llama.cpp and Ollama support grammar-constrained generation. You provide a GBNF (GGML BNF) grammar file that constrains every token the model can emit. The model physically cannot produce output that violates the grammar.
Ollama exposes this via the format parameter. Passing "format": "json" enables a built-in JSON mode. For custom schemas, llama.cpp's --grammar-file flag accepts a GBNF file.
Smaller local models (7Bβ13B parameters) benefit significantly from chain-of-thought prompting on reasoning tasks β but the technique must be applied differently than with frontier models. Appending "Think step by step" at the end of a prompt can cause smaller models to ramble; they do not have the same ability to self-terminate reasoning chains.
The more reliable approach is a structured reasoning template: explicitly ask the model to produce a reasoning section followed by a final answer section, with clear delimiters. This gives the model a token pattern to match rather than an open-ended instruction to interpret.
Few-shot examples consume context tokens permanently β they sit in your context window for the entire session. On a 4K window, three examples averaging 80 tokens each costs 240 tokens before any conversation begins. Budget accordingly.
You need to extract structured data from customer support tickets using a local Llama 3 8B model. The output must be valid JSON with fields: {"issue_type": "...", "priority": "low|medium|high", "product": "..."}.
Work with the lab assistant to construct 2-3 consistent few-shot examples that will reliably produce this format. Consider token cost, example consistency, and whether to combine few-shot with grammar-constrained mode.
In December 2023, the LM Studio team published a debugging guide after their Discord saw a surge of reports about Mistral 7B "ignoring" system prompts. Investigation traced the failures to three distinct causes in roughly equal proportions: wrong template applied (users treating Mistral like ChatML), system prompt token overflow (prompts exceeding 500 tokens on 4K window models), and instruction conflict (system prompt saying "be concise" while few-shot examples showed verbose responses). The guide established a three-step isolation protocol that became standard practice in local LLM communities.
Effective prompt debugging for local models follows a strict isolation principle: change one variable at a time and establish a baseline with the minimal possible prompt before adding complexity. This sounds obvious but is consistently violated by developers who add system prompts, few-shot examples, and format instructions simultaneously, then cannot attribute failures.
| Step | Action | If This Works | If This Fails |
|---|---|---|---|
| 1. Baseline | Send query with no system prompt, minimal phrasing | Model understands query β prompt is the issue | Template or model loading problem |
| 2. Template | Add only the chat template wrappers, no system prompt | Template is correct | Wrong template β check model's tokenizer_config.json |
| 3. System prompt | Add system prompt with identity only (no format/boundary/uncertainty) | Identity works β format instructions may conflict | System prompt syntax error or placement issue |
| 4. Full prompt | Add format instructions, then few-shot examples one at a time | Isolates which addition causes regression | Identifies conflicting or inconsistent element |
Specific failure patterns in local model outputs reliably indicate specific prompt problems. Learning to read these signals accelerates debugging significantly:
| Symptom | Most Likely Cause | Fix |
|---|---|---|
| Model echoes the system prompt back in its response | Wrong template β system instruction placed in wrong position | Check template placement; use Ollama chat API instead of raw completion |
| Response cuts off mid-sentence | Context window exceeded or max_tokens too low | Shorten system prompt; reduce few-shot examples; increase max_tokens |
| Random language switching | No language specified in system prompt; training data bleed | Add explicit "Respond in English only" to system prompt |
| Ignores format instructions after 3-4 turns | System prompt pushed out of effective attention range | Re-inject key format instruction in every user turn; shorten conversation history |
| Confident wrong answers | No uncertainty instruction; model filling gaps with training priors | Add explicit "If you are unsure, say so" to system prompt |
Temperature is often treated as a creativity dial β in debugging it is a signal amplifier. Running the same prompt at temperature 0 (greedy decoding) removes sampling randomness and makes output deterministic. If a format problem persists at temperature 0, it is a structural prompt issue. If it only appears at higher temperatures, it is a sampling issue that can be addressed with lower temperature or top-p/top-k adjustments.
Ollama exposes temperature via the options parameter: "options": {"temperature": 0}. Always debug at temperature 0 first.
Instruction conflicts are the most subtle failure mode and the one most often missed. A system prompt that says "be concise" while few-shot examples each contain three-paragraph responses teaches the model that long responses are acceptable. A format instruction saying "no markdown" while the model name in the persona section is surrounded by asterisks creates ambiguity.
The fix is an explicit audit: for every constraint in your system prompt, ask whether any other element of your prompt context contradicts or undermines it. Few-shot examples are the most common source of conflicts because they are written after the system prompt and developers do not always re-read both together.
Start minimal. Add one element at a time. Debug at temperature 0. Read failure symptoms as diagnostic signals. Audit for instruction conflicts between every prompt component. Document what you change β local prompt debugging without version tracking produces confusion, not insight.
You have a Mistral 7B deployment that is failing in the following ways: (1) it echoes parts of the system prompt in its response, (2) responses after turn 4 ignore the "plain text only" format instruction, (3) one of your few-shot examples shows a markdown-formatted response despite the system prompt banning markdown.
Use the isolation protocol to diagnose each failure and propose fixes. Work through each symptom with the lab assistant systematically.