In March 2023, Klarna publicly disclosed that its AI assistant had handled 2.3 million customer service conversations in its first month — but engineering teams at similar fintechs quietly spent weeks debugging downstream failures where LLM responses that were supposed to be JSON objects arrived with leading prose like "Here is the JSON you requested:" before the actual bracket. Every parser threw. Every ticket escalated.
LLMs are trained to produce helpful, conversational text. Left to their own devices they will wrap JSON in Markdown code fences, add explanatory paragraphs after a closing brace, vary key names across requests, and silently omit optional fields. For a human reader this is fine. For JSON.parse() it is a fatal error.
The cost of an unparseable response is not just a thrown exception. In agentic pipelines it can mean a retry loop that burns tokens, a corrupted database write if your code catches the error badly, or — in the Klarna-class case — a customer-facing failure that triggers a support ticket costing more than the entire inference call.
RLHF training rewards responses that feel helpful to human raters. Human raters prefer explanatory context around code or data. So models learn to add it — even when your system prompt says not to. Structured output prompting fights this learned tendency directly.
JSON is the dominant choice for API integrations. It maps directly to objects in every major language, is compact, and most LLMs have seen billions of examples during pre-training. The risk is that models will wrap it in Markdown or add trailing commas (invalid JSON) when under token pressure.
XML is preferred in enterprise contexts — SAP integrations, FHIR healthcare data, and Claude's own internal chain-of-thought tags use XML because it supports schemas (XSD), namespaces, and mixed-content documents that JSON cannot represent cleanly. Anthropic's system card for Claude 3 notes that Claude was specifically trained on XML-delimited reasoning blocks.
YAML sees use in configuration generation and is more human-readable, but its whitespace-sensitivity makes it fragile for LLM output — a single indentation error breaks the parse. Prefer JSON unless you have a specific reason for YAML.
REST APIs, JavaScript frontends, Python dict operations, most ML pipelines. Widest ecosystem support.
Enterprise integrations, healthcare (FHIR/HL7), document trees with mixed content. Schema validation built-in.
Human-edited configs, Kubernetes manifests. Fragile for LLM generation; validate rigorously.
Single scalar values — a number, a label, a Yes/No decision. Simpler to extract reliably than any structured format.
Structured prompting tasks split into two families. Extraction tasks ask the model to find and format information that already exists in the input — pulling entities from a contract, classifying a support ticket. Generation tasks ask the model to create structured data from scratch — inventing a product catalog entry, drafting an API response schema.
Extraction is generally more reliable because the model can anchor its output on real input text. Generation tasks require the model to hallucinate-proof its own output, which demands stronger schema constraints and more explicit examples in the prompt.
The reliability of structured output is a function of three things: how precisely you specify the schema, how explicitly you forbid prose wrapping, and how robustly your code handles the inevitable edge-case failure. All three matter. Prompting alone is not enough.
You have unstructured customer support tickets arriving as plain text. Your task is to craft prompts that instruct an LLM to extract structured JSON — with specific required fields — without any prose wrapping. Practice writing the prompt, then discuss what makes it reliable or brittle with the AI coach.
When OpenAI introduced function calling in June 2023 with GPT-3.5 and GPT-4, the core mechanism was schema injection: you pass a JSON Schema object in the API request, and the model is constrained to produce output that matches it. The feature shipped because OpenAI's internal evals showed that even GPT-4 without schema injection had a ~15% rate of field name hallucination on extraction tasks — inventing key names that sounded plausible but did not match the spec. Schema injection dropped that to under 2%.
Schema injection is the practice of including a formal or semi-formal description of the expected output structure inside the prompt itself. This is distinct from relying on API-level features like OpenAI's response_format parameter or Anthropic's tool use — though those are strongly preferred when available. Schema injection in the prompt is the fallback when you are using a model or API that does not support native structured outputs.
A schema-injected prompt does three things: it names every field the output must contain, it specifies the type of each field (string, integer, boolean, array, object), and it constrains the value space where possible — listing valid enum values, specifying max array lengths, marking optional fields.
Not all models respond equally well to formal JSON Schema syntax inside a system prompt. A hybrid approach — English prose describing types, followed by a concrete example — often outperforms either pure schema or pure prose alone. This is sometimes called schema-by-example.
When available, always prefer API-level schema constraints. OpenAI's Structured Outputs (released August 2024) uses a constrained decoder that guarantees schema adherence at the token level — something prompt instructions cannot achieve. Anthropic's tool_use parameter provides similar guarantees on Claude. Prompt-level schema injection is your fallback for models without these features.
One of the highest-impact structured prompting techniques is enum anchoring: explicitly listing every valid value for categorical fields. Without this, models will invent synonyms — "SHIPPED" vs "shipped" vs "in transit" — that break string comparisons downstream. Always enumerate valid values even if it feels redundant.
List every valid enum value explicitly. "status must be exactly one of: pending, shipped, delivered, cancelled" — not "status like pending or shipped."
State integer vs number vs string explicitly. Models will guess — and sometimes return a price as a string like "$12.99" instead of the float 12.99.
If a field can be absent, say string|null and specify whether to omit the key or include it with null. Silence on this point causes inconsistent behavior.
Add: "Do not add any keys not listed in the schema." Models helpfully include additional context fields — which break strict parsers.
Design a schema-injected system prompt for a product catalog generator. The output must include: product_id (string, format "PRD-XXXXX"), name (string), category (one of: electronics, apparel, home, beauty), price_usd (number, two decimal places), in_stock (boolean), tags (array of strings, max 4).
Anthropic's 2022 Constitutional AI paper documented that few-shot examples function as implicit format specifications. When Claude was shown examples of a specific output format — even without explicit instructions about that format — it replicated the structure with high fidelity. The researchers found this held for JSON, XML, and table formats. The implication for developers: a single well-constructed example often constrains output more reliably than a paragraph of format instructions.
Natural language instructions are interpreted. "Return compact JSON" can mean different things to different model weights. An example is concrete: the model sees exactly what key names, indentation level, value types, and boundary conditions you want. It pattern-matches rather than interprets, which is more reliable at the token level.
The 2024 paper Many-Shot In-Context Learning from Google DeepMind showed that for structured extraction tasks, accuracy continued improving up to hundreds of examples — well beyond the 3–5 traditionally recommended. For production pipelines where consistency is critical, investing in 10–20 high-quality examples in a system prompt is often worth the token cost.
The three rules for few-shot examples in structured prompting:
1. Cover the edge cases, not just happy paths. Include at least one example where an optional field is null, one where an array is empty, and one where an enum field takes its least-common value. Models generalize from what they see — if all your examples have non-null fields, they will resist returning null even when appropriate.
2. Make examples diverse in input, identical in output structure. Vary the content — different products, different sentiments, different languages — but keep the JSON structure pixel-perfect across all examples. Any structural variation in examples gives the model permission to vary the structure.
3. Use a consistent delimiter to separate input from output. XML tags like <input> / <output> or clear section headers prevent the model from treating the example output as context it should continue.
Examples can hurt if they are inconsistent with each other, contradict the schema instructions, or — a subtle failure mode — include examples that are too short for the task complexity. In the latter case, the model learns to produce brief, truncated outputs even when the input warrants a larger response. Always test your example set on diverse real inputs before deploying.
Another real failure: including Markdown-formatted examples when you want raw JSON. The model will replicate the formatting. If your example shows JSON inside a code fence, you will get code fences at inference time even if your instruction says "no markdown."
Each few-shot example costs tokens. For a system prompt with 10 examples at 150 tokens each, you are spending 1,500 tokens per request before the actual input arrives. Calculate the cost-per-call at your projected volume. For high-volume, low-complexity tasks, 2–3 examples may be the practical ceiling.
Include at least one example where an optional field is explicitly null. Without it, models resist returning null even when the data warrants it.
Every example must have identical key names, order, and nesting. Any variation gives the model structural permission to deviate.
If you want raw JSON, show raw JSON in examples — not JSON in a code fence. Models mirror example formatting exactly.
Use the same input/output delimiters (<input>/<output> or Human:/Assistant:) in every example without variation.
You need to build a few-shot example set for a job posting classifier. The output JSON must include: title (string), department (one of: engineering, marketing, sales, operations, hr), seniority (junior|mid|senior|lead|executive), remote_eligible (boolean), salary_range_usd (object with min and max integers, or null if not disclosed).
In February 2024, a Canadian court ruled against Air Canada after its AI chatbot gave a passenger incorrect information about bereavement fare policies. The system had no structured output validation layer — the chatbot's response was free text that contradicted the airline's own policy documents, with no mechanism to catch the discrepancy. The court held Air Canada liable. The case became a canonical example of what happens when LLM outputs reach users without a validation and grounding layer.
Every production pipeline that uses LLM-generated structured output needs three components between the model response and downstream consumption: a parse attempt, a schema validation step, and a retry or fallback handler. None of these is optional in production.
The parse attempt converts the raw string response to a native data structure. If this throws, you immediately know the output is structurally invalid — do not proceed. The schema validation step checks that required fields are present, types match, and enum values are within bounds. A response can parse successfully but still fail schema validation (e.g., a string where an integer is required). The retry handler decides what to do on failure: retry with an error message injected into the conversation, degrade to a simpler extraction, or surface a controlled error to the user.
Your extract_json_block() function should handle the most common contamination patterns before attempting JSON.parse: stripping Markdown code fences (```json ... ```), extracting the first { ... } or [ ... ] block via regex, and removing trailing commas before closing braces (which models produce under token pressure). These heuristics catch the majority of real-world contamination without a full retry.
A naive retry — call the model again with the identical prompt — fails more than 30% of the time on the same error, because the model is not told what went wrong. A targeted retry injects the validation error into the conversation: "Your previous response failed JSON schema validation with error: 'required field status is missing.' Return only the corrected JSON object." This targeted approach converges in 1–2 retries for most structural errors.
Set a hard maximum of 2–3 retries. Beyond that, you are burning tokens on a structurally confused response that the prompt is not equipped to fix. At the retry ceiling, degrade gracefully: return a null result, queue the input for human review, or return the last parseable partial result with a confidence flag.
Log every parse failure and schema validation error to your observability stack. A baseline parse failure rate above 2% on a stable prompt indicates model update drift — the underlying model has changed in ways that affect your format. This is how teams first detected the GPT-3.5 "lazy" behavior change in January 2024, when format adherence dropped and users reported widespread issues within 48 hours.
Always inject the specific validation error into the retry prompt. Generic retries on the same prompt fail repeatedly.
Cap retries at 2–3. Beyond that, degrade — return null, queue for human review, or return a confidence-flagged partial result.
Apply regex heuristics before JSON.parse. Strip code fences and extract the first { } block. This handles the majority of real failures without a retry.
Log all parse and schema failures. A rate above 2% signals model drift — a silent model update has changed format adherence behavior.
You are building a medical appointment scheduling bot that extracts structured data from patient messages. A parse or schema failure cannot silently pass through — incorrect data could affect patient care. Design a validation and retry strategy that is both robust and cost-efficient.