New benchmark shows multi-turn attacks break LLM agents in a simulated nuclear control room

NRT-Bench, posted to arXiv on June 18, measures jailbreaks by simulator-derived safety-function loss rather than text-only judgment.

A new arXiv preprint posted June 18, NRT-Bench, evaluates how large language model agents hold up as supervisory operators inside a simulated nuclear power plant control room when adversaries are allowed to apply sustained, adaptive pressure across many conversational turns. A five-role operator team — each role backed by a configurable LLM — runs a plant governed by six critical safety functions, while adversaries inject messages over four channels in bounded multi-turn sessions with per-turn feedback. Harm is measured directly from the simulator when a safety function is lost, not from another LLM's judgment about whether a response was bad.

The methodological shift matters. Most published jailbreak benchmarks score attacks using text-only judges, which means a model that produces a polite refusal can be marked safe even if the operational decision it recommended would have caused a real-world incident. NRT-Bench instead defines primary harm as simulator-derived critical safety function loss — a concrete, objective signal. Across four frontier operator models evaluated under a fixed-attack paired-replay protocol, the authors report that adaptive multi-turn attacks reliably push the operator team past at least one safety limit.

The result generalizes beyond nuclear plants. The same multi-turn pressure pattern — patient probing, role-played authority, escalation through plausible-looking context — is what already breaks LLM agents in production SOC copilots, IT helpdesk bots, code-review agents, and customer-support systems. The benchmark joins a small but growing wave of 2026 work, including Anthropic's 'fix this code' jailbreak research and the adversarial poetry paper, that has moved the field's understanding of jailbreaks from single-prompt evasion to sustained adversarial conversation.

Takeaway for learners: when you read that a model 'passed' a safety evaluation, ask what exactly was being measured. A model that refuses a harmful request in one turn is not the same as a model that holds up after twenty turns of pressure from a competent attacker — and the second test is the one that matches how agents are actually being deployed. NRT-Bench is open to read; even a skim of the methodology section will sharpen your sense of how the AI safety field is rethinking its own benchmarks.