OpenAI published "Improving health intelligence in ChatGPT" on June 18, 2026, announcing that GPT-5.5 Instant — the default model for free and paid ChatGPT users — now reaches health performance similar to OpenAI's frontier models on an aggregate of health evaluations, including HealthBench Professional. The company reported a 71% drop in factuality issues on health prompts compared with GPT-5.3 Instant. Behind the upgrade sits a new Global Physician Network of 262 doctors who write rubrics and grade model responses on accuracy, safety, communication, context awareness, completeness, and when the model should escalate to a clinician. OpenAI says more than 230 million people use ChatGPT for health and wellness questions every week.
The interesting move is the evaluation method, not the model. A 71% factuality reduction reported by the model maker is the kind of number that needs an independent re-run before anyone trusts it, but pairing each rubric to a real physician — and publishing the rubrics — is closer to how regulated medical software gets validated than how consumer LLMs usually ship. It also lets OpenAI argue that the Instant tier, which most users actually touch, is no longer the weak link on safety-critical topics.
Health has quietly become the highest-stakes consumer use case for LLMs. OpenAI launched ChatGPT Health in January 2026, the Trump administration's National Policy Framework explicitly called out health AI in March, and on June 18 Midjourney announced its own pivot into medical hardware. Two of the largest consumer AI brands are now staking out clinical-adjacent positioning in the same week — and Anthropic's own enterprise pitch leans heavily on regulated industries through partners like NEC and AWS Bedrock. The rubric-and-physician approach OpenAI shipped today is the playbook competitors will be measured against.
Takeaway for learners: when an LLM vendor reports a benchmark improvement, look for the evaluation method first, the number second. A 71% reduction on a private benchmark means nothing without the rubric, the grader pool, and the prompt distribution. OpenAI publishing all three is the part to copy — both as a user assessing a tool and as a builder shipping one.