DeepSeek Ships V4-Pro and V4-Flash With a New Hybrid Attention Architecture

The Chinese lab's preview release pairs a 1.6T-parameter MoE with a 1M-token context and a sparse-attention scheme aimed squarely at long-horizon agent tasks.

DeepSeek released preview versions of V4-Pro and V4-Flash on April 24, with weights published on Hugging Face. V4-Pro is a 1.6 trillion-parameter Mixture-of-Experts model in the lab's 'Expert Mode' tier, while V4-Flash is a 284-billion-parameter variant for faster inference. Both share a 1-million-token context window and a new design DeepSeek calls Hybrid Attention Architecture, which combines Compressed Sparse Attention with Heavily Compressed Attention to keep memory and latency manageable as conversations grow. DeepSeek says V4-Pro tops every other open model on math and coding benchmarks and trails only Google's closed Gemini 3.1 Pro on world-knowledge evaluations.

What is new here is not the parameter count but the attention mechanism. Standard transformer attention scales quadratically with context length, which is why long-context models tend to either degrade in quality past a few hundred thousand tokens or get prohibitively expensive to serve. The hybrid scheme is a structured way to drop most of those pairwise comparisons while keeping the ones that matter most — the kind of architectural trick that, if it generalizes, lowers the floor for who can run useful long-context agents. Independent reporting puts V4 inference at roughly one-sixth the cost of GPT-5.5.

The release sharpens a pattern that has been building all year. Open-weight Chinese models — DeepSeek, Moonshot's Kimi K2.6, and Qwen — are no longer a step behind closed US frontier models on the benchmarks that enterprises actually run. The gap is collapsing on coding, math, and tool use, and what is left is mostly differentiation on safety tuning, ecosystem integration, and which jurisdiction's privacy rules apply to your data. US export controls on chips clearly slowed China's progress, but they did not stop it.

For learners: long-context windows sound impressive in marketing copy, but quality tends to fall off a cliff somewhere inside the advertised number. If you plan to use a model for tasks that genuinely need million-token reasoning — entire codebases, long depositions, multi-document synthesis — build a small evaluation that checks whether the model still recovers facts and reasons correctly at 200k, 500k, and 1M tokens. The numbers in the model card and the numbers on your data are rarely the same.