DeepSeek released preview versions of V4-Pro and V4-Flash on April 24. V4-Pro is a 1.6-trillion-parameter Mixture-of-Experts model with 49 billion active parameters; V4-Flash is a 284-billion-parameter version with 13 billion active. Both ship with a one-million-token context window and a new Hybrid Attention Architecture that combines compressed sparse and heavily compressed attention to cut long-context cost. DeepSeek says V4-Pro needs roughly a quarter of the per-token inference FLOPs and a tenth of the KV cache of V3.2 at the one-million-token mark.
The headline outside the model card is the chip story. Huawei announced the same day that its full Ascend supernode lineup — A2, A3, and the new 950 series — is compatible with both V4-Pro and V4-Flash, and that the integration was co-designed rather than ported after the fact. This is DeepSeek's first model where Huawei silicon is a first-class deployment target, not an afterthought.
The point is geopolitical as much as technical. US export controls have steadily tightened the supply of Nvidia accelerators to Chinese labs, and Beijing has been pushing for a domestic stack that can train and serve frontier models without them. A frontier-tier open-weights model that runs natively on Huawei hardware — and is priced aggressively for inference — is exactly the artifact that thesis needs. It does not prove independence is achieved, but it shows the gap is closing fast.
For learners: the interesting question is no longer just "which model is best." It's "which stack are you building on, and what changes if your supplier is cut off?" If you work in or around AI, knowing how a model is trained and served — what hardware, what attention mechanism, what tradeoffs — is becoming as important as knowing how to prompt it.