MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second
Points and comments are a snapshot, not live.
Xiaomi releases MiMo-V2.5-Pro-UltraSpeed, a 1-trillion-parameter model achieving 1000 tokens per second generation speed.
Xiaomi and TileRT jointly announced MiMo-V2.5-Pro-UltraSpeed, claiming the first 1-trillion-parameter model to exceed 1000 tokens per second on commodity GPUs. The system combines FP4 quantization (applied selectively to Mixture of Experts layers), DFlash block-level speculative decoding with 4-8 acceptance lengths, and TileRT's persistent-kernel execution model eliminating traditional operator boundaries. The model runs on a single 8-GPU commodity node. Limited API access (3x the cost of standard MiMo-V2.5-Pro) launched June 9-23, 2026, application-based. Free chat access provided during trial. Open-sourced FP4-quantized weights and DFlash parameters on HuggingFace.
What commenters are saying
Commenters focused on performance claims and model safety. Several tested MiMo-V2.5-Pro against sensitive historical prompts (Tiananmen Square 1989, Taiwan status) and reported it answered factually where other Chinese models refused, sparking debate over censorship in Chinese vs US models. One commenter noted most frontier US models are closed-source, limiting comparable testing. Skepticism emerged about the TileRT open-source repo (17 commits, mostly Python wrapper around closed binary) and whether the claimed quality retention under heavy FP4 quantization holds for general use beyond benchmarks. Performance metrics drew interest as potentially enabling agent workflows.