{"work":{"id":"1f3df90c-4bc3-49b1-ad9b-7f3b34e4ffba","openalex_id":null,"doi":null,"arxiv_id":"2601.02780","raw_key":null,"title":"MiMo-V2-Flash Technical Report","authors":null,"authors_text":"Xiaomi LLM-Core Team: Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang","year":2026,"venue":"cs.CL","abstract":"We present MiMo-V2-Flash, a Mixture-of-Experts (MoE) model with 309B total parameters and 15B active parameters, designed for fast, strong reasoning and agentic capabilities. MiMo-V2-Flash adopts a hybrid attention architecture that interleaves Sliding Window Attention (SWA) with global attention, with a 128-token sliding window under a 5:1 hybrid ratio. The model is pre-trained on 27 trillion tokens with Multi-Token Prediction (MTP), employing a native 32k context length and subsequently extended to 256k. To efficiently scale post-training compute, MiMo-V2-Flash introduces a novel Multi-Teacher On-Policy Distillation (MOPD) paradigm. In this framework, domain-specialized teachers (e.g., trained via large-scale reinforcement learning) provide dense and token-level reward, enabling the student model to perfectly master teacher expertise. MiMo-V2-Flash rivals top-tier open-weight models such as DeepSeek-V3.2 and Kimi-K2, despite using only 1/2 and 1/3 of their total parameters, respectively. During inference, by repurposing MTP as a draft model for speculative decoding, MiMo-V2-Flash achieves up to 3.6 acceptance length and 2.6x decoding speedup with three MTP layers. We open-source both the model weights and the three-layer MTP weights to foster open research and community collaboration.","external_url":"https://arxiv.org/abs/2601.02780","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-06-29T07:13:16.238907+00:00","pith_arxiv_id":"2601.02780","created_at":"2026-05-09T05:45:22.415306+00:00","updated_at":"2026-06-29T07:13:16.238907+00:00","title_quality_ok":true,"display_title":"MiMo-V2-Flash Technical Report","render_title":"MiMo-V2-Flash Technical Report"},"hub":{"state":{"work_id":"1f3df90c-4bc3-49b1-ad9b-7f3b34e4ffba","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":51,"external_cited_by_count":null,"distinct_field_count":9,"first_pith_cited_at":"2026-01-26T17:56:50+00:00","last_pith_cited_at":"2026-06-16T15:33:49+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-29T12:38:46.752996+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":10},{"context_role":"baseline","n":5},{"context_role":"dataset","n":2},{"context_role":"method","n":1}],"polarity_counts":[{"context_polarity":"background","n":10},{"context_polarity":"baseline","n":5},{"context_polarity":"use_dataset","n":2},{"context_polarity":"unclear","n":1}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T18:30:12.776504+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":18},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":15},{"title":"Reinforcement Learning via Self-Distillation","work_id":"b193541d-5853-4ea4-8e4b-8e4c08617eb6","shared_citers":12},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":10},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":10},{"title":"Distilling the Knowledge in a Neural Network","work_id":"d927ab1f-17b8-4002-9d09-c3d55764fbad","shared_citers":10},{"title":"Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models","work_id":"bae00e84-9b0d-433d-a066-20b951f0b4d0","shared_citers":10},{"title":"GLM-5: from Vibe Coding to Agentic Engineering","work_id":"ad29b1a2-bf77-46b3-9ead-fb62b1d2c6fe","shared_citers":9},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":9},{"title":"Kimi K2.5: Visual Agentic Intelligence","work_id":"d690be8f-5d53-49b0-b1e7-79668eb8fcdb","shared_citers":8},{"title":"On-Policy Context Distillation for Language Models","work_id":"b56a7e15-d864-43f4-9212-59bc7ec70d21","shared_citers":8},{"title":"On-policy distillation","work_id":"bb76b11f-d59b-421e-88c6-fa0920ed09c3","shared_citers":8},{"title":"Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe","work_id":"42b43df0-4c82-493f-9d9b-1be8c116d9af","shared_citers":8},{"title":"Self-Distilled RLVR","work_id":"935a34f3-b83d-4214-b6a0-ae2395b3d107","shared_citers":8},{"title":"arXiv preprint arXiv:2602.12125 , year=","work_id":"bb968107-1f43-4bf4-aa52-cc58000a6e89","shared_citers":7},{"title":"Group Sequence Policy Optimization","work_id":"3a98b53b-9f52-4d95-adf7-89353c0a9a65","shared_citers":7},{"title":"MiniLLM: On-Policy Distillation of Large Language Models","work_id":"16edb291-dd18-41c5-8486-c6c715ec5311","shared_citers":7},{"title":"Self-Distillation Enables Continual Learning","work_id":"e9aa25e3-870c-46c8-8270-e4e5948d09f0","shared_citers":7},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":6},{"title":"Jiaze Li, Hao Yin, Haoran Xu, Boshen Xu, Wenhui Tan, Zewen He, et al","work_id":"5e961f0b-b20e-4580-965d-15fb63ec8965","shared_citers":6},{"title":"Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes","work_id":"8e059321-d6e4-4359-89eb-83ecbb93b657","shared_citers":6},{"title":"Understanding R1-Zero-Like Training: A Critical Perspective","work_id":"ec354f3b-9484-4a0c-94c8-92d4d0260835","shared_citers":6},{"title":"A Survey of On-Policy Distillation for Large Language Models","work_id":"f6aaea8e-1f0d-43e3-b28f-6066d3e0a66b","shared_citers":5},{"title":"Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643","work_id":"1e58a1be-9294-4bd3-8040-3516dcebcd4a","shared_citers":5}],"time_series":[{"n":33,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T18:29:20.286846+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T18:29:51.022671+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"MiMo-V2-Flash Technical Report","claims":[{"claim_text":"We present MiMo-V2-Flash, a Mixture-of-Experts (MoE) model with 309B total parameters and 15B active parameters, designed for fast, strong reasoning and agentic capabilities. MiMo-V2-Flash adopts a hybrid attention architecture that interleaves Sliding Window Attention (SWA) with global attention, with a 128-token sliding window under a 5:1 hybrid ratio. The model is pre-trained on 27 trillion tokens with Multi-Token Prediction (MTP), employing a native 32k context length and subsequently extended to 256k. To efficiently scale post-training compute, MiMo-V2-Flash introduces a novel Multi-Teach","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks MiMo-V2-Flash Technical Report because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T18:29:36.976598+00:00"}},"summary":{"title":"MiMo-V2-Flash Technical Report","claims":[{"claim_text":"We present MiMo-V2-Flash, a Mixture-of-Experts (MoE) model with 309B total parameters and 15B active parameters, designed for fast, strong reasoning and agentic capabilities. MiMo-V2-Flash adopts a hybrid attention architecture that interleaves Sliding Window Attention (SWA) with global attention, with a 128-token sliding window under a 5:1 hybrid ratio. The model is pre-trained on 27 trillion tokens with Multi-Token Prediction (MTP), employing a native 32k context length and subsequently extended to 256k. To efficiently scale post-training compute, MiMo-V2-Flash introduces a novel Multi-Teach","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks MiMo-V2-Flash Technical Report because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":18},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":15},{"title":"Reinforcement Learning via Self-Distillation","work_id":"b193541d-5853-4ea4-8e4b-8e4c08617eb6","shared_citers":12},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":10},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":10},{"title":"Distilling the Knowledge in a Neural Network","work_id":"d927ab1f-17b8-4002-9d09-c3d55764fbad","shared_citers":10},{"title":"Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models","work_id":"bae00e84-9b0d-433d-a066-20b951f0b4d0","shared_citers":10},{"title":"GLM-5: from Vibe Coding to Agentic Engineering","work_id":"ad29b1a2-bf77-46b3-9ead-fb62b1d2c6fe","shared_citers":9},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":9},{"title":"Kimi K2.5: Visual Agentic Intelligence","work_id":"d690be8f-5d53-49b0-b1e7-79668eb8fcdb","shared_citers":8},{"title":"On-Policy Context Distillation for Language Models","work_id":"b56a7e15-d864-43f4-9212-59bc7ec70d21","shared_citers":8},{"title":"On-policy distillation","work_id":"bb76b11f-d59b-421e-88c6-fa0920ed09c3","shared_citers":8},{"title":"Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe","work_id":"42b43df0-4c82-493f-9d9b-1be8c116d9af","shared_citers":8},{"title":"Self-Distilled RLVR","work_id":"935a34f3-b83d-4214-b6a0-ae2395b3d107","shared_citers":8},{"title":"arXiv preprint arXiv:2602.12125 , year=","work_id":"bb968107-1f43-4bf4-aa52-cc58000a6e89","shared_citers":7},{"title":"Group Sequence Policy Optimization","work_id":"3a98b53b-9f52-4d95-adf7-89353c0a9a65","shared_citers":7},{"title":"MiniLLM: On-Policy Distillation of Large Language Models","work_id":"16edb291-dd18-41c5-8486-c6c715ec5311","shared_citers":7},{"title":"Self-Distillation Enables Continual Learning","work_id":"e9aa25e3-870c-46c8-8270-e4e5948d09f0","shared_citers":7},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":6},{"title":"Jiaze Li, Hao Yin, Haoran Xu, Boshen Xu, Wenhui Tan, Zewen He, et al","work_id":"5e961f0b-b20e-4580-965d-15fb63ec8965","shared_citers":6},{"title":"Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes","work_id":"8e059321-d6e4-4359-89eb-83ecbb93b657","shared_citers":6},{"title":"Understanding R1-Zero-Like Training: A Critical Perspective","work_id":"ec354f3b-9484-4a0c-94c8-92d4d0260835","shared_citers":6},{"title":"A Survey of On-Policy Distillation for Large Language Models","work_id":"f6aaea8e-1f0d-43e3-b28f-6066d3e0a66b","shared_citers":5},{"title":"Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643","work_id":"1e58a1be-9294-4bd3-8040-3516dcebcd4a","shared_citers":5}],"time_series":[{"n":33,"year":2026}],"dependency_candidates":[]},"authors":[]}}