hub Mixed citations

MiMo-V2-Flash Technical Report

Xiaomi LLM-Core Team: Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang · 2026 · cs.CL · arXiv 2601.02780

Mixed citation behavior. Most common role is background (56%).

53 Pith papers citing it

Background 56% of classified citations

open full Pith review browse 53 citing papers arXiv PDF

abstract

We present MiMo-V2-Flash, a Mixture-of-Experts (MoE) model with 309B total parameters and 15B active parameters, designed for fast, strong reasoning and agentic capabilities. MiMo-V2-Flash adopts a hybrid attention architecture that interleaves Sliding Window Attention (SWA) with global attention, with a 128-token sliding window under a 5:1 hybrid ratio. The model is pre-trained on 27 trillion tokens with Multi-Token Prediction (MTP), employing a native 32k context length and subsequently extended to 256k. To efficiently scale post-training compute, MiMo-V2-Flash introduces a novel Multi-Teacher On-Policy Distillation (MOPD) paradigm. In this framework, domain-specialized teachers (e.g., trained via large-scale reinforcement learning) provide dense and token-level reward, enabling the student model to perfectly master teacher expertise. MiMo-V2-Flash rivals top-tier open-weight models such as DeepSeek-V3.2 and Kimi-K2, despite using only 1/2 and 1/3 of their total parameters, respectively. During inference, by repurposing MTP as a draft model for speculative decoding, MiMo-V2-Flash achieves up to 3.6 acceptance length and 2.6x decoding speedup with three MTP layers. We open-source both the model weights and the three-layer MTP weights to foster open research and community collaboration.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 10 baseline 5 dataset 2 method 1

citation-polarity summary

background 10 baseline 5 use dataset 2 unclear 1

claims ledger

abstract We present MiMo-V2-Flash, a Mixture-of-Experts (MoE) model with 309B total parameters and 15B active parameters, designed for fast, strong reasoning and agentic capabilities. MiMo-V2-Flash adopts a hybrid attention architecture that interleaves Sliding Window Attention (SWA) with global attention, with a 128-token sliding window under a 5:1 hybrid ratio. The model is pre-trained on 27 trillion tokens with Multi-Token Prediction (MTP), employing a native 32k context length and subsequently extended to 256k. To efficiently scale post-training compute, MiMo-V2-Flash introduces a novel Multi-Teach

co-cited works

representative citing papers

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

cs.SD · 2026-05-09 · unverdicted · novelty 8.0

Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

cs.LG · 2026-05-31 · unverdicted · novelty 7.0

OmniOPD replaces token-level logit matching in on-policy distillation with Monte Carlo chunk-level semantic verification and a peak-entropy scheduler.

GRASP: Deterministic argument ranking in interaction graphs

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

GRASP aggregates stable local LLM interaction judgments into global argument rankings via a convergent attack-defense propagation operator on interaction graphs, yielding higher reproducibility than holistic judging and no correlation with human convincingness.

Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation

cs.LG · 2026-05-16 · unverdicted · novelty 7.0

Decoupling prefix source from token-level KL direction in autoregressive sequence KL yields four objectives unifying SFT, DAgger, offline RL and OPD, with KL mixing and entropy-gated curriculum improving math reasoning accuracy and shortening responses.

Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.

Beyond the All-in-One Agent: Benchmarking Role-Specialized Multi-Agent Collaboration in Enterprise Workflows

cs.MA · 2026-05-09 · unverdicted · novelty 7.0

EntCollabBench shows that today's LLM agents still struggle with delegation, context transfer, parameter grounding, workflow closure, and decision commitment when tested in a simulated enterprise with 11 role-specialized agents.

Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning

cs.CL · 2026-05-09 · unverdicted · novelty 7.0

OPHSD uses harness-augmented models as teachers to distill reasoning capabilities into base LLMs, yielding strong standalone performance on classification and math tasks.

GameGen-Verifier: Parallel Keypoint-Based Verification for LLM-Generated Games via Runtime State Injection

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

GameGen-Verifier decomposes game specifications into keypoints, injects runtime states for targeted checks, and achieves 92.2% accuracy on 100 games while running up to 16.6x faster than agent-based baselines.

Rubric-based On-policy Distillation

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.

Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

cs.LG · 2026-04-11 · unverdicted · novelty 7.0

The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

cs.CL · 2026-04-09 · unverdicted · novelty 7.0

OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.

Sell More, Play Less: Benchmarking LLM Realistic Selling Skill

cs.CL · 2026-04-08 · conditional · novelty 7.0

SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.

DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions

cs.CV · 2026-04-07 · unverdicted · novelty 7.0

DetailVerifyBench supplies 1,000 images and densely annotated long captions to evaluate precise hallucination localization in multimodal large language models.

Self-Distilled RLVR

cs.LG · 2026-04-03 · unverdicted · novelty 7.0

RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

cs.LG · 2026-01-26 · unverdicted · novelty 7.0

A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better token efficiency.

On the Position Bias of On-Policy Distillation

cs.LG · 2026-06-21 · unverdicted · novelty 6.0

Position bias in on-policy distillation degrades later-token supervision; IW-OPD weights tokens by accumulated discrepancy, yielding faster convergence and up to 6.9 point gains on AIME-2025.

ConSA: Controllable Sparsity in Hybrid Attention via Learnable Allocation

cs.CL · 2026-06-16 · unverdicted · novelty 6.0

ConSA learns FA/SWA allocation via L0 masks and augmented Lagrangian constraints, outperforming rule-based baselines on 0.6B and 1.7B models with consistent layer patterns.

Consolidating Rewarded Perturbations for LLM Post-Training

cs.CL · 2026-05-29 · unverdicted · novelty 6.0

CoRP consolidates reward-weighted perturbations into a single model via low-rank structure, improving base LLMs by 8.1 points on average while using one-tenth the budget of prior ensembles and one forward pass.

DeepSurvey: Enhancing Analytical Depth and Citation Reliability in Automated Survey Generation

cs.AI · 2026-05-28 · unverdicted · novelty 6.0

DeepSurvey introduces an agentic system for automated survey generation that improves depth through full-text keynotes, cross-paper clustering, and code analysis, while boosting citation reliability via graph expansion, hybrid filtering, and evidence-constrained assignment, with reported gains over

CollectionLoRA: Collecting 50 Effects in 1 LoRA via Multi-Teacher On-Policy Distillation

cs.CV · 2026-05-25 · unverdicted · novelty 6.0

A multi-teacher distillation framework that packs 50 effect LoRAs and fast sampling into a single adapter while aiming to avoid concept interference.

Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

Presents Hack-Verifiable TextArena, a benchmark that embeds verifiable reward hacking opportunities into environments to enable deterministic measurement of exploitation by language models.

Self-Supervised On-Policy Distillation for Reasoning Language Models

cs.LG · 2026-05-17 · unverdicted · novelty 6.0

SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.

Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout per prompt.

Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Missing old logits in async agentic RL entangle discrepancy and staleness terms in PPO off-policy correction; exact acquisition methods and revised PPO-EWMA restore decoupled updates with reported gains in speed and performance.

citing papers explorer

Showing 27 of 27 citing papers after filters.

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification cs.LG · 2026-05-31 · unverdicted · none · ref 31 · internal anchor
OmniOPD replaces token-level logit matching in on-policy distillation with Monte Carlo chunk-level semantic verification and a peak-entropy scheduler.
GRASP: Deterministic argument ranking in interaction graphs cs.LG · 2026-05-18 · unverdicted · none · ref 66 · internal anchor
GRASP aggregates stable local LLM interaction judgments into global argument rankings via a convergent attack-defense propagation operator on interaction graphs, yielding higher reproducibility than holistic judging and no correlation with human convincingness.
Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation cs.LG · 2026-05-16 · unverdicted · none · ref 26 · internal anchor
Decoupling prefix source from token-level KL direction in autoregressive sequence KL yields four objectives unifying SFT, DAgger, offline RL and OPD, with KL mixing and entropy-gated curriculum improving math reasoning accuracy and shortening responses.
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why cs.LG · 2026-05-11 · unverdicted · none · ref 26 · internal anchor
Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.
GameGen-Verifier: Parallel Keypoint-Based Verification for LLM-Generated Games via Runtime State Injection cs.LG · 2026-05-08 · unverdicted · none · ref 43 · internal anchor
GameGen-Verifier decomposes game specifications into keypoints, injects runtime states for targeted checks, and achieves 92.2% accuracy on 100 games while running up to 16.6x faster than agent-based baselines.
Rubric-based On-policy Distillation cs.LG · 2026-05-08 · unverdicted · none · ref 9 · internal anchor
Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation cs.LG · 2026-04-11 · unverdicted · none · ref 54 · internal anchor
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
Self-Distilled RLVR cs.LG · 2026-04-03 · unverdicted · none · ref 6 · internal anchor
RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models cs.LG · 2026-01-26 · unverdicted · none · ref 20 · internal anchor
A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better token efficiency.
On the Position Bias of On-Policy Distillation cs.LG · 2026-06-21 · unverdicted · none · ref 26 · internal anchor
Position bias in on-policy distillation degrades later-token supervision; IW-OPD weights tokens by accumulated discrepancy, yielding faster convergence and up to 6.9 point gains on AIME-2025.
Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale cs.LG · 2026-05-20 · unverdicted · none · ref 28 · internal anchor
Presents Hack-Verifiable TextArena, a benchmark that embeds verifiable reward hacking opportunities into environments to enable deterministic measurement of exploitation by language models.
Self-Supervised On-Policy Distillation for Reasoning Language Models cs.LG · 2026-05-17 · unverdicted · none · ref 112 · internal anchor
SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.
Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation cs.LG · 2026-05-12 · unverdicted · none · ref 23 · internal anchor
RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout per prompt.
Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction cs.LG · 2026-05-12 · unverdicted · none · ref 28 · internal anchor
Missing old logits in async agentic RL entangle discrepancy and staleness terms in PPO off-policy correction; exact acquisition methods and revised PPO-EWMA restore decoupled updates with reported gains in speed and performance.
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information cs.LG · 2026-05-12 · unverdicted · none · ref 27 · internal anchor
Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
An Information-Theoretic Criterion for Efficient Data Synthesis cs.LG · 2026-05-11 · unverdicted · none · ref 38 · internal anchor
Synthetic data improves models only in information-open generation-training loops with external signals, and coarser signals like binary correctness enable better generalization by converging to the most information-efficient component.
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe cs.LG · 2026-05-05 · unverdicted · none · ref 41 · internal anchor
Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.
Multilingual Safety Alignment via Self-Distillation cs.LG · 2026-05-03 · unverdicted · none · ref 11 · 2 links · internal anchor
MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
Co-Evolving Policy Distillation cs.LG · 2026-04-29 · unverdicted · none · ref 17 · internal anchor
CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific experts on text-image-video reasoning.
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe cs.LG · 2026-04-14 · unverdicted · none · ref 24 · internal anchor
On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation cs.LG · 2026-04-14 · unverdicted · none · ref 5 · 2 links · internal anchor
Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
How Transformers Learn to Plan via Multi-Token Prediction cs.LG · 2026-04-13 · conditional · none · ref 12 · internal anchor
Multi-token prediction induces a two-stage reverse reasoning process in Transformers via gradient decoupling, improving planning on synthetic and realistic tasks.
Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation cs.LG · 2026-02-12 · conditional · none · ref 23 · internal anchor
Generalized on-policy distillation with reward scaling above one (ExOPD) lets student models surpass teacher performance when merging domain experts on math and code tasks.
One-Way Policy Optimization for Self-Evolving LLMs cs.LG · 2026-05-21 · unverdicted · none · ref 15 · internal anchor
OWPO decouples optimization direction from magnitude via asymmetric reweighting (Accelerated Alignment for inferior deviations, Gain Locking for superior) plus iterative references to create a ratchet effect for continuous LLM improvement.
$\boldsymbol{f}$-OPD: Stabilizing Long-Horizon On-Policy Distillation with Freshness-Aware Control cs.LG · 2026-05-18 · unverdicted · none · ref 6 · internal anchor
f-OPD decomposes on-policy distillation drift into rollout and supervision components, then applies a sample-level freshness score to adaptively limit stale data influence and stabilize long-horizon agent training.
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training cs.LG · 2026-05-12 · unverdicted · none · ref 20 · 4 links · internal anchor
Sparse rewards on capable teachers for exploration followed by dense distillation to students outperforms direct sparse reward application like GRPO on the deployment model.
A Survey of On-Policy Distillation for Large Language Models cs.LG · 2026-04-01 · unreviewed · ref 6 · 2 links · internal anchor

MiMo-V2-Flash Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer