pith. machine review for the scientific record. sign in

arxiv: 2506.13585 · v1 · submitted 2025-06-16 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

· Lean Theorem

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Authors on Pith no claims yet

Pith reviewed 2026-05-12 09:22 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords hybrid attentionlightning attentionMixture of Expertsreinforcement learningCISPOtest-time computelong contextreasoning model
0
0 comments X

The pith

MiniMax-M1 combines hybrid attention with a new RL algorithm to scale test-time compute efficiently in a 456 billion parameter model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MiniMax-M1 as an open-weight model built on a hybrid Mixture-of-Experts architecture that incorporates lightning attention. This design supports native one-million-token contexts and reduces the cost of extending computation during inference for reasoning tasks. Training relies on large-scale reinforcement learning across software engineering environments, accelerated by a new algorithm called CISPO that clips importance sampling weights instead of token updates. The full training run finishes in three weeks on 512 GPUs at a cost of about 535,000 dollars, and the resulting models match or exceed leading open models on benchmarks for complex software tasks, tool use, and long inputs. If the approach holds, it points to a practical route for building reasoning systems that can process extended contexts and allocate more thinking steps without proportional increases in hardware demands.

Core claim

MiniMax-M1 is powered by a hybrid MoE architecture with lightning attention that natively handles 1 million token contexts and enables efficient scaling of test-time compute. The model, derived from a 456 billion parameter base with 45.9 billion activated per token, is trained via reinforcement learning on diverse problems including real-world software engineering. CISPO, the proposed RL algorithm, improves efficiency by clipping importance sampling weights rather than token updates, allowing the entire RL phase to complete on 512 H800 GPUs in three weeks for a rental cost of 534,700 dollars. Released versions with 40K and 80K thinking budgets perform comparably or better than DeepSeek-R1 on

What carries the argument

The lightning attention mechanism, which replaces standard attention to allow efficient scaling of test-time compute in the hybrid-attention MoE model.

If this is right

  • The model supports eight times the context length of comparable open models while keeping training feasible on a modest GPU cluster.
  • CISPO allows reinforcement learning for reasoning models to run faster and cheaper by changing how importance sampling weights are handled.
  • Released models show particular strength in complex software engineering and tool utilization, suggesting the architecture suits agent-like tasks.
  • Two thinking-budget variants let users trade off inference cost against depth of reasoning on the same base model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The efficiency gains could make large-scale reinforcement learning for reasoning models accessible to more research groups with limited hardware budgets.
  • If lightning attention generalizes, it may allow other hybrid architectures to handle million-token inputs at lower latency than full attention alternatives.
  • The focus on sandbox-based software environments during training may produce models that transfer more readily to real-world coding agents than purely text-trained systems.

Load-bearing premise

The lightning attention mechanism delivers large efficiency gains in test-time compute and training without reducing the model's reasoning accuracy or long-context capability.

What would settle it

A side-by-side measurement of tokens processed per second and benchmark scores on long-context software engineering tasks that shows MiniMax-M1 no faster or no more accurate than a standard-attention model of similar size.

read the original abstract

We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model natively supports a context length of 1 million tokens, 8x the context size of DeepSeek R1. Furthermore, the lightning attention mechanism in MiniMax-M1 enables efficient scaling of test-time compute. These properties make M1 particularly suitable for complex tasks that require processing long inputs and thinking extensively. MiniMax-M1 is trained using large-scale reinforcement learning (RL) on diverse problems including sandbox-based, real-world software engineering environments. In addition to M1's inherent efficiency advantage for RL training, we propose CISPO, a novel RL algorithm to further enhance RL efficiency. CISPO clips importance sampling weights rather than token updates, outperforming other competitive RL variants. Combining hybrid-attention and CISPO enables MiniMax-M1's full RL training on 512 H800 GPUs to complete in only three weeks, with a rental cost of just $534,700. We release two versions of MiniMax-M1 models with 40K and 80K thinking budgets respectively, where the 40K model represents an intermediate phase of the 80K training. Experiments on standard benchmarks show that our models are comparable or superior to strong open-weight models such as the original DeepSeek-R1 and Qwen3-235B, with particular strengths in complex software engineering, tool utilization, and long-context tasks. We publicly release MiniMax-M1 at https://github.com/MiniMax-AI/MiniMax-M1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The manuscript introduces MiniMax-M1, the first open-weight large-scale hybrid-attention reasoning model. It combines a 456B-parameter MoE architecture (45.9B active) with a lightning attention mechanism to support native 1M-token context (8x DeepSeek-R1) and efficient test-time compute scaling. The model is trained via large-scale RL on sandbox and real-world software engineering tasks using a proposed CISPO algorithm that clips importance sampling weights; full RL training completes in three weeks on 512 H800 GPUs at $534k cost. Two variants (40K and 80K thinking budgets) are released and reported to match or exceed strong open models such as DeepSeek-R1 and Qwen3-235B, with particular strengths in complex software engineering, tool use, and long-context reasoning.

Significance. If the efficiency and performance claims hold, the work would provide a practical open-weight demonstration of hybrid attention for long-context reasoning and cost-effective RL scaling of test-time compute. The public model release and the CISPO algorithm could serve as useful baselines for future research on efficient inference-time scaling and RL for agentic tasks.

major comments (3)
  1. [Abstract] Abstract: the central performance claim states that the models are 'comparable or superior' to DeepSeek-R1 and Qwen3-235B 'with particular strengths in complex software engineering, tool utilization, and long-context tasks,' yet no benchmark scores, tables, error bars, or specific task results are supplied. This absence is load-bearing for the superiority assertion and prevents assessment of effect sizes.
  2. [Abstract / Model Architecture] Lightning attention description: the manuscript asserts that the mechanism 'enables efficient scaling of test-time compute' for 1M-context reasoning, but provides neither scaling curves, FLOPs-vs-length measurements, nor ablations comparing it to standard attention on long software-engineering traces. Without these data the efficiency advantage remains unverified.
  3. [RL Training / CISPO] CISPO algorithm: the claim that CISPO 'outperforms other competitive RL variants' by clipping importance sampling weights rather than token updates is central to the training-efficiency narrative, yet no equations, pseudocode, or ablation tables versus PPO/GRPO are referenced, nor is any analysis of gradient bias on extended thinking chains supplied.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the positive summary of our work and the constructive major comments. We appreciate the opportunity to clarify and strengthen the manuscript. We address each point below and will incorporate the suggested revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central performance claim states that the models are 'comparable or superior' to DeepSeek-R1 and Qwen3-235B 'with particular strengths in complex software engineering, tool utilization, and long-context tasks,' yet no benchmark scores, tables, error bars, or specific task results are supplied. This absence is load-bearing for the superiority assertion and prevents assessment of effect sizes.

    Authors: We agree that the abstract would be strengthened by including concrete benchmark numbers. While the full manuscript already contains detailed tables, error bars, and per-task results in the experiments section, we will revise the abstract to incorporate a concise summary of key scores (e.g., software-engineering and long-context benchmarks) with direct comparisons to DeepSeek-R1 and Qwen3-235B. This change will make the performance claims immediately verifiable without lengthening the abstract substantially. revision: yes

  2. Referee: [Abstract / Model Architecture] Lightning attention description: the manuscript asserts that the mechanism 'enables efficient scaling of test-time compute' for 1M-context reasoning, but provides neither scaling curves, FLOPs-vs-length measurements, nor ablations comparing it to standard attention on long software-engineering traces. Without these data the efficiency advantage remains unverified.

    Authors: We acknowledge that additional empirical support is needed to substantiate the efficiency claims. In the revised manuscript we will add (i) scaling curves relating context length to test-time FLOPs and latency, (ii) direct FLOPs-vs-length measurements, and (iii) ablations of lightning attention versus standard attention on long software-engineering traces. These figures and tables will be placed in the model-architecture and experiments sections. revision: yes

  3. Referee: [RL Training / CISPO] CISPO algorithm: the claim that CISPO 'outperforms other competitive RL variants' by clipping importance sampling weights rather than token updates is central to the training-efficiency narrative, yet no equations, pseudocode, or ablation tables versus PPO/GRPO are referenced, nor is any analysis of gradient bias on extended thinking chains supplied.

    Authors: We agree that the algorithmic details and supporting evidence must be explicit. We will add the full set of equations, pseudocode, and implementation details for CISPO to the RL-training section. We will also include ablation tables comparing CISPO against PPO and GRPO on both training efficiency and final performance, plus a short analysis of gradient bias on long thinking chains based on our training logs. These additions will improve reproducibility and directly address the referee's concern. revision: yes

Circularity Check

0 steps flagged

Minor self-citation to prior model; central claims rest on independent empirical benchmarks

full rationale

The paper introduces MiniMax-M1 as a new hybrid MoE + lightning attention model trained with the CISPO RL algorithm and validates performance via standard benchmarks against external models like DeepSeek-R1. The only self-reference is the statement that M1 is 'developed based on our previous MiniMax-Text-01 model,' which supplies the base parameters but does not define or force the new architecture, attention mechanism, or RL gains. No equations reduce predictions to fitted inputs by construction, no uniqueness theorems are imported from the same authors, and no ansatz is smuggled via citation. All load-bearing claims (test-time scaling efficiency, CISPO superiority, benchmark results) are presented as outcomes of training and evaluation rather than tautological redefinitions.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 2 invented entities

The central claims rest on the effectiveness of the new lightning attention and CISPO, which are introduced here; limited details on how they were derived or validated beyond the abstract.

free parameters (1)
  • thinking budget (40K and 80K)
    The two versions with different thinking budgets are presented as intermediate and final phases, but the specific values may be chosen based on performance.
axioms (2)
  • domain assumption Hybrid attention mechanism improves efficiency for long contexts
    Assumed in the design of the model for 1M token support.
  • domain assumption CISPO clips importance sampling weights to enhance RL efficiency
    The novel algorithm is based on this clipping approach.
invented entities (2)
  • Lightning attention mechanism no independent evidence
    purpose: To enable efficient scaling of test-time compute
    New mechanism introduced in the paper without external validation mentioned.
  • CISPO RL algorithm no independent evidence
    purpose: To further enhance RL efficiency by clipping importance sampling weights
    Novel algorithm proposed in the paper.

pith-pipeline@v0.9.0 · 6142 in / 1679 out tokens · 44671 ms · 2026-05-12T09:22:33.467728+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 39 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

    cs.LG 2026-05 conditional novelty 8.0

    ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equip...

  2. Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    cs.LG 2026-04 unverdicted novelty 8.0

    Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.

  3. F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking

    cs.LG 2026-05 unverdicted novelty 7.0

    F-GRPO factorizes group-relative policy optimization into generation and ranking phases within one autoregressive sequence, using order-invariant coverage and position-aware utility rewards to improve top-ranked perfo...

  4. Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective

    cs.LG 2026-05 conditional novelty 7.0

    ConSPO improves RLVR training by aligning rollout scores with generation likelihoods via length-normalized log-probabilities and applying a group-wise InfoNCE contrastive loss with a scheduled margin, outperforming GR...

  5. Learning, Fast and Slow: Towards LLMs That Adapt Continually

    cs.LG 2026-05 unverdicted novelty 7.0

    Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...

  6. CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization

    cs.LG 2026-05 unverdicted novelty 7.0

    CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.

  7. CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging

    cs.LG 2026-05 unverdicted novelty 7.0

    CUDABeaver shows LLM CUDA debuggers often degenerate code for test-passing at the cost of speed, with protocol-aware metrics shifting success rates by up to 40 percentage points.

  8. KL for a KL: On-Policy Distillation with Control Variate Baseline

    cs.LG 2026-05 unverdicted novelty 7.0

    vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...

  9. Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

    cs.CL 2026-05 unverdicted novelty 7.0

    POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...

  10. When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs

    cs.PF 2026-05 unverdicted novelty 7.0

    Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and through...

  11. When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs

    cs.PF 2026-05 conditional novelty 7.0

    Hosted open-weight LLMs function as heterogeneous, time-varying services rather than uniform model artifacts, with concentrated demand, decoupled supply and adoption, and measurable gains from task-aware routing.

  12. SAGE: A Service Agent Graph-guided Evaluation Benchmark

    cs.AI 2026-04 unverdicted novelty 7.0

    SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 m...

  13. D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

    cs.AI 2026-05 unverdicted novelty 6.0

    D-VLA uses plane decoupling and a swimlane pipeline to deliver higher throughput and linear speedup than prior RL frameworks when training billion- and trillion-parameter VLA models on benchmarks like LIBERO.

  14. D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

    cs.AI 2026-05 unverdicted novelty 6.0

    D-VLA introduces plane decoupling and a swimlane asynchronous pipeline to achieve high-concurrency RL training and linear scalability for billion- to trillion-parameter vision-language-action models.

  15. MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-...

  16. Learning, Fast and Slow: Towards LLMs That Adapt Continually

    cs.LG 2026-05 unverdicted novelty 6.0

    Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.

  17. Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

    cs.LG 2026-05 unverdicted novelty 6.0

    Missing old logits in async agentic RL entangle discrepancy and staleness terms in PPO off-policy correction; exact acquisition methods and revised PPO-EWMA restore decoupled updates with reported gains in speed and p...

  18. Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning

    cs.LG 2026-05 unverdicted novelty 6.0

    METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior perform...

  19. Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

    cs.CV 2026-05 unverdicted novelty 6.0

    Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...

  20. Priming: Hybrid State Space Models From Pre-trained Transformers

    cs.LG 2026-05 unverdicted novelty 6.0

    Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasonin...

  21. Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

    cs.LG 2026-05 unverdicted novelty 6.0

    LPO reframes group-based RLVR as explicit target-projection on the LLM response simplex and performs exact divergence minimization to achieve monotonic listwise improvement with bounded gradients.

  22. Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR

    cs.LG 2026-05 unverdicted novelty 6.0

    S-trace adds sparse eligibility traces to RLVR that mask low-entropy tokens, outperforming GRPO by 0.49-3.16% pass@16 on Qwen3 models while improving sample and token efficiency.

  23. ZAYA1-8B Technical Report

    cs.AI 2026-05 unverdicted novelty 6.0

    ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

  24. Cost-Aware Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    Cost-aware SGD achieves target error with lower total sampling cost than standard methods, and Cost-Aware GRPO reduces token usage by up to 30% in LLM reinforcement learning while matching baseline performance.

  25. Building a Precise Video Language with Human-AI Oversight

    cs.CV 2026-04 unverdicted novelty 6.0

    CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video gene...

  26. Scaling Self-Play with Self-Guidance

    cs.LG 2026-04 unverdicted novelty 6.0

    SGS adds self-guidance to LLM self-play for Lean4 theorem proving, surpassing RL baselines and enabling a 7B model to outperform a 671B model after 200 rounds.

  27. Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    cs.LG 2026-04 unverdicted novelty 6.0

    Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.

  28. Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO

    cs.LG 2026-04 unverdicted novelty 6.0

    Balanced Aggregation fixes sign-length coupling and length downweighting in GRPO by computing separate token means for positive and negative subsets and combining them with sequence-count weights, yielding more stable...

  29. MEMENTO: Teaching LLMs to Manage Their Own Context

    cs.AI 2026-04 unverdicted novelty 6.0

    MEMENTO trains LLMs to segment reasoning into blocks, generate mementos as dense summaries, and reason forward using only mementos and KV states, cutting peak KV cache by ~2.5x while preserving benchmark accuracy.

  30. Policy Improvement Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    PIRL maximizes cumulative policy improvement across iterations instead of surrogate rewards and is proven aligned with final performance; PIPO implements it via retrospective verification for stable closed-loop optimization.

  31. Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance

    cs.LG 2026-05 unverdicted novelty 5.0

    FEST improves RLVR sample efficiency on math and coding benchmarks by combining supervised signals, on-policy signals, and decaying weights on just 128 randomly chosen demonstrations, matching full-dataset baselines.

  32. StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

    cs.CL 2026-05 unverdicted novelty 5.0

    StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.

  33. On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR

    cs.LG 2026-05 unverdicted novelty 5.0

    RLVR exhibits implicit reward overfitting to training data and optimizes heavy-tailed singular spectra with rank-1 focus on reasoning capability.

  34. On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length

    cs.AI 2026-05 unverdicted novelty 5.0

    Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.

  35. Beyond Distribution Sharpening: The Importance of Task Rewards

    cs.LG 2026-04 unverdicted novelty 5.0

    Task-reward reinforcement learning yields robust gains on math benchmarks for models like Llama-3.2-3B while distribution sharpening alone delivers only limited and unstable improvements.

  36. AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments

    cs.AI 2026-04 unverdicted novelty 5.0

    AgentCE-Bench is a lightweight grid-planning benchmark that controls task horizon via hidden slots H and difficulty via decoy budget B, validated across 13 models for consistent and discriminative evaluation.

  37. Group Sequence Policy Optimization

    cs.LG 2025-07 unverdicted novelty 5.0

    GSPO is a sequence-level policy optimization algorithm that outperforms GRPO in efficiency and stability for LLM reinforcement learning, especially MoE models.

  38. Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    cs.AI 2025-03 unverdicted novelty 5.0

    The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

  39. GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    cs.CL 2025-08 unverdicted novelty 4.0

    GLM-4.5, a 355B-parameter MoE model with hybrid reasoning, scores 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified while ranking 3rd overall and 2nd on agentic benchmarks.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 35 Pith papers · 23 internal anchors

  1. [1]

    Simple linear attention language models balance the recall-throughput tradeoff

    Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, and Ré. Simple linear attention language models balance the recall-throughput tradeoff. arXiv preprint arXiv:2402.18668,

  2. [2]

    Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks, 2025

    Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench.arXiv preprint arXiv:2412.15204,

  3. [3]

    Titans: Learning to Memorize at Test Time

    15 MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663,

  4. [4]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150,

  5. [5]

    Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

    Junyoung Chung and Ç. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555,

  6. [6]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models. arXiv preprint arXiv:2505.22617,

  7. [7]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060,

  8. [8]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948,

  9. [9]

    Mom: Linear sequence modeling with mixture-of-memories

    Jusen Du, Weigao Sun, Disen Lan, Jiaxi Hu, and Yu Cheng. Mom: Linear sequence modeling with mixture-of-memories. arXiv preprint arXiv:2502.13685,

  10. [10]

    Zamba: A compact 7b SSM.arXiv preprint arXiv:2405.16712,

    Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Millidge. Zamba: A compact 7b SSM.arXiv preprint arXiv:2405.16712,

  11. [11]

    Efficiently modeling long sequences with structured state spaces

    Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29,

  12. [12]

    Zhihao He, Hang Yu, Zi Gong, Shizhan Liu, Jianguo Li, and Weiyao Lin

    URLhttp://papers.nips.cc/paper_files/paper/2022/ hash/9156b0f6dfa9bbd18c79cc459ef5d61c-Abstract-Conference.html. Zhihao He, Hang Yu, Zi Gong, Shizhan Liu, Jianguo Li, and Weiyao Lin. Rodimus*: Breaking the accuracy-efficiency trade-off with efficient attentions.arXiv preprint arXiv:2410.06577,

  13. [13]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

  14. [14]

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open- reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290,

  15. [15]

    Jamba-1.5: Hybrid T.arXiv preprint arXiv:2408.12570,

    Jamba Team. Jamba-1.5: Hybrid T.arXiv preprint arXiv:2408.12570,

  16. [16]

    Kimi Team. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599,

  17. [17]

    Zebralogic: On the scaling limits of llms for logical reasoning

    Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. Zebralogic: On the scaling limits of llms for logical reasoning.arXiv preprint arXiv:2502.01100,

  18. [18]

    Synlogic: Synthesizing verifiable reasoning data at scale for learning logical reasoning and beyond

    Junteng Liu, Yuanxiang Fan, Zhuo Jiang, Han Ding, Yongyi Hu, Chi Zhang, Yiqi Shi, Shitong Weng, Aili Chen, Shiqi Chen, Yunan Huang, Mozhi Zhang, Pengyu Zhao, Junjie Yan, and Junxian He. Synlogic: Synthesizing verifiable reasoning data at scale for learning logical reasoning and beyond. arXiv preprint arXiv:2505.19641, 2025a. Siyao Liu, He Zhu, Jerry Liu, ...

  19. [19]

    Understanding R1-Zero-Like Training: A Critical Perspective

    17 MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025b. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInterna...

  20. [20]

    MoBA : Mixture of block attention for long-context LLMs

    Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, et al. Moba: Mixture of block attention for long-context llms.arXiv preprint arXiv:2502.13189,

  21. [21]

    Parallelizing linear recurrent neural nets over sequence length

    Eric Martin and Chris Cundy. Parallelizing linear recurrent neural nets over sequence length. In6th International Conference on Learning Representations, ICLR 2018, Vancouver , BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings . OpenReview.net,

  22. [22]

    Minimax-01: Scaling foundation models with lightning attention.arXiv preprint arXiv:2501.08313, 2025

    URLhttps://openreview. net/forum?id=HyUNwulC-. MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, et al. Minimax-01: Scaling foundation models with lightning attention. arXiv preprint arXiv:2501.08313,

  23. [23]

    A theory on Adam instability in large-scale machine learning.arXiv preprint arXiv:2304.09871, 2023

    Igor Molybog, Peter Albert, Moya Chen, Zachary DeVito, David Esiobu, Naman Goyal, Punit Singh Koura, Sharan Narang, Andrew Poulton, Ruan Silva, Binh Tang, Diana Liskovich, Puxin Xu, Yuchen Zhang, Melanie Kambadur, Stephen Roller, and Susan Zhang. A theory on adam instability in large-scale machine learning.arXiv preprint arXiv:2304.09871,

  24. [24]

    Openai mrcr dataset

    OpenAI. Openai mrcr dataset. https://huggingface.co/datasets/openai/mrcr, 2024b. Accessed: 2025-06-15. OpenAI. Introducing deep research,

  25. [25]

    Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence

    Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Teddy Ferdinan, Haowen Hou, and Przemysł Kazienko. Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence.arXiv preprint arXiv:2404.05892, 2024a. Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eug...

  26. [26]

    URL https: //openreview.net/forum?id=QtTKTdVrFBB. 18 MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249,

  27. [27]

    The devil in linear transformer

    Zhen Qin, Xiaodong Han, Weixuan Sun, Dongxu Li, Lingpeng Kong, Nick Barnes, and Yiran Zhong. The devil in linear transformer. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7025–7041, 2022a. Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, and Yiran Zhong. cosF...

  28. [28]

    You only scan once: Efficient multi-dimension sequential modeling with lightnet.arXiv preprint arXiv:2405.21022, 2024a

    Zhen Qin, Yuxin Mao, Xuyang Shen, Dong Li, Jing Zhang, Yuchao Dai, and Yiran Zhong. You only scan once: Efficient multi-dimension sequential modeling with lightnet.arXiv preprint arXiv:2405.21022, 2024a. Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Lightning attention-2: A free lunch for handling unlimited sequence lengths in ...

  29. [29]

    Samba: Simple hybrid state space models for efficient unlimited context language modeling.arXiv preprint arXiv:2406.07522,

    Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, and Weizhu Chen. Samba: Simple hybrid state space models for efficient unlimited context language modeling. arXiv preprint arXiv:2406.07522,

  30. [30]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

  31. [31]

    19 MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention ByteDance Seed, Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingxuan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, et al. Seed1. 5-thinking: Advancing superb reasoning models with reinforcement learning.arXiv preprint arXiv:2504.13914,

  32. [32]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. DeepSeekMath.arXiv preprint arXiv:2402.03300,

  33. [33]

    Scaling laws for linear complexity language models

    Xuyang Shen, Dong Li, Ruitao Leng, Zhen Qin, Weigao Sun, and Yiran Zhong. Scaling laws for linear complexity language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16377–16426,

  34. [34]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv:2409.19256,

  35. [35]

    Si, C., Hashimoto, T., and Yang, D

    ChengleiSi, DiyiYang, andTatsunoriHashimoto. Canllmsgeneratenovelresearchideas? alarge-scale human study with 100+ nlp researchers.arXiv preprint arXiv:2409.04109,

  36. [36]

    Deltaproduct: Improving state-tracking in linear rnns via householder products.arXiv preprint arXiv:2502.10297,

    Julien Siems, Timur Carstensen, Arber Zela, Frank Hutter, Massimiliano Pontil, and Riccardo Grazzi. Deltaproduct: Improving state-tracking in linear rnns via householder products.arXiv preprint arXiv:2502.10297,

  37. [37]

    arXiv preprint arXiv:2501.17399 , year=

    Ved Sirdeshmukh, Kaustubh Deshpande, Johannes Mols, Lifeng Jin, Ed-Yeremai Cardona, Dean Lee, Jeremy Kritz, Willow Primack, Summer Yue, and Chen Xing. Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier llms.arXiv preprint arXiv:2501.17399,

  38. [38]

    Linear-moe: Linear sequence modeling meets mixture-of-experts

    Weigao Sun, Disen Lan, Tong Zhu, Xiaoye Qu, and Yu Cheng. Linear-moe: Linear sequence modeling meets mixture-of-experts. arXiv preprint arXiv:2503.05447,

  39. [39]

    Learning to (Learn at Test Time): RNNs with Expressive Hidden States

    Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states. arXiv preprint arXiv:2407.04620,

  40. [40]

    Retentive Network: A Successor to Transformer for Large Language Models

    Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621,

  41. [41]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin

    Accessed: 2025-06-15. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30,

  42. [42]

    Saurous, Guillaume Lajoie, Charlotte Frenkel, Razvan Pascanu, Blaise Agüera y Arcas, and João Sacramento

    Johannes von Oswald, Nino Scherrer, Seijin Kobayashi, Luca Versari, Songlin Yang, Maximilian Schlegel, Kaitlin Maile, Yanick Schimpf, Oliver Sieberling, Alexander Meulemans, et al. Mesanet: Sequence modeling by locally optimal test-time training.arXiv preprint arXiv:2506.05233,

  43. [43]

    Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939,

  44. [44]

    Measuring short-form factuality in large language models

    Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models.arXiv preprint arXiv:2411.04368,

  45. [45]

    Agentless: Demystifying LLM-based Software Engineering Agents

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents.arXiv preprint arXiv:2407.01489,

  46. [46]

    TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

    Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, and Graham Neubig. Theagentcompany: Benchmarking llm agents on consequential real world tasks.arXiv ...

  47. [47]

    Gated Linear Attention Transformers with Hardware-Efficient Training

    Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training.arXiv preprint arXiv:2312.06635, 2024a. Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length.arXiv preprint arXiv:2406.06484, 2024b....

  48. [48]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  49. [49]

    Native sparse attention: Hardware-aligned and natively trainable sparse attention

    Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, YX Wei, Lean Wang, Zhiping Xiao, et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention.arXiv preprint arXiv:2502.11089,

  50. [50]

    SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

    Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild.arXiv preprint arXiv:2503.18892,