arxiv: 2506.13585 · v1 · submitted 2025-06-16 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

· Lean Theorem

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

MiniMax: Aili Chen , Aonian Li , Bangwei Gong , Binyang Jiang , Bo Fei , Bo Yang , Boji Shan , Changqing Yu

show 118 more authors

Chao Wang Cheng Zhu Chengjun Xiao Chengyu Du Chi Zhang Chu Qiao Chunhao Zhang Chunhui Du Congchao Guo Da Chen Deming Ding Dianjun Sun Dong Li Enwei Jiao Haigang Zhou Haimo Zhang Han Ding Haohai Sun Haoyu Feng Huaiguang Cai Haichao Zhu Jian Sun Jiaqi Zhuang Jiaren Cai Jiayuan Song Jin Zhu Jingyang Li Jinhao Tian Jinli Liu Junhao Xu Junjie Yan Junteng Liu Junxian He Kaiyi Feng Ke Yang Kecheng Xiao Le Han Leyang Wang Lianfei Yu Liheng Feng Lin Li Lin Zheng Linge Du Lingyu Yang Lunbin Zeng Minghui Yu Mingliang Tao Mingyuan Chi Mozhi Zhang Mujie Lin Nan Hu Nongyu Di Peng Gao Pengfei Li Pengyu Zhao Qibing Ren Qidi Xu Qile Li Qin Wang Rong Tian Ruitao Leng Shaoxiang Chen Shaoyu Chen Shengmin Shi Shitong Weng Shuchang Guan Shuqi Yu Sichen Li Songquan Zhu Tengfei Li Tianchi Cai Tianrun Liang Weiyu Cheng Weize Kong Wenkai Li Xiancai Chen Xiangjun Song Xiao Luo Xiao Su Xiaobo Li Xiaodong Han Xinzhu Hou Xuan Lu Xun Zou Xuyang Shen Yan Gong Yan Ma Yang Wang Yiqi Shi Yiran Zhong Yonghong Duan Yongxiang Fu Yongyi Hu Yu Gao Yuanxiang Fan Yufeng Yang Yuhao Li Yulin Hu Yunan Huang Yunji Li Yunzhi Xu Yuxin Mao Yuxuan Shi Yuze Wenren Zehan Li Zelin Li Zhanxu Tian Zhengmao Zhu Zhenhua Fan Zhenzhen Wu Zhichao Xu Zhihang Yu Zhiheng Lyu Zhuo Jiang Zibo Gao Zijia Wu Zijian Song Zijun Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-12 09:22 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords hybrid attentionlightning attentionMixture of Expertsreinforcement learningCISPOtest-time computelong contextreasoning model

0 comments

The pith

MiniMax-M1 combines hybrid attention with a new RL algorithm to scale test-time compute efficiently in a 456 billion parameter model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MiniMax-M1 as an open-weight model built on a hybrid Mixture-of-Experts architecture that incorporates lightning attention. This design supports native one-million-token contexts and reduces the cost of extending computation during inference for reasoning tasks. Training relies on large-scale reinforcement learning across software engineering environments, accelerated by a new algorithm called CISPO that clips importance sampling weights instead of token updates. The full training run finishes in three weeks on 512 GPUs at a cost of about 535,000 dollars, and the resulting models match or exceed leading open models on benchmarks for complex software tasks, tool use, and long inputs. If the approach holds, it points to a practical route for building reasoning systems that can process extended contexts and allocate more thinking steps without proportional increases in hardware demands.

Core claim

MiniMax-M1 is powered by a hybrid MoE architecture with lightning attention that natively handles 1 million token contexts and enables efficient scaling of test-time compute. The model, derived from a 456 billion parameter base with 45.9 billion activated per token, is trained via reinforcement learning on diverse problems including real-world software engineering. CISPO, the proposed RL algorithm, improves efficiency by clipping importance sampling weights rather than token updates, allowing the entire RL phase to complete on 512 H800 GPUs in three weeks for a rental cost of 534,700 dollars. Released versions with 40K and 80K thinking budgets perform comparably or better than DeepSeek-R1 on

What carries the argument

The lightning attention mechanism, which replaces standard attention to allow efficient scaling of test-time compute in the hybrid-attention MoE model.

If this is right

The model supports eight times the context length of comparable open models while keeping training feasible on a modest GPU cluster.
CISPO allows reinforcement learning for reasoning models to run faster and cheaper by changing how importance sampling weights are handled.
Released models show particular strength in complex software engineering and tool utilization, suggesting the architecture suits agent-like tasks.
Two thinking-budget variants let users trade off inference cost against depth of reasoning on the same base model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The efficiency gains could make large-scale reinforcement learning for reasoning models accessible to more research groups with limited hardware budgets.
If lightning attention generalizes, it may allow other hybrid architectures to handle million-token inputs at lower latency than full attention alternatives.
The focus on sandbox-based software environments during training may produce models that transfer more readily to real-world coding agents than purely text-trained systems.

Load-bearing premise

The lightning attention mechanism delivers large efficiency gains in test-time compute and training without reducing the model's reasoning accuracy or long-context capability.

What would settle it

A side-by-side measurement of tokens processed per second and benchmark scores on long-context software engineering tasks that shows MiniMax-M1 no faster or no more accurate than a standard-attention model of similar size.

read the original abstract

We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model natively supports a context length of 1 million tokens, 8x the context size of DeepSeek R1. Furthermore, the lightning attention mechanism in MiniMax-M1 enables efficient scaling of test-time compute. These properties make M1 particularly suitable for complex tasks that require processing long inputs and thinking extensively. MiniMax-M1 is trained using large-scale reinforcement learning (RL) on diverse problems including sandbox-based, real-world software engineering environments. In addition to M1's inherent efficiency advantage for RL training, we propose CISPO, a novel RL algorithm to further enhance RL efficiency. CISPO clips importance sampling weights rather than token updates, outperforming other competitive RL variants. Combining hybrid-attention and CISPO enables MiniMax-M1's full RL training on 512 H800 GPUs to complete in only three weeks, with a rental cost of just $534,700. We release two versions of MiniMax-M1 models with 40K and 80K thinking budgets respectively, where the 40K model represents an intermediate phase of the 80K training. Experiments on standard benchmarks show that our models are comparable or superior to strong open-weight models such as the original DeepSeek-R1 and Qwen3-235B, with particular strengths in complex software engineering, tool utilization, and long-context tasks. We publicly release MiniMax-M1 at https://github.com/MiniMax-AI/MiniMax-M1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MiniMax-M1 releases an open 456B MoE model with lightning attention for 1M context and a CISPO RL method, but the abstract gives almost no numbers or ablations to support the efficiency and performance claims.

read the letter

The main thing to know is that this is an open-weight release of a 456 billion parameter MoE model with 45.9 billion active per token, built on a hybrid attention setup they call lightning attention. It supports 1 million token context natively and is trained with large-scale RL on software engineering environments using a new algorithm called CISPO that clips importance sampling weights instead of token updates. They report the full training finished on 512 H800 GPUs in three weeks at roughly $535k and release two versions with 40K and 80K thinking budgets. The abstract claims the models match or beat DeepSeek-R1 and Qwen3-235B on benchmarks, with particular strength in complex software tasks, tool use, and long contexts.

Referee Report

3 major / 0 minor

Summary. The manuscript introduces MiniMax-M1, the first open-weight large-scale hybrid-attention reasoning model. It combines a 456B-parameter MoE architecture (45.9B active) with a lightning attention mechanism to support native 1M-token context (8x DeepSeek-R1) and efficient test-time compute scaling. The model is trained via large-scale RL on sandbox and real-world software engineering tasks using a proposed CISPO algorithm that clips importance sampling weights; full RL training completes in three weeks on 512 H800 GPUs at $534k cost. Two variants (40K and 80K thinking budgets) are released and reported to match or exceed strong open models such as DeepSeek-R1 and Qwen3-235B, with particular strengths in complex software engineering, tool use, and long-context reasoning.

Significance. If the efficiency and performance claims hold, the work would provide a practical open-weight demonstration of hybrid attention for long-context reasoning and cost-effective RL scaling of test-time compute. The public model release and the CISPO algorithm could serve as useful baselines for future research on efficient inference-time scaling and RL for agentic tasks.

major comments (3)

[Abstract] Abstract: the central performance claim states that the models are 'comparable or superior' to DeepSeek-R1 and Qwen3-235B 'with particular strengths in complex software engineering, tool utilization, and long-context tasks,' yet no benchmark scores, tables, error bars, or specific task results are supplied. This absence is load-bearing for the superiority assertion and prevents assessment of effect sizes.
[Abstract / Model Architecture] Lightning attention description: the manuscript asserts that the mechanism 'enables efficient scaling of test-time compute' for 1M-context reasoning, but provides neither scaling curves, FLOPs-vs-length measurements, nor ablations comparing it to standard attention on long software-engineering traces. Without these data the efficiency advantage remains unverified.
[RL Training / CISPO] CISPO algorithm: the claim that CISPO 'outperforms other competitive RL variants' by clipping importance sampling weights rather than token updates is central to the training-efficiency narrative, yet no equations, pseudocode, or ablation tables versus PPO/GRPO are referenced, nor is any analysis of gradient bias on extended thinking chains supplied.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the positive summary of our work and the constructive major comments. We appreciate the opportunity to clarify and strengthen the manuscript. We address each point below and will incorporate the suggested revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claim states that the models are 'comparable or superior' to DeepSeek-R1 and Qwen3-235B 'with particular strengths in complex software engineering, tool utilization, and long-context tasks,' yet no benchmark scores, tables, error bars, or specific task results are supplied. This absence is load-bearing for the superiority assertion and prevents assessment of effect sizes.

Authors: We agree that the abstract would be strengthened by including concrete benchmark numbers. While the full manuscript already contains detailed tables, error bars, and per-task results in the experiments section, we will revise the abstract to incorporate a concise summary of key scores (e.g., software-engineering and long-context benchmarks) with direct comparisons to DeepSeek-R1 and Qwen3-235B. This change will make the performance claims immediately verifiable without lengthening the abstract substantially. revision: yes
Referee: [Abstract / Model Architecture] Lightning attention description: the manuscript asserts that the mechanism 'enables efficient scaling of test-time compute' for 1M-context reasoning, but provides neither scaling curves, FLOPs-vs-length measurements, nor ablations comparing it to standard attention on long software-engineering traces. Without these data the efficiency advantage remains unverified.

Authors: We acknowledge that additional empirical support is needed to substantiate the efficiency claims. In the revised manuscript we will add (i) scaling curves relating context length to test-time FLOPs and latency, (ii) direct FLOPs-vs-length measurements, and (iii) ablations of lightning attention versus standard attention on long software-engineering traces. These figures and tables will be placed in the model-architecture and experiments sections. revision: yes
Referee: [RL Training / CISPO] CISPO algorithm: the claim that CISPO 'outperforms other competitive RL variants' by clipping importance sampling weights rather than token updates is central to the training-efficiency narrative, yet no equations, pseudocode, or ablation tables versus PPO/GRPO are referenced, nor is any analysis of gradient bias on extended thinking chains supplied.

Authors: We agree that the algorithmic details and supporting evidence must be explicit. We will add the full set of equations, pseudocode, and implementation details for CISPO to the RL-training section. We will also include ablation tables comparing CISPO against PPO and GRPO on both training efficiency and final performance, plus a short analysis of gradient bias on long thinking chains based on our training logs. These additions will improve reproducibility and directly address the referee's concern. revision: yes

Circularity Check

0 steps flagged

Minor self-citation to prior model; central claims rest on independent empirical benchmarks

full rationale

The paper introduces MiniMax-M1 as a new hybrid MoE + lightning attention model trained with the CISPO RL algorithm and validates performance via standard benchmarks against external models like DeepSeek-R1. The only self-reference is the statement that M1 is 'developed based on our previous MiniMax-Text-01 model,' which supplies the base parameters but does not define or force the new architecture, attention mechanism, or RL gains. No equations reduce predictions to fitted inputs by construction, no uniqueness theorems are imported from the same authors, and no ansatz is smuggled via citation. All load-bearing claims (test-time scaling efficiency, CISPO superiority, benchmark results) are presented as outcomes of training and evaluation rather than tautological redefinitions.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 2 invented entities

The central claims rest on the effectiveness of the new lightning attention and CISPO, which are introduced here; limited details on how they were derived or validated beyond the abstract.

free parameters (1)

thinking budget (40K and 80K)
The two versions with different thinking budgets are presented as intermediate and final phases, but the specific values may be chosen based on performance.

axioms (2)

domain assumption Hybrid attention mechanism improves efficiency for long contexts
Assumed in the design of the model for 1M token support.
domain assumption CISPO clips importance sampling weights to enhance RL efficiency
The novel algorithm is based on this clipping approach.

invented entities (2)

Lightning attention mechanism no independent evidence
purpose: To enable efficient scaling of test-time compute
New mechanism introduced in the paper without external validation mentioned.
CISPO RL algorithm no independent evidence
purpose: To further enhance RL efficiency by clipping importance sampling weights
Novel algorithm proposed in the paper.

pith-pipeline@v0.9.0 · 6142 in / 1679 out tokens · 44671 ms · 2026-05-12T09:22:33.467728+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DimensionForcing alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism... the lightning attention mechanism in MiniMax-M1 enables efficient scaling of test-time compute
IndisputableMonolith.Foundation.LedgerForcing conservation_from_balance unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose CISPO, a novel RL algorithm to further enhance RL efficiency. CISPO clips importance sampling weights rather than token updates

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 40 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning
cs.LG 2026-05 conditional novelty 8.0

ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equip...
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 8.0

Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking
cs.LG 2026-05 unverdicted novelty 7.0

F-GRPO factorizes group-relative policy optimization into generation and ranking phases within one autoregressive sequence, using order-invariant coverage and position-aware utility rewards to improve top-ranked perfo...
Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective
cs.LG 2026-05 conditional novelty 7.0

ConSPO improves RLVR training by aligning rollout scores with generation likelihoods via length-normalized log-probabilities and applying a group-wise InfoNCE contrastive loss with a scheduled margin, outperforming GR...
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 7.0

Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...
CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization
cs.LG 2026-05 unverdicted novelty 7.0

CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.
CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging
cs.LG 2026-05 unverdicted novelty 7.0

CUDABeaver shows LLM CUDA debuggers often degenerate code for test-passing at the cost of speed, with protocol-aware metrics shifting success rates by up to 40 percentage points.
KL for a KL: On-Policy Distillation with Control Variate Baseline
cs.LG 2026-05 unverdicted novelty 7.0

vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
cs.CL 2026-05 unverdicted novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs
cs.PF 2026-05 conditional novelty 7.0

Hosted open-weight LLMs function as heterogeneous, time-varying services rather than uniform model artifacts, with concentrated demand, decoupled supply and adoption, and measurable gains from task-aware routing.
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs
cs.PF 2026-05 unverdicted novelty 7.0

Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and through...
SAGE: A Service Agent Graph-guided Evaluation Benchmark
cs.AI 2026-04 unverdicted novelty 7.0

SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 m...
D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models
cs.AI 2026-05 unverdicted novelty 6.0

D-VLA introduces plane decoupling and a swimlane asynchronous pipeline to achieve high-concurrency RL training and linear scalability for billion- to trillion-parameter vision-language-action models.
D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models
cs.AI 2026-05 unverdicted novelty 6.0

D-VLA uses plane decoupling and a swimlane pipeline to deliver higher throughput and linear speedup than prior RL frameworks when training billion- and trillion-parameter VLA models on benchmarks like LIBERO.
MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-...
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 6.0

Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction
cs.LG 2026-05 unverdicted novelty 6.0

Missing old logits in async agentic RL entangle discrepancy and staleness terms in PPO off-policy correction; exact acquisition methods and revised PPO-EWMA restore decoupled updates with reported gains in speed and p...
Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning
cs.LG 2026-05 unverdicted novelty 6.0

METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior perform...
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
cs.CV 2026-05 unverdicted novelty 6.0

Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...
Priming: Hybrid State Space Models From Pre-trained Transformers
cs.LG 2026-05 unverdicted novelty 6.0

Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasonin...
Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
cs.LG 2026-05 unverdicted novelty 6.0

LPO reframes group-based RLVR as explicit target-projection on the LLM response simplex and performs exact divergence minimization to achieve monotonic listwise improvement with bounded gradients.
Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR
cs.LG 2026-05 unverdicted novelty 6.0

S-trace adds sparse eligibility traces to RLVR that mask low-entropy tokens, outperforming GRPO by 0.49-3.16% pass@16 on Qwen3 models while improving sample and token efficiency.
ZAYA1-8B Technical Report
cs.AI 2026-05 unverdicted novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
Cost-Aware Learning
cs.LG 2026-04 unverdicted novelty 6.0

Cost-aware SGD achieves target error with lower total sampling cost than standard methods, and Cost-Aware GRPO reduces token usage by up to 30% in LLM reinforcement learning while matching baseline performance.
Building a Precise Video Language with Human-AI Oversight
cs.CV 2026-04 unverdicted novelty 6.0

CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video gene...
Scaling Self-Play with Self-Guidance
cs.LG 2026-04 unverdicted novelty 6.0

SGS adds self-guidance to LLM self-play for Lean4 theorem proving, surpassing RL baselines and enabling a 7B model to outperform a 671B model after 200 rounds.
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 6.0

Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO
cs.LG 2026-04 unverdicted novelty 6.0

Balanced Aggregation fixes sign-length coupling and length downweighting in GRPO by computing separate token means for positive and negative subsets and combining them with sequence-count weights, yielding more stable...
MEMENTO: Teaching LLMs to Manage Their Own Context
cs.AI 2026-04 unverdicted novelty 6.0

MEMENTO trains LLMs to segment reasoning into blocks, generate mementos as dense summaries, and reason forward using only mementos and KV states, cutting peak KV cache by ~2.5x while preserving benchmark accuracy.
Policy Improvement Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 6.0

PIRL maximizes cumulative policy improvement across iterations instead of surrogate rewards and is proven aligned with final performance; PIPO implements it via retrospective verification for stable closed-loop optimization.
MICA: Multi-granularity Intertemporal Credit Assignment for Long-Horizon Emotional Support Dialogue
cs.CL 2026-03 unverdicted novelty 6.0

MICA combines incremental per-turn distance rewards and Monte Carlo returns from a shared potential function over user support states to create a mixed advantage signal that enables stable multi-turn RL optimization f...
Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance
cs.LG 2026-05 unverdicted novelty 5.0

FEST improves RLVR sample efficiency on math and coding benchmarks by combining supervised signals, on-policy signals, and decaying weights on just 128 randomly chosen demonstrations, matching full-dataset baselines.
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
cs.CL 2026-05 unverdicted novelty 5.0

StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR
cs.LG 2026-05 unverdicted novelty 5.0

RLVR exhibits implicit reward overfitting to training data and optimizes heavy-tailed singular spectra with rank-1 focus on reasoning capability.
On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length
cs.AI 2026-05 unverdicted novelty 5.0

Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.
Beyond Distribution Sharpening: The Importance of Task Rewards
cs.LG 2026-04 unverdicted novelty 5.0

Task-reward reinforcement learning yields robust gains on math benchmarks for models like Llama-3.2-3B while distribution sharpening alone delivers only limited and unstable improvements.
AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments
cs.AI 2026-04 unverdicted novelty 5.0

AgentCE-Bench is a lightweight grid-planning benchmark that controls task horizon via hidden slots H and difficulty via decoy budget B, validated across 13 models for consistent and discriminative evaluation.
Group Sequence Policy Optimization
cs.LG 2025-07 unverdicted novelty 5.0

GSPO is a sequence-level policy optimization algorithm that outperforms GRPO in efficiency and stability for LLM reinforcement learning, especially MoE models.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
cs.AI 2025-03 unverdicted novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
cs.CL 2025-08 unverdicted novelty 4.0

GLM-4.5, a 355B-parameter MoE model with hybrid reasoning, scores 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified while ranking 3rd overall and 2nd on agentic benchmarks.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 36 Pith papers · 23 internal anchors

[1]

Simple linear attention language models balance the recall-throughput tradeoff

Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, and Ré. Simple linear attention language models balance the recall-throughput tradeoff. arXiv preprint arXiv:2402.18668,

work page arXiv
[2]

Long- bench v2: Towards deeper understanding and rea- soning on realistic long-context multitasks

Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench.arXiv preprint arXiv:2412.15204,

work page arXiv
[3]

Titans: Learning to Memorize at Test Time

15 MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663,

work page internal anchor Pith review arXiv
[4]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150,

work page internal anchor Pith review Pith/arXiv arXiv 2004
[5]

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Junyoung Chung and Ç. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models. arXiv preprint arXiv:2505.22617,

work page internal anchor Pith review arXiv
[7]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Mom: Linear sequence modeling with mixture-of-memories

Jusen Du, Weigao Sun, Disen Lan, Jiaxi Hu, and Yu Cheng. Mom: Linear sequence modeling with mixture-of-memories. arXiv preprint arXiv:2502.13685,

work page arXiv
[10]

Zamba: A compact 7b SSM.arXiv preprint arXiv:2405.16712,

Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Millidge. Zamba: A compact 7b SSM.arXiv preprint arXiv:2405.16712,

work page arXiv
[11]

Efficiently modeling long sequences with structured state spaces

Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29,

work page 2022
[12]

Zhihao He, Hang Yu, Zi Gong, Shizhan Liu, Jianguo Li, and Weiyao Lin

URLhttp://papers.nips.cc/paper_files/paper/2022/ hash/9156b0f6dfa9bbd18c79cc459ef5d61c-Abstract-Conference.html. Zhihao He, Hang Yu, Zi Gong, Shizhan Liu, Jianguo Li, and Weiyao Lin. Rodimus*: Breaking the accuracy-efficiency trade-off with efficient attentions.arXiv preprint arXiv:2410.06577,

work page arXiv 2022
[13]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open- reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290,

work page internal anchor Pith review arXiv
[15]

Jamba-1.5: Hybrid T.arXiv preprint arXiv:2408.12570,

Jamba Team. Jamba-1.5: Hybrid T.arXiv preprint arXiv:2408.12570,

work page arXiv
[16]

Kimi Team. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Zebralogic: On the scaling limits of llms for logical reasoning

Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. Zebralogic: On the scaling limits of llms for logical reasoning.arXiv preprint arXiv:2502.01100,

work page arXiv
[18]

Synlogic: Synthesizing verifiable reasoning data at scale for learning logical reasoning and beyond

Junteng Liu, Yuanxiang Fan, Zhuo Jiang, Han Ding, Yongyi Hu, Chi Zhang, Yiqi Shi, Shitong Weng, Aili Chen, Shiqi Chen, Yunan Huang, Mozhi Zhang, Pengyu Zhao, Junjie Yan, and Junxian He. Synlogic: Synthesizing verifiable reasoning data at scale for learning logical reasoning and beyond. arXiv preprint arXiv:2505.19641, 2025a. Siyao Liu, He Zhu, Jerry Liu, ...

work page arXiv
[19]

Understanding R1-Zero-Like Training: A Critical Perspective

17 MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025b. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInterna...

work page internal anchor Pith review Pith/arXiv arXiv
[20]

MoBA : Mixture of block attention for long-context LLMs

Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, et al. Moba: Mixture of block attention for long-context llms.arXiv preprint arXiv:2502.13189,

work page arXiv
[21]

Parallelizing linear recurrent neural nets over sequence length

Eric Martin and Chris Cundy. Parallelizing linear recurrent neural nets over sequence length. In6th International Conference on Learning Representations, ICLR 2018, Vancouver , BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings . OpenReview.net,

work page 2018
[22]

Minimax-01: Scaling foundation models with lightning attention.arXiv preprint arXiv:2501.08313, 2025

URLhttps://openreview. net/forum?id=HyUNwulC-. MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, et al. Minimax-01: Scaling foundation models with lightning attention. arXiv preprint arXiv:2501.08313,

work page arXiv
[23]

A theory on Adam instability in large-scale machine learning.arXiv preprint arXiv:2304.09871, 2023

Igor Molybog, Peter Albert, Moya Chen, Zachary DeVito, David Esiobu, Naman Goyal, Punit Singh Koura, Sharan Narang, Andrew Poulton, Ruan Silva, Binh Tang, Diana Liskovich, Puxin Xu, Yuchen Zhang, Melanie Kambadur, Stephen Roller, and Susan Zhang. A theory on adam instability in large-scale machine learning.arXiv preprint arXiv:2304.09871,

work page arXiv
[24]

Openai mrcr dataset

OpenAI. Openai mrcr dataset. https://huggingface.co/datasets/openai/mrcr, 2024b. Accessed: 2025-06-15. OpenAI. Introducing deep research,

work page 2025
[25]

Eagle and finch: Rwkv with matrix- valued states and dynamic recurrence.arXiv preprint arXiv:2404.05892,

Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Teddy Ferdinan, Haowen Hou, and Przemysł Kazienko. Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence.arXiv preprint arXiv:2404.05892, 2024a. Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eug...

work page arXiv
[26]

URL https: //openreview.net/forum?id=QtTKTdVrFBB. 18 MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

The devil in linear transformer

Zhen Qin, Xiaodong Han, Weixuan Sun, Dongxu Li, Lingpeng Kong, Nick Barnes, and Yiran Zhong. The devil in linear transformer. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7025–7041, 2022a. Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, and Yiran Zhong. cosF...

work page 2022
[28]

You only scan once: Efficient multi-dimension sequential modeling with lightnet.arXiv preprint arXiv:2405.21022, 2024a

Zhen Qin, Yuxin Mao, Xuyang Shen, Dong Li, Jing Zhang, Yuchao Dai, and Yiran Zhong. You only scan once: Efficient multi-dimension sequential modeling with lightnet.arXiv preprint arXiv:2405.21022, 2024a. Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Lightning attention-2: A free lunch for handling unlimited sequence lengths in ...

work page arXiv
[29]

Samba: Simple hybrid state space models for efficient unlimited context language modeling.arXiv preprint arXiv:2406.07522,

Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, and Weizhu Chen. Samba: Simple hybrid state space models for efficient unlimited context language modeling. arXiv preprint arXiv:2406.07522,

work page arXiv
[30]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

19 MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention ByteDance Seed, Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingxuan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, et al. Seed1. 5-thinking: Advancing superb reasoning models with reinforcement learning.arXiv preprint arXiv:2504.13914,

work page arXiv
[32]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. DeepSeekMath.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Scaling laws for linear complexity language models

Xuyang Shen, Dong Li, Ruitao Leng, Zhen Qin, Weigao Sun, and Yiran Zhong. Scaling laws for linear complexity language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16377–16426,

work page 2024
[34]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv:2409.19256,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Si, C., Hashimoto, T., and Yang, D

ChengleiSi, DiyiYang, andTatsunoriHashimoto. Canllmsgeneratenovelresearchideas? alarge-scale human study with 100+ nlp researchers.arXiv preprint arXiv:2409.04109,

work page arXiv
[36]

Deltaproduct: Improving state-tracking in linear rnns via householder products.arXiv preprint arXiv:2502.10297,

Julien Siems, Timur Carstensen, Arber Zela, Frank Hutter, Massimiliano Pontil, and Riccardo Grazzi. Deltaproduct: Improving state-tracking in linear rnns via householder products.arXiv preprint arXiv:2502.10297,

work page arXiv
[37]

arXiv preprint arXiv:2501.17399 , year=

Ved Sirdeshmukh, Kaustubh Deshpande, Johannes Mols, Lifeng Jin, Ed-Yeremai Cardona, Dean Lee, Jeremy Kritz, Willow Primack, Summer Yue, and Chen Xing. Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier llms.arXiv preprint arXiv:2501.17399,

work page arXiv
[38]

Linear-moe: Linear sequence modeling meets mixture-of-experts

Weigao Sun, Disen Lan, Tong Zhu, Xiaoye Qu, and Yu Cheng. Linear-moe: Linear sequence modeling meets mixture-of-experts. arXiv preprint arXiv:2503.05447,

work page arXiv
[39]

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states. arXiv preprint arXiv:2407.04620,

work page internal anchor Pith review arXiv
[40]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621,

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin

Accessed: 2025-06-15. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30,

work page 2025
[42]

Saurous, Guillaume Lajoie, Charlotte Frenkel, Razvan Pascanu, Blaise Agüera y Arcas, and João Sacramento

Johannes von Oswald, Nino Scherrer, Seijin Kobayashi, Luca Versari, Songlin Yang, Maximilian Schlegel, Kaitlin Maile, Yanick Schimpf, Oliver Sieberling, Alexander Meulemans, et al. Mesanet: Sequence modeling by locally optimal test-time training.arXiv preprint arXiv:2506.05233,

work page arXiv
[43]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939,

work page internal anchor Pith review arXiv
[44]

Measuring short-form factuality in large language models

Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models.arXiv preprint arXiv:2411.04368,

work page internal anchor Pith review arXiv
[45]

Agentless: Demystifying LLM-based Software Engineering Agents

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents.arXiv preprint arXiv:2407.01489,

work page internal anchor Pith review Pith/arXiv arXiv
[46]

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, and Graham Neubig. Theagentcompany: Benchmarking llm agents on consequential real world tasks.arXiv ...

work page internal anchor Pith review arXiv
[47]

Gated Linear Attention Transformers with Hardware-Efficient Training

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training.arXiv preprint arXiv:2312.06635, 2024a. Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length.arXiv preprint arXiv:2406.06484, 2024b....

work page internal anchor Pith review arXiv
[48]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page internal anchor Pith review Pith/arXiv arXiv
[49]

Native sparse attention: Hardware-aligned and natively trainable sparse attention

Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, YX Wei, Lean Wang, Zhiping Xiao, et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention.arXiv preprint arXiv:2502.11089,

work page arXiv
[50]

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild.arXiv preprint arXiv:2503.18892,

work page internal anchor Pith review arXiv