arxiv: 2508.06471 · v1 · submitted 2025-08-08 · 💻 cs.CL

Recognition: 3 theorem links

· Lean Theorem

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

GLM-4.5 Team: Aohan Zeng , Xin Lv , Qinkai Zheng , Zhenyu Hou , Bin Chen , Chengxing Xie , Cunxiang Wang , Da Yin

show 161 more authors

Hao Zeng Jiajie Zhang Kedong Wang Lucen Zhong Mingdao Liu Rui Lu Shulin Cao Xiaohan Zhang Xuancheng Huang Yao Wei Yean Cheng Yifan An Yilin Niu Yuanhao Wen Yushi Bai Zhengxiao Du Zihan Wang Zilin Zhu Bohan Zhang Bosi Wen Bowen Wu Bowen Xu Can Huang Casey Zhao Changpeng Cai Chao Yu Chen Li Chendi Ge Chenghua Huang Chenhui Zhang Chenxi Xu Chenzheng Zhu Chuang Li Congfeng Yin Daoyan Lin Dayong Yang Dazhi Jiang Ding Ai Erle Zhu Fei Wang Gengzheng Pan Guo Wang Hailong Sun Haitao Li Haiyang Li Haiyi Hu Hanyu Zhang Hao Peng Hao Tai Haoke Zhang Haoran Wang Haoyu Yang He Liu He Zhao Hongwei Liu Hongxi Yan Huan Liu Huilong Chen Ji Li Jiajing Zhao Jiamin Ren Jian Jiao Jiani Zhao Jianyang Yan Jiaqi Wang Jiayi Gui Jiayue Zhao Jie Liu Jijie Li Jing Li Jing Lu Jingsen Wang Jingwei Yuan Jingxuan Li Jingzhao Du Jinhua Du Jinxin Liu Junkai Zhi Junli Gao Ke Wang Lekang Yang Liang Xu Lin Fan Lindong Wu Lintao Ding Lu Wang Man Zhang Minghao Li Minghuan Xu Mingming Zhao Mingshu Zhai Pengfan Du Qian Dong Shangde Lei Shangqing Tu Shangtong Yang Shaoyou Lu Shijie Li Shuang Li Shuang-Li Shuxun Yang Sibo Yi Tianshu Yu Wei Tian Weihan Wang Wenbo Yu Weng Lam Tam Wenjie Liang Wentao Liu Xiao Wang Xiaohan Jia Xiaotao Gu Xiaoying Ling Xin Wang Xing Fan Xingru Pan Xinyuan Zhang Xinze Zhang Xiuqing Fu Xunkai Zhang Yabo Xu Yandong Wu Yida Lu Yidong Wang Yilin Zhou Yiming Pan Ying Zhang Yingli Wang Yingru Li Yinpei Su Yipeng Geng Yitong Zhu Yongkun Yang Yuhang Li Yuhao Wu Yujiang Li Yunan Liu Yunqing Wang Yuntao Li Yuxuan Zhang Zezhen Liu Zhen Yang Zhengda Zhou Zhongpei Qiao Zhuoer Feng Zhuorui Liu Zichen Zhang Zijun Yao Zikang Wang Ziqiang Liu Ziwei Chai Zixuan Li Zuodong Zhao Wenguang Chen Jidong Zhai Bin Xu Minlie Huang Hongning Wang Juanzi Li Yuxiao Dong Jie Tang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 17:42 UTC · model grok-4.3

classification 💻 cs.CL

keywords mixture of expertslarge language modelagentic tasksreasoningcodingopen sourcebenchmark evaluationreinforcement learning

0 comments

The pith

GLM-4.5 reaches 70.1 percent on TAU-Bench and 91 percent on AIME 24 using an open-source 355B-parameter MoE model with only 32B parameters active at once.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GLM-4.5 as an open-source Mixture-of-Experts model built to handle agentic tasks, complex reasoning, and coding problems. It describes a multi-stage training process on 23 trillion tokens followed by expert iteration and reinforcement learning that produces a hybrid reasoning capability allowing both extended thinking traces and direct answers. The model posts the listed benchmark scores and ranks near the top of evaluated systems despite activating far fewer parameters than some denser competitors. A smaller 106B-parameter variant is also released to broaden access. The work aims to supply capable tools for building practical AI agents and technical problem solvers.

Core claim

GLM-4.5 achieves strong performance across agentic, reasoning, and coding (ARC) tasks, scoring 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified. With much fewer parameters than several competitors, GLM-4.5 ranks 3rd overall among all evaluated models and 2nd on agentic benchmarks.

What carries the argument

The hybrid reasoning method that supports both thinking and direct response modes, built inside a Mixture-of-Experts architecture with 355 billion total parameters but only 32 billion activated per token.

Load-bearing premise

That the reported benchmark scores reflect genuine capabilities measured through fair, standardized, and uncontaminated evaluations that allow direct comparison to other models.

What would settle it

Independent re-evaluation of the model on the same benchmark problems using fresh, publicly documented prompts and code, or testing on a new suite of problems created after the training cutoff, would confirm or refute the claimed scores.

read the original abstract

We present GLM-4.5, an open-source Mixture-of-Experts (MoE) large language model with 355B total parameters and 32B activated parameters, featuring a hybrid reasoning method that supports both thinking and direct response modes. Through multi-stage training on 23T tokens and comprehensive post-training with expert model iteration and reinforcement learning, GLM-4.5 achieves strong performance across agentic, reasoning, and coding (ARC) tasks, scoring 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified. With much fewer parameters than several competitors, GLM-4.5 ranks 3rd overall among all evaluated models and 2nd on agentic benchmarks. We release both GLM-4.5 (355B parameters) and a compact version, GLM-4.5-Air (106B parameters), to advance research in reasoning and agentic AI systems. Code, models, and more information are available at https://github.com/zai-org/GLM-4.5.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces GLM-4.5, an open-source Mixture-of-Experts (MoE) large language model with 355B total parameters and 32B activated parameters. It features a hybrid reasoning method that supports both thinking and direct response modes. The model undergoes multi-stage training on 23T tokens and post-training with expert model iteration and reinforcement learning. GLM-4.5 reports strong results across agentic, reasoning, and coding (ARC) tasks, including 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified. It ranks 3rd overall among evaluated models and 2nd on agentic benchmarks despite having fewer parameters than several competitors. A compact variant, GLM-4.5-Air (106B parameters), is also released, with code and models made available at a GitHub repository.

Significance. If the benchmark results hold under verifiable and standardized conditions, the work advances open-source models for agentic and reasoning tasks by demonstrating competitive performance with an efficient MoE architecture and hybrid reasoning. The public release of both the full and compact models, along with code, is a clear strength that enables reproducibility and community follow-up research on ARC capabilities.

major comments (1)

[Abstract] Abstract: The central performance claims, including the specific scores of 70.1% on TAU-Bench and 64.2% on SWE-bench Verified together with the 3rd overall and 2nd agentic ranking, are presented without any description of the evaluation methodology. Details on agent scaffolding, tool-use protocols, attempt limits, prompting consistency, use of the hybrid thinking mode, and data-contamination controls are required to establish that the results are comparable to those of competing models; their absence undermines confidence in the headline rankings.

minor comments (2)

[Abstract] The phrase 'expert model iteration' in the abstract is used without definition or reference to a methods section; a brief clarification would improve readability.
The efficiency claim ('much fewer parameters than several competitors') would be strengthened by explicitly listing the parameter counts of the referenced competing models in a comparison table.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. The feedback highlights an important point about ensuring transparency in the abstract for benchmark results. We address this directly below.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claims, including the specific scores of 70.1% on TAU-Bench and 64.2% on SWE-bench Verified together with the 3rd overall and 2nd agentic ranking, are presented without any description of the evaluation methodology. Details on agent scaffolding, tool-use protocols, attempt limits, prompting consistency, use of the hybrid thinking mode, and data-contamination controls are required to establish that the results are comparable to those of competing models; their absence undermines confidence in the headline rankings.

Authors: We agree that the abstract, constrained by length, omits explicit methodology details, which can affect immediate assessment of comparability. The full manuscript contains sections on evaluation protocols that cover agent scaffolding (standard setups for TAU-Bench and SWE-bench), tool-use protocols, attempt limits, prompting strategies, selective use of the hybrid thinking mode, and data-contamination controls via held-out test sets and decontamination procedures. In the revision, we will expand the abstract with a concise clause summarizing these elements and add cross-references to the detailed methodology sections. This change will improve clarity while preserving the abstract's brevity. We do not believe the core results or rankings require alteration, only better contextualization. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark reporting

full rationale

The paper describes training GLM-4.5 (355B MoE) on 23T tokens with post-training and RL, then reports measured benchmark scores (70.1% TAU-Bench, 91.0% AIME 24, 64.2% SWE-bench Verified). No mathematical derivations, equations, fitted predictions, or first-principles results exist. Claims rest on independent empirical evaluations with no self-definitional loops, fitted-input predictions, or load-bearing self-citations that reduce the central results to inputs by construction. Standard model-release structure; derivation chain is absent.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As an empirical report on a trained foundation model, the central claims rest on standard machine learning assumptions including the validity of benchmark evaluations and the effectiveness of the described training pipeline; no novel axioms, free parameters, or invented entities are introduced beyond typical hyperparameter choices in LLM training.

pith-pipeline@v0.9.0 · 6174 in / 1376 out tokens · 98003 ms · 2026-05-11T17:42:50.551830+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GLM-4.5 achieves strong performance across agentic, reasoning, and coding (ARC) tasks, scoring 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified. With much fewer parameters than several competitors, GLM-4.5 ranks 3rd overall among all evaluated models and 2nd on agentic benchmarks.
IndisputableMonolith.Foundation.PhiForcing phi_equation unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present GLM-4.5, an open-source Mixture-of-Experts (MoE) large language model with 355B total parameters and 32B activated parameters, featuring a hybrid reasoning method that supports both thinking and direct response modes.
IndisputableMonolith.Foundation.LedgerForcing conservation_from_balance unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Through multi-stage training on 23T tokens and comprehensive post-training with expert model iteration and reinforcement learning

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 55 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models
cs.AR 2026-05 conditional novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning
cs.LG 2026-05 conditional novelty 8.0

ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equip...
WildTableBench: Benchmarking Multimodal Foundation Models on Table Understanding In the Wild
cs.CV 2026-05 unverdicted novelty 8.0

WildTableBench is the first benchmark for multimodal models on naturally occurring table images, with only one of 21 tested models exceeding 50% accuracy and most ranging from 4.1% to 49.9%.
Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models
cs.AI 2026-04 unverdicted novelty 8.0

User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.
GGBound: A Genome-Grounded Agent for Microbial Life-Boundary Prediction
cs.CY 2026-05 unverdicted novelty 7.0

A genome-conditioned 4B LLM agent predicts microbial life boundaries and matches larger frontier models via token fusion, tool use, and a counterfactual gene-grounding reward.
CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging
cs.LG 2026-05 unverdicted novelty 7.0

CUDABeaver shows LLM CUDA debuggers often degenerate code for test-passing at the cost of speed, with protocol-aware metrics shifting success rates by up to 40 percentage points.
StoryAlign: Evaluating and Training Reward Models for Story Generation
cs.CL 2026-05 unverdicted novelty 7.0

StoryReward, trained on a new 100k story preference dataset, sets state-of-the-art performance on the introduced StoryRMB benchmark for aligning LLM stories with human preferences.
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs
cs.PF 2026-05 unverdicted novelty 7.0

Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and through...
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs
cs.PF 2026-05 conditional novelty 7.0

Hosted open-weight LLMs function as heterogeneous, time-varying services rather than uniform model artifacts, with concentrated demand, decoupled supply and adoption, and measurable gains from task-aware routing.
AgentVisor: Defending LLM Agents Against Prompt Injection via Semantic Virtualization
cs.CR 2026-04 unverdicted novelty 7.0

AgentVisor cuts prompt injection success rate to 0.65% in LLM agents with only 1.45% utility loss via semantic privilege separation and one-shot self-correction.
Dr.Sai: An agentic AI for real-world physics analysis at BESIII
hep-ex 2026-04 unverdicted novelty 7.0

Dr.Sai autonomously executed full physics analysis pipelines on real BESIII data to re-measure ten J/psi decay branching fractions, matching established benchmarks without any manual coding.
Towards Temporal Compositional Reasoning in Long-Form Sports Videos
cs.CV 2026-04 unverdicted novelty 7.0

SportsTime benchmark and CoTR method improve multimodal AI's temporal compositional reasoning and evidence grounding in long-form sports videos.
FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training
cs.DC 2026-04 unverdicted novelty 7.0

FEPLB reduces token and GEMM stragglers in MoE training by 50-70% using nearly free Copy Engine communication on Hopper architecture.
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
cs.LG 2026-04 unverdicted novelty 7.0

Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
AdversarialCoT: Single-Document Retrieval Poisoning for LLM Reasoning
cs.IR 2026-04 unverdicted novelty 7.0

A single query-specific poisoned document, built by extracting and iteratively refining an adversarial chain-of-thought, can substantially degrade reasoning accuracy in retrieval-augmented LLM systems.
E2E-REME: Towards End-to-End Microservices Auto-Remediation via Experience-Simulation Reinforcement Fine-Tuning
cs.SE 2026-04 unverdicted novelty 7.0

E2E-REME outperforms nine LLMs in accuracy and efficiency for end-to-end microservice remediation by using experience-simulation reinforcement fine-tuning on a new benchmark called MicroRemed.
ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models
cs.AI 2026-04 unverdicted novelty 7.0

ImplicitMemBench shows no LLM exceeds 66% on implicit memory tasks, with top models at 65%, far below humans and pointing to architectural limits beyond scaling.
Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding
cs.CV 2026-04 unverdicted novelty 7.0

Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.
Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics
cs.DC 2026-04 unverdicted novelty 7.0

Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.
Code-Centric Detection of Vulnerability-Fixing Commits: A Unified Benchmark and Empirical Study
cs.SE 2026-05 accept novelty 6.0

Code language models show no transferable security understanding from code diffs alone, rely on commit messages, miss over 93% of fixes at 0.5% false positive rate, and suffer large drops under group or temporal splits.
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
cs.AI 2026-05 unverdicted novelty 6.0

ComplexMCP benchmark shows current LLM agents achieve at most 60% success on interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.
Edit-Based Refinement for Parallel Masked Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 6.0

ME-DLM augments parallel masked diffusion models with edit-distance-supervised refinements to raise quality on coding and math benchmarks while using far fewer diffusion steps.
Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories
cs.AI 2026-05 unverdicted novelty 6.0

Self-ReSET is a reinforcement learning approach that lets large reasoning models learn to recover from their own unsafe reasoning trajectories, improving robustness to adversarial jailbreaks while preserving utility.
VeriContest: A Competitive-Programming Benchmark for Verifiable Code Generation
cs.SE 2026-05 unverdicted novelty 6.0

VeriContest supplies 946 problems with specs, code, proofs, and tests to benchmark verifiable code generation in Rust/Verus, showing models reach 92% on code but only 5% end-to-end on full verifiable synthesis.
WebTrap: Stealthy Mid-Task Hijacking of Browser Agents During Navigation
cs.CR 2026-05 unverdicted novelty 6.0

WebTrap uses multi-step instruction fusion and context-grounded generation to stealthily hijack browser agents mid-navigation while preserving original task success.
HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control
cs.LG 2026-05 unverdicted novelty 6.0

HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchm...
Hard to Read, Easy to Jailbreak: How Visual Degradation Bypasses MLLM Safety Alignment
cs.CV 2026-05 conditional novelty 6.0

Degraded image resolution in MLLMs bypasses safety alignments via cognitive overload, raising jailbreak rates across perturbations.
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
cs.LG 2026-05 unverdicted novelty 6.0

Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and ...
AlbumFill: Album-Guided Reasoning and Retrieval for Personalized Image Completion
cs.CV 2026-05 unverdicted novelty 6.0

AlbumFill retrieves identity-consistent references from personal albums via VLM-inferred semantic cues to support personalized image completion.
AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving
cs.AR 2026-04 unverdicted novelty 6.0

AMMA is a memory-centric multi-chiplet architecture using HBM-PNM cubes, custom logic dies, hybrid parallelism, and reordered collectives that delivers 15.5X lower attention latency and 6.9X lower energy than NVIDIA H...
MAIC-UI: Making Interactive Courseware with Generative UI
cs.CL 2026-04 unverdicted novelty 6.0

MAIC-UI provides a zero-code authoring system for generating and iteratively editing interactive courseware from educational materials via structured analysis and incremental generation, with lab and classroom evaluat...
QuantClaw: Precision Where It Matters for OpenClaw
cs.AI 2026-04 unverdicted novelty 6.0

QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.
In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores
cs.CL 2026-04 unverdicted novelty 6.0

Standardized-test benchmarks for LLM fairness are unreliable because prompt wording alone drives most score variance and ranking changes, while a multi-agent conversational framework reveals consistent model-specific ...
DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data
cs.LG 2026-04 unverdicted novelty 6.0

A 4B deep research agent trained on 10K open data outperforms prior agents under 9B parameters and narrows the gap to 30B-class systems on research benchmarks.
AeSlides: Incentivizing Aesthetic Layout in LLM-Based Slide Generation via Verifiable Rewards
cs.CV 2026-04 unverdicted novelty 6.0

AeSlides is a GRPO-based RL framework that uses verifiable aesthetic metrics to optimize LLM slide generation, achieving large gains in layout quality metrics and human scores with only 5K prompts.
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
cs.LG 2026-04 unverdicted novelty 6.0

Expert upcycling expands MoE models by duplicating experts and continuing pre-training, matching baseline performance while saving 32% GPU hours in 7B-13B experiments.
ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving
cs.LG 2026-04 unverdicted novelty 6.0

ELMoE-3D achieves 6.6x average speedup and 4.4x energy efficiency gain for MoE serving on 3D hardware by scaling expert and bit elasticity for elastic self-speculative decoding.
Towards Knowledgeable Deep Research: Framework and Benchmark
cs.AI 2026-04 unverdicted novelty 6.0

The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.
ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment
cs.IR 2026-04 unverdicted novelty 6.0

ReAlign improves visual document retrieval by training retrievers to match query-induced rankings with rankings derived from VLM-generated, region-focused descriptions of relevant page content.
Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR
cs.LG 2026-04 unverdicted novelty 6.0

Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.
Learning to Retrieve from Agent Trajectories
cs.IR 2026-03 conditional novelty 6.0

Retrievers trained on agent trajectories via the LRAT framework improve evidence recall, task success, and efficiency in agentic search benchmarks.
OptiMat Alloys: a FAIR, living database of multi-principal element alloys enabled by a conversational agent
cond-mat.mtrl-sci 2026-04 unverdicted novelty 5.0

OptiMat Alloys is a conversational AI system that maintains a living FAIR database of multi-principal element alloy calculations and enables natural-language, on-demand computations with built-in uncertainty checks.
OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 5.0

OGER adds an auxiliary exploration reward built from offline trajectories and model entropy to hybrid RL training, yielding gains on math reasoning benchmarks and out-of-domain generalization.
Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes
cs.AI 2026-04 unverdicted novelty 5.0

Mixed-complexity procedural datasets provide up to 5x sample efficiency for RLVR on small models in low-data regimes, with low-to-high complexity generalization observed across counting, graph, and spatial tasks.
LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 5.0

LongAct uses saliency from high-magnitude activations to guide sparse weight updates in long-context RL, yielding about 8% gains on LongBench v2 across multiple algorithms.
Apriel-1.5-OpenReasoner: RL Post-Training for General-Purpose and Efficient Reasoning
cs.LG 2026-04 unverdicted novelty 5.0

Apriel-1.5-OpenReasoner uses RL post-training with adaptive sampling and difficulty-aware penalties to boost reasoning accuracy on AIME, GPQA, MMLU-Pro and LiveCodeBench while producing shorter traces and generalizing...
Yet Even Less Is Even Better For Agentic, Reasoning, and Coding LLMs
cs.SE 2026-04 unverdicted novelty 5.0

STITCH trains superior agentic coding and reasoning LLMs by using fewer high-quality trajectories filtered to keep only critical decision tokens, delivering up to 63% relative gains on SWE-bench Verified.
Same Voice, Different Lab: On the Homogenization of Frontier LLM Personalities
cs.HC 2026-03 unverdicted novelty 5.0

Frontier LLMs homogenize toward systematic and analytical personalities, suppressing emotional traits like remorseful or sycophantic, indicating an implicit consensus on optimal assistant behavior.
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
cs.CL 2025-12 unverdicted novelty 5.0

DeepSeek-V3.2 adds sparse attention, scaled RL post-training, and large-scale agentic data synthesis to reach GPT-5-level performance and gold medals in 2025 IMO and IOI with its high-compute variant.
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
cs.AI 2025-09 conditional novelty 5.0

UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
Can Muon Fine-tune Adam-Pretrained Models?
cs.LG 2026-05 unverdicted novelty 4.0

Constraining fine-tuning updates with LoRA mitigates performance degradation when switching from Adam to Muon on pretrained models.
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
cs.AI 2026-05 unverdicted novelty 4.0

Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.
Silicon Showdown: Performance, Efficiency, and Ecosystem Barriers in Consumer-Grade LLM Inference
cs.PF 2026-05 unverdicted novelty 4.0

Nvidia achieves 1.6x throughput with NVFP4 but hits a VRAM wall for 70B+ models, while Apple UMA enables linear scaling to 80B at 4-bit with up to 23x better energy efficiency.
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
cs.AI 2026-05 unverdicted novelty 3.0

Safactory combines parallel simulation, trustworthy data management, and asynchronous evolution platforms into a single pipeline claimed to be the first unified framework for trustworthy autonomous agents.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 52 Pith papers · 18 internal anchors

[1]

Abbas, K

A. Abbas, K. Tirumala, D. Simig, S. Ganguli, and A. S. Morcos. Semdedup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540, 2023

work page arXiv 2023
[2]

C. An, Z. Xie, X. Li, L. Li, J. Zhang, S. Gong, M. Zhong, J. Xu, X. Qiu, M. Wang, and L. Kong. Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025

work page 2025
[3]

Y . Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3119–3137, 2024

work page 2024
[4]

Y . Bai, S. Tu, J. Zhang, H. Peng, X. Wang, X. Lv, S. Cao, J. Xu, L. Hou, Y . Dong, J. Tang, and J. Li. LongBench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3639–3664, Vienna, Austria, July 202...

work page 2025
[5]

Bavarian, H

M. Bavarian, H. Jun, N. Tezak, J. Schulman, C. McLeavey, J. Tworek, and M. Chen. Efficient training of language models to fill in the middle, 2022

work page 2022
[6]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[7]

A. Chen, A. Li, B. Gong, B. Jiang, B. Fei, B. Yang, B. Shan, C. Yu, C. Wang, C. Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention. arXiv preprint arXiv:2506.13585, 2025

work page internal anchor Pith review arXiv 2025
[8]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Cheng, Y

S. Cheng, Y . Bao, Q. Cao, L. Huang, L. Kang, Z. Liu, Y . Lu, W. Zhu, Z. Huang, T. Li, et al. Seed-x: Building strong multilingual translation llm with 7b parameters. arXiv preprint arXiv:2507.13618, 2025

work page arXiv 2025
[10]

Deshpande, V

K. Deshpande, V . Sirdeshmukh, J. B. Mols, L. Jin, E.-Y . Hernandez-Cardona, D. Lee, J. Kritz, W. E. Primack, S. Yue, and C. Xing. Multichallenge: A realistic multi-turn conversation evalua- tion benchmark challenging to frontier llms. In Findings of the Association for Computational Linguistics: ACL 2025, pages 18632–18702, 2025

work page 2025
[11]

H. Ding, Z. Wang, G. Paolini, V . Kumar, A. Deoras, D. Roth, and S. Soatto. Fewer truncations improve language modeling. In Proceedings of the 41st International Conference on Machine Learning, pages 11030–11048, 2024

work page 2024
[12]

Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737, 2024

F. Gloeckle, B. Y . Idrissi, B. Rozière, D. Lopez-Paz, and G. Synnaeve. Better & faster large language models via multi-token prediction. arXiv preprint arXiv:2404.19737, 2024

work page arXiv 2024
[13]

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Hendrycks, C

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)

work page
[15]

Henry, P

A. Henry, P. R. Dachapally, S. Pawar, and Y . Chen. Query-key normalization for transformers, 2020

work page 2020
[16]

Hsieh, S

C.-P. Hsieh, S. Sun, S. Kriman, S. Acharya, D. Rekesh, F. Jia, and B. Ginsburg. Ruler: What’s the real context size of your long-context language models? In First Conference on Language Modeling. 23

work page
[17]

S. Hu, Y . Tu, X. Han, G. Cui, C. He, W. Zhao, X. Long, Z. Zheng, Y . Fang, Y . Huang, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. In First Conference on Language Modeling

work page
[18]

OpenAI o1 System Card

A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. In The Thirteenth International Conference on Learning Representations

work page
[20]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Jordan, Y

K. Jordan, Y . Jin, V . Boza, Y . Jiacheng, F. Cecista, L. Newhouse, and J. Bern- stein. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan.github.io/posts/muon, 6

work page 2024
[22]

Joulin, E

A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Compu- tational Linguistics: Volume 2, Short Papers, pages 427–431. Association for Computational Linguistics, April 2017

work page 2017
[23]

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

J. Liu, J. Su, X. Yao, Z. Jiang, G. Lai, Y . Du, Y . Qin, W. Xu, E. Lu, J. Yan, et al. Muon is scalable for llm training. arXiv preprint arXiv:2502.16982, 2025

work page internal anchor Pith review arXiv 2025
[25]

M. Luo, S. Tan, J. Wong, X. Shi, W. Y . Tang, M. Roongta, C. Cai, J. Luo, L. E. Li, R. A. Popa, and I. Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model- by-Scaling-RL-19681902c1468005bed8ca303013a4e2, 2025. Notion Blog

work page 2025
[26]

S. G. Patil, H. Mao, C. Cheng-Jie Ji, F. Yan, V . Suresh, I. Stoica, and J. E. Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. In Forty-second International Conference on Machine Learning, 2025

work page 2025
[27]

arXiv preprint arXiv:2506.20920 , year=

G. Penedo, H. Kydlí ˇcek, V . Sabolˇcec, B. Messmer, N. Foroutan, A. H. Kargaran, C. Raffel, M. Jaggi, L. V on Werra, and T. Wolf. Fineweb2: One pipeline to scale them all–adapting pre-training data processing to every language. arXiv preprint arXiv:2506.20920, 2025

work page arXiv 2025
[28]

L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, C. B. C. Zhang, M. Shaaban, J. Ling, S. Shi, et al. Humanity’s last exam. arXiv preprint arXiv:2501.14249, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Y . Qin, T. Zhang, Y . Shen, W. Luo, Y . Zhang, Y . Qiao, Z. Zhou, W. Zhang, B. CUI, et al. Sysbench: Can llms follow system message? In The Thirteenth International Conference on Learning Representations, 2024

work page 2024
[30]

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024

work page 2024
[31]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

D. Su, K. Kong, Y . Lin, J. Jennings, B. Norick, M. Kliegl, M. Patwary, M. Shoeybi, and B. Catanzaro. Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset. arXiv preprint arXiv:2412.02595, 2024

work page arXiv 2024
[33]

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 24

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

K. Team, Y . Bai, Y . Bao, G. Chen, J. Chen, N. Chen, R. Chen, Y . Chen, Y . Chen, Y . Chen, et al. Kimi k2: Open agentic intelligence. arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

T. T.-B. Team. Terminal-bench: A benchmark for ai agents in terminal environments, Apr 2025

work page 2025
[36]

M. Tian, L. Gao, S. Zhang, X. Chen, C. Fan, X. Guo, R. Haas, P. Ji, K. Krongchon, Y . Li, et al. Scicode: A research coding benchmark curated by scientists. Advances in Neural Information Processing Systems, 37:30624–30650, 2024

work page 2024
[37]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

V odrahalli, S

K. V odrahalli, S. Ontanon, N. Tripuraneni, K. Xu, S. Jain, R. Shivanna, J. Hui, N. Dikkala, M. Kazemi, B. Fatemi, R. Anil, E. Dyer, S. Shakeri, R. Vij, H. Mehta, V . Ramasesh, Q. Le, E. Chi, Y . Lu, O. Firat, A. Lazaridou, J.-B. Lespiau, N. Attaluri, and K. Olszewska. Michelan- gelo: Long context evaluations beyond haystacks via latent structure queries, 2024

work page 2024
[39]

F. Wan, W. Shen, S. Liao, Y . Shi, C. Li, Z. Yang, J. Zhang, F. Huang, J. Zhou, and M. Yan. Qwenlong-l1: Towards long-context large reasoning models with reinforcement learning. arXiv preprint arXiv:2505.17667, 2025

work page arXiv 2025
[40]

L. Wang, H. Gao, C. Zhao, X. Sun, and D. Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts. arXiv preprint arXiv:2408.15664, 2024

work page arXiv 2024
[41]

S. Wang, L. Yu, C. Gao, C. Zheng, S. Liu, R. Lu, K. Dang, X. Chen, J. Yang, Z. Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning. arXiv preprint arXiv:2506.01939, 2025

work page internal anchor Pith review arXiv 2025
[42]

X. Wang, B. Li, Y . Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y . Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y . Shao, N. Muennighoff, Y . Zhang, B. Hui, J. Lin, R. Brennan, H. Peng, H. Ji, and G. Neubig. Openhands: An open platform for AI software developers as generalist agents. In The Thirteenth International Conference on Lea...

work page 2025
[43]

Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. Advances in Neural Information Processing Systems, 37:95266–95290, 2024

work page 2024
[44]

J. Wei, N. Karina, H. W. Chung, Y . J. Jiao, S. Papay, A. Glaese, J. Schulman, and W. Fedus. Measuring short-form factuality in large language models, 2024

work page 2024
[45]

J. Wei, Z. Sun, S. Papay, S. McKinney, J. Han, I. Fulford, H. W. Chung, A. T. Passos, W. Fedus, and A. Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents. arXiv preprint arXiv:2504.12516, 2025

work page internal anchor Pith review arXiv 2025
[46]

Z. Xi, Y . Ding, W. Chen, B. Hong, H. Guo, J. Wang, D. Yang, C. Liao, X. Guo, W. He, et al. Agentgym: Evolving large language model-based agents across diverse environments. arXiv preprint arXiv:2406.04151, 2024

work page arXiv 2024
[47]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

S. Yao, N. Shinn, P. Razavi, and K. Narasimhan. tau-bench: A benchmark for tool-agent-user interaction in real-world domains. arXiv preprint arXiv:2406.12045, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, Y . Yue, W. Dai, T. Fan, G. Liu, L. Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y . Xu, W. Zheng, X. Xia, et al. Glm-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations. 25

work page
[51]

Safetybench: Eval- uating the safety of large language models with mul- tiple choice questions

Z. Zhang, L. Lei, L. Wu, R. Sun, Y . Huang, C. Long, X. Liu, X. Lei, J. Tang, and M. Huang. Safetybench: Evaluating the safety of large language models with multiple choice questions. arXiv preprint arXiv:2309.07045, 2023

work page arXiv 2023
[52]

J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y . Luan, D. Zhou, and L. Hou. Instruction- following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023. 26

work page internal anchor Pith review Pith/arXiv arXiv 2023