arxiv: 2512.02556 · v1 · submitted 2025-12-02 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-AI , Aixin Liu , Aoxue Mei , Bangcai Lin , Bing Xue , Bingxuan Wang , Bingzheng Xu , Bochao Wu

show 255 more authors

Bowei Zhang Chaofan Lin Chen Dong Chengda Lu Chenggang Zhao Chengqi Deng Chenhao Xu Chong Ruan Damai Dai Daya Guo Dejian Yang Deli Chen Erhang Li Fangqi Zhou Fangyun Lin Fucong Dai Guangbo Hao Guanting Chen Guowei Li H. Zhang Hanwei Xu Hao Li Haofen Liang Haoran Wei Haowei Zhang Haowen Luo Haozhe Ji Honghui Ding Hongxuan Tang Huanqi Cao Huazuo Gao Hui Qu Hui Zeng Jialiang Huang Jiashi Li Jiaxin Xu Jiewen Hu Jingchang Chen Jingting Xiang Jingyang Yuan Jingyuan Cheng Jinhua Zhu Jun Ran Junguang Jiang Junjie Qiu Junlong Li Junxiao Song Kai Dong Kaige Gao Kang Guan Kexin Huang Kexing Zhou Kezhao Huang Kuai Yu Lean Wang Lecong Zhang Lei Wang Liang Zhao Liangsheng Yin Lihua Guo Lingxiao Luo Linwang Ma Litong Wang Liyue Zhang M.S. Di M.Y Xu Mingchuan Zhang Minghua Zhang Minghui Tang Mingxu Zhou Panpan Huang Peixin Cong Peiyi Wang Qiancheng Wang Qihao Zhu Qingyang Li Qinyu Chen Qiushi Du Ruiling Xu Ruiqi Ge Ruisong Zhang Ruizhe Pan Runji Wang Runqiu Yin Runxin Xu Ruomeng Shen Ruoyu Zhang S.H. Liu Shanghao Lu Shangyan Zhou Shanhuang Chen Shaofei Cai Shaoyuan Chen Shengding Hu Shengyu Liu Shiqiang Hu Shirong Ma Shiyu Wang Shuiping Yu Shunfeng Zhou Shuting Pan Songyang Zhou Tao Ni Tao Yun Tian Pei Tian Ye Tianyuan Yue Wangding Zeng Wen Liu Wenfeng Liang Wenjie Pang Wenjing Luo Wenjun Gao Wentao Zhang Xi Gao Xiangwen Wang Xiao Bi XiaoDong Liu Xiaohan Wang Xiaokang Chen Xiaokang Zhang Xiaotao Nie Xin Cheng Xin Liu Xin Xie Xingchao Liu Xingkai Yu Xingyou Li Xinyu Yang Xinyuan Li Xu Chen Xuecheng Su Xuehai Pan Xuheng Lin Xuwei Fu Y.Q. Wang Yang Zhang Yanhong Xu Yanru Ma Yao Li Yao Zhao Yaofeng Sun Yaohui Wang Yi Qian Yi Yu Yichao Zhang Yifan Ding Yifan Shi Yiliang Xiong Ying He Ying Zhou Yinmin Zhong Yishi Piao Yisong Wang Yixiao Chen Yixuan Tan Yixuan Wei Yiyang Ma Yiyuan Liu Yonglun Yang Yongqiang Guo Yongtong Wu Yu Wu Yuan Cheng Yuan Ou Yuanfan Xu Yuduan Wang Yue Gong Yuhan Wu Yuheng Zou Yukun Li Yunfan Xiong Yuxiang Luo Yuxiang You Yuxuan Liu Yuyang Zhou Z.F. Wu Z.Z. Ren Zehua Zhao Zehui Ren Zhangli Sha Zhe Fu Zhean Xu Zhenda Xie Zhengyan Zhang Zhewen Hao Zhibin Gou Zhicheng Ma Zhigang Yan Zhihong Shao Zhixian Huang Zhiyu Wu Zhuoshu Li Zhuping Zhang Zian Xu Zihao Wang Zihui Gu Zijia Zhu Zilin Li Zipeng Zhang Ziwei Xie Ziyi Gao Zizheng Pan Zongqing Yao Bei Feng Hui Li J.L. Cai Jiaqi Ni Lei Xu Meng Li Ning Tian R.J. Chen R.L. Jin S.S. Li Shuang Zhou Tianyu Sun X.Q. Li Xiangyue Jin Xiaojin Shen Xiaosha Chen Xinnan Song Xinyi Zhou Y.X. Zhu Yanping Huang Yaohui Li Yi Zheng Yuchen Zhu Yunxian Ma Zhen Huang Zhipeng Xu Zhongyu Zhang Dongjie Ji Jian Liang Jianzhong Guo Jin Chen Leyi Xia Miaojun Wang Mingming Li Peng Zhang Ruyi Chen Shangmian Sun Shaoqing Wu Shengfeng Ye T.Wang W.L. Xiao Wei An Xianzu Wang Xiaowen Sun Xiaoxiang Wang Ying Tang Yukun Zha Zekai Zhang Zhe Ju Zhen Zhang Zihua Qu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:01 UTC · model grok-4.3

classification 💻 cs.CL

keywords DeepSeek-V3.2large language modelssparse attentionreinforcement learningagentic tasksreasoningopen source modelspost-training

0 comments

The pith

DeepSeek-V3.2 combines sparse attention, scaled reinforcement learning, and large-scale agentic data synthesis to match or exceed closed models on advanced reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DeepSeek-V3.2 as an open large language model that balances computational efficiency with strong performance in reasoning and agentic tool use. Its three core advances are DeepSeek Sparse Attention for reduced compute in long contexts, a reinforcement learning framework scaled through post-training, and a pipeline that generates synthetic training data for interactive agent scenarios. The high-compute variant is presented as surpassing GPT-5 while matching Gemini-3.0-Pro, with gold-medal results on the 2025 International Mathematical Olympiad and International Olympiad in Informatics. A sympathetic reader would see this as evidence that open models can close the gap with proprietary systems by improving attention efficiency, post-training scale, and data synthesis rather than relying solely on larger pretraining runs.

Core claim

DeepSeek-V3.2 shows that an open model, through DeepSeek Sparse Attention, a scalable reinforcement learning protocol, and a large-scale agentic task synthesis pipeline, can achieve reasoning proficiency on par with or better than GPT-5, with its high-compute variant surpassing GPT-5 and equaling Gemini-3.0-Pro while securing gold medals in both the 2025 IMO and IOI.

What carries the argument

DeepSeek Sparse Attention (DSA), an attention mechanism that cuts computational complexity while preserving long-context performance, paired with the scalable reinforcement learning framework and the agentic task synthesis pipeline.

If this is right

Scaled post-training reinforcement learning plus synthetic agent data can lift open models to frontier-level performance on complex interactive tasks.
Sparse attention mechanisms allow high-performance models to handle longer contexts without proportional increases in compute cost.
Systematic generation of agentic training scenarios improves generalization and robustness in tool-use environments.
Open models can reach gold-medal results on international olympiad benchmarks in mathematics and informatics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the results hold, post-training innovations become a primary route for open-source efforts to match closed-model capabilities without matching pretraining scale.
The synthesis pipeline could be adapted to generate training data for domains beyond coding and math, such as scientific hypothesis testing.
Smaller variants incorporating DSA might deliver usable performance on consumer hardware while retaining core reasoning strengths.
Verification on entirely new olympiad-style problems would clarify whether the gains transfer beyond the specific 2025 test sets.

Load-bearing premise

The reported benchmark scores, especially the 2025 olympiad gold medals, reflect genuine model generalization rather than data contamination, overfitting, or non-standard evaluation protocols.

What would settle it

An independent run of the model on a fresh, previously unpublished collection of IMO and IOI problems, with blinded scoring and no access to any synthetic or prior training data derived from similar problems.

read the original abstract

We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as follows: (1) DeepSeek Sparse Attention (DSA): We introduce DSA, an efficient attention mechanism that substantially reduces computational complexity while preserving model performance in long-context scenarios. (2) Scalable Reinforcement Learning Framework: By implementing a robust reinforcement learning protocol and scaling post-training compute, DeepSeek-V3.2 performs comparably to GPT-5. Notably, our high-compute variant, DeepSeek-V3.2-Speciale, surpasses GPT-5 and exhibits reasoning proficiency on par with Gemini-3.0-Pro, achieving gold-medal performance in both the 2025 International Mathematical Olympiad (IMO) and the International Olympiad in Informatics (IOI). (3) Large-Scale Agentic Task Synthesis Pipeline: To integrate reasoning into tool-use scenarios, we developed a novel synthesis pipeline that systematically generates training data at scale. This methodology facilitates scalable agentic post-training, yielding substantial improvements in generalization and instruction-following robustness within complex, interactive environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeepSeek-V3.2 claims open-model gold medals on 2025 IMO/IOI and parity with GPT-5 via DSA, scaled RL, and agentic synthesis, but the evaluation details for those contest results are missing and that is the part that needs checking first.

read the letter

The main takeaway is that this paper positions DeepSeek-V3.2 as an open model that closes the gap with closed frontier systems on hard reasoning and agent tasks. The high-compute Speciale variant is said to match Gemini-3.0-Pro and top GPT-5 while earning gold on both the 2025 IMO and IOI. They credit three pieces: DeepSeek Sparse Attention to cut long-context cost, a scaled reinforcement learning setup for post-training, and a pipeline that generates large amounts of agentic tool-use data. If the numbers hold, this would be useful for anyone trying to run strong reasoning models without closed APIs. The efficiency angle with DSA and the focus on scalable agent training are practical steps that build on existing sparse attention and synthetic data work. The authors are at least naming concrete components rather than just reporting benchmark tables. That said, the abstract supplies almost no equations, ablation tables, training hyperparameters, or contamination checks. The olympiad results are the headline, yet there is no description of how the 2025 problems were sourced, whether they were held out from all training stages, or what scoring rules were followed. Without those specifics, it is difficult to rule out leakage or non-standard evaluation. The circularity risk is also present: the same post-training pipeline that produces the scores is also used to generate the data, and no independent verification is shown. This does not make the work worthless, but it does mean the strongest claims rest on trust rather than transparent evidence. The paper is aimed at groups building or benchmarking open LLMs and agent systems. Researchers who need reproducible efficiency tricks or large-scale synthetic agent data could extract value even if the top-line numbers require follow-up. It deserves peer review because the stakes are high and the technical directions are concrete; referees can press for the missing evaluation audits and ablations. I would send it out rather than desk-reject, with a clear request for the contest evaluation protocol in the first round.

Referee Report

3 major / 0 minor

Summary. The manuscript introduces DeepSeek-V3.2, an open large language model that combines DeepSeek Sparse Attention (DSA) for reduced computational complexity in long contexts, a scalable reinforcement learning framework for post-training, and a large-scale agentic task synthesis pipeline for tool-use scenarios. It claims that the high-compute variant DeepSeek-V3.2-Speciale surpasses GPT-5, matches Gemini-3.0-Pro in reasoning, and achieves gold-medal performance on the 2025 IMO and IOI.

Significance. If the reported results hold under standard evaluation conditions, the work would be significant for demonstrating that open models can reach frontier reasoning levels through efficient attention and scaled post-training, while providing a synthesis pipeline for agentic capabilities. The emphasis on open release could accelerate community progress, but the absence of verifiable details currently limits its contribution.

major comments (3)

[Abstract] Abstract: The gold-medal claims for DeepSeek-V3.2-Speciale on the 2025 IMO and IOI are presented without any description of problem sourcing, contamination audits, adherence to official judging rubrics, single-attempt constraints, or tool-access rules during evaluation. This is load-bearing for the central comparison to GPT-5 and Gemini-3.0-Pro, as deviations from standard protocols would undermine the generalization argument.
[Abstract] Abstract: The DeepSeek Sparse Attention (DSA) is described as substantially reducing complexity while preserving performance, but no equations, complexity analysis, ablation results, or quantitative long-context benchmarks are supplied to support this.
[Abstract] Abstract: The scalable RL framework and agentic synthesis pipeline are outlined at a high level with no specifics on reward modeling, training protocol, data generation details, or ablation studies showing their contribution to the reported gains.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for your constructive review and recommendation for major revision. We agree that the abstract requires additional supporting details to substantiate the key claims. We will revise the manuscript to incorporate the requested information on evaluation protocols, DSA technical specifics, and RL/pipeline details. Point-by-point responses to the major comments follow.

read point-by-point responses

Referee: [Abstract] Abstract: The gold-medal claims for DeepSeek-V3.2-Speciale on the 2025 IMO and IOI are presented without any description of problem sourcing, contamination audits, adherence to official judging rubrics, single-attempt constraints, or tool-access rules during evaluation. This is load-bearing for the central comparison to GPT-5 and Gemini-3.0-Pro, as deviations from standard protocols would undermine the generalization argument.

Authors: We agree these details are essential and their omission from the abstract is a limitation. In the revised manuscript we will add a dedicated evaluation protocol subsection (cross-referenced from the abstract) covering: sourcing of official 2025 IMO/IOI problems, contamination audits via n-gram and semantic overlap checks against training data, adherence to official rubrics with scoring methodology, enforcement of single-attempt constraints, and tool-access rules (none for IMO; standard permitted for IOI). This will directly support the comparisons to GPT-5 and Gemini-3.0-Pro. revision: yes
Referee: [Abstract] Abstract: The DeepSeek Sparse Attention (DSA) is described as substantially reducing complexity while preserving performance, but no equations, complexity analysis, ablation results, or quantitative long-context benchmarks are supplied to support this.

Authors: The referee is correct that the abstract lacks these elements. We will revise the manuscript to include (or prominently reference) the DSA equations, asymptotic complexity analysis demonstrating the reduction for long sequences, ablation studies on sparsity patterns, and quantitative results on long-context benchmarks showing performance retention. A brief mention of the complexity benefit will also be added to the abstract where space permits. revision: yes
Referee: [Abstract] Abstract: The scalable RL framework and agentic synthesis pipeline are outlined at a high level with no specifics on reward modeling, training protocol, data generation details, or ablation studies showing their contribution to the reported gains.

Authors: We acknowledge the high-level presentation in the abstract. As part of the major revision we will expand the Methods and Experiments sections with specifics on reward modeling, the RL training protocol and scaling procedure, details of the agentic task synthesis pipeline (including generation scale and diversity), and ablation studies quantifying each component's contribution to reasoning and tool-use gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in claimed results.

full rationale

The paper reports empirical performance outcomes from its DSA mechanism, RL post-training, and agentic synthesis pipeline, including comparisons to GPT-5 and gold-medal results on 2025 IMO/IOI. No equations, derivations, or first-principles predictions are presented that reduce by construction to the paper's own inputs, fitted parameters, or self-citations. The abstract and context describe architectural and training innovations without self-definitional loops, renamed known results, or load-bearing self-citations that would force the central claims. The evaluation details are asserted rather than derived, leaving the results as independent empirical statements rather than circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities beyond naming the three techniques; all quantitative claims rest on unreported training and evaluation procedures.

pith-pipeline@v0.9.0 · 6579 in / 1176 out tokens · 20424 ms · 2026-05-10T13:01:03.011656+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.PhiForcing phi_equation unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

our high-compute variant, DeepSeek-V3.2-Speciale, surpasses GPT-5 and exhibits reasoning proficiency on par with Gemini-3.0-Pro, achieving gold-medal performance in both the 2025 International Mathematical Olympiad (IMO) and the International Olympiad in Informatics (IOI)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments
cs.CR 2026-05 conditional novelty 8.0

LITMUS is the first benchmark using semantic-physical dual verification and OS state rollback to measure behavioral jailbreaks in LLM agents, revealing that even strong models execute 40%+ of high-risk operations and ...
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
cs.AI 2026-05 unverdicted novelty 8.0

Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning
cs.LG 2026-05 conditional novelty 8.0

ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equip...
HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?
cs.CR 2026-04 unverdicted novelty 8.0

Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.
HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks
cs.AI 2026-04 unverdicted novelty 8.0

HWE-Bench is the first repository-level benchmark for LLM agents on real hardware bug repair, where the best agent fixes 70.7% of 417 tasks but drops below 65% on complex SoC projects.
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
cs.CL 2026-04 unverdicted novelty 8.0

OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perf...
GGBound: A Genome-Grounded Agent for Microbial Life-Boundary Prediction
cs.CY 2026-05 unverdicted novelty 7.0

A genome-conditioned 4B LLM agent predicts microbial life boundaries and matches larger frontier models via token fusion, tool use, and a counterfactual gene-grounding reward.
AIS: Adaptive Importance Sampling for Quantized RL
stat.ML 2026-05 unverdicted novelty 7.0

AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.
CommonWhy: A Dataset for Evaluating Entity-Based Causal Commonsense Reasoning in Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

CommonWhy is a new dataset of 15,000 why-questions for evaluating LLMs on entity-based causal commonsense reasoning grounded in Wikidata.
RecRM-Bench: Benchmarking Multidimensional Reward Modeling for Agentic Recommender Systems
cs.IR 2026-05 unverdicted novelty 7.0

RecRM-Bench is a new large-scale benchmark dataset and framework for multi-dimensional reward modeling in agentic recommender systems, spanning instruction following, factual consistency, query-item relevance, and use...
Can LLM Agents Respond to Disasters? Benchmarking Heterogeneous Geospatial Reasoning in Emergency Operations
cs.AI 2026-05 unverdicted novelty 7.0

DORA is the first end-to-end agentic benchmark for LLM-based disaster response, covering perception, spatial analysis, evacuation planning, temporal reasoning, and report generation over heterogeneous geospatial data,...
AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents
cs.LG 2026-05 unverdicted novelty 7.0

AssayBench is a new gene-ranking benchmark for phenotypic CRISPR screens that shows zero-shot generalist LLMs outperform both biology-specific LLMs and trainable baselines on adjusted nDCG.
Budget-Efficient Automatic Algorithm Design via Code Graph
cs.AI 2026-05 unverdicted novelty 7.0

A code-graph and correction-based LLM search framework outperforms full-algorithm generation at equal token budgets on three combinatorial optimization problems.
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
cs.CY 2026-05 accept novelty 7.0

StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
cs.CY 2026-05 unverdicted novelty 7.0

StereoTales shows that LLMs produce harmful, culturally adapted stereotypes in open-ended multilingual stories, with patterns consistent across providers and aligned human-LLM harm judgments.
Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness
cs.CL 2026-05 unverdicted novelty 7.0

LLM proofs for hard math problems show large differences in quality metrics like conciseness and cognitive simplicity that correctness-only tests miss, along with trade-offs between quality and correctness.
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning
cs.CV 2026-05 unverdicted novelty 7.0

RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multi...
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

SeePhys Pro benchmark reveals multimodal models degrade on physics reasoning as information transfers from text to images, with blind training improvements often stemming from textual cues rather than visual evidence.
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

Multimodal AI models for physics reasoning lose performance when information shifts from text to images, and RLVR training gains often come from non-visual textual or distributional cues rather than actual visual evidence.
VibeProteinBench: An Evaluation Benchmark for Language-interfaced Vibe Protein Design
q-bio.QM 2026-05 unverdicted novelty 7.0

VibeProteinBench is a three-stage language-interfaced benchmark revealing that no current LLM performs strongly across recognition, engineering, and generation of proteins.
VibeProteinBench: An Evaluation Benchmark for Language-interfaced Vibe Protein Design
q-bio.QM 2026-05 unverdicted novelty 7.0

VibeProteinBench is a new benchmark evaluating LLMs on open-ended language-interfaced protein design across recognition, engineering, and generation, with no model showing strong performance in all areas.
FactoryBench: Evaluating Industrial Machine Understanding
cs.AI 2026-05 unverdicted novelty 7.0

FactoryBench reveals that frontier LLMs achieve under 50% on structured causal questions and under 18% on decision-making in industrial robotic telemetry.
ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring
cs.CV 2026-05 unverdicted novelty 7.0

ChartREG++ creates a new multi-target chart grounding benchmark with diverse cues and a code-driven synthesis pipeline for accurate masks, yielding a model that outperforms baselines and generalizes to real ChartQA charts.
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
cs.LG 2026-05 conditional novelty 7.0

MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
Breaking, Stale, or Missing? Benchmarking Coding Agents on Project-Level Test Evolution
cs.SE 2026-05 unverdicted novelty 7.0

TEBench is a new project-level benchmark for test evolution showing coding agents achieve only 45-49% F1 on identifying tests needing changes, with stale tests hardest due to reliance on execution failures.
Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend
cs.DC 2026-05 unverdicted novelty 7.0

A relay-buffer-free MoE communication scheme on Ascend uses pooled HBM for direct expert-window placement and reading, cutting dispatch and combine latency in prefill and decode phases.
Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend
cs.DC 2026-05 unverdicted novelty 7.0

A buffer-free MoE dispatch and combine method on Ascend hardware with pooled HBM cuts intermediate relay overhead via direct expert window access.
SCOUT: Active Information Foraging for Long-Text Understanding with Decoupled Epistemic States
cs.CL 2026-05 unverdicted novelty 7.0

SCOUT achieves state-of-the-art long-text understanding with up to 8x lower token use by actively foraging for sparse query-relevant information and updating a compact provenance-grounded epistemic state.
Automated Large-scale CVRP Solver Design via LLM-assisted Flexible MCTS
cs.AI 2026-05 unverdicted novelty 7.0

LaF-MCTS uses LLM-assisted flexible MCTS with a three-tier hierarchy, semantic pruning, and branch regrowth to automatically compose decomposition-enhanced CVRP solvers that outperform state-of-the-art methods on CVRP...
MolViBench: Evaluating LLMs on Molecular Vibe Coding
cs.CL 2026-05 unverdicted novelty 7.0

MolViBench is the first benchmark designed to evaluate LLMs on generating executable programs for molecular tasks in drug discovery.
Constraint-Guided Multi-Agent Decompilation for Executable Binary Recovery
cs.SE 2026-04 unverdicted novelty 7.0

A constraint-guided multi-agent system turns raw decompiler output into re-executable code at 84-97% success rates, outperforming prior LLM decompilation methods on real binaries.
MathDuels: Evaluating LLMs as Problem Posers and Solvers
cs.CL 2026-04 unverdicted novelty 7.0

Self-play between LLMs for problem authoring and solving, scored via Rasch modeling, shows that authoring and solving skills are partially decoupled and that the benchmark difficulty evolves with new models.
OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving
cs.CL 2026-04 unverdicted novelty 7.0

OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve pe...
Using large language models for embodied planning introduces systematic safety risks
cs.AI 2026-04 unverdicted novelty 7.0

LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
Neural Garbage Collection: Learning to Forget while Learning to Reason
cs.LG 2026-04 conditional novelty 7.0

Language models learn to evict KV cache entries end-to-end via reinforcement learning from outcome reward alone, achieving 2-3x cache compression while maintaining accuracy on Countdown, AMC, and AIME tasks.
Single-Language Evidence Is Insufficient for Automated Logging: A Multilingual Benchmark and Empirical Study with LLMs
cs.SE 2026-04 unverdicted novelty 7.0

MultiLogBench shows that LLM performance on automated logging varies substantially across programming languages, demonstrating that single-language evidence is insufficient for general claims about model behavior or t...
Matlas: A Semantic Search Engine for Mathematics
cs.IR 2026-04 unverdicted novelty 7.0

Matlas introduces a semantic retrieval system over 8.07 million mathematical statements from papers and textbooks, using dependency graphs and topological unfolding for self-contained search via natural language queries.
STRIDE: Strategic Iterative Decision-Making for Retrieval-Augmented Multi-Hop Question Answering
cs.AI 2026-04 unverdicted novelty 7.0

STRIDE uses a meta-planner for entity-agnostic reasoning skeletons and a supervisor for dependency-aware execution to improve retrieval-augmented multi-hop QA.
GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows
cs.CL 2026-04 conditional novelty 7.0

GTA-2 benchmark shows frontier models achieve below 50% on atomic tool tasks and only 14.39% success on realistic long-horizon workflows, with execution harnesses like Manus providing substantial gains.
CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation
cs.SE 2026-04 accept novelty 7.0

CodeSpecBench shows LLMs achieve at most 20.2% pass rate on repository-level executable behavioral specification generation, revealing that strong code generation does not imply deep semantic understanding.
Exploring Knowledge Conflicts for Faithful LLM Reasoning: Benchmark and Method
cs.CL 2026-04 unverdicted novelty 7.0

ConflictQA benchmark shows LLMs fail to resolve conflicts between text and KG evidence and often default to one source, motivating the XoT explanation-based reasoning method.
Beyond Compliance: A Resistance-Informed Motivation Reasoning Framework for Challenging Psychological Client Simulation
cs.AI 2026-04 unverdicted novelty 7.0

ResistClient creates more realistic challenging client simulators by combining resistance theory with supervised fine-tuning on a new dataset followed by process-supervised reinforcement learning for motivation reasoning.
AIM-Bench: Benchmarking and Improving Affective Image Manipulation via Fine-Grained Hierarchical Control
cs.CV 2026-04 unverdicted novelty 7.0

AIM-Bench is the first dedicated benchmark for editing images to evoke specific emotions with fine-grained control, paired with AIM-40k dataset that delivers a 9.15% performance gain by correcting training data imbalances.
AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search
cs.SE 2026-04 unverdicted novelty 7.0

AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.
A Minimal Model of Representation Collapse: Frustration, Stop-Gradient, and Dynamics
cond-mat.dis-nn 2026-04 unverdicted novelty 7.0

A minimal embedding model shows representation collapse arises from frustrated samples through slow dynamics and is prevented by stop-gradient.
Hierarchical SVG Tokenization: Learning Compact Visual Programs for Scalable Vector Graphics Modeling
cs.LG 2026-04 unverdicted novelty 7.0

HiVG introduces hierarchical SVG tokenization with atomic and segment tokens plus HMN initialization to enable more efficient and stable autoregressive generation of vector graphics programs.
BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence
cs.CL 2026-04 unverdicted novelty 7.0

BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.
SkVM: Revisiting Language VM for Skills across Heterogenous LLMs and Harnesses
cs.SE 2026-04 unverdicted novelty 7.0

SkVM uses capability profiling and compiler-style techniques to make skills portable across LLMs and harnesses, raising task completion rates while cutting token use by up to 40% and delivering up to 3.2x speedup.
YingMusic-Singer-Plus: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance
eess.AS 2026-03 unverdicted novelty 7.0

YingMusic-Singer-Plus is a diffusion model for singing voice synthesis that preserves melody from a reference clip while allowing flexible lyric changes without manual alignment, outperforming Vevo2 and introducing th...
WhatsApp Vaccine Discourse (WhaVax): An Expert-Annotated Dataset and Benchmark for Health Misinformation Detection
cs.SI 2026-03 unverdicted novelty 7.0

WhaVax is a new expert-annotated dataset of WhatsApp vaccine messages with benchmarks showing competitive performance from embeddings and LLMs for misinformation detection under data scarcity.
FuzzAgent: Multi-Agent System for Evolutionary Library Fuzzing
cs.SE 2026-05 conditional novelty 6.0

FuzzAgent deploys specialized agents that collaborate on harness generation, execution, and crash triage to evolve fuzzing campaigns, delivering 45-191% more branch coverage than four baselines on 20 C/C++ libraries a...
Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax
cs.CL 2026-05 unverdicted novelty 6.0

Reinforcement learning with semantic rewards lets LLMs gain low-resource language skills without the alignment tax that degrades general capabilities in supervised fine-tuning.
SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle
cs.SE 2026-05 unverdicted novelty 6.0

SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
Selective Off-Policy Reference Tuning with Plan Guidance
cs.AI 2026-05 unverdicted novelty 6.0

SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
cs.LG 2026-05 unverdicted novelty 6.0

LoKA enables practical FP8 use in numerically sensitive large recommendation models via profiling, model adaptations, and runtime kernel orchestration.
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
cs.LG 2026-05 unverdicted novelty 6.0

LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.
Information Extraction of Nested Complex Structure of Quantum Cascade Lasers via Large Language Models
physics.optics 2026-05 unverdicted novelty 6.0

JSON schema constraints improve LLM extraction of nested quantum cascade laser structures to 83.4% F1, delivering up to 24.1% gains for smaller models.
FAME: Forecasting Academic Impact via Continuous-Time Manifold Evolution
cs.LG 2026-05 unverdicted novelty 6.0

FAME models scientific topic trajectories in continuous time to forecast paper impact more accurately than LLMs by aligning manuscripts with field momentum in a dynamic latent space.
Learning Agent Routing From Early Experience
cs.CL 2026-05 unverdicted novelty 6.0

BoundaryRouter routes queries to LLM or agent using early experience memory from a seed set, cutting inference time 60.6% versus always using agents and raising performance 28.6% versus always using direct LLM inference.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 136 Pith papers · 10 internal anchors

[1]

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

URLhttps://arxiv.org/abs/2506.07982. DeepMind. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025a. G. DeepMind. Gemini 3 pro model card, 2025b. URL https://storage.googleapis.com /deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf. Deep...

work page internal anchor Pith review arXiv
[2]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

doi: 10.48550/ARXIV.2405.04434. URL https: //doi.org/10.48550/arXiv.2405.04434. DeepSeek-AI. Deepseek-v3 technical report,

work page internal anchor Pith review doi:10.48550/arxiv.2405.04434
[3]

URLhttps://mcpmark.ai/leaderboard. J. Li, W. Zhao, J. Zhao, W. Zeng, H. Wu, X. Wang, R. Ge, Y. Cao, Y. Huang, W. Liu, et al. The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon task execution. arXiv preprint arXiv:2510.25726,

work page arXiv
[4]

Z. Luo, Z. Shen, W. Yang, Z. Zhao, P . Jwalapuram, A. Saha, D. Sahoo, S. Savarese, C. Xiong, and J. Li. Mcp-universe: Benchmarking large language models with real-world model context protocol servers. arXiv preprint arXiv:2508.14704,

work page arXiv
[5]

Luong, D

T. Luong, D. Hwang, H. H. Nguyen, G. Ghiasi, Y. Chervonyi, I. Seo, J. Kim, G. Bingham, J. Lee, S. Mishra, A. Zhai, C. H. Hu, H. Michalewski, J. Kim, J. Ahn, J. Bae, X. Song, T. H. Trinh, Q. V . Le, and J. Jung. Towards robust mathematical reasoning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,

work page 2025
[6]

18 MiniMax

URL https://aclanthology.org/2025.emnlp-main.1794/. 18 MiniMax. https://www.minimax.io/news/minimax-m2,

work page 2025
[7]

URL https://openai.com/index/introducing-gpt-5 /. L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, C. B. C. Zhang, M. Shaaban, J. Ling, S. Shi, et al. Humanity’s last exam. arXiv preprint arXiv:2501.14249,

work page internal anchor Pith review arXiv
[8]

URLhttps://arxiv.org/abs/2505.09388. D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

URL http://joschu.net/blog/kl-app rox.html. Z. Shao, P . Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. K. Li, Y. Wu, and D. Guo. Deepseek- math: Pushing the limits of mathematical reasoning in open language models. CoRR, abs/2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

doi: 10.48550/ARXIV.2402.03300. URL https://doi.org/10 .48550/arXiv.2402.03300. Z. Shao, Y. Luo, C. Lu, Z. Ren, J. Hu, T. Ye, Z. Gou, S. Ma, and X. Zhang. Deepseekmath-v2: Towards self-verifiable mathematical reasoning,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300
[12]

URLhttp://arxiv.org/abs/1911.02150. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polo- sukhin. Attention is all you need. pages 5998–6008,

work page internal anchor Pith review arXiv 1911
[13]

URLhttps://proceedings.neur ips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html. Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li, M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. CoRR, abs/2406.01574,

work page internal anchor Pith review arXiv 2017
[14]

URLhttps://doi.org/10.48550/arXiv.2406.01574. J. Wei, Z. Sun, S. Papay, S. McKinney, J. Han, I. Fulford, H. W. Chung, A. T. Passos, W. Fedus, and A. Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents. arXiv preprint arXiv:2504.12516,

work page internal anchor Pith review doi:10.48550/arxiv.2406.01574
[15]

URL https: //arxiv.org/abs/2504.21798. 19 J. Yuan, H. Gao, D. Dai, J. Luo, L. Zhao, Z. Zhang, Z. Xie, Y. Wei, L. Wang, Z. Xiao, Y. Wang, C. Ruan, M. Zhang, W. Liang, and W. Zeng. Native sparse attention: Hardware-aligned and natively trainable sparse attention. In W. Che, J. Nabende, E. Shutova, and M. T. Pile- hvar, editors, Proceedings of the 63rd Annua...

work page arXiv 2025
[16]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

URLhttps://aclanthology.org/2025.acl-long.1126/. ZhiPu-AI. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. arXiv preprint arXiv:2508.06471,

work page internal anchor Pith review arXiv 2025
[17]

P . Zhou, B. Leon, X. Ying, C. Zhang, Y. Shao, Q. Ye, D. Chong, Z. Jin, C. Xie, M. Cao, et al. Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese. arXiv preprint arXiv:2504.19314,

work page arXiv
[18]

Appendices A. MHA and MQA Modes of MLA 𝑾𝒊𝑼𝑽𝐜$%& Input Hidden 𝐡! {𝐪!,#$} {𝐯!,#$}𝑾𝒊𝑼𝑲𝐜$%& 𝐜!%&𝐜!' {𝐪!,#(}𝐤!( Multi-Head Attention (Core Attention) concatenateconcatenate{[𝐪!,#$;𝐪!,#(]} {[𝐤!,#$;𝐤!(]} ··· Output Hidden 𝐮! ··· ··· ··· ··· ··· applyRoPEapplyRoPE{𝐤!,#$} {𝐨!,#} (a) MHA mode of MLA. 𝑾𝒊𝑼𝑽𝐨$,&' 𝑾𝒊𝑼𝑲𝐪$,&'··· concatenate Input Hidden 𝐡! {𝐪!,#$} 𝐜!%& 𝐜...

work page 2025