pith. machine review for the scientific record. sign in

arxiv: 2505.09388 · v1 · submitted 2025-05-14 · 💻 cs.CL

Recognition: 3 theorem links

· Lean Theorem

Qwen3 Technical Report

An Yang , Anfeng Li , Baosong Yang , Beichen Zhang , Binyuan Hui , Bo Zheng , Bowen Yu , Chang Gao , Chengen Huang , Chenxu Lv , Chujie Zheng , Dayiheng Liu , Fan Zhou , Fei Huang , Feng Hu , Hao Ge , Haoran Wei , Huan Lin , Jialong Tang , Jian Yang , Jianhong Tu , Jianwei Zhang , Jianxin Yang , Jiaxi Yang , Jing Zhou , Jingren Zhou , Junyang Lin , Kai Dang , Keqin Bao , Kexin Yang , Le Yu , Lianghao Deng , Mei Li , Mingfeng Xue , Mingze Li , Pei Zhang , Peng Wang , Qin Zhu , Rui Men , Ruize Gao , Shixuan Liu , Shuang Luo , Tianhao Li , Tianyi Tang , Wenbiao Yin , Xingzhang Ren , Xinyu Wang , Xinyu Zhang , Xuancheng Ren , Yang Fan , Yang Su , Yichang Zhang , Yinger Zhang , Yu Wan , Yuqiong Liu , Zekun Wang , Zeyu Cui , Zhenru Zhang , Zhipeng Zhou , Zihan Qiu

Authors on Pith no claims yet

Pith reviewed 2026-05-09 00:14 UTC · model claude-opus-4-7

classification 💻 cs.CL
keywords large language modelsmixture of expertschain-of-thought reasoningknowledge distillationreinforcement learning from rewardsmultilingual pretrainingthinking budgetinstruction tuning
0
0 comments X

The pith

Qwen3 packs explicit chain-of-thought reasoning and direct answering into one model, with a user-controlled token budget that smoothly trades inference cost for accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Qwen3 is a family of dense and Mixture-of-Experts language models, from 0.6B up to a 235B-parameter MoE with 22B activated, that the authors train to switch between two response styles inside one set of weights: an explicit chain-of-thought "thinking" mode and a direct "non-thinking" mode, selected by /think and /no_think flags in the chat template. On top of that switch, the authors expose a thinking budget — a cap on reasoning tokens — that they show produces smooth, monotonic accuracy gains across math, coding, and STEM benchmarks as the cap is raised. They build the flagship via a four-stage post-training pipeline (long-CoT cold start, reasoning RL, thinking-mode fusion, general RL) and then transfer both modes to smaller variants by distilling teacher logits, claiming this is roughly ten times cheaper in GPU hours than rerunning RL on each size and yields better pass@64 than direct RL. Pretraining covers 36 trillion tokens across 119 languages, with synthetic math, code, and PDF-extracted text generated by earlier Qwen models. The reported headline is that the 235B-A22B model matches or beats DeepSeek-R1, DeepSeek-V3, and Llama-4-Maverick on most of the 23 benchmarks shown, while activating about a third of DeepSeek-V3's parameters.

Core claim

The paper argues that a single language model can be trained to operate in two regimes — a deliberate "thinking" mode that produces explicit chain-of-thought, and a fast "non-thinking" mode that answers directly — with the choice controlled by a flag in the chat template and the depth of reasoning controlled by a user-set token budget. The authors claim this fusion does not require two separate models, that the budget knob produces smooth, monotonic accuracy gains as more thinking tokens are allowed, and that strong-to-weak distillation from the flagship models lets small dense and Mixture-of-Experts variants inherit both modes at roughly one tenth the GPU cost of running the full four-stage

What carries the argument

A four-stage post-training pipeline applied to flagship models — long-CoT cold start, reasoning RL with GRPO on ~4k verifier-checked queries, thinking-mode fusion via SFT on mixed /think and /no_think data with a chat-template flag, and general RL with rule-based, reference-based, and preference-based rewards — combined with a budget-control trick that injects a stop-thinking instruction once a user-set token cap is hit. Smaller models (0.6B to 14B dense, plus a 30B-A3B MoE) skip stages 1–4 and instead receive off-policy then on-policy logit distillation from the 32B and 235B-A22B teachers.

If this is right

  • Deployments no longer need to route between a chat model and a separate reasoning model; a single checkpoint with a /think or /no_think flag covers both regimes.
  • Inference cost becomes a tunable dial: operators can cap thinking tokens per query and trade latency for accuracy without retraining.
  • Small models (0.6B–14B) can acquire reasoning behaviour from a larger teacher via off-policy then on-policy logit distillation at roughly a tenth of the GPU cost of running full RL post-training on each size.
  • Pretraining at 36T tokens across 119 languages, with synthetic data from domain-specialist Qwen2.5 variants, is sufficient to let a 22B-activated MoE match or beat a 671B-parameter MoE on most reported benchmarks.
  • Fusing non-thinking data into a reasoning-RL'd model degrades peak AIME and LiveCodeBench scores slightly (Table 22), suggesting an explicit accuracy–versatility trade that future work will have to price.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Editorial: the budget mechanism works because the model has been trained on truncated thinking traces during fusion, so 'stop thinking now' is in-distribution rather than a brittle prompt hack — which predicts the trick will not transfer cleanly to models that were not trained with the same template.
  • Editorial: the reported drop in AIME and LiveCodeBench after stages 3–4 (Table 22) suggests the thinking and non-thinking objectives are partially adversarial, and that frontier reasoning scores will increasingly come from specialist checkpoints even as user-facing products ship the fused model.
  • Editorial: on-policy distillation beating direct RL at a tenth of the cost (Table 21), including on pass@64, hints that for current open models the bottleneck is exploration quality rather than reward signal — a stronger teacher's logits encode useful exploration that GRPO from scratch fails to discover.
  • Editorial: the long-context evaluation on RULER shows thinking mode underperforming non-thinking mode at 128k, indicating that chain-of-thought currently hurts pure retrieval and that future work will need to gate when to think rather than how long to think.

Load-bearing premise

That a single model can be trained to do both careful step-by-step reasoning and fast direct answering without one mode quietly degrading the other — the paper's own ablation already shows AIME and LiveCodeBench scores dropping after the fusion and general-RL stages, so the claim rests on the bet that the lost peak reasoning is a worthwhile price for unified deployment.

What would settle it

Run the released Qwen3-235B-A22B with a sweep of thinking-budget caps on AIME'24/'25, LiveCodeBench, and GPQA; if accuracy does not rise smoothly with budget, or if forcing /no_think collapses scores below the matched non-thinking baselines reported in Tables 11–12, the unified-mode and budget-control claims fail. Equivalently, distill an 8B student from the same off-policy checkpoint and check whether on-policy logit distillation reproduces the reported ~10x GPU-hour saving and the pass@64 gain on AIME over direct RL (Table 21).

read the original abstract

In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration of thinking mode (for complex, multi-step reasoning) and non-thinking mode (for rapid, context-driven responses) into a unified framework. This eliminates the need to switch between different models--such as chat-optimized models (e.g., GPT-4o) and dedicated reasoning models (e.g., QwQ-32B)--and enables dynamic mode switching based on user queries or chat templates. Meanwhile, Qwen3 introduces a thinking budget mechanism, allowing users to allocate computational resources adaptively during inference, thereby balancing latency and performance based on task complexity. Moreover, by leveraging the knowledge from the flagship models, we significantly reduce the computational resources required to build smaller-scale models, while ensuring their highly competitive performance. Empirical evaluations demonstrate that Qwen3 achieves state-of-the-art results across diverse benchmarks, including tasks in code generation, mathematical reasoning, agent tasks, etc., competitive against larger MoE models and proprietary models. Compared to its predecessor Qwen2.5, Qwen3 expands multilingual support from 29 to 119 languages and dialects, enhancing global accessibility through improved cross-lingual understanding and generation capabilities. To facilitate reproducibility and community-driven research and development, all Qwen3 models are publicly accessible under Apache 2.0.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

5 major / 9 minor

Summary. The manuscript presents Qwen3, a family of dense (0.6B–32B) and MoE (30B-A3B, 235B-A22B) open-weight LLMs. The principal contributions are: (i) a unified architecture that integrates "thinking" and "non-thinking" modes within a single model with a user-controllable thinking budget; (ii) a 36T-token, 119-language pretraining corpus assembled with PDF OCR via Qwen2.5-VL and synthetic data from Qwen2.5-Math/Coder, plus instance-level data-mixture optimization; (iii) a four-stage post-training pipeline (long-CoT cold start, reasoning RL with GRPO, thinking-mode fusion via SFT with /think and /no_think flags, and general RL with a multi-reward system); and (iv) a Strong-to-Weak Distillation pipeline (off-policy + on-policy logit distillation) that the authors argue obviates the four-stage pipeline for smaller models. Extensive benchmark tables compare Qwen3 against Qwen2.5, DeepSeek-V3/R1, Llama-4, Gemma-3, GPT-4o, o1/o3-mini, and Gemini-2.5-Pro across general, alignment, reasoning, coding, agent, multilingual (12 detailed languages plus Belebele), and long-context (RULER) benchmarks. Models are released under Apache 2.0.

Significance. This is a substantive open-weight model release accompanied by a well-organized technical report. If the released checkpoints reproduce the reported numbers, the contribution is significant: the flagship Qwen3-235B-A22B is competitive with proprietary frontier reasoning models on AIME/LiveCodeBench/CodeForces while activating only 22B parameters, and the unified thinking/non-thinking design with a budget mechanism is a clean engineering answer to the proliferation of separate "chat" and "reasoning" models. The thinking-budget scaling curves (Fig. 2) and the on-policy vs. RL comparison (Table 21) are concrete, falsifiable claims rather than narrative ones. Apache-2.0 release of all eight scales — including 0.6B and 1.7B variants that outperform similarly-sized Gemma-3 baselines — directly supports reproducibility and downstream research. The multilingual expansion (29→119 languages) and the Belebele evaluation across 80 languages provide genuine breadth beyond English/Chinese-centric reporting. The report stops short of frontier scientific novelty (most components — GRPO, GQA, QK-Norm, YARN, DCA, on-policy distillation — are adapted from prior work), but its value as an artifact-pap

major comments (5)
  1. [§4.7 / Table 21] The headline '1/10 GPU hours' efficiency claim for on-policy distillation vs. reinforcement learning is load-bearing for the strong-to-weak pipeline (which is applied to six of the eight released models) and is not adequately decomposed. On-policy distillation requires forward passes through Qwen3-32B or Qwen3-235B-A22B for every student rollout to obtain teacher logits (§4.5), and at the 235B-A22B scale this teacher inference is comparable to or larger than the student's own forward/backward. The reported 1,800 vs. 17,920 GPU-hours does not state whether teacher-inference cost is included, what hardware was used, what the prompt budget and rollout counts were on each arm, or whether the RL arm used the same 3,995 query-verifier set as §4.2. Without this decomposition the comparison conflates 'cheaper because distilled' with 'cheaper because the teacher's compute is externalized.' Please
  2. [§4.2, Reasoning RL] The reasoning-RL stage is reported to use only 3,995 query-verifier pairs and to lift Qwen3-235B-A22B AIME'24 from 70.1 to 85.1 over 170 GRPO steps. Given that this is a central efficiency claim and that contamination of small held-out math sets is a known risk, the report should (a) describe the decontamination procedure against AIME'24/'25, MATH-500, LiveCodeBench v5, and the validation queries reserved in §4.1; (b) report variance across seeds for at least one model; and (c) clarify whether the '170 steps' figure is comparable to the 17,920 GPU-hours in Table 21 or refers to a different run. Without these, the RL-vs-distillation comparison and the AIME headline scores are difficult to evaluate.
  3. [§3, Pre-training data] The 36T-token corpus is constructed in part from Qwen2.5-VL OCR of PDFs and from Qwen2.5-Math/Coder synthetic data, and instance-level data-mixture optimization is performed on 'small proxy models with fine-grained data labels.' No quantitative ablation is given for either the synthetic-data fraction or the instance-level mixture optimization, so the reader cannot assess how much of the Qwen3-vs-Qwen2.5 gap is attributable to architecture/training-strategy changes vs. data scale and synthetic data. A small-scale ablation (e.g., S1 with vs. without synthetic component, or with uniform vs. optimized mixture) would substantiate the claim that 'instance-level' mixture optimization is the right granularity, which §3.1 asserts but does not test.
  4. [§4.7, Table 22] Table 22 shows that Stage 3 (Thinking Mode Fusion) and Stage 4 (General RL) degrade performance in thinking mode on AIME'24 (83.8→81.4) and LiveCodeBench v5 (68.4→65.7) for Qwen3-32B, while improving general/alignment/agent metrics. The text frames this as an accepted trade-off 'to enhance the model's overall versatility.' This is a legitimate design choice, but it directly tensions the framing in §1 and §4.6 that the unified model matches dedicated reasoning models. Please quantify the gap relative to a Stage-2-only checkpoint (i.e., a 'reasoning-only' Qwen3-32B) on the full reasoning suite, so that users who do not need non-thinking mode can judge whether to deploy the Stage-2 artifact.
  5. [Appendix A.1.1, Table 23] Long-context results show that thinking mode degrades RULER performance relative to non-thinking mode at every scale (e.g., Qwen3-235B-A22B 95.0 → 92.2 average), and the authors hypothesize that thinking 'may interfere with the retrieval process.' Given that the abstract emphasizes 128K context and thinking mode as joint selling points, this regression deserves more than a one-paragraph hypothesis. Please report whether the thinking-budget setting (8192) was tuned, whether the degradation persists with budget=0 (which should reduce to non-thinking) or increases with longer budgets, and whether the YARN scaling factor=4 is identical across the two modes.
minor comments (9)
  1. [Abstract / §2] Tokenizer vocabulary size is stated as 151,669 in §2 but the abstract and introduction never mention tokenizer changes relative to Qwen2.5 (which used 151,643). A one-line note on whether tokens were added (and for which languages) would help readers interpret the multilingual claims.
  2. [§2, Tables 1–2] The dense-model table omits hidden size, FFN size, and total non-embedding parameter counts; the MoE table omits expert FFN size and routing details (top-k, auxiliary-loss coefficient for the global-batch load-balancing loss of Qiu et al. 2025). These should be included for reproducibility.
  3. [§4.6] Sampling settings are given (temperature 0.6/0.7, top-p, top-k, presence penalty), but the number of samples per query is specified only for GPQA (10) and AIME (64). Please state n for the other benchmarks, especially LiveCodeBench and CodeForces, where pass@1 vs. pass@k materially affects rankings.
  4. [Figure 1] The figure caption is illegible due to encoded glyphs ('4UBHF...3-'). Replace with a clean rendering.
  5. [§4.3, Thinking budget] The stop-thinking instruction is hard-coded in English ('Considering the limited time by the user, ...'). For a 119-language model, please clarify whether non-English queries receive a localized stop instruction and whether this affects the multilingual thinking-budget curves.
  6. [§4.4] The reward-model-without-reference is trained on 'human preference data' but no description is given of dataset size, annotator pool, or evaluation of reward-model accuracy. Even a brief paragraph would aid reproducibility.
  7. [Tables 11–20] Several baselines have missing entries (e.g., Grok-3-Beta MMLU-Redux, Gemini-2.5-Pro MATH-500). Indicate whether these are unreported by the source or were not run, and the source for each cell.
  8. [§3.3] The sentence 'Even with 1/10 of the activated parameters of the Qwen2.5 dense base model' (Qwen3-MoE conclusion 2c) refers to Qwen3-30B-A3B vs. Qwen2.5-32B; '1/10 of activated' is approximately 3B/32B, which is 1/10.7 — fine, but elsewhere similar ratios are quoted as '1/5' and '1/3,' so it would help to standardize.
  9. [References] Several citations point to blog posts (xAI 2025, Qwen Team 2024/2025, Anthropic 2025, DeepMind 2025, Meta-AI 2025, OpenAI 2024/2025) without dates of access. Add accessed-on dates given that these URLs are mutable.

Simulated Author's Rebuttal

5 responses · 2 unresolved

We thank the referee for a careful and constructive report. The five major comments target real gaps in the manuscript — particularly around the cost decomposition of on-policy distillation, decontamination and seed variance for the small-scale reasoning-RL run, missing data-mixture ablations, the reasoning-vs-versatility trade-off introduced by Stages 3–4, and the under-explained RULER regression in thinking mode. We accept all five and will revise the manuscript accordingly. Below we respond point by point, indicating in each case what additional measurements or text we will add. We note honestly that some requested numbers (e.g., multi-seed GRPO at 235B scale) are not feasible to produce at flagship scale within a revision cycle; where that is the case we will be explicit about the limitation rather than paper over it. We also agree with the referee's overall framing that this report's contribution is primarily as an artifact paper documenting a substantive open-weight release, and we will tighten the language around novelty claims accordingly.

read point-by-point responses
  1. Referee: §4.7 / Table 21: The '1/10 GPU hours' on-policy distillation vs. RL claim is not decomposed: teacher inference cost, hardware, prompt/rollout budget, and whether the RL arm shares the §4.2 query set are unspecified, conflating distillation efficiency with externalized teacher compute.

    Authors: We agree this comparison needs a clearer accounting and will revise §4.7 accordingly. Concretely, in the next version we will (i) state the hardware (H800 nodes) and the per-arm wall-clock and accelerator-hour breakdown including teacher forward passes, (ii) report that on-policy distillation of Qwen3-8B used Qwen3-32B as the teacher (not 235B-A22B; the 235B teacher is used selectively for larger students), so per-step teacher inference is roughly 4× the student forward and is included in the 1,800 GPU-hour figure, (iii) clarify that the RL arm used the same §4.2 query-verifier pool augmented with code queries, the same rollout count, and the same context length as the distillation arm, and (iv) note that even when the teacher's pre-training cost is amortized in, the marginal cost to obtain the distilled 8B is well below the RL run. We acknowledge that 'cheaper because the teacher exists' is a fair characterization; the claim we wish to defend is the marginal-cost claim, and we will reframe the text to that effect. revision: yes

  2. Referee: §4.2 Reasoning RL: please describe decontamination against AIME'24/'25, MATH-500, LiveCodeBench v5 and the §4.1 validation set; report seed variance; and clarify whether '170 steps' corresponds to the 17,920 GPU-hours in Table 21.

    Authors: (a) Decontamination: our pretraining and post-training pipelines apply n-gram (13-gram) and embedding-based near-duplicate filtering against the public test sets named by the referee, plus the validation queries reserved in §4.1; we will add an explicit subsection describing the filter, the match thresholds, and the number of removed candidates. (b) Seed variance: we did not run multiple full-scale GRPO seeds for the 235B model due to cost, but we have 3-seed runs at the 32B scale on AIME'24/'25 and MATH-500 and will add these as a variance table; we will be transparent that 235B numbers are single-seed. (c) The '170 steps' figure refers to the Qwen3-235B-A22B reasoning-RL run and is a different run from Table 21, which compares 8B distillation against 8B RL; we will rewrite the sentence to remove the implied linkage and report the GPU-hours of the 235B RL run separately. revision: yes

  3. Referee: §3 Pre-training data: no quantitative ablation is given for the synthetic-data fraction or the instance-level mixture optimization, so the Qwen3-vs-Qwen2.5 gap cannot be attributed.

    Authors: This is a fair criticism. We did run small-proxy ablations during data development — specifically (i) S1 training of ~1B and ~3B proxies with vs. without the Qwen2.5-Math/Coder synthetic component, and (ii) uniform-domain vs. instance-level optimized mixtures at the same proxy scales — and these informed the final recipe. We will add a table summarizing these proxy-scale ablations on MMLU, GSM8K, HumanEval, and MMLU-Pro in the revised §3.1, with the caveat that they were conducted at proxy scale and we did not re-run them at flagship scale. We will also temper the language so that 'instance-level granularity is the right choice' is presented as supported at proxy scale rather than at 235B scale. revision: yes

  4. Referee: §4.7 Table 22: Stages 3 and 4 degrade thinking-mode AIME'24 (83.8→81.4) and LiveCodeBench v5 (68.4→65.7) on Qwen3-32B; please quantify the gap to a Stage-2-only checkpoint on the full reasoning suite so users can decide whether to deploy that artifact.

    Authors: We accept the framing concern. In the revision we will add a table reporting the Stage-2-only Qwen3-32B checkpoint across the full Math & Text Reasoning and Agent & Coding suites used in Table 13 (AIME'24/'25, MATH-500, LiveCodeBench v5, CodeForces, ZebraLogic, AutoLogi, BFCL v3), alongside the released Stage-4 model in thinking mode. This will let downstream users quantify the 'versatility tax' on each benchmark. We will also revise §1 and §4.6 to state explicitly that the unified Stage-4 model trades a small amount of pure-reasoning headroom for instruction-following, agent, and non-thinking competence, rather than implying parity with reasoning-specialized checkpoints. We do not, however, plan to release the Stage-2-only artifact as a separate model in this release cycle. revision: yes

  5. Referee: Appendix A.1.1 / Table 23: thinking mode degrades RULER at every scale; the one-paragraph hypothesis is insufficient. Was the 8192 thinking budget tuned? Does the gap close at budget=0 or widen with longer budgets? Is the YARN factor identical across modes?

    Authors: Thank you — these are the right diagnostic questions. To address them: (i) the YARN scaling factor of 4 is identical across thinking and non-thinking modes; we will state this explicitly in the caption. (ii) The 8192 thinking budget was a single chosen operating point, not a tuned optimum; we did spot-check 4096 and 16384 on RULER 64K/128K and observed that performance is approximately monotone-decreasing with larger thinking budgets on retrieval-heavy slices, consistent with our 'thinking interferes with retrieval' hypothesis. We will add a small sweep (budget ∈ {0, 4096, 8192, 16384}) for Qwen3-32B and Qwen3-235B-A22B at 64K and 128K. (iii) Budget=0 is operationally close to non-thinking mode (empty <think></think> block) and we will report it explicitly so the reader can see the regression collapses. We will soften the abstract's joint framing of 128K and thinking mode to acknowledge that on pure needle-style retrieval the non-thinking path is currently preferable. revision: yes

standing simulated objections not resolved
  • Multi-seed variance for the full-scale Qwen3-235B-A22B GRPO run is not feasible to produce within a revision cycle; we will report multi-seed variance only at 32B scale and flag the 235B numbers as single-seed.
  • We will not release a separate Stage-2-only ('reasoning-only') Qwen3-32B checkpoint in this release cycle, though we will report its evaluation numbers so users can assess the trade-off.

Circularity Check

1 steps flagged

No significant circularity: Qwen3's claims are evaluated against external public benchmarks and third-party baselines, not against author-defined targets.

specific steps
  1. self citation load bearing [§4.1 Long-CoT Cold Start; §4.4 General RL (rule/judge model)]
    "we use Qwen2.5-72B-Instruct to identify and remove queries that are not easily verifiable... we provide a reference answer for each query and prompt Qwen2.5-72B-Instruct to score the model's response based on this reference."

    Data curation and one of three RL reward signals depend on a prior author model (Qwen2.5-72B-Instruct) as judge/filter. This is a mild self-referential dependency in the training pipeline, but the final claims are still evaluated on external benchmarks against external baselines, so it does not make any reported headline number circular by construction. Noted as minor, not load-bearing for the externally-scored claims.

full rationale

This is an engineering/system technical report, not a derivation paper. The paper's central claims—performance of Qwen3 models in thinking and non-thinking modes—are evaluated against widely used external benchmarks (MMLU, MMLU-Pro, GPQA, MATH-500, AIME'24/'25, LiveCodeBench, BFCL v3, CodeForces Elo, Arena-Hard, IFEval, RULER, Belebele, INCLUDE, MMMLU, etc.) and compared to non-author baselines (DeepSeek-R1/V3, Llama-4, Gemma-3, GPT-4o, OpenAI-o1/o3-mini, Gemini 2.5-Pro, Grok-3). These are externally falsifiable and not defined in terms of the model itself, so the headline claims are not circular by construction. The reader's skeptic concern—about the 1,800 vs 17,920 GPU-hours comparison in Table 21 (on-policy distillation vs RL)—is a legitimate methodology/correctness concern (under-specified accounting, possibly excluding teacher-inference cost, unequal prompt budgets), but it is not a circularity issue in the technical sense defined here: the GPU-hour numbers are not defined in terms of each other, and the AIME/MATH/LiveCodeBench scores being compared are external benchmarks. It is an apples-to-apples accounting risk, not a self-definitional or fitted-input-renamed-as-prediction loop. Minor self-citation does occur: (i) data filtering and judging uses Qwen2.5-72B-Instruct (§4.1, §4.4) including as a model-based reward judge for RL, which creates a teacher/judge dependency on a sibling model but does not make any specific external benchmark claim circular; (ii) the Strong-to-Weak Distillation efficacy argument relies on Qwen3-32B / Qwen3-235B-A22B as teachers for the smaller Qwen3 models (§4.5), but the resulting students are still scored on external benchmarks against external baselines. These are normal pipeline self-references, not load-bearing circular derivations. Overall score: 1.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Model omitted the axiom ledger; defaulted for pipeline continuity.

pith-pipeline@v0.9.0 · 10251 in / 7356 out tokens · 114735 ms · 2026-05-09T00:14:54.655382+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FlowCompile: An Optimizing Compiler for Structured LLM Workflows

    cs.CL 2026-05 unverdicted novelty 8.0

    FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.

  2. RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

    cs.AI 2026-05 unverdicted novelty 8.0

    RealICU is a new benchmark using physician hindsight labels on MIMIC-IV ICU data that exposes LLM failures in long-horizon clinical assessment, acute problem detection, action recommendation, and red-flag identification.

  3. Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation

    cs.GR 2026-05 unverdicted novelty 8.0

    Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset ...

  4. Large Language Models Lack Temporal Awareness of Medical Knowledge

    cs.LG 2026-05 unverdicted novelty 8.0

    LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.

  5. Pretraining Exposure Explains Popularity Judgments in Large Language Models

    cs.CL 2026-05 unverdicted novelty 8.0

    LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.

  6. Grid Games: The Power of Multiple Grids for Quantizing Large Language Models

    cs.LG 2026-05 accept novelty 8.0

    Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.

  7. Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

    cs.CL 2026-05 unverdicted novelty 8.0

    Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.

  8. Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

    cs.AR 2026-05 conditional novelty 8.0

    Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

  9. Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

    cs.AI 2026-05 unverdicted novelty 8.0

    Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.

  10. DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents

    cs.CV 2026-05 accept novelty 8.0

    DeepTumorVQA is a new stage-wise 3D CT VQA benchmark showing that quantitative measurement is the main failure point for current medical VLMs and that tool augmentation substantially improves later reasoning stages.

  11. Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

    cs.CL 2026-05 unverdicted novelty 8.0

    Soohak is a new 439-problem mathematician-authored benchmark showing frontier LLMs reach only 30% on research math and fail to exceed 50% on refusing ill-posed questions.

  12. ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

    cs.LG 2026-05 conditional novelty 8.0

    ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equip...

  13. LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

    cs.CL 2026-05 conditional novelty 8.0

    AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning task...

  14. The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits

    cs.LG 2026-05 unverdicted novelty 8.0

    Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budget...

  15. Crafting Reversible SFT Behaviors in Large Language Models

    cs.LG 2026-05 unverdicted novelty 8.0

    LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.

  16. LLM Translation of Compiler Intermediate Representation

    cs.PL 2026-05 unverdicted novelty 8.0

    IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.

  17. VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

    cs.AI 2026-05 unverdicted novelty 8.0

    VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...

  18. SLAM: Structural Linguistic Activation Marking for Language Models

    cs.CL 2026-05 unverdicted novelty 8.0

    SLAM achieves 100% detection accuracy on Gemma-2 models with only 1-2 points of quality loss by causally steering SAE-identified structural directions while preserving lexical sampling and semantics.

  19. SLAM: Structural Linguistic Activation Marking for Language Models

    cs.CL 2026-05 unverdicted novelty 8.0

    SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.

  20. SecGoal: A Benchmark for Security Goal Extraction and Formalization from Protocol Documents

    cs.CR 2026-04 unverdicted novelty 8.0

    The paper presents SecGoal, the first expert-annotated benchmark for security goal extraction from protocol documents, and demonstrates that fine-tuned 7B/9B parameter models achieve over 80% F1 score, outperforming l...

  21. Efficient Training on Multiple Consumer GPUs with RoundPipe

    cs.DC 2026-04 conditional novelty 8.0

    RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on...

  22. MappingEvolve: LLM-Driven Code Evolution for Technology Mapping

    cs.CE 2026-04 unverdicted novelty 8.0

    MappingEvolve applies LLMs through Planner-Evolver-Evaluator agents to evolve technology mapping code, delivering 10.04% area reduction versus ABC and 7.93% versus mockturtle on EPFL benchmarks.

  23. AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

    cs.AI 2026-04 accept novelty 8.0

    AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.

  24. S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

    cs.CV 2026-04 unverdicted novelty 8.0

    S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.

  25. When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

    cs.CV 2026-04 unverdicted novelty 8.0

    VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-...

  26. ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

    cs.CL 2026-04 unverdicted novelty 8.0

    ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

  27. MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning

    cs.CL 2026-04 unverdicted novelty 8.0

    MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6....

  28. HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

    cs.CR 2026-04 unverdicted novelty 8.0

    Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

  29. InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis

    cs.CL 2026-04 unverdicted novelty 8.0

    InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.

  30. Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    cs.LG 2026-04 unverdicted novelty 8.0

    Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.

  31. Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

    cs.CL 2026-04 conditional novelty 8.0

    Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.

  32. NovBench: Evaluating Large Language Models on Academic Paper Novelty Assessment

    cs.CL 2026-04 unverdicted novelty 8.0

    NovBench is the first large-scale benchmark with 1,684 expert-annotated pairs to evaluate LLMs on assessing academic paper novelty via a four-dimensional framework of Relevance, Correctness, Coverage, and Clarity.

  33. MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

    cs.CV 2026-04 unverdicted novelty 8.0

    MMRareBench is the first rare-disease benchmark for multimodal and multi-image clinical evaluation of MLLMs, revealing fragmented capabilities, low treatment-planning scores, and medical models underperforming general...

  34. MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

    cs.CV 2026-04 unverdicted novelty 8.0

    MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due...

  35. Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation

    cs.DC 2026-04 unverdicted novelty 8.0

    Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.

  36. ReconPhys: Reconstruct Appearance and Physical Attributes from Single Video

    cs.CV 2026-04 unverdicted novelty 8.0

    ReconPhys is the first feedforward neural network that jointly reconstructs 3D geometry and appearance via Gaussian Splatting while estimating physical attributes from a single monocular video using self-supervised training.

  37. Do Audio-Visual Large Language Models Really See and Hear?

    cs.AI 2026-04 unverdicted novelty 8.0

    AVLLMs encode audio semantics in middle layers but suppress them in final text outputs when audio conflicts with vision, due to training that largely inherits from vision-language base models.

  38. SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology

    cs.AI 2026-03 conditional novelty 8.0

    SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.

  39. SEVerA: Verified Synthesis of Self-Evolving Agents

    cs.LG 2026-03 unverdicted novelty 8.0

    SEVerA uses Formally Guarded Generative Models and a three-stage Search-Verification-Learning process to synthesize self-evolving agents that satisfy hard formal constraints while improving task performance.

  40. MetaBackdoor: Exploiting Positional Encoding as a Backdoor Attack Surface in LLMs

    cs.CR 2026-05 unverdicted novelty 7.0

    MetaBackdoor shows that LLMs can be backdoored using positional triggers like sequence length, enabling stealthy activation on clean inputs to leak system prompts or trigger malicious behavior.

  41. RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation

    cs.LG 2026-05 unverdicted novelty 7.0

    RxEval benchmark shows frontier LLMs reach at most 46.10% exact match on prescription-level medication, dose, and route selection from real patient trajectories.

  42. Asymmetric Generative Recommendation via Multi-Expert Projection and Multi-Faceted Hierarchical Quantization

    cs.IR 2026-05 unverdicted novelty 7.0

    AsymRec decouples input and output representations in generative recommendation via multi-expert semantic projection and multi-faceted hierarchical quantization, outperforming prior models by 15.8% on average.

  43. From Table to Cell: Attention for Better Reasoning with TABALIGN

    cs.AI 2026-05 unverdicted novelty 7.0

    TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding exec...

  44. GGBound: A Genome-Grounded Agent for Microbial Life-Boundary Prediction

    cs.CY 2026-05 unverdicted novelty 7.0

    A genome-conditioned 4B LLM agent predicts microbial life boundaries and matches larger frontier models via token fusion, tool use, and a counterfactual gene-grounding reward.

  45. Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation

    cs.CL 2026-05 unverdicted novelty 7.0

    New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.

  46. Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

    cs.LG 2026-05 unverdicted novelty 7.0

    BBCritic uses contrastive learning to align GUI actions in a continuous affordance space, outperforming larger binary critic models on a new four-level hierarchical benchmark while enabling zero-shot transfer.

  47. KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration

    cs.CV 2026-05 unverdicted novelty 7.0

    KVPO aligns streaming autoregressive video generators with human preferences via ODE-native GRPO, using KV cache for semantic exploration and TVE for velocity-based policy modeling, yielding gains in quality and alignment.

  48. GenCircuit-RL: Reinforcement Learning from Hierarchical Verification for Genetic Circuit Design

    cs.AI 2026-05 conditional novelty 7.0

    GenCircuit-RL uses hierarchical verification rewards and curriculum learning in RL to generate correct genetic circuit code in SBOL, improving functional task success by 14-16 points and generalizing to novel biologic...

  49. ASH: Agents that Self-Hone via Embodied Learning

    cs.AI 2026-05 unverdicted novelty 7.0

    ASH reaches 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Zelda by self-improving via an IDM trained on its own trajectories to label internet video, while baselines plateau at roughly 6/12.

  50. SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents

    cs.AI 2026-05 conditional novelty 7.0

    SimPersona uses VQ-VAE to induce discrete buyer types from clickstreams, maps them to LLM persona tokens, and fine-tunes agents to achieve 78% conversion-rate alignment with real buyers across 42 storefronts.

  51. WARDEN: Endangered Indigenous Language Transcription and Translation with 6 Hours of Training Data

    cs.CL 2026-05 conditional novelty 7.0

    WARDEN achieves better transcription and translation for Wardaman than larger models by separating the tasks and using Sundanese initialization plus a domain dictionary with just 6 hours of data.

  52. Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights

    cs.CL 2026-05 unverdicted novelty 7.0

    TFlow enables multi-agent LLMs to collaborate via transient low-rank LoRA perturbations derived from sender activations, yielding up to 8.5 accuracy gains and 83% token reduction versus text-based baselines on Qwen3-4...

  53. Reward-Weighted On-Policy Distillation with an Open Property-Equivalence Verifier for NL-to-SVA Generation

    cs.AR 2026-05 unverdicted novelty 7.0

    Reward-Weighted On-Policy Distillation with an open property-equivalence verifier produces a 7B model that surpasses prior SOTA on NL-to-SVA generation across pass@1/5/10 metrics.

  54. BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning

    cs.RO 2026-05 unverdicted novelty 7.0

    BlockVLA accelerates autoregressive VLA models by 3.3x using block diffusion finetuning, with faster training convergence and better early performance on long-horizon robotic tasks.

  55. Query-Conditioned Test-Time Self-Training for Large Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    QueST lets LLMs create query-conditioned problem-solution pairs at inference time and use them for parameter-efficient self-training, outperforming prior test-time baselines on math and science benchmarks.

  56. What Does LLM Refinement Actually Improve? A Systematic Study on Document-Level Literary Translation

    cs.CL 2026-05 accept novelty 7.0

    Document-level machine translation followed by segment-level LLM refinement provides the strongest and most stable improvements in literary translation quality, mainly enhancing fluency and style rather than adequacy.

  57. HCSG: Human-Centric Semantic-Geometric Reasoning for Vision-Language Navigation

    cs.RO 2026-05 unverdicted novelty 7.0

    HCSG combines geometric forecasting of human pose and trajectory with VLM-generated semantic descriptions of intentions, fused into a topological map with a social distance loss, yielding 14% higher success rate and 3...

  58. Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    EGRSD and CL-EGRSD advance the accuracy-length frontier in LLM reasoning by entropy-guided weighting of token-level distillation signals from the teacher.

  59. LeanSearch v2: Global Premise Retrieval for Lean 4 Theorem Proving

    cs.IR 2026-05 conditional novelty 7.0

    LeanSearch v2 recovers 46.1% of ground-truth premise groups for research-level Lean 4 theorems within 10 candidates and raises fixed-loop proof success to 20%.

  60. LeanSearch v2: Global Premise Retrieval for Lean 4 Theorem Proving

    cs.IR 2026-05 conditional novelty 7.0

    LeanSearch v2 recovers 46.1% of ground-truth premise groups on research-level Mathlib theorems and raises fixed-loop proof success from 4% to 20% via embedding-reranker plus iterative sketch-retrieve-reflect retrieval.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 1179 Pith papers · 18 internal anchors

  1. [1]

    Phi-4 Technical Report

    Marah Abdin, Jyoti Aneja, Harkirat Behl, S´ebastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905,

  2. [2]

    arXiv preprint arXiv:2402.17463 , year=

    Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, and Lingpeng Kong. Training-free long-context scaling of large language models. CoRR, abs/2402.17463,

  3. [3]

    Program Synthesis with Large Language Models

    URL https://www.anthropic.com/news/claude-3-7-sonnet . Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models. CoRR, abs/2108.07732,

  4. [4]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

  5. [5]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report. arXiv preprint arXiv:2502.13923,

  6. [6]

    The Belebele benchmark: A parallel reading comprehension dataset in 122 language variants

    Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. The Belebele benchmark: A parallel reading comprehension dataset in 122 language variants. CoRR, abs/2308.16884,

  7. [7]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond´e de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bava...

  8. [8]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. CoRR, abs/2110.14168,

  9. [9]

    Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. CoRR, abs/2401.06066,

  10. [10]

    URL https://blog.google/technology/google-deepmind/gemi ni-model-thinking-updates-march-2025/ . Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, An- dreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Ri...

  11. [11]

    Supergpqa: Scaling llm evaluation across 285 graduate disciplines, 2025

    Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, et al. SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines. arXiv preprint arXiv:2502.14739,

  12. [12]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aur´elien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozi`ere, Betha...

  13. [13]

    Doge: Domain reweighting with generalization estimation.arXiv preprint arXiv:2310.15393, 2023

    Simin Fan, Matteo Pagliardini, and Martin Jaggi. DoGE: Domain reweighting with generalization estimation. arXiv preprint arXiv:2310.15393,

  14. [14]

    Are we done with mmlu?arXiv preprint arXiv:2406.04127,

    Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with MMLU? CoRR, abs/2406.04127,

  15. [15]

    Alex Gu, Baptiste Rozi `ere, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I. Wang. CRUXEval: A benchmark for code reasoning, understanding and execution. arXiv preprint arXiv:2401.03065,

  16. [16]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948,

  17. [17]

    arXiv preprint arXiv:2410.15553 , year=

    Yun He, Di Jin, Chaoqi Wang, Chloe Bi, Karishma Mandyam, Hejia Zhang, Chen Zhu, Ning Li, Tengyu Xu, Hongjiang Lv, et al. Multi-IF: Benchmarking LLMs on multi-turn and multilingual instructions following. arXiv preprint arXiv:2410.15553,

  18. [18]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In ICLR. OpenReview.net, 2021a. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH ...

  19. [19]

    Qwen2.5-Coder Technical Report

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2.5-Coder technical report. CoRR, abs/2409.12186,

  20. [20]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code. CoRR, abs/2403.07974,

  21. [21]

    Zixuan Jiang, Jiaqi Gu, Hanqing Zhu, and David Z. Pan. Pre-RMSNorm and Pre-CRMSNorm Transform- ers: Equivalent and efficient pre-LN Transformers. CoRR, abs/2305.14858,

  22. [22]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V . Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirz...

  23. [23]

    From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline.arXiv preprint arXiv:2406.11939,

    Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-Hard and BenchBuilder pipeline. CoRR, abs/2406.11939,

  24. [24]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. CoRR, abs/2305.20050,

  25. [25]

    Zebralogic: On the scaling limits of llms for logical reasoning.arXiv preprint arXiv:2502.01100,

    Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. ZebraLogic: On the scaling limits of LLMs for logical reasoning. CoRR, abs/2502.01100,

  26. [26]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024a. Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by ChatGPT really correct? Rigorous evaluation of large language mode...

  27. [27]

    YaRN: Efficient Context Window Extension of Large Language Models

    URL https://eqbench.com/creative writing.html. Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window extension of large language models. CoRR, abs/2309.00071,

  28. [28]

    Jason Ramapuram, Federico Danieli, Eeshan Dhekane, Floris Weers, Dan Busbridge, Pierre Ablin, Tatiana Likhomanenko, Jagrit Digani, Zijin Gu, Amitis Shidani, et al

    Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Demons in the detail: On implementing load balancing loss for training specialized mixture-of-expert models. CoRR, abs/2501.11873,

  29. [29]

    CodeElo: Benchmarking competition-level code generation of LLMs with human-comparable elo ratings.arXiv preprint arXiv:2501.01257, 2025

    33 Shanghaoran Quan, Jiaxi Yang, Bowen Yu, Bo Zheng, Dayiheng Liu, An Yang, Xuancheng Ren, Bofei Gao, Yibo Miao, Yunlong Feng, Zekun Wang, Jian Yang, Zeyu Cui, Yang Fan, Yichang Zhang, Binyuan Hui, and Junyang Lin. CodeElo: Benchmarking competition-level code generation of LLMs with human-comparable Elo ratings. CoRR, abs/2501.01257,

  30. [30]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    URL https: //qwenlm.github.io/blog/qwq-32b/. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level Google-proof Q&A benchmark. CoRR, abs/2311.12022,

  31. [31]

    Angelika Romanou, Negar Foroutan, Anna Sotnikova, Zeming Chen, Sree Harsha Nelaturu, Shivalika Singh, Rishabh Maheshwary, Micol Altomare, Mohamed A. Haggag, Snegha A, Alfonso Amayuelas, Azril Hafizi Amirudin, Viraat Aryabumi, Danylo Boiko, Michael Chang, Jenny Chim, Gal Cohen, Aditya Kumar Dalmia, Abraham Diress, Sharad Duwal, Daniil Dzenhaliou, Daniel Fe...

  32. [32]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. CoRR, abs/2402.03300,

  33. [33]

    Linguistic generalizability of test-time scaling in mathematical reasoning

    Guijin Son, Jiwoo Hong, Hyunwoo Ko, and James Thorne. Linguistic generalizability of test-time scaling in mathematical reasoning. CoRR, abs/2502.17407,

  34. [34]

    Gemma 3 Technical Report

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram ´e, Morgane Rivi `ere, et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786,

  35. [35]

    MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. CoRR, abs/2406.01574,

  36. [36]

    Livebench: A challenging, contamination-free llm benchmark.arXiv preprint arXiv:2406.19314, 2024

    34 Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz- Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. LiveBench: A challenging, contamination-free LLM benchmark. CoRR, abs/2406.19314,

  37. [37]

    Writingbench: A comprehensive benchmark for generative writing.CoRR, abs/2503.05244,

    Yuning Wu, Jiahao Mei, Ming Yan, Chenliang Li, Shaopeng Lai, Yuran Ren, Zijia Wang, Ji Zhang, Mengyue Wu, Qin Jin, and Fei Huang. WritingBench: A comprehensive benchmark for generative writing. CoRR, abs/2503.05244,

  38. [38]

    Effective Long-Context Scaling of Foundation Models.arXiv preprint arXiv:2309.16039, 2023

    Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang, Yashar Mehdad, Sharan Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang, and Hao Ma. Effective long-context scaling of foundation models. CoR...

  39. [39]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei...

  40. [40]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. CoRR, abs/2311.07911,

  41. [41]

    Autologi: Automated generation of logic puzzles for evaluating reasoning abilities of large language models

    Qin Zhu, Fei Huang, Runyu Peng, Keming Lu, Bowen Yu, Qinyuan Cheng, Xipeng Qiu, Xuanjing Huang, and Junyang Lin. AutoLogi: Automated generation of logic puzzles for evaluating reasoning abilities of large language models. CoRR, abs/2502.16906,