Recognition: 3 theorem links
· Lean TheoremQwen3 Technical Report
Pith reviewed 2026-05-09 00:14 UTC · model claude-opus-4-7
The pith
Qwen3 packs explicit chain-of-thought reasoning and direct answering into one model, with a user-controlled token budget that smoothly trades inference cost for accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper argues that a single language model can be trained to operate in two regimes — a deliberate "thinking" mode that produces explicit chain-of-thought, and a fast "non-thinking" mode that answers directly — with the choice controlled by a flag in the chat template and the depth of reasoning controlled by a user-set token budget. The authors claim this fusion does not require two separate models, that the budget knob produces smooth, monotonic accuracy gains as more thinking tokens are allowed, and that strong-to-weak distillation from the flagship models lets small dense and Mixture-of-Experts variants inherit both modes at roughly one tenth the GPU cost of running the full four-stage
What carries the argument
A four-stage post-training pipeline applied to flagship models — long-CoT cold start, reasoning RL with GRPO on ~4k verifier-checked queries, thinking-mode fusion via SFT on mixed /think and /no_think data with a chat-template flag, and general RL with rule-based, reference-based, and preference-based rewards — combined with a budget-control trick that injects a stop-thinking instruction once a user-set token cap is hit. Smaller models (0.6B to 14B dense, plus a 30B-A3B MoE) skip stages 1–4 and instead receive off-policy then on-policy logit distillation from the 32B and 235B-A22B teachers.
If this is right
- Deployments no longer need to route between a chat model and a separate reasoning model; a single checkpoint with a /think or /no_think flag covers both regimes.
- Inference cost becomes a tunable dial: operators can cap thinking tokens per query and trade latency for accuracy without retraining.
- Small models (0.6B–14B) can acquire reasoning behaviour from a larger teacher via off-policy then on-policy logit distillation at roughly a tenth of the GPU cost of running full RL post-training on each size.
- Pretraining at 36T tokens across 119 languages, with synthetic data from domain-specialist Qwen2.5 variants, is sufficient to let a 22B-activated MoE match or beat a 671B-parameter MoE on most reported benchmarks.
- Fusing non-thinking data into a reasoning-RL'd model degrades peak AIME and LiveCodeBench scores slightly (Table 22), suggesting an explicit accuracy–versatility trade that future work will have to price.
Where Pith is reading between the lines
- Editorial: the budget mechanism works because the model has been trained on truncated thinking traces during fusion, so 'stop thinking now' is in-distribution rather than a brittle prompt hack — which predicts the trick will not transfer cleanly to models that were not trained with the same template.
- Editorial: the reported drop in AIME and LiveCodeBench after stages 3–4 (Table 22) suggests the thinking and non-thinking objectives are partially adversarial, and that frontier reasoning scores will increasingly come from specialist checkpoints even as user-facing products ship the fused model.
- Editorial: on-policy distillation beating direct RL at a tenth of the cost (Table 21), including on pass@64, hints that for current open models the bottleneck is exploration quality rather than reward signal — a stronger teacher's logits encode useful exploration that GRPO from scratch fails to discover.
- Editorial: the long-context evaluation on RULER shows thinking mode underperforming non-thinking mode at 128k, indicating that chain-of-thought currently hurts pure retrieval and that future work will need to gate when to think rather than how long to think.
Load-bearing premise
That a single model can be trained to do both careful step-by-step reasoning and fast direct answering without one mode quietly degrading the other — the paper's own ablation already shows AIME and LiveCodeBench scores dropping after the fusion and general-RL stages, so the claim rests on the bet that the lost peak reasoning is a worthwhile price for unified deployment.
What would settle it
Run the released Qwen3-235B-A22B with a sweep of thinking-budget caps on AIME'24/'25, LiveCodeBench, and GPQA; if accuracy does not rise smoothly with budget, or if forcing /no_think collapses scores below the matched non-thinking baselines reported in Tables 11–12, the unified-mode and budget-control claims fail. Equivalently, distill an 8B student from the same off-policy checkpoint and check whether on-policy logit distillation reproduces the reported ~10x GPU-hour saving and the pass@64 gain on AIME over direct RL (Table 21).
read the original abstract
In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration of thinking mode (for complex, multi-step reasoning) and non-thinking mode (for rapid, context-driven responses) into a unified framework. This eliminates the need to switch between different models--such as chat-optimized models (e.g., GPT-4o) and dedicated reasoning models (e.g., QwQ-32B)--and enables dynamic mode switching based on user queries or chat templates. Meanwhile, Qwen3 introduces a thinking budget mechanism, allowing users to allocate computational resources adaptively during inference, thereby balancing latency and performance based on task complexity. Moreover, by leveraging the knowledge from the flagship models, we significantly reduce the computational resources required to build smaller-scale models, while ensuring their highly competitive performance. Empirical evaluations demonstrate that Qwen3 achieves state-of-the-art results across diverse benchmarks, including tasks in code generation, mathematical reasoning, agent tasks, etc., competitive against larger MoE models and proprietary models. Compared to its predecessor Qwen2.5, Qwen3 expands multilingual support from 29 to 119 languages and dialects, enhancing global accessibility through improved cross-lingual understanding and generation capabilities. To facilitate reproducibility and community-driven research and development, all Qwen3 models are publicly accessible under Apache 2.0.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Qwen3, a family of dense (0.6B–32B) and MoE (30B-A3B, 235B-A22B) open-weight LLMs. The principal contributions are: (i) a unified architecture that integrates "thinking" and "non-thinking" modes within a single model with a user-controllable thinking budget; (ii) a 36T-token, 119-language pretraining corpus assembled with PDF OCR via Qwen2.5-VL and synthetic data from Qwen2.5-Math/Coder, plus instance-level data-mixture optimization; (iii) a four-stage post-training pipeline (long-CoT cold start, reasoning RL with GRPO, thinking-mode fusion via SFT with /think and /no_think flags, and general RL with a multi-reward system); and (iv) a Strong-to-Weak Distillation pipeline (off-policy + on-policy logit distillation) that the authors argue obviates the four-stage pipeline for smaller models. Extensive benchmark tables compare Qwen3 against Qwen2.5, DeepSeek-V3/R1, Llama-4, Gemma-3, GPT-4o, o1/o3-mini, and Gemini-2.5-Pro across general, alignment, reasoning, coding, agent, multilingual (12 detailed languages plus Belebele), and long-context (RULER) benchmarks. Models are released under Apache 2.0.
Significance. This is a substantive open-weight model release accompanied by a well-organized technical report. If the released checkpoints reproduce the reported numbers, the contribution is significant: the flagship Qwen3-235B-A22B is competitive with proprietary frontier reasoning models on AIME/LiveCodeBench/CodeForces while activating only 22B parameters, and the unified thinking/non-thinking design with a budget mechanism is a clean engineering answer to the proliferation of separate "chat" and "reasoning" models. The thinking-budget scaling curves (Fig. 2) and the on-policy vs. RL comparison (Table 21) are concrete, falsifiable claims rather than narrative ones. Apache-2.0 release of all eight scales — including 0.6B and 1.7B variants that outperform similarly-sized Gemma-3 baselines — directly supports reproducibility and downstream research. The multilingual expansion (29→119 languages) and the Belebele evaluation across 80 languages provide genuine breadth beyond English/Chinese-centric reporting. The report stops short of frontier scientific novelty (most components — GRPO, GQA, QK-Norm, YARN, DCA, on-policy distillation — are adapted from prior work), but its value as an artifact-pap
major comments (5)
- [§4.7 / Table 21] The headline '1/10 GPU hours' efficiency claim for on-policy distillation vs. reinforcement learning is load-bearing for the strong-to-weak pipeline (which is applied to six of the eight released models) and is not adequately decomposed. On-policy distillation requires forward passes through Qwen3-32B or Qwen3-235B-A22B for every student rollout to obtain teacher logits (§4.5), and at the 235B-A22B scale this teacher inference is comparable to or larger than the student's own forward/backward. The reported 1,800 vs. 17,920 GPU-hours does not state whether teacher-inference cost is included, what hardware was used, what the prompt budget and rollout counts were on each arm, or whether the RL arm used the same 3,995 query-verifier set as §4.2. Without this decomposition the comparison conflates 'cheaper because distilled' with 'cheaper because the teacher's compute is externalized.' Please
- [§4.2, Reasoning RL] The reasoning-RL stage is reported to use only 3,995 query-verifier pairs and to lift Qwen3-235B-A22B AIME'24 from 70.1 to 85.1 over 170 GRPO steps. Given that this is a central efficiency claim and that contamination of small held-out math sets is a known risk, the report should (a) describe the decontamination procedure against AIME'24/'25, MATH-500, LiveCodeBench v5, and the validation queries reserved in §4.1; (b) report variance across seeds for at least one model; and (c) clarify whether the '170 steps' figure is comparable to the 17,920 GPU-hours in Table 21 or refers to a different run. Without these, the RL-vs-distillation comparison and the AIME headline scores are difficult to evaluate.
- [§3, Pre-training data] The 36T-token corpus is constructed in part from Qwen2.5-VL OCR of PDFs and from Qwen2.5-Math/Coder synthetic data, and instance-level data-mixture optimization is performed on 'small proxy models with fine-grained data labels.' No quantitative ablation is given for either the synthetic-data fraction or the instance-level mixture optimization, so the reader cannot assess how much of the Qwen3-vs-Qwen2.5 gap is attributable to architecture/training-strategy changes vs. data scale and synthetic data. A small-scale ablation (e.g., S1 with vs. without synthetic component, or with uniform vs. optimized mixture) would substantiate the claim that 'instance-level' mixture optimization is the right granularity, which §3.1 asserts but does not test.
- [§4.7, Table 22] Table 22 shows that Stage 3 (Thinking Mode Fusion) and Stage 4 (General RL) degrade performance in thinking mode on AIME'24 (83.8→81.4) and LiveCodeBench v5 (68.4→65.7) for Qwen3-32B, while improving general/alignment/agent metrics. The text frames this as an accepted trade-off 'to enhance the model's overall versatility.' This is a legitimate design choice, but it directly tensions the framing in §1 and §4.6 that the unified model matches dedicated reasoning models. Please quantify the gap relative to a Stage-2-only checkpoint (i.e., a 'reasoning-only' Qwen3-32B) on the full reasoning suite, so that users who do not need non-thinking mode can judge whether to deploy the Stage-2 artifact.
- [Appendix A.1.1, Table 23] Long-context results show that thinking mode degrades RULER performance relative to non-thinking mode at every scale (e.g., Qwen3-235B-A22B 95.0 → 92.2 average), and the authors hypothesize that thinking 'may interfere with the retrieval process.' Given that the abstract emphasizes 128K context and thinking mode as joint selling points, this regression deserves more than a one-paragraph hypothesis. Please report whether the thinking-budget setting (8192) was tuned, whether the degradation persists with budget=0 (which should reduce to non-thinking) or increases with longer budgets, and whether the YARN scaling factor=4 is identical across the two modes.
minor comments (9)
- [Abstract / §2] Tokenizer vocabulary size is stated as 151,669 in §2 but the abstract and introduction never mention tokenizer changes relative to Qwen2.5 (which used 151,643). A one-line note on whether tokens were added (and for which languages) would help readers interpret the multilingual claims.
- [§2, Tables 1–2] The dense-model table omits hidden size, FFN size, and total non-embedding parameter counts; the MoE table omits expert FFN size and routing details (top-k, auxiliary-loss coefficient for the global-batch load-balancing loss of Qiu et al. 2025). These should be included for reproducibility.
- [§4.6] Sampling settings are given (temperature 0.6/0.7, top-p, top-k, presence penalty), but the number of samples per query is specified only for GPQA (10) and AIME (64). Please state n for the other benchmarks, especially LiveCodeBench and CodeForces, where pass@1 vs. pass@k materially affects rankings.
- [Figure 1] The figure caption is illegible due to encoded glyphs ('4UBHF...3-'). Replace with a clean rendering.
- [§4.3, Thinking budget] The stop-thinking instruction is hard-coded in English ('Considering the limited time by the user, ...'). For a 119-language model, please clarify whether non-English queries receive a localized stop instruction and whether this affects the multilingual thinking-budget curves.
- [§4.4] The reward-model-without-reference is trained on 'human preference data' but no description is given of dataset size, annotator pool, or evaluation of reward-model accuracy. Even a brief paragraph would aid reproducibility.
- [Tables 11–20] Several baselines have missing entries (e.g., Grok-3-Beta MMLU-Redux, Gemini-2.5-Pro MATH-500). Indicate whether these are unreported by the source or were not run, and the source for each cell.
- [§3.3] The sentence 'Even with 1/10 of the activated parameters of the Qwen2.5 dense base model' (Qwen3-MoE conclusion 2c) refers to Qwen3-30B-A3B vs. Qwen2.5-32B; '1/10 of activated' is approximately 3B/32B, which is 1/10.7 — fine, but elsewhere similar ratios are quoted as '1/5' and '1/3,' so it would help to standardize.
- [References] Several citations point to blog posts (xAI 2025, Qwen Team 2024/2025, Anthropic 2025, DeepMind 2025, Meta-AI 2025, OpenAI 2024/2025) without dates of access. Add accessed-on dates given that these URLs are mutable.
Simulated Author's Rebuttal
We thank the referee for a careful and constructive report. The five major comments target real gaps in the manuscript — particularly around the cost decomposition of on-policy distillation, decontamination and seed variance for the small-scale reasoning-RL run, missing data-mixture ablations, the reasoning-vs-versatility trade-off introduced by Stages 3–4, and the under-explained RULER regression in thinking mode. We accept all five and will revise the manuscript accordingly. Below we respond point by point, indicating in each case what additional measurements or text we will add. We note honestly that some requested numbers (e.g., multi-seed GRPO at 235B scale) are not feasible to produce at flagship scale within a revision cycle; where that is the case we will be explicit about the limitation rather than paper over it. We also agree with the referee's overall framing that this report's contribution is primarily as an artifact paper documenting a substantive open-weight release, and we will tighten the language around novelty claims accordingly.
read point-by-point responses
-
Referee: §4.7 / Table 21: The '1/10 GPU hours' on-policy distillation vs. RL claim is not decomposed: teacher inference cost, hardware, prompt/rollout budget, and whether the RL arm shares the §4.2 query set are unspecified, conflating distillation efficiency with externalized teacher compute.
Authors: We agree this comparison needs a clearer accounting and will revise §4.7 accordingly. Concretely, in the next version we will (i) state the hardware (H800 nodes) and the per-arm wall-clock and accelerator-hour breakdown including teacher forward passes, (ii) report that on-policy distillation of Qwen3-8B used Qwen3-32B as the teacher (not 235B-A22B; the 235B teacher is used selectively for larger students), so per-step teacher inference is roughly 4× the student forward and is included in the 1,800 GPU-hour figure, (iii) clarify that the RL arm used the same §4.2 query-verifier pool augmented with code queries, the same rollout count, and the same context length as the distillation arm, and (iv) note that even when the teacher's pre-training cost is amortized in, the marginal cost to obtain the distilled 8B is well below the RL run. We acknowledge that 'cheaper because the teacher exists' is a fair characterization; the claim we wish to defend is the marginal-cost claim, and we will reframe the text to that effect. revision: yes
-
Referee: §4.2 Reasoning RL: please describe decontamination against AIME'24/'25, MATH-500, LiveCodeBench v5 and the §4.1 validation set; report seed variance; and clarify whether '170 steps' corresponds to the 17,920 GPU-hours in Table 21.
Authors: (a) Decontamination: our pretraining and post-training pipelines apply n-gram (13-gram) and embedding-based near-duplicate filtering against the public test sets named by the referee, plus the validation queries reserved in §4.1; we will add an explicit subsection describing the filter, the match thresholds, and the number of removed candidates. (b) Seed variance: we did not run multiple full-scale GRPO seeds for the 235B model due to cost, but we have 3-seed runs at the 32B scale on AIME'24/'25 and MATH-500 and will add these as a variance table; we will be transparent that 235B numbers are single-seed. (c) The '170 steps' figure refers to the Qwen3-235B-A22B reasoning-RL run and is a different run from Table 21, which compares 8B distillation against 8B RL; we will rewrite the sentence to remove the implied linkage and report the GPU-hours of the 235B RL run separately. revision: yes
-
Referee: §3 Pre-training data: no quantitative ablation is given for the synthetic-data fraction or the instance-level mixture optimization, so the Qwen3-vs-Qwen2.5 gap cannot be attributed.
Authors: This is a fair criticism. We did run small-proxy ablations during data development — specifically (i) S1 training of ~1B and ~3B proxies with vs. without the Qwen2.5-Math/Coder synthetic component, and (ii) uniform-domain vs. instance-level optimized mixtures at the same proxy scales — and these informed the final recipe. We will add a table summarizing these proxy-scale ablations on MMLU, GSM8K, HumanEval, and MMLU-Pro in the revised §3.1, with the caveat that they were conducted at proxy scale and we did not re-run them at flagship scale. We will also temper the language so that 'instance-level granularity is the right choice' is presented as supported at proxy scale rather than at 235B scale. revision: yes
-
Referee: §4.7 Table 22: Stages 3 and 4 degrade thinking-mode AIME'24 (83.8→81.4) and LiveCodeBench v5 (68.4→65.7) on Qwen3-32B; please quantify the gap to a Stage-2-only checkpoint on the full reasoning suite so users can decide whether to deploy that artifact.
Authors: We accept the framing concern. In the revision we will add a table reporting the Stage-2-only Qwen3-32B checkpoint across the full Math & Text Reasoning and Agent & Coding suites used in Table 13 (AIME'24/'25, MATH-500, LiveCodeBench v5, CodeForces, ZebraLogic, AutoLogi, BFCL v3), alongside the released Stage-4 model in thinking mode. This will let downstream users quantify the 'versatility tax' on each benchmark. We will also revise §1 and §4.6 to state explicitly that the unified Stage-4 model trades a small amount of pure-reasoning headroom for instruction-following, agent, and non-thinking competence, rather than implying parity with reasoning-specialized checkpoints. We do not, however, plan to release the Stage-2-only artifact as a separate model in this release cycle. revision: yes
-
Referee: Appendix A.1.1 / Table 23: thinking mode degrades RULER at every scale; the one-paragraph hypothesis is insufficient. Was the 8192 thinking budget tuned? Does the gap close at budget=0 or widen with longer budgets? Is the YARN factor identical across modes?
Authors: Thank you — these are the right diagnostic questions. To address them: (i) the YARN scaling factor of 4 is identical across thinking and non-thinking modes; we will state this explicitly in the caption. (ii) The 8192 thinking budget was a single chosen operating point, not a tuned optimum; we did spot-check 4096 and 16384 on RULER 64K/128K and observed that performance is approximately monotone-decreasing with larger thinking budgets on retrieval-heavy slices, consistent with our 'thinking interferes with retrieval' hypothesis. We will add a small sweep (budget ∈ {0, 4096, 8192, 16384}) for Qwen3-32B and Qwen3-235B-A22B at 64K and 128K. (iii) Budget=0 is operationally close to non-thinking mode (empty <think></think> block) and we will report it explicitly so the reader can see the regression collapses. We will soften the abstract's joint framing of 128K and thinking mode to acknowledge that on pure needle-style retrieval the non-thinking path is currently preferable. revision: yes
- Multi-seed variance for the full-scale Qwen3-235B-A22B GRPO run is not feasible to produce within a revision cycle; we will report multi-seed variance only at 32B scale and flag the 235B numbers as single-seed.
- We will not release a separate Stage-2-only ('reasoning-only') Qwen3-32B checkpoint in this release cycle, though we will report its evaluation numbers so users can assess the trade-off.
Circularity Check
No significant circularity: Qwen3's claims are evaluated against external public benchmarks and third-party baselines, not against author-defined targets.
specific steps
-
self citation load bearing
[§4.1 Long-CoT Cold Start; §4.4 General RL (rule/judge model)]
"we use Qwen2.5-72B-Instruct to identify and remove queries that are not easily verifiable... we provide a reference answer for each query and prompt Qwen2.5-72B-Instruct to score the model's response based on this reference."
Data curation and one of three RL reward signals depend on a prior author model (Qwen2.5-72B-Instruct) as judge/filter. This is a mild self-referential dependency in the training pipeline, but the final claims are still evaluated on external benchmarks against external baselines, so it does not make any reported headline number circular by construction. Noted as minor, not load-bearing for the externally-scored claims.
full rationale
This is an engineering/system technical report, not a derivation paper. The paper's central claims—performance of Qwen3 models in thinking and non-thinking modes—are evaluated against widely used external benchmarks (MMLU, MMLU-Pro, GPQA, MATH-500, AIME'24/'25, LiveCodeBench, BFCL v3, CodeForces Elo, Arena-Hard, IFEval, RULER, Belebele, INCLUDE, MMMLU, etc.) and compared to non-author baselines (DeepSeek-R1/V3, Llama-4, Gemma-3, GPT-4o, OpenAI-o1/o3-mini, Gemini 2.5-Pro, Grok-3). These are externally falsifiable and not defined in terms of the model itself, so the headline claims are not circular by construction. The reader's skeptic concern—about the 1,800 vs 17,920 GPU-hours comparison in Table 21 (on-policy distillation vs RL)—is a legitimate methodology/correctness concern (under-specified accounting, possibly excluding teacher-inference cost, unequal prompt budgets), but it is not a circularity issue in the technical sense defined here: the GPU-hour numbers are not defined in terms of each other, and the AIME/MATH/LiveCodeBench scores being compared are external benchmarks. It is an apples-to-apples accounting risk, not a self-definitional or fitted-input-renamed-as-prediction loop. Minor self-citation does occur: (i) data filtering and judging uses Qwen2.5-72B-Instruct (§4.1, §4.4) including as a model-based reward judge for RL, which creates a teacher/judge dependency on a sibling model but does not make any specific external benchmark claim circular; (ii) the Strong-to-Weak Distillation efficacy argument relies on Qwen3-32B / Qwen3-235B-A22B as teachers for the smaller Qwen3 models (§4.5), but the resulting students are still scored on external benchmarks against external baselines. These are normal pipeline self-references, not load-bearing circular derivations. Overall score: 1.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation (entire forcing chain is parameter-free; RS posture is zero adjustable parameters)reality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion.
-
IndisputableMonolith.Foundation.DimensionForcing (8-tick period 2^D=8 from D=3 via Alexander duality)eight_tick_forces_D3 unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The Qwen3 MoE models have 128 total experts with 8 activated experts per token.
-
IndisputableMonolith.Cost.FunctionalEquation (J-cost uniqueness via Aczél)washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We employ GRPO ... to update the model parameters. ... AIME'24 score of the Qwen3-235B-A22B model increases from 70.1 to 85.1 over a total of 170 RL training steps.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
ViMU: Benchmarking Video Metaphorical Understanding
ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.
-
FlowCompile: An Optimizing Compiler for Structured LLM Workflows
FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.
-
RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation
RealICU is a new benchmark using physician hindsight labels on MIMIC-IV ICU data that exposes LLM failures in long-horizon clinical assessment, acute problem detection, action recommendation, and red-flag identification.
-
Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation
Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset ...
-
Large Language Models Lack Temporal Awareness of Medical Knowledge
LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.
-
Pretraining Exposure Explains Popularity Judgments in Large Language Models
LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.
-
Grid Games: The Power of Multiple Grids for Quantizing Large Language Models
Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.
-
Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty
Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
-
Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
-
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
-
DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents
DeepTumorVQA is a new stage-wise 3D CT VQA benchmark showing that quantitative measurement is the main failure point for current medical VLMs and that tool augmentation substantially improves later reasoning stages.
-
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs
Soohak is a new 439-problem mathematician-authored benchmark showing frontier LLMs reach only 30% on research math and fail to exceed 50% on refusing ill-posed questions.
-
ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning
ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equip...
-
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning task...
-
The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits
Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budget...
-
Crafting Reversible SFT Behaviors in Large Language Models
LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.
-
LLM Translation of Compiler Intermediate Representation
IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.
-
VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...
-
SLAM: Structural Linguistic Activation Marking for Language Models
SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
-
SLAM: Structural Linguistic Activation Marking for Language Models
SLAM achieves 100% detection accuracy on Gemma-2 models with only 1-2 points of quality loss by causally steering SAE-identified structural directions while preserving lexical sampling and semantics.
-
SecGoal: A Benchmark for Security Goal Extraction and Formalization from Protocol Documents
The paper presents SecGoal, the first expert-annotated benchmark for security goal extraction from protocol documents, and demonstrates that fine-tuned 7B/9B parameter models achieve over 80% F1 score, outperforming l...
-
Efficient Training on Multiple Consumer GPUs with RoundPipe
RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on...
-
MappingEvolve: LLM-Driven Code Evolution for Technology Mapping
MappingEvolve applies LLMs through Planner-Evolver-Evaluator agents to evolve technology mapping code, delivering 10.04% area reduction versus ABC and 7.93% versus mockturtle on EPFL benchmarks.
-
AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery
AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.
-
S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images
S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.
-
When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models
VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-...
-
ArgBench: Benchmarking LLMs on Computational Argumentation Tasks
ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
-
MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning
MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6....
-
HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?
Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.
-
InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis
InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.
-
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
-
Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models
Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.
-
NovBench: Evaluating Large Language Models on Academic Paper Novelty Assessment
NovBench is the first large-scale benchmark with 1,684 expert-annotated pairs to evaluate LLMs on assessing academic paper novelty via a four-dimensional framework of Relevance, Correctness, Coverage, and Clarity.
-
MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark
MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due...
-
MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark
MMRareBench is the first rare-disease benchmark for multimodal and multi-image clinical evaluation of MLLMs, revealing fragmented capabilities, low treatment-planning scores, and medical models underperforming general...
-
Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation
Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.
-
ReconPhys: Reconstruct Appearance and Physical Attributes from Single Video
ReconPhys is the first feedforward neural network that jointly reconstructs 3D geometry and appearance via Gaussian Splatting while estimating physical attributes from a single monocular video using self-supervised training.
-
Do Audio-Visual Large Language Models Really See and Hear?
AVLLMs encode audio semantics in middle layers but suppress them in final text outputs when audio conflicts with vision, due to training that largely inherits from vision-language base models.
-
SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology
SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.
-
SEVerA: Verified Synthesis of Self-Evolving Agents
SEVerA uses Formally Guarded Generative Models and a three-stage Search-Verification-Learning process to synthesize self-evolving agents that satisfy hard formal constraints while improving task performance.
-
Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection
Knowledge Packs deliver knowledge via pre-computed KV caches with exact equivalence under causal masking, achieving zero divergences on tested questions and enabling value-based steering without training.
-
MetaBackdoor: Exploiting Positional Encoding as a Backdoor Attack Surface in LLMs
MetaBackdoor shows that LLMs can be backdoored using positional triggers like sequence length, enabling stealthy activation on clean inputs to leak system prompts or trigger malicious behavior.
-
RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation
RxEval benchmark shows frontier LLMs reach at most 46.10% exact match on prescription-level medication, dose, and route selection from real patient trajectories.
-
Asymmetric Generative Recommendation via Multi-Expert Projection and Multi-Faceted Hierarchical Quantization
AsymRec decouples input and output representations in generative recommendation via multi-expert semantic projection and multi-faceted hierarchical quantization, outperforming prior models by 15.8% on average.
-
From Table to Cell: Attention for Better Reasoning with TABALIGN
TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding exec...
-
GGBound: A Genome-Grounded Agent for Microbial Life-Boundary Prediction
A genome-conditioned 4B LLM agent predicts microbial life boundaries and matches larger frontier models via token fusion, tool use, and a counterfactual gene-grounding reward.
-
Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation
New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.
-
Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
BBCritic uses contrastive learning to align GUI actions in a continuous affordance space, outperforming larger binary critic models on a new four-level hierarchical benchmark while enabling zero-shot transfer.
-
KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration
KVPO aligns streaming autoregressive video generators with human preferences via ODE-native GRPO, using KV cache for semantic exploration and TVE for velocity-based policy modeling, yielding gains in quality and alignment.
-
GenCircuit-RL: Reinforcement Learning from Hierarchical Verification for Genetic Circuit Design
GenCircuit-RL uses hierarchical verification rewards and curriculum learning in RL to generate correct genetic circuit code in SBOL, improving functional task success by 14-16 points and generalizing to novel biologic...
-
ASH: Agents that Self-Hone via Embodied Learning
ASH reaches 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Zelda by self-improving via an IDM trained on its own trajectories to label internet video, while baselines plateau at roughly 6/12.
-
SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents
SimPersona uses VQ-VAE to induce discrete buyer types from clickstreams, maps them to LLM persona tokens, and fine-tunes agents to achieve 78% conversion-rate alignment with real buyers across 42 storefronts.
-
BOOKMARKS: Efficient Active Storyline Memory for Role-playing
BOOKMARKS introduces searchable bookmarks as reusable answers to storyline questions, enabling active initialization and passive synchronization for more consistent role-playing agent memory than recurrent summarization.
-
CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing
CRANE merges Instruct and Thinking model checkpoints via constrained nullspace editing to improve code agent reasoning and benchmark performance without retraining.
-
Do Language Models Align with Brains? Prediction Scores Are Not Enough
Language model representations fail all L-PACT alignment gates once controls explain the apparent predictive and relational effects.
-
Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding
Mistletoe is a stealthy attack that collapses the speedup of speculative decoding by reducing average accepted length τ without changing output semantics or perplexity.
-
WARDEN: Endangered Indigenous Language Transcription and Translation with 6 Hours of Training Data
WARDEN achieves better transcription and translation for Wardaman than larger models by separating the tasks and using Sundanese initialization plus a domain dictionary with just 6 hours of data.
-
Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights
TFlow enables multi-agent LLMs to collaborate via transient low-rank LoRA perturbations derived from sender activations, yielding up to 8.5 accuracy gains and 83% token reduction versus text-based baselines on Qwen3-4...
-
Reward-Weighted On-Policy Distillation with an Open Property-Equivalence Verifier for NL-to-SVA Generation
Reward-Weighted On-Policy Distillation with an open property-equivalence verifier produces a 7B model that surpasses prior SOTA on NL-to-SVA generation across pass@1/5/10 metrics.
-
BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning
BlockVLA accelerates autoregressive VLA models by 3.3x using block diffusion finetuning, with faster training convergence and better early performance on long-horizon robotic tasks.
Reference graph
Works this paper leans on
-
[1]
Marah Abdin, Jyoti Aneja, Harkirat Behl, S´ebastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905,
work page internal anchor Pith review arXiv
-
[2]
Training-free long-context scaling of large language models
Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, and Lingpeng Kong. Training-free long-context scaling of large language models. CoRR, abs/2402.17463,
-
[3]
Program Synthesis with Large Language Models
URL https://www.anthropic.com/news/claude-3-7-sonnet . Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models. CoRR, abs/2108.07732,
-
[4]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...
-
[5]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report. arXiv preprint arXiv:2502.13923,
-
[6]
The Belebele benchmark: A parallel reading comprehension dataset in 122 language variants
Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. The Belebele benchmark: A parallel reading comprehension dataset in 122 language variants. CoRR, abs/2308.16884,
-
[7]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond´e de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bava...
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. CoRR, abs/2110.14168,
-
[9]
Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. CoRR, abs/2401.06066,
work page internal anchor Pith review arXiv
-
[10]
URL https://blog.google/technology/google-deepmind/gemi ni-model-thinking-updates-march-2025/ . Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, An- dreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Ri...
work page 2025
-
[11]
Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, et al. SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines. arXiv preprint arXiv:2502.14739,
-
[12]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aur´elien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozi`ere, Betha...
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Doge: Domain reweighting with generalization estimation.arXiv preprint arXiv:2310.15393, 2023
Simin Fan, Matteo Pagliardini, and Martin Jaggi. DoGE: Domain reweighting with generalization estimation. arXiv preprint arXiv:2310.15393,
-
[14]
Are we done with mmlu? CoRR, abs/2406.04127,
Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with MMLU? CoRR, abs/2406.04127,
-
[15]
Alex Gu, Baptiste Rozi `ere, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I. Wang. CRUXEval: A benchmark for code reasoning, understanding and execution. arXiv preprint arXiv:2401.03065,
work page internal anchor Pith review arXiv
-
[16]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
arXiv preprint arXiv:2410.15553 , year=
Yun He, Di Jin, Chaoqi Wang, Chloe Bi, Karishma Mandyam, Hejia Zhang, Chen Zhu, Ning Li, Tengyu Xu, Hongjiang Lv, et al. Multi-IF: Benchmarking LLMs on multi-turn and multilingual instructions following. arXiv preprint arXiv:2410.15553,
-
[18]
RULER: What's the Real Context Size of Your Long-Context Language Models?
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In ICLR. OpenReview.net, 2021a. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH ...
work page internal anchor Pith review arXiv
-
[19]
Qwen2.5-Coder Technical Report
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2.5-Coder technical report. CoRR, abs/2409.12186,
work page internal anchor Pith review arXiv
-
[20]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code. CoRR, abs/2403.07974,
work page internal anchor Pith review arXiv
- [21]
-
[22]
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V . Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirz...
work page internal anchor Pith review arXiv
-
[23]
Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-Hard and BenchBuilder pipeline. CoRR, abs/2406.11939,
-
[24]
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. CoRR, abs/2305.20050,
work page internal anchor Pith review arXiv
-
[25]
Zebralogic: On the scaling limits of llms for logical reasoning
Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. ZebraLogic: On the scaling limits of LLMs for logical reasoning. CoRR, abs/2502.01100,
-
[26]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024a. Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by ChatGPT really correct? Rigorous evaluation of large language mode...
-
[27]
YaRN: Efficient Context Window Extension of Large Language Models
URL https://eqbench.com/creative writing.html. Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window extension of large language models. CoRR, abs/2309.00071,
work page internal anchor Pith review arXiv
-
[28]
Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Demons in the detail: On implementing load balancing loss for training specialized mixture-of-expert models. CoRR, abs/2501.11873,
-
[29]
33 Shanghaoran Quan, Jiaxi Yang, Bowen Yu, Bo Zheng, Dayiheng Liu, An Yang, Xuancheng Ren, Bofei Gao, Yibo Miao, Yunlong Feng, Zekun Wang, Jian Yang, Zeyu Cui, Yang Fan, Yichang Zhang, Binyuan Hui, and Junyang Lin. CodeElo: Benchmarking competition-level code generation of LLMs with human-comparable Elo ratings. CoRR, abs/2501.01257,
-
[30]
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
URL https: //qwenlm.github.io/blog/qwq-32b/. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level Google-proof Q&A benchmark. CoRR, abs/2311.12022,
work page internal anchor Pith review arXiv
-
[31]
Angelika Romanou, Negar Foroutan, Anna Sotnikova, Zeming Chen, Sree Harsha Nelaturu, Shivalika Singh, Rishabh Maheshwary, Micol Altomare, Mohamed A. Haggag, Snegha A, Alfonso Amayuelas, Azril Hafizi Amirudin, Viraat Aryabumi, Danylo Boiko, Michael Chang, Jenny Chim, Gal Cohen, Aditya Kumar Dalmia, Abraham Diress, Sharad Duwal, Daniil Dzenhaliou, Daniel Fe...
-
[32]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. CoRR, abs/2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Linguistic generalizability of test-time scaling in mathematical reasoning
Guijin Son, Jiwoo Hong, Hyunwoo Ko, and James Thorne. Linguistic generalizability of test-time scaling in mathematical reasoning. CoRR, abs/2502.17407,
-
[34]
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram ´e, Morgane Rivi `ere, et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786,
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. CoRR, abs/2406.01574,
work page internal anchor Pith review arXiv
-
[36]
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
34 Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz- Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. LiveBench: A challenging, contamination-free LLM benchmark. CoRR, abs/2406.19314,
work page internal anchor Pith review arXiv
-
[37]
Writingbench: A comprehensive benchmark for generative writing.CoRR, abs/2503.05244,
Yuning Wu, Jiahao Mei, Ming Yan, Chenliang Li, Shaopeng Lai, Yuran Ren, Zijia Wang, Ji Zhang, Mengyue Wu, Qin Jin, and Fei Huang. WritingBench: A comprehensive benchmark for generative writing. CoRR, abs/2503.05244,
-
[38]
Effective long-context scaling of foundation models
Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang, Yashar Mehdad, Sharan Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang, and Hao Ma. Effective long-context scaling of foundation models. CoR...
-
[39]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei...
work page internal anchor Pith review arXiv
-
[40]
Instruction-Following Evaluation for Large Language Models
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. CoRR, abs/2311.07911,
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
Qin Zhu, Fei Huang, Runyu Peng, Keming Lu, Bowen Yu, Qinyuan Cheng, Xipeng Qiu, Xuanjing Huang, and Junyang Lin. AutoLogi: Automated generation of logic puzzles for evaluating reasoning abilities of large language models. CoRR, abs/2502.16906,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.