super hub Baseline reference

Measuring Mathematical Problem Solving With the MATH Dataset

Akul Arora, Collin Burns, Dan Hendrycks, Eric Tang, Saurav Kadavath, Steven Basart · 2021 · cs.LG · arXiv 2103.03874

Baseline reference. 54% of citing Pith papers use this work as a benchmark or comparison.

447 Pith papers citing it

Baseline 54% of classified citations

open full Pith review browse 447 citing papers more from Akul Arora arXiv PDF

abstract

Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are able to increase accuracy on MATH, our results show that accuracy remains relatively low, even with enormous Transformer models. Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue. While scaling Transformers is automatically solving most other text-based tasks, scaling is not currently solving MATH. To have more traction on mathematical problem solving we will likely need new algorithmic advancements from the broader research community.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 48 background 32 method 1

citation-polarity summary

use dataset 44 background 32 unclear 4 use method 1

claims ledger

abstract Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are

authors

Akul Arora Collin Burns Dan Hendrycks Eric Tang Saurav Kadavath Steven Basart

co-cited works

representative citing papers

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

cs.CL · 2022-01-28 · accept · novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

Rethinking the Role of Positional Encoding: Sliding-Window Transformers without PE Remain Turing Complete

cs.LG · 2026-06-01 · unverdicted · novelty 8.0

Sliding-window transformers without positional encodings are Turing complete because the sliding window breaks permutation symmetry and suffices to simulate Post machines via a constant-size histogram state.

Beyond Accuracy: Diagnosing Algebraic Reasoning Failures in LLMs Across Nine Complexity Dimensions

cs.CL · 2026-04-08 · unverdicted · novelty 8.0

A nine-dimension algebraic complexity framework shows that LLMs suffer a scale-invariant working memory bottleneck, collapsing at 20-30 parallel branches regardless of parameter count from 8B to 235B.

PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data

q-fin.CP · 2026-04-03 · conditional · novelty 8.0

Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

cs.AI · 2026-04-02 · unverdicted · novelty 8.0

User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.

SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology

cs.AI · 2026-03-30 · conditional · novelty 8.0

SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

Training Software Engineering Agents and Verifiers with SWE-Gym

cs.SE · 2024-12-30 · conditional · novelty 8.0

SWE-Gym supplies 2438 executable real-world Python tasks to train SWE agents and verifiers, yielding up to 19% gains and new open-weight SOTA of 32% on SWE-Bench Verified.

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

cs.CL · 2024-10-06 · unverdicted · novelty 8.0

ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics

cs.AI · 2021-08-31 · accept · novelty 8.0

MiniF2F is a new cross-system benchmark containing 488 Olympiad-level mathematics problems formalized in Metamath, Lean, Isabelle, and HOL Light, together with baseline results from a GPT-3-based prover.

Will Scaling Improve Social Simulation with LLMs?

cs.CL · 2026-07-02 · conditional · novelty 7.0

Scaling improves LLM social simulation fidelity in most opinion and behavior tasks but not for human cognitive bias calibration or low-resource domains.

ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving

cs.DC · 2026-07-01 · unverdicted · novelty 7.0 · 2 refs

ELDR reduces median TPOT by 5.9-13.9% in PD-disaggregated MoE serving via expert signatures from prefill, K-means partitioning, and locality-band routing with KV-co-indexed signature cache.

When Does Online Imitation Learning Help in LLM Post-Training? The Role of (Non-)Realizability Beyond Horizon

cs.LG · 2026-06-29 · unverdicted · novelty 7.0

Online IL overcomes an information-theoretic bottleneck that offline IL faces in non-realizable settings even at horizon 1, under a new structural characterization of reward-relative misspecification.

Tandem Reinforcement Learning with Verifiable Rewards

cs.AI · 2026-06-26 · unverdicted · novelty 7.0

TRL extends tandem training to RLVR pipelines, matching GRPO solo reasoning on Qwen3-4B math tasks while improving handoff robustness, reducing distributional drift, and increasing CoT legibility for the junior.

Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

cs.AI · 2026-06-23 · unverdicted · novelty 7.0

TAC is a bandit curriculum for multi-domain RLVR that prioritizes domains whose gradient updates align with and benefit other domains, yielding up to 2.8-point macro accuracy gains over learnability-only baselines on Qwen3-1.7B and Llama3.2-3B.

Predicting Mergeability of Parameter-Efficient Fine-Tuning Updates

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

MergeProbe forecasts LoRA adapter mergeability from first-few-percent training signals and outperforms interference-aware baselines on retention while adding low overhead on a five-domain benchmark.

Leadership as Coordination Control: Behavioral Signatures and the Recovery-Advantage Boundary in Multi-Agent LLM Teams

cs.CL · 2026-06-17 · unverdicted · novelty 7.0

Coordination control in LLM teams adds accuracy only where round-0 majority is unreliable, task recoverable, and free interaction fails to repair it, matching contingency theory predictions across models and regimes.

GraphPO: Graph-based Policy Optimization for Reasoning Models

cs.CL · 2026-06-17 · unverdicted · novelty 7.0

GraphPO represents reasoning rollouts as a DAG to merge semantically equivalent paths, share suffixes, and assign separate efficiency and correctness advantages for lower variance and better performance than chain or tree baselines.

Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

TAPO constructs learnable micro-reflective trajectories from contrastive model rollouts during RL training to provide explicit error diagnoses and corrections, reporting consistent gains over GRPO on AIME and HMMT math benchmarks.

See What I See, Know What I Think: Dense Latent Communication Across Heterogeneous Agents

cs.MA · 2026-06-11 · unverdicted · novelty 7.0

Heterogeneous agents achieve dense latent KV-cache communication via lightweight cross-model transformation and two-phase training, outperforming text at lower compute in context-aware settings and enabling context-unaware transfer.

ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning

cs.AI · 2026-06-11 · unverdicted · novelty 7.0

ReSum trains LLMs via RLVR to self-summarize reasoning trajectories, yielding 4% average performance gains and 18.6% shorter rollouts through contrastive rollout branches.

Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

cs.LG · 2026-06-11 · unverdicted · novelty 7.0

SWITCH uses explicit <swi> and </swi> boundary tokens to make latent chain-of-thought compatible with on-policy RL (GRPO) and open to causal mechanistic probing, outperforming prior hidden-state recurrence methods.

citing papers explorer

Showing 50 of 144 citing papers after filters.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models cs.CL · 2022-01-28 · accept · none · ref 24 · internal anchor
Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
Beyond Accuracy: Diagnosing Algebraic Reasoning Failures in LLMs Across Nine Complexity Dimensions cs.CL · 2026-04-08 · unverdicted · none · ref 4 · internal anchor
A nine-dimension algebraic complexity framework shows that LLMs suffer a scale-invariant working memory bottleneck, collapsing at 20-30 parallel branches regardless of parameter count from 8B to 235B.
Large Language Diffusion Models cs.CL · 2025-02-14 · unverdicted · none · ref 119 · internal anchor
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection cs.CL · 2024-10-06 · unverdicted · none · ref 22 · internal anchor
ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.
LiveBench: A Challenging, Contamination-Limited LLM Benchmark cs.CL · 2024-06-27 · unverdicted · none · ref 20 · internal anchor
LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
Will Scaling Improve Social Simulation with LLMs? cs.CL · 2026-07-02 · conditional · none · ref 14 · internal anchor
Scaling improves LLM social simulation fidelity in most opinion and behavior tasks but not for human cognitive bias calibration or low-resource domains.
Leadership as Coordination Control: Behavioral Signatures and the Recovery-Advantage Boundary in Multi-Agent LLM Teams cs.CL · 2026-06-17 · unverdicted · none · ref 6 · internal anchor
Coordination control in LLM teams adds accuracy only where round-0 majority is unreliable, task recoverable, and free interaction fails to repair it, matching contingency theory predictions across models and regimes.
GraphPO: Graph-based Policy Optimization for Reasoning Models cs.CL · 2026-06-17 · unverdicted · none · ref 36 · internal anchor
GraphPO represents reasoning rollouts as a DAG to merge semantically equivalent paths, share suffixes, and assign separate efficiency and correctness advantages for lower variance and better performance than chain or tree baselines.
Agreement in Representation Space for Open-Ended Self-Consistency cs.CL · 2026-06-10 · unverdicted · none · ref 6 · internal anchor
EBA clusters sampled LLM generations in representation space to estimate agreement, outperforming random selection with stable scaling and showing that central positions correlate with higher generation quality.
Unified Energy for Invariant and Independent Decoding in Diffusion Language Models cs.CL · 2026-06-08 · unverdicted · none · ref 18 · internal anchor
The paper introduces Uni-E, a unified energy for DLMs that accounts for model capacity, dependency and invariance, can be computed exactly, and corrects distribution shifts from dependency and invariance.
From Correctness to Utility: Gain-Based Prefix Evaluation for LLM Reasoning cs.CL · 2026-06-05 · unverdicted · none · ref 27 · internal anchor
Prefix gain measured via student-model solve-rate improvement is used to train a Prefix Utility Model (PUM) that supplies stronger supervision than correctness-based process rewards for mathematical reasoning.
D$^3$: Dynamic Directional Graph-Constrained Data Scheduling for LLM Training cs.CL · 2026-05-29 · unverdicted · none · ref 10 · internal anchor
D³ introduces a dynamic directional graph-constrained framework that models sample interactions via loss dependencies to derive an optimized training sequence for LLMs.
CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning cs.CL · 2026-05-19 · unverdicted · none · ref 12 · internal anchor
CopT reverses CoT by eliciting a draft answer first then using continuous-embedding contrastive verification and on-policy thinking to reflect and correct, yielding up to 23% higher accuracy and 57% fewer tokens without training.
Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding cs.CL · 2026-05-14 · unverdicted · none · ref 9 · internal anchor
FeF-DLLM achieves factorization-error-free generation in discrete diffusion language models via prefix-conditioned posterior factorization and speculative decoding, delivering 5.04 pp higher accuracy and 3.86x faster inference on GSM8K, MATH, HumanEval, and MBPP.
Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights cs.CL · 2026-05-13 · unverdicted · none · ref 49 · internal anchor
TFlow enables multi-agent LLMs to collaborate via transient low-rank LoRA perturbations derived from sender activations, yielding up to 8.5 accuracy gains and 83% token reduction versus text-based baselines on Qwen3-4B models.
Query-Conditioned Test-Time Self-Training for Large Language Models cs.CL · 2026-05-13 · conditional · none · ref 7 · 2 links · internal anchor
QueST adapts LLMs at test time by generating query-specific problem-solution pairs for self-supervised fine-tuning, improving reasoning performance without external data.
TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM cs.CL · 2026-05-10 · unverdicted · none · ref 25 · internal anchor
TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems cs.CL · 2026-05-09 · unverdicted · none · ref 17 · 2 links · internal anchor
AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
Not All Tokens Learn Alike: Attention Entropy Reveals Heterogeneous Signals in RL Reasoning cs.CL · 2026-05-08 · unverdicted · none · ref 3 · internal anchor
Attention entropy splits RL training tokens into stable anchors and volatile explorers, and entropy-aware reweighting improves held-out reasoning performance.
SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting cs.CL · 2026-05-08 · unverdicted · none · ref 43 · 2 links · internal anchor
SpecBlock achieves 8-13% higher mean speedup than EAGLE-3 at 44-52% drafting cost via block-iterative drafting with hidden-state inheritance, dynamic rank-head branching, valid-prefix masking, and optional cost-aware bandit adaptation.
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients cs.CL · 2026-05-07 · unverdicted · none · ref 28 · internal anchor
POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.
Focus on the Core: Empowering Diffusion Large Language Models by Self-Contrast cs.CL · 2026-05-02 · unverdicted · none · ref 13 · internal anchor
FoCore uses self-contrast on early-converging high-density tokens to boost diffusion LLM quality on reasoning benchmarks while cutting decoding steps by over 2x.
TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice cs.CL · 2026-04-10 · unverdicted · none · ref 25 · internal anchor
TaxPraBen is a new benchmark with 14 datasets and a structured evaluation method for measuring LLM performance on Chinese real-world tax tasks and scenarios.
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models cs.CL · 2026-04-09 · unverdicted · none · ref 5 · internal anchor
OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models cs.CL · 2026-04-01 · conditional · none · ref 8 · internal anchor
S0 tuning optimizes initial recurrent states in hybrid models to outperform LoRA with zero inference cost on HumanEval and partial cross-domain transfer.
MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese cs.CL · 2026-04-01 · conditional · none · ref 1 · internal anchor
Math-PT provides 1,729 native Portuguese math problems and shows frontier LLMs perform well on multiple-choice but drop on figures and open-ended items.
A Comparative analysis of Layer-wise Representational Capacity in AR and Diffusion LLMs cs.CL · 2026-03-08 · unverdicted · none · ref 5 · internal anchor
Diffusion language models form more global representations with early-layer redundancy compared to autoregressive models, allowing layer skipping for up to 18.75% FLOP savings while maintaining over 90% performance.
Towards Distillation-Resistant Large Language Models: An Information-Theoretic Perspective cs.CL · 2026-02-03 · unverdicted · none · ref 28 · internal anchor
A learned transformation matrix minimizes CMI in teacher logits to degrade distillation performance while preserving task accuracy.
Learning from Self-Debate: Preparing Reasoning Models for Multi-Agent Debate cs.CL · 2026-01-29 · unverdicted · none · ref 8 · internal anchor
SDRL trains LLMs via self-generated multi-path debates and joint optimization of standalone plus debate-conditioned responses to boost both single-model reasoning and multi-agent debate performance.
DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning cs.CL · 2025-10-10 · conditional · none · ref 12 · internal anchor
DELTA partitions layers into full, delta, and sparse groups to select salient tokens via aggregated attention scores, matching full-attention accuracy on AIME and GPQA while cutting attended tokens up to 4.25x and achieving 1.54x speedup.
Task-Dependent Evaluation of LLM Output Homogenization: A Taxonomy-Guided Framework cs.CL · 2025-09-25 · conditional · none · ref 9 · internal anchor
Proposes a task taxonomy for functional diversity in LLM outputs, validates it via user study, introduces targeted sampling to boost diversity only where needed, and presents evidence that the diversity-quality tradeoff may be an artifact of task-agnostic measurement.
Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression cs.CL · 2025-02-04 · unverdicted · none · ref 68 · internal anchor
KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation, delivering 9-18% accuracy gains and 11% latency reduction.
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models cs.CL · 2024-10-10 · conditional · none · ref 59 · internal anchor
Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cs.CL · 2024-05-07 · unverdicted · none · ref 119 · internal anchor
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
GAIA: a benchmark for General AI Assistants cs.CL · 2023-11-21 · unverdicted · none · ref 101 · internal anchor
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks cs.CL · 2022-11-22 · unverdicted · none · ref 13 · internal anchor
PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.
SLIM-RL: Risk-Budgeted Random-Masking RL for Diffusion LLMs Without Trajectory Slicing cs.CL · 2026-06-30 · unverdicted · none · ref 10 · internal anchor
SLIM-RL matches or exceeds TraceRL performance on MATH500, GSM8K, MBPP and HumanEval for diffusion LLMs by risk-budgeted random-masking RL without trajectory slicing.
Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs cs.CL · 2026-06-30 · unverdicted · none · ref 39 · internal anchor
RLMF uses quality of model self-judgments to refine RL rankings and select training data, achieving SOTA faithful calibration while preserving accuracy and outperforming standard RL by up to 63%.
VoidPadding: Let [VOID] Handle Padding in Masked Diffusion Language Models so that [EOS] Can Focus on Semantic Termination cs.CL · 2026-06-16 · unverdicted · none · ref 26 · internal anchor
VoidPadding decouples padding from termination in MDLMs via a new [VOID] token, delivering +17.84 average benchmark points and 55.7% fewer decoding steps on Dream-7B-Instruct.
Diffusion Language Model Parallel Decoding via Product-of-Experts Bridge cs.CL · 2026-06-06 · unverdicted · none · ref 7 · internal anchor
PoE-Bridge uses a product-of-experts bridge between diffusion and autoregressive distributions, with DLM drafting plus rejection and importance sampling, to deliver 5x speedup over standard DLM decoding while recovering at least 95% of AR performance on math and coding tasks.
Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation cs.CL · 2026-06-04 · unverdicted · none · ref 22 · internal anchor
On-policy distillation from a frozen autoregressive teacher to a bidirectional student eliminates train-inference mismatch and enables data-efficient ARLM-to-DLM conversion.
Imbuing Large Language Models with Bidirectional Logic for Robust Chain Repair cs.CL · 2026-06-03 · unverdicted · none · ref 9 · internal anchor
TRI trains LLMs on goal-conditioned fill-in-the-middle tasks via PSM token rearrangement and symbolic verification to surgically repair erroneous CoT segments.
DLLG: Dynamic Logit-Level Gating of LLM Experts cs.CL · 2026-06-03 · unverdicted · none · ref 3 · internal anchor
DLLG learns token-level fusion weights for LLM experts from sparse response supervision and outperforms routing, ensembling, and merging baselines on reasoning and code tasks.
Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling cs.CL · 2026-06-02 · unverdicted · none · ref 40 · internal anchor
RL-trained lightweight controller using answer statistics improves trade-offs among correctness, latency, and total samples in adaptive sampling for LLM test-time scaling.
Economy of Minds: Emerging Multi-Agent Intelligence with Economic Interactions cs.CL · 2026-06-01 · unverdicted · none · ref 16 · internal anchor
An economy of agents using auctions and wealth accumulation produces emergent multi-step reasoning that outperforms monolithic baselines on five agentic tasks.
Geometric Latent Reasoning Induces Shorter Generations in LLMs cs.CL · 2026-06-01 · unverdicted · none · ref 25 · internal anchor
GLR formulates latent reasoning as geometric path approximation in pretrained embedding space and reports shorter LLM generations on math tasks without an explicit length penalty.
Consistent and Distinctive: LLM Benchmark Efficiency via Maximum Independent Set Prompt Selection on Similarity Graphs cs.CL · 2026-05-31 · unverdicted · none · ref 12 · internal anchor
A graph-based MIS prompt selection method on embedding similarity graphs yields reduced benchmark subsets with highly consistent LLM rankings (Kendall's W ≥ 0.90 in 99.2% of cases) and 25-48% size reduction at higher thresholds.
Revisiting Parameter-Based Knowledge Editing in Large Language Models: Theoretical Limits and Empirical Evidence cs.CL · 2026-05-30 · conditional · none · ref 32 · internal anchor
Parameter-based knowledge editing in LLMs induces reasoning collapse via dimensional collapse and is consistently outperformed by a retrieval baseline across varied edit counts, knowledge complexity, and evaluation metrics.
AgentOdyssey: Open-Ended Long-Horizon Text Game Generation for Test-Time Continual Learning Agents cs.CL · 2026-05-29 · unverdicted · none · ref 14 · internal anchor
Introduces AgentOdyssey, a procedural generator of open-ended long-horizon text games, to evaluate test-time continual learning agents and diagnose limits in exploration, memory, and planning.
Efficient Diffusion LLMs via Temporal-Spatial Parallel Decoding and Confidence Extrapolation cs.CL · 2026-05-29 · unverdicted · none · ref 6 · internal anchor
Introduces TSPD with a trajectory-feature controller and training-free CE to reduce denoising steps in dLLMs while aiming to preserve quality.

Measuring Mathematical Problem Solving With the MATH Dataset

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer