Title resolution pending

Training Verifiers to Solve Math Word Problems , author= · 2021

50 Pith papers cite this work. Polarity classification is still indexing.

50 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 2 dataset 1

citation-polarity summary

background 2 use dataset 1

representative citing papers

Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks

cs.CL · 2026-05-22 · conditional · novelty 7.0

Audits reveal no reasoning benchmark controls position/filler/length jointly; CRE shows LLMs drop up to 88pp on middle-position tasks at 64K context, with diagnostic probe supporting positional cause.

Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

ReElicit uses LLMs to elicit adaptive feature embeddings for Gaussian process Bayesian optimization of system prompts under aggregate-only feedback, outperforming baselines across ten tasks with a 30-evaluation budget.

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

cs.CL · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.

Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

cs.AI · 2026-05-07 · conditional · novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.

Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

Prepending stochastic sequences from Lorem Ipsum vocabulary to prompts during GRPO resampling broadens reasoning exploration and outperforms standard resampling on hard tasks for 1.7B-7B models.

BWLA: Breaking the Barrier of W1AX Post-Training Quantization for LLMs

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

BWLA is the first post-training quantization method for LLMs that achieves 1-bit weights paired with low-bit activations such as 6 bits, using OKT to reshape weights and suppress activation tails plus PSP for low-rank refinement.

Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

UCPO modifies GRPO with a uniformity penalty over correct solutions to prevent diversity collapse in RLVR, yielding up to 10% higher Pass@64 on AIME24 and 45% more equation-level diversity.

Beyond Meta-Reasoning: Metacognitive Consolidation for Self-Improving LLM Reasoning

cs.AI · 2026-04-19 · unverdicted · novelty 7.0

Metacognitive Consolidation lets LLMs accumulate reusable meta-reasoning skills from past episodes to improve future performance across benchmarks.

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

cs.SE · 2025-02-25 · unverdicted · novelty 7.0

SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

cs.CL · 2024-10-10 · conditional · novelty 7.0

Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.

RouterBench: A Benchmark for Multi-LLM Routing System

cs.LG · 2024-03-18 · unverdicted · novelty 7.0

RouterBench supplies a standardized benchmark, 405k+ inference dataset, theoretical framework, and comparative analysis for multi-LLM routing systems.

A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA

cs.CL · 2023-11-28 · unverdicted · novelty 7.0

LoRA adapters should be scaled by 1/sqrt(rank) rather than 1/rank to stabilize learning and enable effective use of higher ranks during fine-tuning of large language models.

GAIA: a benchmark for General AI Assistants

cs.CL · 2023-11-21 · unverdicted · novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

Design and Report Benchmarks for Knowledge Work

cs.AI · 2026-05-22 · unverdicted · novelty 6.0

Proposes a three-step benchmark design method (define work activity, specify tested setting, score work product) derived from work studies and O*NET, demonstrated via three case analyses.

MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

MixRea benchmark reveals LLMs achieve at most 42.8% consistency on explicit-implicit reasoning tasks, with PRCP prompting proposed to recover overlooked relations.

ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents

cs.CR · 2026-05-17 · conditional · novelty 6.0

Clarification-seeking in LLM agents amplifies prompt injection attack success from ~2% to over 30% across ten frontier models in a new 728-scenario benchmark.

GradShield: Alignment Preserving Finetuning

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

GradShield removes data points likely to cause safety misalignment during LLM finetuning by computing a Finetuning Implicit Harmfulness Score and applying adaptive thresholding, keeping attack success rates below 6% while preserving utility.

Search Your Block Floating Point Scales!

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.

Verifier-Free RL for LLMs via Intrinsic Gradient-Norm Reward

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

VIGOR assigns higher rewards to LLM completions that produce smaller l2 norms of teacher-forced negative log-likelihood gradients, with sqrt(T) length correction and group ranking, yielding +3.31% math and +1.91% code gains over RLIF on Qwen2.5-7B.

Geometry Guided Self-Consistency for Physical AI

cs.RO · 2026-05-09 · unverdicted · novelty 6.0

KeyStone improves task success rates in diffusion-based physical AI models by up to 13.3% by sampling K trajectories in parallel, clustering them in action space, and returning the medoid of the largest cluster.

Query-efficient model evaluation using cached responses

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

DKPS-based methods leverage cached model responses to achieve equivalent benchmark prediction accuracy with substantially fewer queries than standard evaluation.

Is Escalation Worth It? A Decision-Theoretic Characterization of LLM Cascades

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Decision theory shows that LLM cascades are structurally limited by always incurring the cheap model's cost before deciding to escalate, with the best performance given by the envelope of pairwise cascades rather than fixed chains or many stages.

Parallel Prefix Verification for Speculative Generation

cs.AI · 2026-05-05 · unverdicted · novelty 6.0

PARSE accelerates LLM inference via parallel semantic prefix verification in a single forward pass, delivering 1.25x-4.3x speedups alone and up to 4.5x when combined with EAGLE-3.

The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling

cs.AI · 2026-05-04 · unverdicted · novelty 6.0 · 2 refs

APPS approximates power targets p(x)^alpha via parallel particle propagation with proposal-corrected reweighting and future-value-guided selection at block boundaries, improving accuracy-runtime trade-offs in training-free decoding.

citing papers explorer

Showing 50 of 50 citing papers.

Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks cs.CL · 2026-05-22 · conditional · none · ref 43
Audits reveal no reasoning benchmark controls position/filler/length jointly; CRE shows LLMs drop up to 88pp on middle-position tasks at 64K context, with diagnostic probe supporting positional cause.
Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts cs.AI · 2026-05-18 · unverdicted · none · ref 33
ReElicit uses LLMs to elicit adaptive feature embeddings for Gaussian process Bayesian optimization of system prompts under aggregate-only feedback, outperforming baselines across ten tasks with a 30-evaluation budget.
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching cs.CL · 2026-05-12 · unverdicted · none · ref 24 · 2 links
TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost cs.AI · 2026-05-07 · conditional · none · ref 31
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration cs.AI · 2026-05-07 · unverdicted · none · ref 28
Prepending stochastic sequences from Lorem Ipsum vocabulary to prompts during GRPO resampling broadens reasoning exploration and outperforms standard resampling on hard tasks for 1.7B-7B models.
BWLA: Breaking the Barrier of W1AX Post-Training Quantization for LLMs cs.LG · 2026-05-01 · unverdicted · none · ref 31
BWLA is the first post-training quantization method for LLMs that achieves 1-bit weights paired with low-bit activations such as 6 bits, using OKT to reshape weights and suppress activation tails plus PSP for low-rank refinement.
Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity cs.LG · 2026-05-01 · unverdicted · none · ref 9
UCPO modifies GRPO with a uniformity penalty over correct solutions to prevent diversity collapse in RLVR, yielding up to 10% higher Pass@64 on AIME24 and 45% more equation-level diversity.
Beyond Meta-Reasoning: Metacognitive Consolidation for Self-Improving LLM Reasoning cs.AI · 2026-04-19 · unverdicted · none · ref 25
Metacognitive Consolidation lets LLMs accumulate reusable meta-reasoning skills from past episodes to improve future performance across benchmarks.
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution cs.SE · 2025-02-25 · unverdicted · none · ref 127
SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models cs.CL · 2024-10-10 · conditional · none · ref 1
Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.
RouterBench: A Benchmark for Multi-LLM Routing System cs.LG · 2024-03-18 · unverdicted · none · ref 45
RouterBench supplies a standardized benchmark, 405k+ inference dataset, theoretical framework, and comparative analysis for multi-LLM routing systems.
A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA cs.CL · 2023-11-28 · unverdicted · none · ref 69
LoRA adapters should be scaled by 1/sqrt(rank) rather than 1/rank to stabilize learning and enable effective use of higher ranks during fine-tuning of large language models.
GAIA: a benchmark for General AI Assistants cs.CL · 2023-11-21 · unverdicted · none · ref 36
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
Design and Report Benchmarks for Knowledge Work cs.AI · 2026-05-22 · unverdicted · none · ref 40
Proposes a three-step benchmark design method (define work activity, specify tested setting, score work product) derived from work studies and O*NET, demonstrated via three case analyses.
MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models cs.CL · 2026-05-19 · unverdicted · none · ref 30
MixRea benchmark reveals LLMs achieve at most 42.8% consistency on explicit-implicit reasoning tasks, with PRCP prompting proposed to recover overlooked relations.
ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents cs.CR · 2026-05-17 · conditional · none · ref 45
Clarification-seeking in LLM agents amplifies prompt injection attack success from ~2% to over 30% across ten frontier models in a new 728-scenario benchmark.
GradShield: Alignment Preserving Finetuning cs.CL · 2026-05-13 · unverdicted · none · ref 51
GradShield removes data points likely to cause safety misalignment during LLM finetuning by computing a Finetuning Implicit Harmfulness Score and applying adaptive thresholding, keeping attack success rates below 6% while preserving utility.
Search Your Block Floating Point Scales! cs.LG · 2026-05-12 · unverdicted · none · ref 99
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
Verifier-Free RL for LLMs via Intrinsic Gradient-Norm Reward cs.LG · 2026-05-11 · unverdicted · none · ref 20
VIGOR assigns higher rewards to LLM completions that produce smaller l2 norms of teacher-forced negative log-likelihood gradients, with sqrt(T) length correction and group ranking, yielding +3.31% math and +1.91% code gains over RLIF on Qwen2.5-7B.
Geometry Guided Self-Consistency for Physical AI cs.RO · 2026-05-09 · unverdicted · none · ref 64
KeyStone improves task success rates in diffusion-based physical AI models by up to 13.3% by sampling K trajectories in parallel, clustering them in action space, and returning the medoid of the largest cluster.
Query-efficient model evaluation using cached responses cs.LG · 2026-05-08 · unverdicted · none · ref 160
DKPS-based methods leverage cached model responses to achieve equivalent benchmark prediction accuracy with substantially fewer queries than standard evaluation.
Is Escalation Worth It? A Decision-Theoretic Characterization of LLM Cascades cs.LG · 2026-05-07 · unverdicted · none · ref 110
Decision theory shows that LLM cascades are structurally limited by always incurring the cheap model's cost before deciding to escalate, with the best performance given by the envelope of pairwise cascades rather than fixed chains or many stages.
Parallel Prefix Verification for Speculative Generation cs.AI · 2026-05-05 · unverdicted · none · ref 15
PARSE accelerates LLM inference via parallel semantic prefix verification in a single forward pass, delivering 1.25x-4.3x speedups alone and up to 4.5x when combined with EAGLE-3.
The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling cs.AI · 2026-05-04 · unverdicted · none · ref 6 · 2 links
APPS approximates power targets p(x)^alpha via parallel particle propagation with proposal-corrected reweighting and future-value-guided selection at block boundaries, improving accuracy-runtime trade-offs in training-free decoding.
Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting cs.LG · 2026-05-04 · unverdicted · none · ref 63
Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.
Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding cs.CL · 2026-05-01 · unverdicted · none · ref 37
EVICT adaptively truncates draft trees in MoE speculative decoding by combining drafter signals with profiled costs to retain only cost-effective prefixes, delivering up to 2.35x speedup over autoregressive decoding.
Unsupervised Confidence Calibration for Reasoning LLMs from a Single Generation cs.LG · 2026-04-21 · unverdicted · none · ref 73
Unsupervised single-generation confidence calibration for reasoning LLMs via offline self-consistency proxy distillation outperforms baselines on math and QA tasks and improves selective prediction.
GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models cs.AI · 2026-04-21 · unverdicted · none · ref 62
GRASPrune removes 50% of parameters from LLaMA-2-7B via global gating and projected straight-through estimation, reaching 12.18 WikiText-2 perplexity and competitive zero-shot accuracy after four epochs on 512 calibration sequences.
Does Self-Consistency Improve the Recall of Encyclopedic Knowledge? cs.CL · 2026-04-21 · unverdicted · none · ref 23
Self-consistency improves LLM performance on encyclopedic knowledge recall as well as symbolic reasoning, setting a new 89% accuracy record on MMLU with GPT-4o.
Learning to Correct: Calibrated Reinforcement Learning for Multi-Attempt Chain-of-Thought cs.LG · 2026-04-20 · unverdicted · none · ref 28
CAL-GRPO calibrates per-attempt weights in multi-attempt CoT to deliver unbiased gradients for optimizing Verification@K success while keeping variance low.
KnowledgeBerg: Evaluating Systematic Knowledge Coverage and Compositional Reasoning in Large Language Models cs.AI · 2026-04-19 · unverdicted · none · ref 12
KnowledgeBerg benchmark shows open-source LLMs achieve only 5.26-36.88 F1 on universe enumeration and 16-44% accuracy on knowledge-grounded compositional reasoning, with persistent failures in completeness, awareness, and application.
Probabilistic Programs of Thought cs.CL · 2026-04-19 · unverdicted · none · ref 16
Probabilistic programs of thought let LLMs produce many program variants from one generation by building a compact probabilistic representation of the token distribution.
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning cs.AI · 2025-07-01 · conditional · none · ref 42
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
Muon is Scalable for LLM Training cs.LG · 2025-02-24 · unverdicted · none · ref 107
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
Lessons from the Trenches on Reproducible Evaluation of Language Models cs.CL · 2024-05-23 · accept · none · ref 134
The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.
Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive cs.CL · 2024-02-20 · conditional · none · ref 93
DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.
ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models cs.CL · 2024-02-18 · unverdicted · none · ref 46
ALLaVA creates 1.3M GPT4V-synthesized samples enabling 4B VLMs to achieve competitive results on 17 benchmarks and match 7B/13B models on some tasks.
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations cs.CL · 2023-05-23 · conditional · none · ref 212
UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.
Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 43
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 13
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
TAPIOCA: Why Task- Aware Pruning Improves OOD model Capability cs.LG · 2026-05-14 · unverdicted · none · ref 55
Task-aware pruning improves OOD performance by removing layers that distort task-adapted representation profiles, realigning OOD inputs with the geometry observed on ID data.
Minimal-Intervention KV Retention via Set-Conditioned Diversity cs.LG · 2026-05-14 · conditional · none · ref 24
A minimal scoring modification to TriAttention using greedy facility-location selection with V-space redundancy penalty improves KV retention at budgets 64 and 128 on distilled reasoning models under matched-memory held-out evaluation.
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training cs.LG · 2026-05-09 · unverdicted · none · ref 3 · 2 links
Pruning pretrained MoE models outperforms training from scratch under fixed budget, different expert compression methods converge after continued training, and progressive pruning plus multi-token KD improves the final 23A2B model.
DACA-GRPO: Denoising-Aware Credit Assignment for Reinforcement Learning in Diffusion Language Models cs.LG · 2026-05-08 · unverdicted · none · ref 15
DACA-GRPO adds denoising-aware credit assignment and bias-reduced likelihood estimation to GRPO, delivering consistent gains up to 36.3pp on math, code, constraint, and schema benchmarks for diffusion LLMs.
NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning cs.LG · 2026-05-06 · unverdicted · none · ref 18
Injecting noise into LLM latent trajectories creates diverse reasoning paths whose agreement acts as a confidence signal for selective abstention, cutting error rates from 40-70% to under 15% on math tasks.
Can Continual Pre-training Bridge the Performance Gap between General-purpose and Specialized Language Models in the Medical Domain? cs.CL · 2026-04-21 · unverdicted · none · ref 80
Continual pre-training on a German medical corpus lets 7B models close much of the performance gap with 24B general models on medical benchmarks, though merging introduces some language mixing and verbosity.
Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges cs.CL · 2026-05-19 · unverdicted · none · ref 14
A literature survey synthesizing benchmarks, architectures, training strategies, and evaluation methods for mathematical reasoning in LLMs, based on roughly 120 papers.
PipeSD: An Efficient Cloud-Edge Collaborative Pipeline Inference Framework with Speculative Decoding cs.DC · 2026-05-13 · unreviewed · ref 39 · 2 links
GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations cs.CL · 2026-05-08 · unreviewed · ref 45
Learning in the Fisher Subspace: A Guided Initialization for LoRA Fine-Tuning cs.LG · 2026-05-01 · unreviewed · ref 39

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer