hub Canonical reference

P., Kawaguchi, K., and Shieh, M

Xie, Y · 2024 · arXiv 2405.00451

Canonical reference. 78% of citing Pith papers cite this work as background.

21 Pith papers citing it

Background 78% of classified citations

read on arXiv browse 21 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 method 2 other 1

citation-polarity summary

background 7 unclear 1 use method 1

representative citing papers

CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

CA-SQL achieves 51.72% execution accuracy on the challenging tier of the BIRD benchmark using GPT-4o-mini by scaling exploration breadth according to estimated task difficulty, evolutionary prompt seeding, and candidate voting.

StoryAlign: Evaluating and Training Reward Models for Story Generation

cs.CL · 2026-05-06 · unverdicted · novelty 7.0

StoryReward, trained on a new 100k story preference dataset, sets state-of-the-art performance on the introduced StoryRMB benchmark for aligning LLM stories with human preferences.

The Art of Scaling Reinforcement Learning Compute for LLMs

cs.LG · 2025-10-15 · unverdicted · novelty 7.0

A 400k+ GPU-hour study shows RL scaling in LLMs follows predictable sigmoidal trajectories, with most design choices affecting efficiency rather than the performance asymptote, enabling accurate large-scale predictions via the ScaleRL recipe.

Scalable Token-Level Hallucination Detection in Large Language Models

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

TokenHD uses a scalable data synthesis engine and importance-weighted training to create token-level hallucination detectors that work on free-form text and scale from 0.6B to 8B parameters, outperforming larger reasoning models.

Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration

cs.LG · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

SPEX delivers 1.2-3x speedup on ToT algorithms via speculative path selection, dynamic budget allocation, and adaptive early termination, reaching up to 4.1x when combined with token-level speculative decoding.

A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

cs.CL · 2026-05-07 · unverdicted · novelty 6.0

A²TGPO improves RL policy optimization for multi-turn agentic LLMs by normalizing information gain within same-depth turn groups, rescaling cumulative advantages by sqrt of term count, and modulating clipping ranges per turn's normalized IG.

Online Self-Calibration Against Hallucination in Vision-Language Models

cs.CV · 2026-05-01 · unverdicted · novelty 6.0

OSCAR exploits the generative-discriminative gap in LVLMs to build online preference data with MCTS and dual-granularity rewards for DPO-based calibration, claiming SOTA hallucination reduction and improved multimodal performance.

Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

cs.CL · 2025-12-08 · unverdicted · novelty 6.0

NPR trains LLMs to reason in parallel via self-distilled RL, delivering up to 24.5% performance gains and 4.6x speedups with 100% genuine parallel execution on reasoning benchmarks.

Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models

cs.CL · 2025-10-04 · unverdicted · novelty 6.0

Curtailing diversity in candidate pools for test-time scaling increases unsafe LLM outputs, as demonstrated by a reference-guided reduction protocol that evades standard safety classifiers across open and closed models.

Retrieval-of-Thought: Efficient Reasoning via Reusing Thoughts

cs.AI · 2025-09-26 · unverdicted · novelty 6.0

Retrieval-of-Thought organizes prior reasoning into a thought graph for retrieval and reward-guided recombination, reducing output tokens by up to 40% and latency by 82% while preserving accuracy on reasoning benchmarks.

ChipSeek: Optimizing Verilog Generation via EDA-Integrated Reinforcement Learning

cs.AI · 2025-07-07 · unverdicted · novelty 6.0

ChipSeek is a hierarchical-reward reinforcement learning framework with Curriculum-Guided Dynamic Policy Optimization that integrates EDA simulator feedback to improve LLM-generated RTL code on both functional correctness and PPA metrics.

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

cs.CL · 2025-04-15 · unverdicted · novelty 6.0

ReTool uses outcome-driven RL to train 32B LLMs to dynamically use code tools during reasoning, reaching 72.5% accuracy on AIME and surpassing o1-preview.

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

cs.CL · 2025-03-10 · unverdicted · novelty 6.0

A two-stage RL framework first boosts text reasoning in 3B LMMs then adapts it to multimodal inputs, producing modest benchmark gains of 4.5-4.8%.

Efficient Multi-Agent System Training with Data Influence-Oriented Tree Search

cs.CL · 2025-02-02 · unverdicted · novelty 6.0

DITS replaces Q-value guidance in MCTS with influence scores for synthetic data synthesis in multi-agent LLM training, claiming better efficiency and performance on eight datasets.

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

cs.CV · 2025-01-16 · conditional · novelty 6.0

Diffusion models improve generation quality via inference-time search over noise candidates guided by verifiers and algorithms, yielding gains beyond denoising step scaling on class- and text-conditioned benchmarks.

APCD: Adaptive Path-Contrastive Decoding for Reliable Large Language Model Generation

cs.CL · 2026-05-10 · unverdicted · novelty 5.0

APCD adaptively branches LLM decoding paths based on token entropy and contrasts divergent paths to improve factual accuracy while preserving efficiency.

Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

cs.LG · 2025-04-18 · unverdicted · novelty 5.0

PODS applies max-variance down-sampling to GRPO rollouts in LLM RLVR, delivering at least 1.7x faster training to peak test accuracy on reasoning benchmarks.

A Survey of Scaling in Large Language Model Reasoning

cs.AI · 2025-04-02 · unverdicted · novelty 3.0

A survey categorizing scaling in LLM reasoning across input size, steps, rounds, training, and future directions, noting that scaling can negatively affect performance.

From System 1 to System 2: A Survey of Reasoning Large Language Models

cs.AI · 2025-02-24 · accept · novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

cs.AI · 2025-01-16 · unverdicted · novelty 3.0

The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.

Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

cs.AI · 2026-05-11

citing papers explorer

Showing 21 of 21 citing papers.

CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation cs.CL · 2026-05-08 · unverdicted · none · ref 28
CA-SQL achieves 51.72% execution accuracy on the challenging tier of the BIRD benchmark using GPT-4o-mini by scaling exploration breadth according to estimated task difficulty, evolutionary prompt seeding, and candidate voting.
StoryAlign: Evaluating and Training Reward Models for Story Generation cs.CL · 2026-05-06 · unverdicted · none · ref 30
StoryReward, trained on a new 100k story preference dataset, sets state-of-the-art performance on the introduced StoryRMB benchmark for aligning LLM stories with human preferences.
The Art of Scaling Reinforcement Learning Compute for LLMs cs.LG · 2025-10-15 · unverdicted · none · ref 21
A 400k+ GPU-hour study shows RL scaling in LLMs follows predictable sigmoidal trajectories, with most design choices affecting efficiency rather than the performance asymptote, enabling accurate large-scale predictions via the ScaleRL recipe.
Scalable Token-Level Hallucination Detection in Large Language Models cs.CL · 2026-05-12 · unverdicted · none · ref 17
TokenHD uses a scalable data synthesis engine and importance-weighted training to create token-level hallucination detectors that work on free-form text and scale from 0.6B to 8B parameters, outperforming larger reasoning models.
Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration cs.LG · 2026-05-11 · unverdicted · none · ref 61 · 2 links
SPEX delivers 1.2-3x speedup on ToT algorithms via speculative path selection, dynamic budget allocation, and adaptive early termination, reaching up to 4.1x when combined with token-level speculative decoding.
A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping cs.CL · 2026-05-07 · unverdicted · none · ref 27
A²TGPO improves RL policy optimization for multi-turn agentic LLMs by normalizing information gain within same-depth turn groups, rescaling cumulative advantages by sqrt of term count, and modulating clipping ranges per turn's normalized IG.
Online Self-Calibration Against Hallucination in Vision-Language Models cs.CV · 2026-05-01 · unverdicted · none · ref 36
OSCAR exploits the generative-discriminative gap in LVLMs to build online preference data with MCTS and dual-granularity rewards for DPO-based calibration, claiming SOTA hallucination reduction and improved multimodal performance.
Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning cs.CL · 2025-12-08 · unverdicted · none · ref 16
NPR trains LLMs to reason in parallel via self-distilled RL, delivering up to 24.5% performance gains and 4.6x speedups with 100% genuine parallel execution on reasoning benchmarks.
Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models cs.CL · 2025-10-04 · unverdicted · none · ref 12
Curtailing diversity in candidate pools for test-time scaling increases unsafe LLM outputs, as demonstrated by a reference-guided reduction protocol that evades standard safety classifiers across open and closed models.
Retrieval-of-Thought: Efficient Reasoning via Reusing Thoughts cs.AI · 2025-09-26 · unverdicted · none · ref 27
Retrieval-of-Thought organizes prior reasoning into a thought graph for retrieval and reward-guided recombination, reducing output tokens by up to 40% and latency by 82% while preserving accuracy on reasoning benchmarks.
ChipSeek: Optimizing Verilog Generation via EDA-Integrated Reinforcement Learning cs.AI · 2025-07-07 · unverdicted · none · ref 33
ChipSeek is a hierarchical-reward reinforcement learning framework with Curriculum-Guided Dynamic Policy Optimization that integrates EDA simulator feedback to improve LLM-generated RTL code on both functional correctness and PPA metrics.
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs cs.CL · 2025-04-15 · unverdicted · none · ref 33
ReTool uses outcome-driven RL to train 32B LLMs to dynamically use code tools during reasoning, reaching 72.5% accuracy on AIME and surpassing o1-preview.
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL cs.CL · 2025-03-10 · unverdicted · none · ref 84
A two-stage RL framework first boosts text reasoning in 3B LMMs then adapts it to multimodal inputs, producing modest benchmark gains of 4.5-4.8%.
Efficient Multi-Agent System Training with Data Influence-Oriented Tree Search cs.CL · 2025-02-02 · unverdicted · none · ref 21
DITS replaces Q-value guidance in MCTS with influence scores for synthetic data synthesis in multi-agent LLM training, claiming better efficiency and performance on eight datasets.
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps cs.CV · 2025-01-16 · conditional · none · ref 91
Diffusion models improve generation quality via inference-time search over noise candidates guided by verifiers and algorithms, yielding gains beyond denoising step scaling on class- and text-conditioned benchmarks.
APCD: Adaptive Path-Contrastive Decoding for Reliable Large Language Model Generation cs.CL · 2026-05-10 · unverdicted · none · ref 81
APCD adaptively branches LLM decoding paths based on token entropy and contrasts divergent paths to improve factual accuracy while preserving efficiency.
Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning cs.LG · 2025-04-18 · unverdicted · none · ref 22
PODS applies max-variance down-sampling to GRPO rollouts in LLM RLVR, delivering at least 1.7x faster training to peak test accuracy on reasoning benchmarks.
A Survey of Scaling in Large Language Model Reasoning cs.AI · 2025-04-02 · unverdicted · none · ref 231
A survey categorizing scaling in LLM reasoning across input size, steps, rounds, training, and future directions, noting that scaling can negatively affect performance.
From System 1 to System 2: A Survey of Reasoning Large Language Models cs.AI · 2025-02-24 · accept · none · ref 150
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models cs.AI · 2025-01-16 · unverdicted · none · ref 167
The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace cs.AI · 2026-05-11 · unreviewed · ref 48

P., Kawaguchi, K., and Shieh, M

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer