Title resolution pending

Let's Verify Step by Step , author= · 2023

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

browse 9 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

FlowCompile: An Optimizing Compiler for Structured LLM Workflows

cs.CL · 2026-05-13 · unverdicted · novelty 8.0

FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.

When Does a Language Model Commit? A Finite-Answer Theory of Pre-Verbalization Commitment

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

Finite-answer projections of continuation probabilities stabilize before the answer is parseable, showing 17-31 token mean lead in delayed-verdict tasks with Qwen3-4B-Instruct.

GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

GRLO shows RLHF from scratch on 5K open-ended prompts raises average performance from 24.1 to 63.1 across domains on Qwen3-4B-Base using 46x less data and 68x less compute than in-domain RLVR while remaining competitive with heavily post-trained models.

Verifier-Free RL for LLMs via Intrinsic Gradient-Norm Reward

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

VIGOR assigns higher rewards to LLM completions that produce smaller l2 norms of teacher-forced negative log-likelihood gradients, with sqrt(T) length correction and group ranking, yielding +3.31% math and +1.91% code gains over RLIF on Qwen2.5-7B.

Geometry Guided Self-Consistency for Physical AI

cs.RO · 2026-05-09 · unverdicted · novelty 6.0

KeyStone improves task success rates in diffusion-based physical AI models by up to 13.3% by sampling K trajectories in parallel, clustering them in action space, and returning the medoid of the largest cluster.

Learning to Correct: Calibrated Reinforcement Learning for Multi-Attempt Chain-of-Thought

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

CAL-GRPO calibrates per-attempt weights in multi-attempt CoT to deliver unbiased gradients for optimizing Verification@K success while keeping variance low.

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

cs.AI · 2024-08-13 · unverdicted · novelty 6.0

Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.

A Nash Equilibrium Framework For Training-Free Multimodal Step Verification

cs.CV · 2026-05-19 · unverdicted · novelty 5.0

A Nash equilibrium framework for training-free multimodal step verification that uses cross-modal agreement and disagreement signals for filtering and ranking reasoning steps.

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

cs.AI · 2025-07-15 · unverdicted · novelty 5.0

Chain-of-thought monitorability provides a promising but fragile method for AI safety oversight that developers should actively preserve.

citing papers explorer

Showing 9 of 9 citing papers.

FlowCompile: An Optimizing Compiler for Structured LLM Workflows cs.CL · 2026-05-13 · unverdicted · none · ref 9
FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.
When Does a Language Model Commit? A Finite-Answer Theory of Pre-Verbalization Commitment cs.AI · 2026-05-07 · unverdicted · none · ref 27
Finite-answer projections of continuation probabilities stabilize before the answer is parseable, showing 17-31 token mean lead in delayed-verdict tasks with Qwen3-4B-Instruct.
GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero cs.LG · 2026-05-14 · unverdicted · none · ref 20
GRLO shows RLHF from scratch on 5K open-ended prompts raises average performance from 24.1 to 63.1 across domains on Qwen3-4B-Base using 46x less data and 68x less compute than in-domain RLVR while remaining competitive with heavily post-trained models.
Verifier-Free RL for LLMs via Intrinsic Gradient-Norm Reward cs.LG · 2026-05-11 · unverdicted · none · ref 4
VIGOR assigns higher rewards to LLM completions that produce smaller l2 norms of teacher-forced negative log-likelihood gradients, with sqrt(T) length correction and group ranking, yielding +3.31% math and +1.91% code gains over RLIF on Qwen2.5-7B.
Geometry Guided Self-Consistency for Physical AI cs.RO · 2026-05-09 · unverdicted · none · ref 65
KeyStone improves task success rates in diffusion-based physical AI models by up to 13.3% by sampling K trajectories in parallel, clustering them in action space, and returning the medoid of the largest cluster.
Learning to Correct: Calibrated Reinforcement Learning for Multi-Attempt Chain-of-Thought cs.LG · 2026-04-20 · unverdicted · none · ref 29
CAL-GRPO calibrates per-attempt weights in multi-attempt CoT to deliver unbiased gradients for optimizing Verification@K success while keeping variance low.
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents cs.AI · 2024-08-13 · unverdicted · none · ref 268
Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.
A Nash Equilibrium Framework For Training-Free Multimodal Step Verification cs.CV · 2026-05-19 · unverdicted · none · ref 38
A Nash equilibrium framework for training-free multimodal step verification that uses cross-modal agreement and disagreement signals for filtering and ranking reasoning steps.
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety cs.AI · 2025-07-15 · unverdicted · none · ref 100
Chain-of-thought monitorability provides a promising but fragile method for AI safety oversight that developers should actively preserve.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer