hub Canonical reference

Critique-out-loud reward models

Zachary Ankner, Mansheej Paul, Brandon Cui, Jonathan D Chang, Prithviraj Ammanabrolu · 2024 · arXiv 2408.11791

Canonical reference. 83% of citing Pith papers cite this work as background.

14 Pith papers citing it

Background 83% of classified citations

read on arXiv browse 14 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 method 1

citation-polarity summary

background 5 use method 1

representative citing papers

Will It Go Viral? Grounding Micro-Video Popularity Prediction on the Open Web

cs.MM · 2026-05-18 · unverdicted · novelty 7.0

WEBSHORTS dataset and SHORTS-CAST framework ground micro-video popularity prediction in structured open-web context collected at upload time and enable selective online adaptation using delayed labels.

Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

VeGAS improves MLLM-based embodied agents by sampling action ensembles and using a verifier trained on LLM-synthesized failure cases, yielding up to 36% relative gains on hard multi-object long-horizon tasks in Habitat and ALFRED.

POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference

cs.SE · 2026-05-05 · unverdicted · novelty 7.0

POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.

Text-to-Distribution Prediction with Quantile Tokens and Neighbor Context

cs.CL · 2026-04-22 · unverdicted · novelty 7.0

Quantile tokens inserted into LLM inputs combined with neighbor retrieval enable direct prediction of full distributions, yielding lower MAPE and narrower intervals than baselines on Airbnb and StackSample tasks.

STRIDE: Learnable Stepwise Language Feedback for LLM Reasoning

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

STRIDE co-trains generator and verifier on outcome rewards alone to deliver learnable stepwise language feedback that redirects LLM reasoning trajectories and outperforms scalar-reward baselines.

Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text-to-image and editing benchmarks.

Building a Precise Video Language with Human-AI Oversight

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video generation models.

PaTaRM: Bridging Pairwise and Pointwise Signals via Preference-Aware Task-Adaptive Reward Modeling

cs.LG · 2025-10-28 · unverdicted · novelty 6.0

PaTaRM converts pairwise preference data into pointwise reward signals via a novel PAR mechanism and task-adaptive rubrics, reporting 8.7% gains on RewardBench/RMBench and 13.6% relative RLHF improvement.

RewardBench 2: Advancing Reward Model Evaluation

cs.CL · 2025-06-02 · unverdicted · novelty 6.0

RewardBench 2 is a new benchmark that supplies challenging fresh human prompts for reward model evaluation, yielding lower average scores but higher correlation with downstream best-of-N sampling and RLHF training performance.

When Slower Isn't Truer: Inverse Scaling Law of Truthfulness in Multimodal Reasoning

cs.AI · 2025-05-26 · unverdicted · novelty 6.0

Slower multimodal reasoning models exhibit inverse scaling in truthfulness by fabricating details under ambiguous visual inputs, while faster models remain more cautious via broader inference.

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

cs.LG · 2026-04-15 · unverdicted · novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.

A Survey on LLM-as-a-Judge

cs.CL · 2024-11-23 · unverdicted · novelty 4.0

A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.

A Survey of Reinforcement Learning for Large Reasoning Models

cs.CL · 2025-09-10 · accept · novelty 3.0

A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.

Reinforcement Learning from Human Feedback

cs.LG · 2025-04-16 · unverdicted · novelty 2.0

The book introduces the origins, mathematical setup, and optimization stages of RLHF including reward modeling, reinforcement learning, rejection sampling, and direct alignment algorithms.

citing papers explorer

Showing 14 of 14 citing papers.

Will It Go Viral? Grounding Micro-Video Popularity Prediction on the Open Web cs.MM · 2026-05-18 · unverdicted · none · ref 59
WEBSHORTS dataset and SHORTS-CAST framework ground micro-video popularity prediction in structured open-web context collected at upload time and enable selective online adaptation using delayed labels.
Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents cs.AI · 2026-05-12 · unverdicted · none · ref 2
VeGAS improves MLLM-based embodied agents by sampling action ensembles and using a verifier trained on LLM-synthesized failure cases, yielding up to 36% relative gains on hard multi-object long-horizon tasks in Habitat and ALFRED.
POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference cs.SE · 2026-05-05 · unverdicted · none · ref 34
POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.
Text-to-Distribution Prediction with Quantile Tokens and Neighbor Context cs.CL · 2026-04-22 · unverdicted · none · ref 40
Quantile tokens inserted into LLM inputs combined with neighbor retrieval enable direct prediction of full distributions, yielding lower MAPE and narrower intervals than baselines on Airbnb and StackSample tasks.
STRIDE: Learnable Stepwise Language Feedback for LLM Reasoning cs.LG · 2026-05-13 · unverdicted · none · ref 25
STRIDE co-trains generator and verifier on outcome rewards alone to deliver learnable stepwise language feedback that redirects LLM reasoning trajectories and outperforms scalar-reward baselines.
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria cs.AI · 2026-05-08 · unverdicted · none · ref 1
Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text-to-image and editing benchmarks.
Building a Precise Video Language with Human-AI Oversight cs.CV · 2026-04-22 · unverdicted · none · ref 1
CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video generation models.
PaTaRM: Bridging Pairwise and Pointwise Signals via Preference-Aware Task-Adaptive Reward Modeling cs.LG · 2025-10-28 · unverdicted · none · ref 2
PaTaRM converts pairwise preference data into pointwise reward signals via a novel PAR mechanism and task-adaptive rubrics, reporting 8.7% gains on RewardBench/RMBench and 13.6% relative RLHF improvement.
RewardBench 2: Advancing Reward Model Evaluation cs.CL · 2025-06-02 · unverdicted · none · ref 63
RewardBench 2 is a new benchmark that supplies challenging fresh human prompts for reward model evaluation, yielding lower average scores but higher correlation with downstream best-of-N sampling and RLHF training performance.
When Slower Isn't Truer: Inverse Scaling Law of Truthfulness in Multimodal Reasoning cs.AI · 2025-05-26 · unverdicted · none · ref 1
Slower multimodal reasoning models exhibit inverse scaling in truthfulness by fabricating details under ambiguous visual inputs, while faster models remain more cautious via broader inference.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges cs.LG · 2026-04-15 · unverdicted · none · ref 132
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
A Survey on LLM-as-a-Judge cs.CL · 2024-11-23 · unverdicted · none · ref 5
A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.
A Survey of Reinforcement Learning for Large Reasoning Models cs.CL · 2025-09-10 · accept · none · ref 17
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
Reinforcement Learning from Human Feedback cs.LG · 2025-04-16 · unverdicted · none · ref 84
The book introduces the origins, mathematical setup, and optimization stages of RLHF including reward modeling, reinforcement learning, rejection sampling, and direct alignment algorithms.

Critique-out-loud reward models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer