arXiv preprint arXiv:2505.10320

J1: Incentivizing thinking in llm-as-a-judge via reinforcement learning · 2025 · arXiv 2505.10320

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

RACER routes between reasoning and non-reasoning LLM judges via constrained distributionally robust optimization to achieve better accuracy-cost trade-offs under distribution shift.

Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation

cs.CL · 2026-01-29 · unverdicted · novelty 6.0

CoNL lets LLMs self-improve on non-verifiable tasks by rewarding critiques that produce better solutions in multi-agent conversations, jointly optimizing generation and judging without external feedback.

On the Shelf Life of Fine-Tuned LLM-Judges: Future-Proofing, Backward-Compatibility, and Question Generalization

cs.CL · 2025-09-28 · unverdicted · novelty 6.0

Fine-tuned LLM judges struggle with future-proofing to newer generators but maintain backward-compatibility more easily; DPO training and continual learning improve adaptation while all models degrade on unseen questions.

Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs

cs.LG · 2025-08-27 · conditional · novelty 6.0

GSR jointly trains LLMs to generate candidate solutions and refine a superior final answer from them, achieving state-of-the-art performance on five mathematical benchmarks while transferring across model scales.

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

cs.LG · 2025-07-23 · unverdicted · novelty 6.0

RaR uses aggregated rubric feedback as rewards in on-policy RL, delivering up to 31% relative gains on HealthBench and 7% on GPQA-Diamond versus direct Likert LLM-as-judge baselines.

Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games

cs.AI · 2026-04-13 · unverdicted · novelty 5.0

A multi-agent system creates role-specific murder mystery scripts and applies chain-of-thought fine-tuning plus GRPO reinforcement learning to improve VLMs' multi-hop reasoning under uncertainty and deception.

VRPRM: Process Reward Modeling via Visual Reasoning

cs.LG · 2025-08-05 · unverdicted · novelty 5.0

VRPRM combines visual reasoning with a two-stage SFT-plus-RL strategy to deliver higher-quality process reward modeling using far less annotated data than prior non-thinking PRMs.

citing papers explorer

Showing 7 of 7 citing papers.

Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge cs.AI · 2026-05-11 · unverdicted · none · ref 26
RACER routes between reasoning and non-reasoning LLM judges via constrained distributionally robust optimization to achieve better accuracy-cost trade-offs under distribution shift.
Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation cs.CL · 2026-01-29 · unverdicted · none · ref 25
CoNL lets LLMs self-improve on non-verifiable tasks by rewarding critiques that produce better solutions in multi-agent conversations, jointly optimizing generation and judging without external feedback.
On the Shelf Life of Fine-Tuned LLM-Judges: Future-Proofing, Backward-Compatibility, and Question Generalization cs.CL · 2025-09-28 · unverdicted · none · ref 45
Fine-tuned LLM judges struggle with future-proofing to newer generators but maintain backward-compatibility more easily; DPO training and continual learning improve adaptation while all models degrade on unseen questions.
Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs cs.LG · 2025-08-27 · conditional · none · ref 35
GSR jointly trains LLMs to generate candidate solutions and refine a superior final answer from them, achieving state-of-the-art performance on five mathematical benchmarks while transferring across model scales.
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains cs.LG · 2025-07-23 · unverdicted · none · ref 35
RaR uses aggregated rubric feedback as rewards in on-policy RL, delivering up to 31% relative gains on HealthBench and 7% on GPQA-Diamond versus direct Likert LLM-as-judge baselines.
Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games cs.AI · 2026-04-13 · unverdicted · none · ref 2
A multi-agent system creates role-specific murder mystery scripts and applies chain-of-thought fine-tuning plus GRPO reinforcement learning to improve VLMs' multi-hop reasoning under uncertainty and deception.
VRPRM: Process Reward Modeling via Visual Reasoning cs.LG · 2025-08-05 · unverdicted · none · ref 9
VRPRM combines visual reasoning with a two-stage SFT-plus-RL strategy to deliver higher-quality process reward modeling using far less annotated data than prior non-thinking PRMs.

arXiv preprint arXiv:2505.10320

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer