RACER routes between reasoning and non-reasoning LLM judges via constrained distributionally robust optimization to achieve better accuracy-cost trade-offs under distribution shift.
arXiv preprint arXiv:2505.10320
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
CoNL lets LLMs self-improve on non-verifiable tasks by rewarding critiques that produce better solutions in multi-agent conversations, jointly optimizing generation and judging without external feedback.
Fine-tuned LLM judges struggle with future-proofing to newer generators but maintain backward-compatibility more easily; DPO training and continual learning improve adaptation while all models degrade on unseen questions.
GSR jointly trains LLMs to generate candidate solutions and refine a superior final answer from them, achieving state-of-the-art performance on five mathematical benchmarks while transferring across model scales.
RaR uses aggregated rubric feedback as rewards in on-policy RL, delivering up to 31% relative gains on HealthBench and 7% on GPQA-Diamond versus direct Likert LLM-as-judge baselines.
A multi-agent system creates role-specific murder mystery scripts and applies chain-of-thought fine-tuning plus GRPO reinforcement learning to improve VLMs' multi-hop reasoning under uncertainty and deception.
VRPRM combines visual reasoning with a two-stage SFT-plus-RL strategy to deliver higher-quality process reward modeling using far less annotated data than prior non-thinking PRMs.
citing papers explorer
-
Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge
RACER routes between reasoning and non-reasoning LLM judges via constrained distributionally robust optimization to achieve better accuracy-cost trade-offs under distribution shift.
-
Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation
CoNL lets LLMs self-improve on non-verifiable tasks by rewarding critiques that produce better solutions in multi-agent conversations, jointly optimizing generation and judging without external feedback.
-
On the Shelf Life of Fine-Tuned LLM-Judges: Future-Proofing, Backward-Compatibility, and Question Generalization
Fine-tuned LLM judges struggle with future-proofing to newer generators but maintain backward-compatibility more easily; DPO training and continual learning improve adaptation while all models degrade on unseen questions.
-
Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs
GSR jointly trains LLMs to generate candidate solutions and refine a superior final answer from them, achieving state-of-the-art performance on five mathematical benchmarks while transferring across model scales.
-
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
RaR uses aggregated rubric feedback as rewards in on-policy RL, delivering up to 31% relative gains on HealthBench and 7% on GPQA-Diamond versus direct Likert LLM-as-judge baselines.
-
Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games
A multi-agent system creates role-specific murder mystery scripts and applies chain-of-thought fine-tuning plus GRPO reinforcement learning to improve VLMs' multi-hop reasoning under uncertainty and deception.
-
VRPRM: Process Reward Modeling via Visual Reasoning
VRPRM combines visual reasoning with a two-stage SFT-plus-RL strategy to deliver higher-quality process reward modeling using far less annotated data than prior non-thinking PRMs.