hub

arXiv preprint arXiv:2405.01535 (2024)

Kim, S · 2024 · arXiv 2405.01535

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 baseline 1

citation-polarity summary

background 1 baseline 1

representative citing papers

Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents

cs.LG · 2026-03-13 · unverdicted · novelty 7.0

A rubric-based generative reward model improves reinforced fine-tuning of SWE agents by supplying richer behavioral guidance than binary terminal rewards alone.

Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

Rubric-grounded RL with LLM judges on document-derived criteria raises Llama-3.1-8B normalized reward to 71.7% on held-out rubrics and improves performance on GSM8K, MATH, and GPQA benchmarks.

On the Shelf Life of Fine-Tuned LLM-Judges: Future-Proofing, Backward-Compatibility, and Question Generalization

cs.CL · 2025-09-28 · unverdicted · novelty 6.0

Fine-tuned LLM judges struggle with future-proofing to newer generators but maintain backward-compatibility more easily; DPO training and continual learning improve adaptation while all models degrade on unseen questions.

RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards

cs.CL · 2025-09-25 · unverdicted · novelty 6.0

RLBFF extracts binary principles from human feedback to train reward models that outperform Bradley-Terry models on RM-Bench and JudgeBench and enable customizable inference-time focus for LLM alignment.

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

cs.LG · 2025-07-23 · unverdicted · novelty 6.0

RaR uses aggregated rubric feedback as rewards in on-policy RL, delivering up to 31% relative gains on HealthBench and 7% on GPQA-Diamond versus direct Likert LLM-as-judge baselines.

AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use

cs.AI · 2026-05-06 · unverdicted · novelty 5.0

AgentTrust introduces a runtime interception system for AI agent tool use that achieves 95-97% verdict accuracy on 930 safety scenarios including obfuscated shell payloads.

Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines

cs.AI · 2026-04-25 · unverdicted · novelty 5.0

Style bias dominates LLM-as-a-Judge systems far more than position bias, with debiasing strategies providing model-dependent gains and public tools released for replication.

Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

cs.CL · 2025-12-29 · unverdicted · novelty 5.0

LLM-PeerReview ensembles LLMs by scoring responses with LLM-as-Judge and selecting the best via averaging or truth inference, beating Smoothie-Global by 6.9-7.3 points on four datasets.

HalluScan: A Systematic Benchmark for Detecting and Mitigating Hallucinations in Instruction-Following LLMs

cs.CL · 2026-05-04 · unverdicted · novelty 4.0

HalluScan benchmark evaluates hallucination detection in LLMs, reporting NLI Verification at AUROC 0.88 and introducing HalluScore (r=0.41 with humans) plus Adaptive Detection Routing for 2x cost savings.

KnowPilot: Your Knowledge-Driven Copilot for Domain Tasks

cs.SE · 2026-04-19 · unverdicted · novelty 4.0

KnowPilot integrates knowledge retrieval and memory systems into generative agents to achieve better results on domain-specific tasks such as text generation.

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

cs.CL · 2024-12-07 · accept · novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems

cs.AI · 2026-04-16

citing papers explorer

Showing 12 of 12 citing papers.

Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents cs.LG · 2026-03-13 · unverdicted · none · ref 13
A rubric-based generative reward model improves reinforced fine-tuning of SWE agents by supplying richer behavioral guidance than binary terminal rewards alone.
Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning cs.AI · 2026-05-08 · unverdicted · none · ref 7
Rubric-grounded RL with LLM judges on document-derived criteria raises Llama-3.1-8B normalized reward to 71.7% on held-out rubrics and improves performance on GSM8K, MATH, and GPQA benchmarks.
On the Shelf Life of Fine-Tuned LLM-Judges: Future-Proofing, Backward-Compatibility, and Question Generalization cs.CL · 2025-09-28 · unverdicted · none · ref 16
Fine-tuned LLM judges struggle with future-proofing to newer generators but maintain backward-compatibility more easily; DPO training and continual learning improve adaptation while all models degrade on unseen questions.
RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards cs.CL · 2025-09-25 · unverdicted · none · ref 15
RLBFF extracts binary principles from human feedback to train reward models that outperform Bradley-Terry models on RM-Bench and JudgeBench and enable customizable inference-time focus for LLM alignment.
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains cs.LG · 2025-07-23 · unverdicted · none · ref 17
RaR uses aggregated rubric feedback as rewards in on-policy RL, delivering up to 31% relative gains on HealthBench and 7% on GPQA-Diamond versus direct Likert LLM-as-judge baselines.
AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use cs.AI · 2026-05-06 · unverdicted · none · ref 13
AgentTrust introduces a runtime interception system for AI agent tool use that achieves 95-97% verdict accuracy on 930 safety scenarios including obfuscated shell payloads.
Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines cs.AI · 2026-04-25 · unverdicted · none · ref 9
Style bias dominates LLM-as-a-Judge systems far more than position bias, with debiasing strategies providing model-dependent gains and public tools released for replication.
Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process cs.CL · 2025-12-29 · unverdicted · none · ref 29
LLM-PeerReview ensembles LLMs by scoring responses with LLM-as-Judge and selecting the best via averaging or truth inference, beating Smoothie-Global by 6.9-7.3 points on four datasets.
HalluScan: A Systematic Benchmark for Detecting and Mitigating Hallucinations in Instruction-Following LLMs cs.CL · 2026-05-04 · unverdicted · none · ref 29
HalluScan benchmark evaluates hallucination detection in LLMs, reporting NLI Verification at AUROC 0.88 and introducing HalluScore (r=0.41 with humans) plus Adaptive Detection Routing for 2x cost savings.
KnowPilot: Your Knowledge-Driven Copilot for Domain Tasks cs.SE · 2026-04-19 · unverdicted · none · ref 8
KnowPilot integrates knowledge retrieval and memory systems into generative agents to achieve better results on domain-specific tasks such as text generation.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods cs.CL · 2024-12-07 · accept · none · ref 114
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems cs.AI · 2026-04-16 · unreviewed · ref 6

arXiv preprint arXiv:2405.01535 (2024)

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer