hub Canonical reference

Agent-RewardBench: Towards a unified benchmark for reward modeling across perception, planning, and safety in real- world multimodal agents

Alexander Spangher, Michael Lu, Sriya Kalyan, Hyundong Justin Cho, Tenghao Huang, Weiyan Shi, Jonathan May · 2025 · DOI 10.18653/v1/2025.acl-

Canonical reference. 90% of citing Pith papers cite this work as background.

24 Pith papers citing it

Background 90% of classified citations

open at publisher browse 24 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 9 dataset 1

citation-polarity summary

background 9 support 1

representative citing papers

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents

cs.CR · 2026-01-26 · unverdicted · novelty 8.0

GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.

Code Generation by Differential Test Time Scaling

cs.SE · 2026-05-19 · unverdicted · novelty 7.0

DiffCodeGen clusters code candidates by behavioral similarity from fuzzing-synthesized inputs and selects the largest cluster's medoid, matching or exceeding prior test-time scaling methods with far less token and time cost.

Needle-in-RAG: Prompt-Conditioned Character-Level Traceback of Poisoned Spans in Retrieved Evidence

cs.CR · 2026-05-03 · unverdicted · novelty 7.0

RAGCharacter localizes poisoned character spans in RAG evidence via prompt-conditioned counterfactual masking and achieves the best accuracy-over-attribution trade-off across tested attacks and models.

Beyond Code Reasoning: Specification-Anchored Auditing of Multi-Implementation Distributed Protocols

cs.CR · 2026-04-29 · conditional · novelty 7.0

SPECA derives categorized security properties from specifications to enable cross-implementation auditing of distributed protocols, recovering all 15 expert-augmented vulnerabilities on an Ethereum contest and achieving 88.9% precision at 100% recall on a C/C++ benchmark.

ExpertEdit: Learning Skill-Aware Motion Editing from Expert Videos

cs.CV · 2026-04-12 · unverdicted · novelty 7.0

ExpertEdit edits novice motions to expert skill levels by learning a motion prior from unpaired videos and infilling masked skill-critical spans.

Measuring and Exploiting Contextual Bias in LLM-Assisted Security Code Review

cs.SE · 2026-03-19 · accept · novelty 7.0

LLM-based security code review is vulnerable to framing bias, with a novel iterative refinement attack achieving 100% success in reintroducing vulnerabilities across real projects.

JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion

cs.GR · 2026-01-29 · unverdicted · novelty 7.0

JUST-DUB-IT adapts a joint audio-visual diffusion model via LoRA to generate high-quality dubbed videos with translated audio and lip-synced facial motion.

Making Multimodal LLMs Reliable Chart Data Extractors: A Benchmark and Training Framework

cs.HC · 2026-06-29 · unverdicted · novelty 6.0

Introduces a benchmark for MLLM-based chart data extraction from unlabeled images and a human-centered training framework that reaches SOTA numerical accuracy with a 7B model.

Ontology-Guided Evidence Path Inference for Multi-hop Knowledge Graph Question Answering

cs.AI · 2026-06-26 · unverdicted · novelty 6.0

OPI introduces a relation-centric ontology graph enabling bidirectional retrieval and iterative refinement, yielding Hit@1/F1 gains of 4.6/5.0 on WebQSP and 8.9/3.3 on CWQ plus near-saturated Hit@1 on MetaQA.

Adversarial Orthogonal Disentanglement for LVLM Hallucination Mitigation

cs.CV · 2026-05-25 · unverdicted · novelty 6.0

AOD isolates hallucination signals in LVLM representations with an adversarial minimax objective and uses dual-forward contrastive decoding to reduce hallucinations while preserving utility.

Task-Aware Automated User Profile Generation for Recommendation Simulation Using Large Language Models

cs.IR · 2026-05-13 · unverdicted · novelty 6.0

APG4RecSim automatically generates realistic user profiles for LLM-based recommendation simulations, outperforming manual baselines by up to 7% in nDCG@10 and 8% in JSD on three benchmark datasets.

Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code

cs.SE · 2026-05-06 · accept · novelty 6.0

A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.

SpecPylot: Python Specification Generation using Large Language Models

cs.SE · 2026-04-17 · unverdicted · novelty 6.0

SpecPylot generates and validates icontract specifications for Python programs by combining LLM proposals with Crosshair symbolic execution feedback.

Evaluation of Agents under Simulated AI Marketplace Dynamics

cs.IR · 2026-04-15 · unverdicted · novelty 6.0

Marketplace Evaluation uses repeated-interaction simulations to assess information access systems with marketplace-level metrics such as retention and market share that complement traditional accuracy measures.

ReflectCAP: Detailed Image Captioning with Reflective Memory

cs.AI · 2026-04-14 · unverdicted · novelty 6.0

ReflectCAP distills model-specific hallucination and oversight patterns into Structured Reflection Notes that steer LVLMs toward more factual and complete image captions, reaching the Pareto frontier on factuality-coverage trade-offs.

Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation

cs.MM · 2026-04-10 · unverdicted · novelty 6.0

Personalized soft prompts steer VLM attention to match user-specific gaze patterns, yielding better attention alignment and click prediction in recommendation simulations.

Evaluating Privilege Usage of Agents with Real-World Tools

cs.CR · 2026-03-30 · unverdicted · novelty 6.0

GrantBox evaluates LLM agents using real-world tools and finds they remain vulnerable to sophisticated prompt injection attacks with an 84.80% average success rate.

Beyond Community Notes: A Framework for Understanding and Building Crowdsourced Context Systems for Social Media

cs.HC · 2025-09-18 · conditional · novelty 6.0

The authors conduct a systematic literature review and real-world analysis to define Crowdsourced Context Systems and map a six-aspect design space with normative implications.

Ethics and Social Responsibility in AI-Assisted Interviewing: An LLM-in-the-Loop Study of AI-Generated Follow-Up Questions

cs.HC · 2026-06-29 · unverdicted · novelty 5.0

An LLM-in-the-loop study with 17 interviewers identifies five ethical concerns with AI-generated follow-up questions and translates them into design and governance implications.

Reliable Multilingual Orthopedic Decision Support from Clinical Narratives: Language-Aware Adaptation and Verification-Guided Deferral

cs.CL · 2026-05-29 · unverdicted · novelty 5.0

IndicBERT-HPA with language-aware adapters and verification-guided deferral outperforms baselines on multilingual orthopedic note classification, reaching 0.8792 Macro-F1 overall and 84.4% selective accuracy at 72.3% coverage.

Learning from AVA: Early Lessons from a Curated and Trustworthy Generative AI for Policy and Development Research

cs.HC · 2026-04-20 · unverdicted · novelty 5.0

AVA is a specialized GenAI platform for development policy research that provides verifiable syntheses from World Bank reports and is associated with 2.4-3.9 hours of weekly time savings in a large-scale user evaluation.

Anthropogenic Regional Adaptation in Multimodal Vision-Language Model

cs.AI · 2026-04-13 · unverdicted · novelty 5.0

Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.

LLM-based Multimodal Feedback Produces Equivalent Learning and Better Student Perceptions than Educator Feedback

cs.HC · 2026-01-21 · unverdicted · novelty 5.0

LLM-based multimodal feedback matches educator feedback in learning outcomes but exceeds it in student perceptions of quality, engagement, and reduced cognitive load.

Reinforcement Learning from Human Feedback

cs.LG · 2025-04-16

citing papers explorer

Showing 24 of 24 citing papers.

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents cs.CR · 2026-01-26 · unverdicted · none · ref 72
GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.
Code Generation by Differential Test Time Scaling cs.SE · 2026-05-19 · unverdicted · none · ref 10
DiffCodeGen clusters code candidates by behavioral similarity from fuzzing-synthesized inputs and selects the largest cluster's medoid, matching or exceeding prior test-time scaling methods with far less token and time cost.
Needle-in-RAG: Prompt-Conditioned Character-Level Traceback of Poisoned Spans in Retrieved Evidence cs.CR · 2026-05-03 · unverdicted · none · ref 9
RAGCharacter localizes poisoned character spans in RAG evidence via prompt-conditioned counterfactual masking and achieves the best accuracy-over-attribution trade-off across tested attacks and models.
Beyond Code Reasoning: Specification-Anchored Auditing of Multi-Implementation Distributed Protocols cs.CR · 2026-04-29 · conditional · none · ref 8
SPECA derives categorized security properties from specifications to enable cross-implementation auditing of distributed protocols, recovering all 15 expert-augmented vulnerabilities on an Ethereum contest and achieving 88.9% precision at 100% recall on a C/C++ benchmark.
ExpertEdit: Learning Skill-Aware Motion Editing from Expert Videos cs.CV · 2026-04-12 · unverdicted · none · ref 63
ExpertEdit edits novice motions to expert skill levels by learning a motion prior from unpaired videos and infilling masked skill-critical spans.
Measuring and Exploiting Contextual Bias in LLM-Assisted Security Code Review cs.SE · 2026-03-19 · accept · none · ref 88
LLM-based security code review is vulnerable to framing bias, with a novel iterative refinement attack achieving 100% success in reintroducing vulnerabilities across real projects.
JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion cs.GR · 2026-01-29 · unverdicted · none · ref 2
JUST-DUB-IT adapts a joint audio-visual diffusion model via LoRA to generate high-quality dubbed videos with translated audio and lip-synced facial motion.
Making Multimodal LLMs Reliable Chart Data Extractors: A Benchmark and Training Framework cs.HC · 2026-06-29 · unverdicted · none · ref 84
Introduces a benchmark for MLLM-based chart data extraction from unlabeled images and a human-centered training framework that reaches SOTA numerical accuracy with a 7B model.
Ontology-Guided Evidence Path Inference for Multi-hop Knowledge Graph Question Answering cs.AI · 2026-06-26 · unverdicted · none · ref 26
OPI introduces a relation-centric ontology graph enabling bidirectional retrieval and iterative refinement, yielding Hit@1/F1 gains of 4.6/5.0 on WebQSP and 8.9/3.3 on CWQ plus near-saturated Hit@1 on MetaQA.
Adversarial Orthogonal Disentanglement for LVLM Hallucination Mitigation cs.CV · 2026-05-25 · unverdicted · none · ref 32
AOD isolates hallucination signals in LVLM representations with an adversarial minimax objective and uses dual-forward contrastive decoding to reduce hallucinations while preserving utility.
Task-Aware Automated User Profile Generation for Recommendation Simulation Using Large Language Models cs.IR · 2026-05-13 · unverdicted · none · ref 4
APG4RecSim automatically generates realistic user profiles for LLM-based recommendation simulations, outperforming manual baselines by up to 7% in nDCG@10 and 8% in JSD on three benchmark datasets.
Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code cs.SE · 2026-05-06 · accept · none · ref 80
A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
SpecPylot: Python Specification Generation using Large Language Models cs.SE · 2026-04-17 · unverdicted · none · ref 11
SpecPylot generates and validates icontract specifications for Python programs by combining LLM proposals with Crosshair symbolic execution feedback.
Evaluation of Agents under Simulated AI Marketplace Dynamics cs.IR · 2026-04-15 · unverdicted · none · ref 31
Marketplace Evaluation uses repeated-interaction simulations to assess information access systems with marketplace-level metrics such as retention and market share that complement traditional accuracy measures.
ReflectCAP: Detailed Image Captioning with Reflective Memory cs.AI · 2026-04-14 · unverdicted · none · ref 32
ReflectCAP distills model-specific hallucination and oversight patterns into Structured Reflection Notes that steer LVLMs toward more factual and complete image captions, reaching the Pareto frontier on factuality-coverage trade-offs.
Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation cs.MM · 2026-04-10 · unverdicted · none · ref 4
Personalized soft prompts steer VLM attention to match user-specific gaze patterns, yielding better attention alignment and click prediction in recommendation simulations.
Evaluating Privilege Usage of Agents with Real-World Tools cs.CR · 2026-03-30 · unverdicted · none · ref 12
GrantBox evaluates LLM agents using real-world tools and finds they remain vulnerable to sophisticated prompt injection attacks with an 84.80% average success rate.
Beyond Community Notes: A Framework for Understanding and Building Crowdsourced Context Systems for Social Media cs.HC · 2025-09-18 · conditional · none · ref 7
The authors conduct a systematic literature review and real-world analysis to define Crowdsourced Context Systems and map a six-aspect design space with normative implications.
Ethics and Social Responsibility in AI-Assisted Interviewing: An LLM-in-the-Loop Study of AI-Generated Follow-Up Questions cs.HC · 2026-06-29 · unverdicted · none · ref 15
An LLM-in-the-loop study with 17 interviewers identifies five ethical concerns with AI-generated follow-up questions and translates them into design and governance implications.
Reliable Multilingual Orthopedic Decision Support from Clinical Narratives: Language-Aware Adaptation and Verification-Guided Deferral cs.CL · 2026-05-29 · unverdicted · none · ref 45
IndicBERT-HPA with language-aware adapters and verification-guided deferral outperforms baselines on multilingual orthopedic note classification, reaching 0.8792 Macro-F1 overall and 84.4% selective accuracy at 72.3% coverage.
Learning from AVA: Early Lessons from a Curated and Trustworthy Generative AI for Policy and Development Research cs.HC · 2026-04-20 · unverdicted · none · ref 1
AVA is a specialized GenAI platform for development policy research that provides verifiable syntheses from World Bank reports and is associated with 2.4-3.9 hours of weekly time savings in a large-scale user evaluation.
Anthropogenic Regional Adaptation in Multimodal Vision-Language Model cs.AI · 2026-04-13 · unverdicted · none · ref 15
Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.
LLM-based Multimodal Feedback Produces Equivalent Learning and Better Student Perceptions than Educator Feedback cs.HC · 2026-01-21 · unverdicted · none · ref 43
LLM-based multimodal feedback matches educator feedback in learning outcomes but exceeds it in student perceptions of quality, engagement, and reduced cognitive load.
Reinforcement Learning from Human Feedback cs.LG · 2025-04-16 · unreviewed · ref 101

Agent-RewardBench: Towards a unified benchmark for reward modeling across perception, planning, and safety in real- world multimodal agents

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer