GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.
hub Canonical reference
Agent-RewardBench: Towards a unified benchmark for reward modeling across perception, planning, and safety in real- world multimodal agents
Canonical reference. 90% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
DiffCodeGen clusters code candidates by behavioral similarity from fuzzing-synthesized inputs and selects the largest cluster's medoid, matching or exceeding prior test-time scaling methods with far less token and time cost.
RAGCharacter localizes poisoned character spans in RAG evidence via prompt-conditioned counterfactual masking and achieves the best accuracy-over-attribution trade-off across tested attacks and models.
SPECA derives categorized security properties from specifications to enable cross-implementation auditing of distributed protocols, recovering all 15 expert-augmented vulnerabilities on an Ethereum contest and achieving 88.9% precision at 100% recall on a C/C++ benchmark.
ExpertEdit edits novice motions to expert skill levels by learning a motion prior from unpaired videos and infilling masked skill-critical spans.
LLM-based security code review is vulnerable to framing bias, with a novel iterative refinement attack achieving 100% success in reintroducing vulnerabilities across real projects.
JUST-DUB-IT adapts a joint audio-visual diffusion model via LoRA to generate high-quality dubbed videos with translated audio and lip-synced facial motion.
Introduces a benchmark for MLLM-based chart data extraction from unlabeled images and a human-centered training framework that reaches SOTA numerical accuracy with a 7B model.
OPI introduces a relation-centric ontology graph enabling bidirectional retrieval and iterative refinement, yielding Hit@1/F1 gains of 4.6/5.0 on WebQSP and 8.9/3.3 on CWQ plus near-saturated Hit@1 on MetaQA.
AOD isolates hallucination signals in LVLM representations with an adversarial minimax objective and uses dual-forward contrastive decoding to reduce hallucinations while preserving utility.
APG4RecSim automatically generates realistic user profiles for LLM-based recommendation simulations, outperforming manual baselines by up to 7% in nDCG@10 and 8% in JSD on three benchmark datasets.
A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
SpecPylot generates and validates icontract specifications for Python programs by combining LLM proposals with Crosshair symbolic execution feedback.
Marketplace Evaluation uses repeated-interaction simulations to assess information access systems with marketplace-level metrics such as retention and market share that complement traditional accuracy measures.
ReflectCAP distills model-specific hallucination and oversight patterns into Structured Reflection Notes that steer LVLMs toward more factual and complete image captions, reaching the Pareto frontier on factuality-coverage trade-offs.
Personalized soft prompts steer VLM attention to match user-specific gaze patterns, yielding better attention alignment and click prediction in recommendation simulations.
GrantBox evaluates LLM agents using real-world tools and finds they remain vulnerable to sophisticated prompt injection attacks with an 84.80% average success rate.
The authors conduct a systematic literature review and real-world analysis to define Crowdsourced Context Systems and map a six-aspect design space with normative implications.
An LLM-in-the-loop study with 17 interviewers identifies five ethical concerns with AI-generated follow-up questions and translates them into design and governance implications.
IndicBERT-HPA with language-aware adapters and verification-guided deferral outperforms baselines on multilingual orthopedic note classification, reaching 0.8792 Macro-F1 overall and 84.4% selective accuracy at 72.3% coverage.
AVA is a specialized GenAI platform for development policy research that provides verifiable syntheses from World Bank reports and is associated with 2.4-3.9 hours of weekly time savings in a large-scale user evaluation.
Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.
LLM-based multimodal feedback matches educator feedback in learning outcomes but exceeds it in student perceptions of quality, engagement, and reduced cognitive load.
citing papers explorer
-
GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents
GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.
-
Code Generation by Differential Test Time Scaling
DiffCodeGen clusters code candidates by behavioral similarity from fuzzing-synthesized inputs and selects the largest cluster's medoid, matching or exceeding prior test-time scaling methods with far less token and time cost.
-
Needle-in-RAG: Prompt-Conditioned Character-Level Traceback of Poisoned Spans in Retrieved Evidence
RAGCharacter localizes poisoned character spans in RAG evidence via prompt-conditioned counterfactual masking and achieves the best accuracy-over-attribution trade-off across tested attacks and models.
-
Beyond Code Reasoning: Specification-Anchored Auditing of Multi-Implementation Distributed Protocols
SPECA derives categorized security properties from specifications to enable cross-implementation auditing of distributed protocols, recovering all 15 expert-augmented vulnerabilities on an Ethereum contest and achieving 88.9% precision at 100% recall on a C/C++ benchmark.
-
ExpertEdit: Learning Skill-Aware Motion Editing from Expert Videos
ExpertEdit edits novice motions to expert skill levels by learning a motion prior from unpaired videos and infilling masked skill-critical spans.
-
Measuring and Exploiting Contextual Bias in LLM-Assisted Security Code Review
LLM-based security code review is vulnerable to framing bias, with a novel iterative refinement attack achieving 100% success in reintroducing vulnerabilities across real projects.
-
JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion
JUST-DUB-IT adapts a joint audio-visual diffusion model via LoRA to generate high-quality dubbed videos with translated audio and lip-synced facial motion.
-
Making Multimodal LLMs Reliable Chart Data Extractors: A Benchmark and Training Framework
Introduces a benchmark for MLLM-based chart data extraction from unlabeled images and a human-centered training framework that reaches SOTA numerical accuracy with a 7B model.
-
Ontology-Guided Evidence Path Inference for Multi-hop Knowledge Graph Question Answering
OPI introduces a relation-centric ontology graph enabling bidirectional retrieval and iterative refinement, yielding Hit@1/F1 gains of 4.6/5.0 on WebQSP and 8.9/3.3 on CWQ plus near-saturated Hit@1 on MetaQA.
-
Adversarial Orthogonal Disentanglement for LVLM Hallucination Mitigation
AOD isolates hallucination signals in LVLM representations with an adversarial minimax objective and uses dual-forward contrastive decoding to reduce hallucinations while preserving utility.
-
Task-Aware Automated User Profile Generation for Recommendation Simulation Using Large Language Models
APG4RecSim automatically generates realistic user profiles for LLM-based recommendation simulations, outperforming manual baselines by up to 7% in nDCG@10 and 8% in JSD on three benchmark datasets.
-
Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code
A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
-
SpecPylot: Python Specification Generation using Large Language Models
SpecPylot generates and validates icontract specifications for Python programs by combining LLM proposals with Crosshair symbolic execution feedback.
-
Evaluation of Agents under Simulated AI Marketplace Dynamics
Marketplace Evaluation uses repeated-interaction simulations to assess information access systems with marketplace-level metrics such as retention and market share that complement traditional accuracy measures.
-
ReflectCAP: Detailed Image Captioning with Reflective Memory
ReflectCAP distills model-specific hallucination and oversight patterns into Structured Reflection Notes that steer LVLMs toward more factual and complete image captions, reaching the Pareto frontier on factuality-coverage trade-offs.
-
Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation
Personalized soft prompts steer VLM attention to match user-specific gaze patterns, yielding better attention alignment and click prediction in recommendation simulations.
-
Evaluating Privilege Usage of Agents with Real-World Tools
GrantBox evaluates LLM agents using real-world tools and finds they remain vulnerable to sophisticated prompt injection attacks with an 84.80% average success rate.
-
Beyond Community Notes: A Framework for Understanding and Building Crowdsourced Context Systems for Social Media
The authors conduct a systematic literature review and real-world analysis to define Crowdsourced Context Systems and map a six-aspect design space with normative implications.
-
Ethics and Social Responsibility in AI-Assisted Interviewing: An LLM-in-the-Loop Study of AI-Generated Follow-Up Questions
An LLM-in-the-loop study with 17 interviewers identifies five ethical concerns with AI-generated follow-up questions and translates them into design and governance implications.
-
Reliable Multilingual Orthopedic Decision Support from Clinical Narratives: Language-Aware Adaptation and Verification-Guided Deferral
IndicBERT-HPA with language-aware adapters and verification-guided deferral outperforms baselines on multilingual orthopedic note classification, reaching 0.8792 Macro-F1 overall and 84.4% selective accuracy at 72.3% coverage.
-
Learning from AVA: Early Lessons from a Curated and Trustworthy Generative AI for Policy and Development Research
AVA is a specialized GenAI platform for development policy research that provides verifiable syntheses from World Bank reports and is associated with 2.4-3.9 hours of weekly time savings in a large-scale user evaluation.
-
Anthropogenic Regional Adaptation in Multimodal Vision-Language Model
Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.
-
LLM-based Multimodal Feedback Produces Equivalent Learning and Better Student Perceptions than Educator Feedback
LLM-based multimodal feedback matches educator feedback in learning outcomes but exceeds it in student perceptions of quality, engagement, and reduced cognitive load.
- Reinforcement Learning from Human Feedback