GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.
hub Canonical reference
Agent-RewardBench: Towards a unified benchmark for reward modeling across perception, planning, and safety in real- world multimodal agents
Canonical reference. 90% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
DiffCodeGen clusters code candidates by behavioral similarity from fuzzing-synthesized inputs and selects the largest cluster's medoid, matching or exceeding prior test-time scaling methods with far less token and time cost.
RAGCharacter localizes poisoned character spans in RAG evidence via prompt-conditioned counterfactual masking and achieves the best accuracy-over-attribution trade-off across tested attacks and models.
SPECA derives categorized security properties from specifications to enable cross-implementation auditing of distributed protocols, recovering all 15 expert-augmented vulnerabilities on an Ethereum contest and achieving 88.9% precision at 100% recall on a C/C++ benchmark.
ExpertEdit edits novice motions to expert skill levels by learning a motion prior from unpaired videos and infilling masked skill-critical spans.
LLM-based security code review is vulnerable to framing bias, with a novel iterative refinement attack achieving 100% success in reintroducing vulnerabilities across real projects.
JUST-DUB-IT adapts a joint audio-visual diffusion model via LoRA to generate high-quality dubbed videos with translated audio and lip-synced facial motion.
Introduces a benchmark for MLLM-based chart data extraction from unlabeled images and a human-centered training framework that reaches SOTA numerical accuracy with a 7B model.
OPI introduces a relation-centric ontology graph enabling bidirectional retrieval and iterative refinement, yielding Hit@1/F1 gains of 4.6/5.0 on WebQSP and 8.9/3.3 on CWQ plus near-saturated Hit@1 on MetaQA.
AOD isolates hallucination signals in LVLM representations with an adversarial minimax objective and uses dual-forward contrastive decoding to reduce hallucinations while preserving utility.
APG4RecSim automatically generates realistic user profiles for LLM-based recommendation simulations, outperforming manual baselines by up to 7% in nDCG@10 and 8% in JSD on three benchmark datasets.
A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
SpecPylot generates and validates icontract specifications for Python programs by combining LLM proposals with Crosshair symbolic execution feedback.
Marketplace Evaluation uses repeated-interaction simulations to assess information access systems with marketplace-level metrics such as retention and market share that complement traditional accuracy measures.
ReflectCAP distills model-specific hallucination and oversight patterns into Structured Reflection Notes that steer LVLMs toward more factual and complete image captions, reaching the Pareto frontier on factuality-coverage trade-offs.
Personalized soft prompts steer VLM attention to match user-specific gaze patterns, yielding better attention alignment and click prediction in recommendation simulations.
GrantBox evaluates LLM agents using real-world tools and finds they remain vulnerable to sophisticated prompt injection attacks with an 84.80% average success rate.
The authors conduct a systematic literature review and real-world analysis to define Crowdsourced Context Systems and map a six-aspect design space with normative implications.
An LLM-in-the-loop study with 17 interviewers identifies five ethical concerns with AI-generated follow-up questions and translates them into design and governance implications.
IndicBERT-HPA with language-aware adapters and verification-guided deferral outperforms baselines on multilingual orthopedic note classification, reaching 0.8792 Macro-F1 overall and 84.4% selective accuracy at 72.3% coverage.
AVA is a specialized GenAI platform for development policy research that provides verifiable syntheses from World Bank reports and is associated with 2.4-3.9 hours of weekly time savings in a large-scale user evaluation.
Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.
LLM-based multimodal feedback matches educator feedback in learning outcomes but exceeds it in student perceptions of quality, engagement, and reduced cognitive load.
citing papers explorer
-
Reliable Multilingual Orthopedic Decision Support from Clinical Narratives: Language-Aware Adaptation and Verification-Guided Deferral
IndicBERT-HPA with language-aware adapters and verification-guided deferral outperforms baselines on multilingual orthopedic note classification, reaching 0.8792 Macro-F1 overall and 84.4% selective accuracy at 72.3% coverage.