Title resolution pending

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al · 2023

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

browse 15 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Heimdallr: Characterizing and Detecting LLM-Induced Security Risks in GitHub CI Workflows

cs.CR · 2026-05-07 · unverdicted · novelty 8.0

Heimdallr detects LLM-induced security risks in GitHub CI workflows by normalizing them into an LLM-Workflow Property Graph and combining triggerability analysis with LLM-assisted dataflow summarization, achieving over 0.91 F1 on threat detection in evaluation.

Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks

cs.CV · 2026-04-14 · unverdicted · novelty 8.0

MSLA is the first physically deployable attack that uses adversarial lighting to break semantic alignment in VLMs such as CLIP, LLaVA, and BLIP, causing classification failures and hallucinations in real scenes.

Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

cs.CR · 2026-04-03 · unverdicted · novelty 8.0

DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.

Training Computer Use Agents to Assess the Usability of Graphical User Interfaces

cs.CL · 2026-04-28 · unverdicted · novelty 7.0

uxCUA is a trained computer use agent that assesses GUI usability more accurately than larger models by learning to prioritize and execute important user interactions on labeled interface datasets.

PlayCoder: Making LLM-Generated GUI Code Playable

cs.SE · 2026-04-21 · conditional · novelty 7.0

PlayCoder raises the rate of LLM-generated GUI apps that can be played end-to-end without logic errors from near zero to 20.3% Play@3 by adding repository-aware generation, agent-driven testing, and iterative repair.

Efficient Personalization of Generative User Interfaces

cs.LG · 2026-04-10 · unverdicted · novelty 7.0

A dataset revealing high inter-designer disagreement on UI preferences motivates a sample-efficient method that personalizes generative interfaces by embedding new users in the space of prior designers, outperforming baselines in both modeling and user preference.

SAGE: A Service Agent Graph-guided Evaluation Benchmark

cs.AI · 2026-04-10 · unverdicted · novelty 7.0

SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 models in 6 scenarios.

GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis

cs.AI · 2026-04-06 · unverdicted · novelty 7.0

GUIDE decomposes GUI agent evaluation into trajectory segmentation, subtask diagnosis, and overall summary to deliver higher accuracy and structured error reports than holistic baselines.

Synthesis and Evaluation of Long-term History-aware Medical Dialogue

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

Creates MediLongChat synthetic longitudinal medical dialogues and benchmarks showing state-of-the-art LLMs struggle with in-dialogue, cross-dialogue, and synthesis reasoning tasks.

AOCI: Symbolic-Semantic Indexing for Practical Repository-Scale Code Understanding with LLMs

cs.SE · 2026-05-04 · unverdicted · novelty 6.0

AOCI creates an incremental symbolic-semantic index per code unit that gives LLMs a complete, consistent repository view, outperforming baselines with zero defects on 19 industrial tasks while using far fewer tokens.

MultEval: Supporting Collaborative Alignment for LLM-as-a-Judge Evaluation Criteria

cs.HC · 2026-04-29 · unverdicted · novelty 6.0

MultEval supports collaborative creation of LLM-as-a-judge criteria by surfacing disagreements via consensus-building methods, allowing iterative revisions with examples and history, and keeping transparent how human judgments become automated rules.

From Program Slices to Causal Clarity: Evaluating Faithful, Actionable LLM-Generated Failure Explanations via Context Partitioning and LLM-as-a-Judge

cs.SE · 2026-04-20 · conditional · novelty 6.0 · 2 refs

Targeted, evidence-rich context partitions improve causal clarity and actionability of LLM failure explanations while large undifferentiated contexts produce vaguer outputs, with higher-quality explanations correlating to better downstream repair rates.

EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation

cs.DB · 2026-04-17 · unverdicted · novelty 6.0

EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.

Evaluation of LLM-Based Software Engineering Tools: Practices, Challenges, and Future Directions

cs.SE · 2026-04-27 · unverdicted · novelty 4.0

LLM-based SE tools lack stable ground truth and deterministic outputs, making standard evaluation assumptions invalid and requiring new approaches for reliable assessment.

From Binary Groundedness to Support Relations: Towards a Reader-Centred Taxonomy for Comprehension of AI Output

cs.HC · 2026-04-09 · unverdicted · novelty 4.0

Binary groundedness judgments in AI evaluations should be replaced by a reader-centered taxonomy of support relations that distinguishes syntactic and interpretive moves between generated statements and source documents.

citing papers explorer

Showing 15 of 15 citing papers.

Heimdallr: Characterizing and Detecting LLM-Induced Security Risks in GitHub CI Workflows cs.CR · 2026-05-07 · unverdicted · none · ref 43
Heimdallr detects LLM-induced security risks in GitHub CI workflows by normalizing them into an LLM-Workflow Property Graph and combining triggerability analysis with LLM-assisted dataflow summarization, achieving over 0.91 F1 on threat detection in evaluation.
Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks cs.CV · 2026-04-14 · unverdicted · none · ref 49
MSLA is the first physically deployable attack that uses adversarial lighting to break semantic alignment in VLMs such as CLIP, LLaVA, and BLIP, causing classification failures and hallucinations in real scenes.
Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems cs.CR · 2026-04-03 · unverdicted · none · ref 52
DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.
Training Computer Use Agents to Assess the Usability of Graphical User Interfaces cs.CL · 2026-04-28 · unverdicted · none · ref 90
uxCUA is a trained computer use agent that assesses GUI usability more accurately than larger models by learning to prioritize and execute important user interactions on labeled interface datasets.
PlayCoder: Making LLM-Generated GUI Code Playable cs.SE · 2026-04-21 · conditional · none · ref 87
PlayCoder raises the rate of LLM-generated GUI apps that can be played end-to-end without logic errors from near zero to 20.3% Play@3 by adding repository-aware generation, agent-driven testing, and iterative repair.
Efficient Personalization of Generative User Interfaces cs.LG · 2026-04-10 · unverdicted · none · ref 110
A dataset revealing high inter-designer disagreement on UI preferences motivates a sample-efficient method that personalizes generative interfaces by embedding new users in the space of prior designers, outperforming baselines in both modeling and user preference.
SAGE: A Service Agent Graph-guided Evaluation Benchmark cs.AI · 2026-04-10 · unverdicted · none · ref 68
SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 models in 6 scenarios.
GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis cs.AI · 2026-04-06 · unverdicted · none · ref 28
GUIDE decomposes GUI agent evaluation into trajectory segmentation, subtask diagnosis, and overall summary to deliver higher accuracy and structured error reports than holistic baselines.
Synthesis and Evaluation of Long-term History-aware Medical Dialogue cs.CL · 2026-05-19 · unverdicted · none · ref 32
Creates MediLongChat synthetic longitudinal medical dialogues and benchmarks showing state-of-the-art LLMs struggle with in-dialogue, cross-dialogue, and synthesis reasoning tasks.
AOCI: Symbolic-Semantic Indexing for Practical Repository-Scale Code Understanding with LLMs cs.SE · 2026-05-04 · unverdicted · none · ref 60
AOCI creates an incremental symbolic-semantic index per code unit that gives LLMs a complete, consistent repository view, outperforming baselines with zero defects on 19 industrial tasks while using far fewer tokens.
MultEval: Supporting Collaborative Alignment for LLM-as-a-Judge Evaluation Criteria cs.HC · 2026-04-29 · unverdicted · none · ref 42
MultEval supports collaborative creation of LLM-as-a-judge criteria by surfacing disagreements via consensus-building methods, allowing iterative revisions with examples and history, and keeping transparent how human judgments become automated rules.
From Program Slices to Causal Clarity: Evaluating Faithful, Actionable LLM-Generated Failure Explanations via Context Partitioning and LLM-as-a-Judge cs.SE · 2026-04-20 · conditional · none · ref 34 · 2 links
Targeted, evidence-rich context partitions improve causal clarity and actionability of LLM failure explanations while large undifferentiated contexts produce vaguer outputs, with higher-quality explanations correlating to better downstream repair rates.
EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation cs.DB · 2026-04-17 · unverdicted · none · ref 111
EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.
Evaluation of LLM-Based Software Engineering Tools: Practices, Challenges, and Future Directions cs.SE · 2026-04-27 · unverdicted · none · ref 46
LLM-based SE tools lack stable ground truth and deterministic outputs, making standard evaluation assumptions invalid and requiring new approaches for reliable assessment.
From Binary Groundedness to Support Relations: Towards a Reader-Centred Taxonomy for Comprehension of AI Output cs.HC · 2026-04-09 · unverdicted · none · ref 47
Binary groundedness judgments in AI evaluations should be replaced by a reader-centered taxonomy of support relations that distinguishes syntactic and interpretive moves between generated statements and source documents.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer