Coding agents struggle to infer least-privilege file permissions by omitting needed accesses while granting unused or sensitive ones, but Sufficiency-Tightness Decomposition improves sensitive-task success by up to 15.8% and reduces attacks.
R-judge: Benchmarking safety risk awareness for llm agents
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
BadSkill poisons embedded models in agent skills to achieve up to 99.5% attack success rate on triggered tasks with only 3% poison rate while preserving normal behavior on non-trigger inputs.
citing papers explorer
-
Do Coding Agents Understand Least-Privilege Authorization?
Coding agents struggle to infer least-privilege file permissions by omitting needed accesses while granting unused or sensitive ones, but Sufficiency-Tightness Decomposition improves sensitive-task success by up to 15.8% and reduces attacks.
-
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges
LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
-
BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning
BadSkill poisons embedded models in agent skills to achieve up to 99.5% attack success rate on triggered tasks with only 3% poison rate while preserving normal behavior on non-trigger inputs.