AgentHazard benchmark shows computer-use agents remain highly vulnerable, with attack success rates reaching 73.63% on models like Qwen3-Coder powering Claude Code.
Locobench-agent: An interactive benchmark for llm agents in long-context software engineering
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4verdicts
UNVERDICTED 4representative citing papers
RL Developer Memory is a feedback-normalized, safety-gated memory architecture for RL coding agents that logs contextual decisions and applies conservative off-policy gates to maintain 80% decision accuracy and full hard-negative suppression on a 200-case benchmark.
The paper defines a bounded reference architecture for LLM-orchestrated hybrid retrieval in dataset search using BM25, dense embeddings, reciprocal rank fusion, and metadata augmentation with pseudo-queries.
Formal architecture descriptors reduce AI coding agent navigation steps by 33-44% and behavioral variance by 52% in controlled and observational studies.
citing papers explorer
-
AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents
AgentHazard benchmark shows computer-use agents remain highly vulnerable, with attack success rates reaching 73.63% on models like Qwen3-Coder powering Claude Code.
-
Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture
RL Developer Memory is a feedback-normalized, safety-gated memory architecture for RL coding agents that logs contextual decisions and applies conservative off-policy gates to maintain 80% decision accuracy and full hard-negative suppression on a 200-case benchmark.
-
A Reference Architecture for Agentic Hybrid Retrieval in Dataset Search
The paper defines a bounded reference architecture for LLM-orchestrated hybrid retrieval in dataset search using BM25, dense embeddings, reciprocal rank fusion, and metadata augmentation with pseudo-queries.
-
Formal Architecture Descriptors as Navigation Primitives for AI Coding Agents
Formal architecture descriptors reduce AI coding agent navigation steps by 33-44% and behavioral variance by 52% in controlled and observational studies.