Locobench-agent: An interactive benchmark for llm agents in long-context software engineering

· 2025 · arXiv 2511.13998

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1 dataset 1

citation-polarity summary

background 1 support 1

representative citing papers

AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents

cs.AI · 2026-04-03 · unverdicted · novelty 7.0

AgentHazard benchmark shows computer-use agents remain highly vulnerable, with attack success rates reaching 73.63% on models like Qwen3-Coder powering Claude Code.

Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture

cs.SE · 2026-05-02 · unverdicted · novelty 6.0

RL Developer Memory is a feedback-normalized, safety-gated memory architecture for RL coding agents that logs contextual decisions and applies conservative off-policy gates to maintain 80% decision accuracy and full hard-negative suppression on a 200-case benchmark.

A Reference Architecture for Agentic Hybrid Retrieval in Dataset Search

cs.IR · 2026-03-28 · unverdicted · novelty 6.0

The paper defines a bounded reference architecture for LLM-orchestrated hybrid retrieval in dataset search using BM25, dense embeddings, reciprocal rank fusion, and metadata augmentation with pseudo-queries.

Formal Architecture Descriptors as Navigation Primitives for AI Coding Agents

cs.SE · 2026-04-11 · unverdicted · novelty 5.0

Formal architecture descriptors reduce AI coding agent navigation steps by 33-44% and behavioral variance by 52% in controlled and observational studies.

citing papers explorer

Showing 4 of 4 citing papers.

AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents cs.AI · 2026-04-03 · unverdicted · none · ref 19
AgentHazard benchmark shows computer-use agents remain highly vulnerable, with attack success rates reaching 73.63% on models like Qwen3-Coder powering Claude Code.
Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture cs.SE · 2026-05-02 · unverdicted · none · ref 26
RL Developer Memory is a feedback-normalized, safety-gated memory architecture for RL coding agents that logs contextual decisions and applies conservative off-policy gates to maintain 80% decision accuracy and full hard-negative suppression on a 200-case benchmark.
A Reference Architecture for Agentic Hybrid Retrieval in Dataset Search cs.IR · 2026-03-28 · unverdicted · none · ref 23
The paper defines a bounded reference architecture for LLM-orchestrated hybrid retrieval in dataset search using BM25, dense embeddings, reciprocal rank fusion, and metadata augmentation with pseudo-queries.
Formal Architecture Descriptors as Navigation Primitives for AI Coding Agents cs.SE · 2026-04-11 · unverdicted · none · ref 5
Formal architecture descriptors reduce AI coding agent navigation steps by 33-44% and behavioral variance by 52% in controlled and observational studies.

Locobench-agent: An interactive benchmark for llm agents in long-context software engineering

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer