Measuring agents in production.arXiv preprint arXiv:2512.04123, 2025

· 2025 · arXiv 2512.04123

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

What Do Evolutionary Coding Agents Evolve?

cs.NE · 2026-05-19 · unverdicted · novelty 7.0

Evolutionary coding agents achieve most benchmark gains through a small subset of edit types and by cycling previously deleted code lines rather than developing new algorithmic structures.

SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations

cs.CL · 2026-05-21 · unverdicted · novelty 6.0

SynAE is a multi-metric framework that evaluates how well synthetic benchmarks replicate real data characteristics for multi-turn tool-calling agent testing.

Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots

cs.HC · 2026-04-20 · unverdicted · novelty 6.0

A new benchmark shows LLM smartphone agents achieve comparable success with screen text alone as with screenshots, but both fail often due to UI accessibility and reasoning gaps.

Don't Let AI Agents YOLO Your Files: Shifting Information and Control to Filesystems for Agent Safety and Autonomy

cs.OS · 2026-04-15 · unverdicted · novelty 6.0

YoloFS is an agent-native filesystem that stages mutations for review, provides snapshots for agent self-correction, and uses progressive permissions to reduce user interruptions while matching baseline task success.

Security Considerations for Multi-agent Systems

cs.CR · 2026-03-09 · unverdicted · novelty 6.0

No existing AI security framework covers a majority of the 193 identified multi-agent system threats in any category, with OWASP Agentic Security Initiative achieving the highest overall coverage at 65.3%.

Echo: Learning from Experience Data via User-Driven Refinement

cs.AI · 2026-05-21 · unverdicted · novelty 5.0

Echo is a framework that harvests user-driven refinements of agent proposals as training signals to align models with real-world needs, demonstrated by raising code completion acceptance from 25.7% to 35.7% in production.

Robust Agent Compensation (RAC): Teaching AI Agents to Compensate

cs.AI · 2026-05-05 · unverdicted · novelty 5.0 · 2 refs

RAC is a log-based recovery paradigm implemented as an architectural extension to agent frameworks, achieving 1.5-8X better latency and token economy than LLM-based recovery on τ-bench and REALM-Bench.

Low-Data Supervised Adaptation Outperforms Prompting for Cloud Segmentation Under Domain Shift

cs.CV · 2026-04-10 · unverdicted · novelty 5.0

Supervised fine-tuning with 0.1% labeled data outperforms all 60 tested prompt variants for CLIPSeg cloud segmentation on satellite imagery under domain shift.

Riemann-Bench: A Benchmark for Moonshot Mathematics

cs.AI · 2026-04-08 · conditional · novelty 5.0

Riemann-Bench is a private benchmark of 25 research-level math problems on which all tested frontier AI models score below 10%.

Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation

cs.SE · 2026-04-06 · unverdicted · novelty 4.0

Compiled AI generates deterministic code artifacts from LLMs in a one-time compilation step, enabling reliable workflow execution with zero runtime tokens after break-even.

citing papers explorer

Showing 10 of 10 citing papers.

What Do Evolutionary Coding Agents Evolve? cs.NE · 2026-05-19 · unverdicted · none · ref 63
Evolutionary coding agents achieve most benchmark gains through a small subset of edit types and by cycling previously deleted code lines rather than developing new algorithmic structures.
SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations cs.CL · 2026-05-21 · unverdicted · none · ref 26
SynAE is a multi-metric framework that evaluates how well synthetic benchmarks replicate real data characteristics for multi-turn tool-calling agent testing.
Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots cs.HC · 2026-04-20 · unverdicted · none · ref 44
A new benchmark shows LLM smartphone agents achieve comparable success with screen text alone as with screenshots, but both fail often due to UI accessibility and reasoning gaps.
Don't Let AI Agents YOLO Your Files: Shifting Information and Control to Filesystems for Agent Safety and Autonomy cs.OS · 2026-04-15 · unverdicted · none · ref 70
YoloFS is an agent-native filesystem that stages mutations for review, provides snapshots for agent self-correction, and uses progressive permissions to reduce user interruptions while matching baseline task success.
Security Considerations for Multi-agent Systems cs.CR · 2026-03-09 · unverdicted · none · ref 129
No existing AI security framework covers a majority of the 193 identified multi-agent system threats in any category, with OWASP Agentic Security Initiative achieving the highest overall coverage at 65.3%.
Echo: Learning from Experience Data via User-Driven Refinement cs.AI · 2026-05-21 · unverdicted · none · ref 17
Echo is a framework that harvests user-driven refinements of agent proposals as training signals to align models with real-world needs, demonstrated by raising code completion acceptance from 25.7% to 35.7% in production.
Robust Agent Compensation (RAC): Teaching AI Agents to Compensate cs.AI · 2026-05-05 · unverdicted · none · ref 32 · 2 links
RAC is a log-based recovery paradigm implemented as an architectural extension to agent frameworks, achieving 1.5-8X better latency and token economy than LLM-based recovery on τ-bench and REALM-Bench.
Low-Data Supervised Adaptation Outperforms Prompting for Cloud Segmentation Under Domain Shift cs.CV · 2026-04-10 · unverdicted · none · ref 1
Supervised fine-tuning with 0.1% labeled data outperforms all 60 tested prompt variants for CLIPSeg cloud segmentation on satellite imagery under domain shift.
Riemann-Bench: A Benchmark for Moonshot Mathematics cs.AI · 2026-04-08 · conditional · none · ref 12
Riemann-Bench is a private benchmark of 25 research-level math problems on which all tested frontier AI models score below 10%.
Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation cs.SE · 2026-04-06 · unverdicted · none · ref 13
Compiled AI generates deterministic code artifacts from LLMs in a one-time compilation step, enabling reliable workflow execution with zero runtime tokens after break-even.

Measuring agents in production.arXiv preprint arXiv:2512.04123, 2025

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer