Evolutionary coding agents achieve most benchmark gains through a small subset of edit types and by cycling previously deleted code lines rather than developing new algorithmic structures.
Measuring agents in production.arXiv preprint arXiv:2512.04123, 2025
10 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 10roles
background 1polarities
background 1representative citing papers
SynAE is a multi-metric framework that evaluates how well synthetic benchmarks replicate real data characteristics for multi-turn tool-calling agent testing.
A new benchmark shows LLM smartphone agents achieve comparable success with screen text alone as with screenshots, but both fail often due to UI accessibility and reasoning gaps.
YoloFS is an agent-native filesystem that stages mutations for review, provides snapshots for agent self-correction, and uses progressive permissions to reduce user interruptions while matching baseline task success.
No existing AI security framework covers a majority of the 193 identified multi-agent system threats in any category, with OWASP Agentic Security Initiative achieving the highest overall coverage at 65.3%.
Echo is a framework that harvests user-driven refinements of agent proposals as training signals to align models with real-world needs, demonstrated by raising code completion acceptance from 25.7% to 35.7% in production.
RAC is a log-based recovery paradigm implemented as an architectural extension to agent frameworks, achieving 1.5-8X better latency and token economy than LLM-based recovery on τ-bench and REALM-Bench.
Supervised fine-tuning with 0.1% labeled data outperforms all 60 tested prompt variants for CLIPSeg cloud segmentation on satellite imagery under domain shift.
Riemann-Bench is a private benchmark of 25 research-level math problems on which all tested frontier AI models score below 10%.
Compiled AI generates deterministic code artifacts from LLMs in a one-time compilation step, enabling reliable workflow execution with zero runtime tokens after break-even.
citing papers explorer
-
What Do Evolutionary Coding Agents Evolve?
Evolutionary coding agents achieve most benchmark gains through a small subset of edit types and by cycling previously deleted code lines rather than developing new algorithmic structures.
-
SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations
SynAE is a multi-metric framework that evaluates how well synthetic benchmarks replicate real data characteristics for multi-turn tool-calling agent testing.
-
Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots
A new benchmark shows LLM smartphone agents achieve comparable success with screen text alone as with screenshots, but both fail often due to UI accessibility and reasoning gaps.
-
Don't Let AI Agents YOLO Your Files: Shifting Information and Control to Filesystems for Agent Safety and Autonomy
YoloFS is an agent-native filesystem that stages mutations for review, provides snapshots for agent self-correction, and uses progressive permissions to reduce user interruptions while matching baseline task success.
-
Security Considerations for Multi-agent Systems
No existing AI security framework covers a majority of the 193 identified multi-agent system threats in any category, with OWASP Agentic Security Initiative achieving the highest overall coverage at 65.3%.
-
Echo: Learning from Experience Data via User-Driven Refinement
Echo is a framework that harvests user-driven refinements of agent proposals as training signals to align models with real-world needs, demonstrated by raising code completion acceptance from 25.7% to 35.7% in production.
-
Robust Agent Compensation (RAC): Teaching AI Agents to Compensate
RAC is a log-based recovery paradigm implemented as an architectural extension to agent frameworks, achieving 1.5-8X better latency and token economy than LLM-based recovery on τ-bench and REALM-Bench.
-
Low-Data Supervised Adaptation Outperforms Prompting for Cloud Segmentation Under Domain Shift
Supervised fine-tuning with 0.1% labeled data outperforms all 60 tested prompt variants for CLIPSeg cloud segmentation on satellite imagery under domain shift.
-
Riemann-Bench: A Benchmark for Moonshot Mathematics
Riemann-Bench is a private benchmark of 25 research-level math problems on which all tested frontier AI models score below 10%.
-
Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation
Compiled AI generates deterministic code artifacts from LLMs in a one-time compilation step, enabling reliable workflow execution with zero runtime tokens after break-even.