Fidelity probes from code raise specification fidelity from 0.63 to 0.94 on a 12k-line COBOL benchmark over eight iterations, with convergence predicted by a two-state Markov fixed point from four iterations of rate data.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
CoCoDA co-evolves a typed compositional DAG of primitive and composite tools with the agent planner, using signature-based retrieval and a size-based reward to scale libraries efficiently and let an 8B model match or beat a 32B model on math and code benchmarks.
Meta-prompt optimization enables LLM agents to discover stable, generalizable tacit collusion strategies in market simulations that outperform hand-crafted prompt baselines.
Runtime-structured task decomposition reduces retry costs in agentic coding systems by up to 51.7% versus monolithic prompts by rerunning only failed subtasks on two software engineering workloads.
citing papers explorer
-
Fidelity Probes for Specification--Code Alignment
Fidelity probes from code raise specification fidelity from 0.63 to 0.94 on a 12k-line COBOL benchmark over eight iterations, with convergence predicted by a two-state Markov fixed point from four iterations of rate data.
-
CoCoDA: Co-evolving Compositional DAG for Tool-Augmented Agents
CoCoDA co-evolves a typed compositional DAG of primitive and composite tools with the agent planner, using signature-based retrieval and a size-based reward to scale libraries efficiently and let an 8B model match or beat a 32B model on math and code benchmarks.
-
Prompt Optimization Enables Stable Algorithmic Collusion in LLM Agents
Meta-prompt optimization enables LLM agents to discover stable, generalizable tacit collusion strategies in market simulations that outperform hand-crafted prompt baselines.
-
Runtime-Structured Task Decomposition for Agentic Coding Systems
Runtime-structured task decomposition reduces retry costs in agentic coding systems by up to 51.7% versus monolithic prompts by rerunning only failed subtasks on two software engineering workloads.