Recognition: unknown
Feedback Over Form: Why Execution Feedback Matters More Than Pipeline Topology in 1-3B Code Generation
Pith reviewed 2026-05-09 21:32 UTC · model grok-4.3
The pith
Self-refinement with execution feedback improves 1-3B code generation more than complex pipeline topologies do.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Self-refinement with execution feedback improves code generation by more than 4 standard deviations on both benchmarks. The mechanism is narrow: refinement fixes many runtime errors such as NameError and SyntaxError but rarely fixes logic errors such as AssertionError. Within the general-purpose model pool, refiner capability matters more than generator identity, so a 1.5B generator paired with a 3B refiner matches a 3B model performing both roles. Early stopping is required because every iteration without it is net-negative. In the constrained search space, evolutionary search rediscovers the simple refinement loop with no significant gain from added topology, while single-evaluation runs,
What carries the argument
NEAT-inspired evolutionary search over pipeline topologies that incorporate execution feedback for iterative refinement
Load-bearing premise
The constrained evolutionary search space is broad enough to discover any superior topologies beyond the simple refinement loop, and single-run fitness evaluations still allow reliable ranking of pipelines.
What would settle it
Running the evolutionary search and finding a pipeline topology that scores significantly higher than the simple refinement loop on both benchmarks would show that added topology can matter.
Figures
read the original abstract
Small language models (1-3B) are practical to run locally, but individually limited on harder code generation tasks. We ask whether composing them into pipelines can recover some of that lost capability. We study code generation pipelines built from 1-3B models with execution feedback, and use a NEAT-inspired evolutionary search to test whether more complex pipeline structure helps beyond a simple refinement loop. We evaluate on HumanEval (164 problems) and sanitized MBPP (427 problems), all with local inference on a single laptop. Self-refinement with execution feedback improves code generation by more than 4 standard deviations on both benchmarks. The gains are narrow in mechanism: refinement fixes many runtime errors (especially NameError and SyntaxError), but rarely fixes logic errors such as AssertionError. Within our tested general-purpose model pool, generator identity mattered less than refiner capability: a 1.5B generator paired with a 3B refiner matched a 3B model doing both roles. Early stopping is essential; without it, every iteration is net-negative. The code-specialized models outperform every general-purpose pipeline configuration, suggesting model specialization matters more than pipeline architecture. Preliminary text-only pipeline experiments without execution feedback did not show gains at this scale. In our constrained search space, evolutionary search mostly rediscovered the same simple generate-execute-refine loop we found manually, with no clearly significant gain from added topology. Single-evaluation fitness inflates results by 5-7 percent, selecting lucky genomes over good ones. On these benchmarks at 1-3B scale, execution feedback mattered more than added pipeline complexity in determining whether composition helped.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that for 1-3B parameter code generation models, execution feedback in pipelines improves performance substantially (>4 standard deviations on HumanEval and sanitized MBPP) by fixing runtime errors such as NameError and SyntaxError, while added pipeline topological complexity beyond a simple generate-execute-refine loop provides no clear benefit. Using a NEAT-inspired evolutionary search over a constrained space, the search mostly rediscovers the simple refinement loop; generator identity matters less than refiner capability, code-specialized models outperform general-purpose pipelines, and single-run fitness evaluations inflate results by 5-7% while early stopping is required to avoid net-negative iterations.
Significance. If the results hold, the work provides practical guidance for local deployment of small models by showing that simple execution-feedback loops suffice and that model specialization outweighs pipeline architecture at this scale. Strengths include quantitative reporting with standard deviations, detailed error-type breakdowns, explicit flagging of single-evaluation inflation, and the necessity of early stopping; these elements make the positive feedback result robust via direct comparisons.
major comments (2)
- [Evolutionary Search section] Evolutionary search methodology: Single-run fitness evaluations (acknowledged to inflate pass rates by 5-7%) directly impair genome ranking and selection, weakening the negative result that no superior topologies exist beyond the simple loop; this noise, combined with the explicitly constrained search space, means failure to discover better structures could stem from evaluation error or limited exploration rather than true absence of benefit.
- [Results and Discussion] Results on pipeline comparisons: While direct with/without feedback ablations support the primacy of execution feedback, the topology conclusion relies on rediscovery within a constrained NEAT-inspired space; without a broader or less noisy search (e.g., multi-run fitness or expanded operators), the claim that 'added pipeline complexity' does not help remains tentative and load-bearing for the central 'feedback over form' thesis.
minor comments (2)
- [Methods] Methods: Additional specifics on exact model checkpoints (e.g., precise 1.5B/3B variants), inference hyperparameters (temperature, max tokens, top-p), and the exact early-stopping rule would strengthen reproducibility of the local-inference experiments.
- [Error Analysis] Error analysis: The breakdown of fixed vs. unfixed errors (e.g., AssertionError) would benefit from reporting sample counts per category and any statistical tests to support the claim that refinement 'rarely fixes logic errors'.
Simulated Author's Rebuttal
We thank the referee for the constructive review and the recommendation for minor revision. We appreciate the acknowledgment of our quantitative reporting, error breakdowns, and explicit discussion of single-evaluation inflation. Below we respond point-by-point to the major comments, agreeing that the evolutionary search results warrant additional qualification while maintaining that the direct feedback ablations robustly support the primacy of execution feedback.
read point-by-point responses
-
Referee: [Evolutionary Search section] Evolutionary search methodology: Single-run fitness evaluations (acknowledged to inflate pass rates by 5-7%) directly impair genome ranking and selection, weakening the negative result that no superior topologies exist beyond the simple loop; this noise, combined with the explicitly constrained search space, means failure to discover better structures could stem from evaluation error or limited exploration rather than true absence of benefit.
Authors: We thank the referee for this observation. The manuscript already states that single-run fitness inflates pass rates by 5-7% and that early stopping is required to avoid net-negative iterations. Despite this acknowledged noise, repeated independent evolutionary runs consistently converged on the simple generate-execute-refine loop and did not surface any topology with statistically significant improvement. Noise in fitness would be expected to occasionally promote suboptimal genomes, so the repeated rediscovery of the same simple structure under noisy conditions provides some evidence against the existence of markedly superior topologies within the space. Nevertheless, we agree that multi-run fitness would yield more reliable genome rankings. Given the computational expense of local 1-3B inference across the required number of evaluations, multi-run fitness was not performed. In revision we will expand the Evolutionary Search section to explicitly discuss how evaluation noise affects the strength of the negative topology result and to list multi-run fitness and expanded operators as valuable future directions. revision: partial
-
Referee: [Results and Discussion] Results on pipeline comparisons: While direct with/without feedback ablations support the primacy of execution feedback, the topology conclusion relies on rediscovery within a constrained NEAT-inspired space; without a broader or less noisy search (e.g., multi-run fitness or expanded operators), the claim that 'added pipeline complexity' does not help remains tentative and load-bearing for the central 'feedback over form' thesis.
Authors: We agree that the topology-related finding is necessarily qualified by the constraints of the search space and the single-run evaluation protocol, and the manuscript already qualifies the claim with the phrase 'in our constrained search space.' The central thesis—that execution feedback matters more than pipeline topology at 1-3B scale—rests primarily on the direct with/without-feedback ablations, which are independent of the evolutionary search and demonstrate gains exceeding four standard deviations on both benchmarks. The evolutionary search was an exploratory complement intended to test whether more elaborate topologies could deliver additional benefit; its failure to identify such topologies, even under noisy conditions, supplies supporting rather than conclusive evidence. In the revised manuscript we will strengthen the Results and Discussion section by (1) reiterating that the topology conclusion is scoped to the explored space and (2) explicitly noting that broader or less noisy searches remain an open direction for future work. This will reduce any impression that the topology result is load-bearing for the overall 'feedback over form' argument. revision: partial
Circularity Check
No circularity: purely empirical benchmark comparisons with no derivations or self-referential reductions
full rationale
The paper reports experimental outcomes from running 1-3B models on HumanEval and MBPP, using a NEAT-inspired evolutionary search over pipeline topologies and direct with/without execution-feedback ablations. All claims (e.g., >4 SD improvement from refinement, rediscovery of the simple loop, 5-7% fitness inflation) are grounded in measured pass rates and search results rather than any equation, fitted parameter renamed as prediction, or theorem whose justification loops back to the present work. No mathematical derivation chain exists; the central contrast between feedback and topology is established by independent experimental controls. The acknowledged limitations (single-run noise, constrained search space) are empirical caveats, not circular reductions.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption HumanEval and sanitized MBPP are representative proxies for code generation capability at 1-3B scale
- domain assumption The constrained search space and fitness function adequately test whether complex topologies can outperform simple refinement
Reference graph
Works this paper leans on
-
[1]
InProceedings of the Association for Computing Machinery on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2024
Cycle: Learning to self-refine the code generation. InProceedings of the Association for Computing Machinery on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2024
2024
-
[2]
Aflow: Automating agentic workflow generation.Interna- tional Conference on Learning Representations (ICLR), 2025
Anonymous. Aflow: Automating agentic workflow generation.Interna- tional Conference on Learning Representations (ICLR), 2025
2025
-
[3]
arXiv preprint arXiv:2502.07373 , year=
Anonymous. Evoflow: Evolving diverse agentic workflows on the fly.arXiv preprint arXiv:2502.07373, 2025
-
[4]
Controlled self-evolution for algorithmic code optimization
Anonymous. Controlled self-evolution for algorithmic code optimization. arXiv preprint arXiv:2601.07348, 2026
-
[5]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance.arXiv preprint arXiv:2305.05176, 2023
work page internal anchor Pith review arXiv 2023
-
[7]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
Omar Khattab et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023
work page internal anchor Pith review arXiv 2023
-
[9]
arXiv preprint arXiv:2502.00674 , year =
Wenzhe Li, Zhibin Zhang, Zhibin Zhang, et al. Is mixing different large language models beneficial?arXiv preprint arXiv:2502.00674, 2025. Intro- duces Self-MoA: outperforms MoA by 6.6% on AlpacaEval 2.0 and 3.8% avg. on MMLU/CRUX/MATH
-
[10]
Small models struggle to learn from strong reasoners
Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, and Bhaskar Ramasubramanian. Small models struggle to learn from strong reasoners. InFindings of the Association for Compu- tational Linguistics: ACL 2025, 2025
2025
-
[11]
Ares: Adaptive red-teaming and end-to-end repair of policy-reward system, 2026
Jiacheng Liang, Yao Ma, Tharindu Kumarage, Satyapriya Krishna, Rahul Gupta, Kai-Wei Chang, Aram Galstyan, and Charith Peris. Ares: Adaptive red-teaming and end-to-end repair of policy-reward system, 2026
2026
-
[12]
WenTao Liu, Siyu Song, Hao Hao, and Aimin Zhou. Ea4llm: A gradient- free approach to large language model optimization via evolutionary algo- rithms.arXiv preprint arXiv:2510.10603, 2025. 18
-
[13]
Evolving neural networks through augmenting topologies.Evolutionary Computation, 10(2):99–127, 2002
Kenneth O Stanley and Risto Miikkulainen. Evolving neural networks through augmenting topologies.Evolutionary Computation, 10(2):99–127, 2002
2002
-
[14]
Mixture-of-agents enhances large language model capabilities
Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou. Mixture-of-agents enhances large language model capabilities. 2024
2024
-
[15]
AgentConductor : Topology evolution for multi-agent competition-level code generation
Siyu Wang et al. Agentconductor: Topology evolution for multi-agent competition-level code generation.arXiv preprint arXiv:2602.17100, 2026
-
[16]
Weixiao Zhong, Liang Cui, Shu Liang, Shu Zhang, Chenyang Li, Yue Liu, Xing Miao, Shuming Wang, and Qiang Liu. Evoprompt: Connecting llms with evolutionary algorithms yields powerful prompt optimizers.arXiv preprint arXiv:2309.08532, 2023. 19
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.