pith. machine review for the scientific record. sign in

arxiv: 2604.21950 · v1 · submitted 2026-04-23 · 💻 cs.SE · cs.AI· cs.LG

Recognition: unknown

Feedback Over Form: Why Execution Feedback Matters More Than Pipeline Topology in 1-3B Code Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:32 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LG
keywords code generationexecution feedbackpipeline topologysmall language modelsself-refinementevolutionary searchHumanEvalMBPP
0
0 comments X

The pith

Self-refinement with execution feedback improves 1-3B code generation more than complex pipeline topologies do.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether pipelines of 1-3B language models can recover capability on hard code generation tasks when they incorporate execution feedback. It applies a NEAT-inspired evolutionary search to explore pipeline topologies and compares the results against a basic generate-execute-refine loop on HumanEval and sanitized MBPP. Feedback-driven refinement produces large gains by correcting runtime errors, yet more elaborate structures add no clear benefit and code-specialized models still outperform every general-purpose pipeline. A reader should care because the findings suggest that, for models small enough to run locally, simple feedback mechanisms deliver more practical improvement than architectural complexity.

Core claim

Self-refinement with execution feedback improves code generation by more than 4 standard deviations on both benchmarks. The mechanism is narrow: refinement fixes many runtime errors such as NameError and SyntaxError but rarely fixes logic errors such as AssertionError. Within the general-purpose model pool, refiner capability matters more than generator identity, so a 1.5B generator paired with a 3B refiner matches a 3B model performing both roles. Early stopping is required because every iteration without it is net-negative. In the constrained search space, evolutionary search rediscovers the simple refinement loop with no significant gain from added topology, while single-evaluation runs,

What carries the argument

NEAT-inspired evolutionary search over pipeline topologies that incorporate execution feedback for iterative refinement

Load-bearing premise

The constrained evolutionary search space is broad enough to discover any superior topologies beyond the simple refinement loop, and single-run fitness evaluations still allow reliable ranking of pipelines.

What would settle it

Running the evolutionary search and finding a pipeline topology that scores significantly higher than the simple refinement loop on both benchmarks would show that added topology can matter.

Figures

Figures reproduced from arXiv: 2604.21950 by Charles Junichi McAndrews.

Figure 1
Figure 1. Figure 1: Pipeline architecture showing a single refinement stage. The gen [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Fix rates by error type across both benchmarks. Runtime errors [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The early stopping paradox. Left: cumulative problems solved in [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
read the original abstract

Small language models (1-3B) are practical to run locally, but individually limited on harder code generation tasks. We ask whether composing them into pipelines can recover some of that lost capability. We study code generation pipelines built from 1-3B models with execution feedback, and use a NEAT-inspired evolutionary search to test whether more complex pipeline structure helps beyond a simple refinement loop. We evaluate on HumanEval (164 problems) and sanitized MBPP (427 problems), all with local inference on a single laptop. Self-refinement with execution feedback improves code generation by more than 4 standard deviations on both benchmarks. The gains are narrow in mechanism: refinement fixes many runtime errors (especially NameError and SyntaxError), but rarely fixes logic errors such as AssertionError. Within our tested general-purpose model pool, generator identity mattered less than refiner capability: a 1.5B generator paired with a 3B refiner matched a 3B model doing both roles. Early stopping is essential; without it, every iteration is net-negative. The code-specialized models outperform every general-purpose pipeline configuration, suggesting model specialization matters more than pipeline architecture. Preliminary text-only pipeline experiments without execution feedback did not show gains at this scale. In our constrained search space, evolutionary search mostly rediscovered the same simple generate-execute-refine loop we found manually, with no clearly significant gain from added topology. Single-evaluation fitness inflates results by 5-7 percent, selecting lucky genomes over good ones. On these benchmarks at 1-3B scale, execution feedback mattered more than added pipeline complexity in determining whether composition helped.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that for 1-3B parameter code generation models, execution feedback in pipelines improves performance substantially (>4 standard deviations on HumanEval and sanitized MBPP) by fixing runtime errors such as NameError and SyntaxError, while added pipeline topological complexity beyond a simple generate-execute-refine loop provides no clear benefit. Using a NEAT-inspired evolutionary search over a constrained space, the search mostly rediscovers the simple refinement loop; generator identity matters less than refiner capability, code-specialized models outperform general-purpose pipelines, and single-run fitness evaluations inflate results by 5-7% while early stopping is required to avoid net-negative iterations.

Significance. If the results hold, the work provides practical guidance for local deployment of small models by showing that simple execution-feedback loops suffice and that model specialization outweighs pipeline architecture at this scale. Strengths include quantitative reporting with standard deviations, detailed error-type breakdowns, explicit flagging of single-evaluation inflation, and the necessity of early stopping; these elements make the positive feedback result robust via direct comparisons.

major comments (2)
  1. [Evolutionary Search section] Evolutionary search methodology: Single-run fitness evaluations (acknowledged to inflate pass rates by 5-7%) directly impair genome ranking and selection, weakening the negative result that no superior topologies exist beyond the simple loop; this noise, combined with the explicitly constrained search space, means failure to discover better structures could stem from evaluation error or limited exploration rather than true absence of benefit.
  2. [Results and Discussion] Results on pipeline comparisons: While direct with/without feedback ablations support the primacy of execution feedback, the topology conclusion relies on rediscovery within a constrained NEAT-inspired space; without a broader or less noisy search (e.g., multi-run fitness or expanded operators), the claim that 'added pipeline complexity' does not help remains tentative and load-bearing for the central 'feedback over form' thesis.
minor comments (2)
  1. [Methods] Methods: Additional specifics on exact model checkpoints (e.g., precise 1.5B/3B variants), inference hyperparameters (temperature, max tokens, top-p), and the exact early-stopping rule would strengthen reproducibility of the local-inference experiments.
  2. [Error Analysis] Error analysis: The breakdown of fixed vs. unfixed errors (e.g., AssertionError) would benefit from reporting sample counts per category and any statistical tests to support the claim that refinement 'rarely fixes logic errors'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for minor revision. We appreciate the acknowledgment of our quantitative reporting, error breakdowns, and explicit discussion of single-evaluation inflation. Below we respond point-by-point to the major comments, agreeing that the evolutionary search results warrant additional qualification while maintaining that the direct feedback ablations robustly support the primacy of execution feedback.

read point-by-point responses
  1. Referee: [Evolutionary Search section] Evolutionary search methodology: Single-run fitness evaluations (acknowledged to inflate pass rates by 5-7%) directly impair genome ranking and selection, weakening the negative result that no superior topologies exist beyond the simple loop; this noise, combined with the explicitly constrained search space, means failure to discover better structures could stem from evaluation error or limited exploration rather than true absence of benefit.

    Authors: We thank the referee for this observation. The manuscript already states that single-run fitness inflates pass rates by 5-7% and that early stopping is required to avoid net-negative iterations. Despite this acknowledged noise, repeated independent evolutionary runs consistently converged on the simple generate-execute-refine loop and did not surface any topology with statistically significant improvement. Noise in fitness would be expected to occasionally promote suboptimal genomes, so the repeated rediscovery of the same simple structure under noisy conditions provides some evidence against the existence of markedly superior topologies within the space. Nevertheless, we agree that multi-run fitness would yield more reliable genome rankings. Given the computational expense of local 1-3B inference across the required number of evaluations, multi-run fitness was not performed. In revision we will expand the Evolutionary Search section to explicitly discuss how evaluation noise affects the strength of the negative topology result and to list multi-run fitness and expanded operators as valuable future directions. revision: partial

  2. Referee: [Results and Discussion] Results on pipeline comparisons: While direct with/without feedback ablations support the primacy of execution feedback, the topology conclusion relies on rediscovery within a constrained NEAT-inspired space; without a broader or less noisy search (e.g., multi-run fitness or expanded operators), the claim that 'added pipeline complexity' does not help remains tentative and load-bearing for the central 'feedback over form' thesis.

    Authors: We agree that the topology-related finding is necessarily qualified by the constraints of the search space and the single-run evaluation protocol, and the manuscript already qualifies the claim with the phrase 'in our constrained search space.' The central thesis—that execution feedback matters more than pipeline topology at 1-3B scale—rests primarily on the direct with/without-feedback ablations, which are independent of the evolutionary search and demonstrate gains exceeding four standard deviations on both benchmarks. The evolutionary search was an exploratory complement intended to test whether more elaborate topologies could deliver additional benefit; its failure to identify such topologies, even under noisy conditions, supplies supporting rather than conclusive evidence. In the revised manuscript we will strengthen the Results and Discussion section by (1) reiterating that the topology conclusion is scoped to the explored space and (2) explicitly noting that broader or less noisy searches remain an open direction for future work. This will reduce any impression that the topology result is load-bearing for the overall 'feedback over form' argument. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark comparisons with no derivations or self-referential reductions

full rationale

The paper reports experimental outcomes from running 1-3B models on HumanEval and MBPP, using a NEAT-inspired evolutionary search over pipeline topologies and direct with/without execution-feedback ablations. All claims (e.g., >4 SD improvement from refinement, rediscovery of the simple loop, 5-7% fitness inflation) are grounded in measured pass rates and search results rather than any equation, fitted parameter renamed as prediction, or theorem whose justification loops back to the present work. No mathematical derivation chain exists; the central contrast between feedback and topology is established by independent experimental controls. The acknowledged limitations (single-run noise, constrained search space) are empirical caveats, not circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claim rests on empirical benchmark comparisons and evolutionary search outcomes; no free parameters are fitted to produce the main result, and no new entities are postulated.

axioms (2)
  • domain assumption HumanEval and sanitized MBPP are representative proxies for code generation capability at 1-3B scale
    Invoked by using them as the sole evaluation targets; if unrepresentative, the relative importance of feedback versus topology could shift.
  • domain assumption The constrained search space and fitness function adequately test whether complex topologies can outperform simple refinement
    Stated via the NEAT-inspired search description; the negative result depends on this coverage assumption.

pith-pipeline@v0.9.0 · 5603 in / 1369 out tokens · 40205 ms · 2026-05-09T21:32:39.057682+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 10 canonical work pages · 4 internal anchors

  1. [1]

    InProceedings of the Association for Computing Machinery on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2024

    Cycle: Learning to self-refine the code generation. InProceedings of the Association for Computing Machinery on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2024

  2. [2]

    Aflow: Automating agentic workflow generation.Interna- tional Conference on Learning Representations (ICLR), 2025

    Anonymous. Aflow: Automating agentic workflow generation.Interna- tional Conference on Learning Representations (ICLR), 2025

  3. [3]

    arXiv preprint arXiv:2502.07373 , year=

    Anonymous. Evoflow: Evolving diverse agentic workflows on the fly.arXiv preprint arXiv:2502.07373, 2025

  4. [4]

    Controlled self-evolution for algorithmic code optimization

    Anonymous. Controlled self-evolution for algorithmic code optimization. arXiv preprint arXiv:2601.07348, 2026

  5. [5]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  6. [6]

    FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

    Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance.arXiv preprint arXiv:2305.05176, 2023

  7. [7]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  8. [8]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    Omar Khattab et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023

  9. [9]

    arXiv preprint arXiv:2502.00674 , year =

    Wenzhe Li, Zhibin Zhang, Zhibin Zhang, et al. Is mixing different large language models beneficial?arXiv preprint arXiv:2502.00674, 2025. Intro- duces Self-MoA: outperforms MoA by 6.6% on AlpacaEval 2.0 and 3.8% avg. on MMLU/CRUX/MATH

  10. [10]

    Small models struggle to learn from strong reasoners

    Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, and Bhaskar Ramasubramanian. Small models struggle to learn from strong reasoners. InFindings of the Association for Compu- tational Linguistics: ACL 2025, 2025

  11. [11]

    Ares: Adaptive red-teaming and end-to-end repair of policy-reward system, 2026

    Jiacheng Liang, Yao Ma, Tharindu Kumarage, Satyapriya Krishna, Rahul Gupta, Kai-Wei Chang, Aram Galstyan, and Charith Peris. Ares: Adaptive red-teaming and end-to-end repair of policy-reward system, 2026

  12. [12]

    Ea4llm: A gradient- free approach to large language model optimization via evolutionary algo- rithms.arXiv preprint arXiv:2510.10603, 2025

    WenTao Liu, Siyu Song, Hao Hao, and Aimin Zhou. Ea4llm: A gradient- free approach to large language model optimization via evolutionary algo- rithms.arXiv preprint arXiv:2510.10603, 2025. 18

  13. [13]

    Evolving neural networks through augmenting topologies.Evolutionary Computation, 10(2):99–127, 2002

    Kenneth O Stanley and Risto Miikkulainen. Evolving neural networks through augmenting topologies.Evolutionary Computation, 10(2):99–127, 2002

  14. [14]

    Mixture-of-agents enhances large language model capabilities

    Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou. Mixture-of-agents enhances large language model capabilities. 2024

  15. [15]

    AgentConductor : Topology evolution for multi-agent competition-level code generation

    Siyu Wang et al. Agentconductor: Topology evolution for multi-agent competition-level code generation.arXiv preprint arXiv:2602.17100, 2026

  16. [16]

    Connecting large language models with evolutionary algorithms yields powerful prompt optimizers.arXiv preprint arXiv:2309.08532,

    Weixiao Zhong, Liang Cui, Shu Liang, Shu Zhang, Chenyang Li, Yue Liu, Xing Miao, Shuming Wang, and Qiang Liu. Evoprompt: Connecting llms with evolutionary algorithms yields powerful prompt optimizers.arXiv preprint arXiv:2309.08532, 2023. 19