AdaptEvolve: Improving Efficiency of Evolutionary AI Agents through Adaptive Model Selection

Emad Barsoum; Pratik Prabhanjan Brahma; Pretam Ray; Zicheng Liu

arxiv: 2602.11931 · v2 · submitted 2026-02-12 · 💻 cs.CL · cs.AI

AdaptEvolve: Improving Efficiency of Evolutionary AI Agents through Adaptive Model Selection

Pretam Ray , Pratik Prabhanjan Brahma , Zicheng Liu , Emad Barsoum This is my paper

Pith reviewed 2026-05-16 02:48 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords adaptive LLM selectionevolutionary agentsmodel cascadesinference efficiencygeneration confidencePareto frontier

0 comments

The pith

Adaptive selection of language models via their own confidence scores reduces inference costs by 38 percent in evolutionary agents while retaining 97.5 percent of peak accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Evolutionary AI agents refine solutions by repeatedly calling large language models, which drives up inference expense. The paper demonstrates that agents can switch to smaller models for refinement steps where the current model shows high generation confidence. This switch is guided by the model's intrinsic uncertainty signal rather than fixed rules or external routers. Across benchmarks the approach cuts total cost while the final answers stay close to what the largest model would achieve alone.

Core claim

AdaptEvolve performs adaptive LLM selection inside an evolutionary sequential refinement loop by treating intrinsic generation confidence as an estimate of whether the current step can be solved by a smaller model. When confidence is high the system routes to a cheaper model; when low it stays with the large model. This produces a Pareto frontier that lowers average inference cost by 37.9 percent while preserving 97.5 percent of the accuracy obtained from always using the static large-model baseline.

What carries the argument

Intrinsic generation confidence used as a real-time solvability estimator to decide model size at each evolutionary refinement step.

If this is right

Total inference cost across benchmarks falls by an average of 37.9 percent.
Final accuracy remains within 2.5 percent of the static large-model ceiling.
Model cascades improve when routing decisions incorporate the model's own uncertainty rather than static heuristics.
The evolutionary refinement process tolerates smaller models on confident steps without measurable degradation in solution quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same confidence signal could be tested for routing decisions in non-evolutionary multi-step agent workflows such as tool-use chains.
If the signal holds across model families, mixed-model agent stacks could adopt the selection logic without retraining.
Lightweight auxiliary predictors of confidence might further lower overhead by avoiding full forward passes on every candidate model.

Load-bearing premise

An LLM's self-reported generation confidence accurately indicates whether a smaller model can correctly complete the current evolutionary refinement step.

What would settle it

A controlled run in which high-confidence steps are still forced to the large model and the final task accuracy rises by more than the 2.5 percent gap reported for the adaptive method.

read the original abstract

Evolutionary agentic systems intensify the trade-off between computational efficiency and reasoning capability by repeatedly invoking large language models (LLMs) during inference. This setting raises a central question: how can an agent dynamically select an LLM that is sufficiently capable for the current generation step while remaining computationally efficient? While model cascades offer a practical mechanism for balancing this trade-off, existing routing strategies typically rely on static heuristics or external controllers and do not explicitly account for model uncertainty. We introduce AdaptEvolve: Adaptive LLM Selection for Multi-LLM Evolutionary Refinement within an evolutionary sequential refinement framework that leverages intrinsic generation confidence to estimate real-time solvability. Empirical results show that confidence-driven selection yields a favourable Pareto frontier, reducing total inference cost by an average of 37.9% across benchmarks while retaining 97.5% of the upper-bound accuracy of static large-model baselines. Our code is available at https://github.com/raypretam/adaptive_llm_selection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces AdaptEvolve, a framework for adaptive LLM selection within evolutionary agentic systems. It uses intrinsic generation confidence scores from LLMs to dynamically route between large and small models during sequential refinement steps, with the goal of improving computational efficiency. The central empirical claim is that this confidence-driven approach produces a favorable Pareto frontier, achieving an average 37.9% reduction in total inference cost across benchmarks while retaining 97.5% of the upper-bound accuracy of static large-model baselines.

Significance. If the reported results hold under rigorous validation, the method offers a practical, uncertainty-aware routing mechanism that could meaningfully reduce inference costs in evolutionary AI agents without substantial accuracy loss. The open availability of code supports reproducibility. However, the contribution's impact is currently difficult to assess because the key assumption—that intrinsic confidence reliably proxies per-step solvability—lacks supporting analysis, limiting claims of generalizability.

major comments (2)

Abstract: The headline result (37.9% cost reduction at 97.5% accuracy retention) is stated without any benchmark details, number of runs, error bars, ablation studies, or description of how confidence thresholds were determined, rendering the central claim unverifiable from the provided text.
Empirical results section: The selection rule relies on intrinsic generation confidence as a proxy for real-time solvability by smaller models, but no calibration plots, per-step success-vs-confidence scatter plots, or threshold-selection protocol are described; this correlation is load-bearing for the safety of down-selection decisions and the Pareto claim.

minor comments (1)

Abstract: The description of the method as 'Adaptive LLM Selection for Multi-LLM Evolutionary Refinement' could be more explicitly linked to the title 'AdaptEvolve' for terminological consistency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on verifiability and the need for explicit validation of the confidence proxy. We address each major comment below and will incorporate the suggested additions in the revised manuscript to strengthen the empirical claims.

read point-by-point responses

Referee: Abstract: The headline result (37.9% cost reduction at 97.5% accuracy retention) is stated without any benchmark details, number of runs, error bars, ablation studies, or description of how confidence thresholds were determined, rendering the central claim unverifiable from the provided text.

Authors: We agree that the abstract lacks sufficient detail to make the headline results verifiable on its own. In the revision we will expand the abstract to name the specific benchmarks, state the number of independent runs, reference the error bars shown in the main text, and briefly describe the validation-based procedure used to set confidence thresholds. Full ablation tables will continue to appear in the body with a forward reference from the abstract. This change directly addresses the verifiability concern while remaining within abstract length limits. revision: yes
Referee: Empirical results section: The selection rule relies on intrinsic generation confidence as a proxy for real-time solvability by smaller models, but no calibration plots, per-step success-vs-confidence scatter plots, or threshold-selection protocol are described; this correlation is load-bearing for the safety of down-selection decisions and the Pareto claim.

Authors: We acknowledge that the current manuscript does not include the requested supporting analyses. The revised version will add (i) calibration plots of per-step accuracy versus binned confidence scores, (ii) scatter plots of success rate versus confidence for each evolutionary refinement step, and (iii) an explicit threshold-selection protocol that uses a held-out validation set to choose operating points on the accuracy-cost curve. These additions will provide direct evidence for the reliability of the intrinsic-confidence proxy and will be placed in a new subsection of the empirical results. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper introduces an empirical adaptive selection method based on intrinsic LLM generation confidence within an evolutionary refinement loop. No equations, derivations, or fitted parameters are presented that reduce the claimed Pareto improvement to a self-referential definition, a fitted input renamed as prediction, or a self-citation chain. The selection rule is stated directly from model confidence scores, and performance claims rest on benchmark measurements rather than any tautological reduction to inputs. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that confidence scores can safely trigger model down-selection. No free parameters are named in the abstract. The key unstated premise is treated as a domain assumption rather than derived.

axioms (1)

domain assumption LLM generation confidence correlates sufficiently with step-wise solvability to permit safe model-size reduction without final accuracy loss
Method uses this correlation to decide when to switch models; no independent verification supplied in abstract.

pith-pipeline@v0.9.0 · 5471 in / 1197 out tokens · 76894 ms · 2026-05-16T02:48:33.453506+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

What Do Evolutionary Coding Agents Evolve?
cs.NE 2026-05 unverdicted novelty 7.0

Evolutionary coding agents achieve most benchmark gains through a small subset of edit types and by cycling previously deleted code lines rather than developing new algorithmic structures.