ATLAS: A Multi-LLM Training Framework for EvoDPO with Adaptive Reference Evolution

Caleb Eunho Lee; Guang Lin; Jiyong Kwon; Madison Ann Sullivan; Ujin Jeon

arxiv: 2602.02709 · v3 · pith:2P5ZLZG5new · submitted 2026-02-02 · 💻 cs.AI

ATLAS: A Multi-LLM Training Framework for EvoDPO with Adaptive Reference Evolution

Ujin Jeon , Jiyong Kwon , Madison Ann Sullivan , Caleb Eunho Lee , Guang Lin This is my paper

Pith reviewed 2026-05-22 11:37 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-LLM agentsEvoDPOadaptive referenceself-improvementpreference optimizationmulti-agent traininginspection agent

0 comments

The pith

Multi-LLM agents sustain longer self-improvement when an inspection agent adaptively updates the reference policy using training telemetry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ATLAS as a multi-agent system in which specialized meta-agents jointly refine a main policy through iterative preference learning. It replaces fixed reference models with EvoDPO, an approach that lets an inspection agent adjust the reference policy on the fly according to signals from ongoing training. This addresses the tendency of static references to produce either overly cautious updates or outright stagnation during extended runs. A sympathetic reader would care because many current agent pipelines stop improving after a few iterations, and a working solution could allow systems to keep refining themselves across changing tasks without repeated human resets. If the central mechanism holds, agent training could shift from short supervised bursts to extended evaluator-guided evolution on domains such as optimization and physics-informed problems.

Core claim

ATLAS shows that supporter-driven exploration combined with EvoDPO-driven stability improves long-horizon evaluator-driven self-improvement. The framework deploys an inspection agent that performs adaptive, proxy-KL gated reference policy updates drawn from continuous training telemetry, and this combination outperforms fixed-reference, adaptive-reference, and external automated-discovery baselines on non-stationary contextual bandits, PINN tasks, and combinatorial problems including TSP and bin packing.

What carries the argument

EvoDPO, which lets an inspection agent execute adaptive, proxy-KL gated reference policy updates based on continuous training telemetry.

If this is right

The adaptive reference mechanism reduces stagnation that fixed references cause in iterative preference learning.
Supporter agents provide exploration while the inspection agent supplies stability, yielding measurable gains on non-stationary bandits and combinatorial tasks.
The same pipeline extends to physics-informed neural network training without requiring external automated discovery methods.
Longer training horizons become feasible because reference updates keep the policy from becoming either too conservative or too unstable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The inspection-agent approach might be combined with other preference optimization losses that currently rely on static references.
If telemetry proves reliable across more domains, similar gating could reduce the need for manual hyper-parameter sweeps in multi-agent training.
Extending the framework to even longer sequences or higher-dimensional tasks would test whether the stability benefit persists when task complexity increases.

Load-bearing premise

Continuous training telemetry is stable and informative enough for an inspection agent to decide when and how much to evolve the reference policy without creating new instability or reward hacking.

What would settle it

A side-by-side run on a long sequence of the same tasks in which the adaptive-reference version stops improving or begins to degrade at the same iteration count as the fixed-reference baseline.

Figures

Figures reproduced from arXiv: 2602.02709 by Caleb Eunho Lee, Guang Lin, Jiyong Kwon, Madison Ann Sullivan, Ujin Jeon.

**Figure 1.** Figure 1: ATLAS workflow. ATLAS (Adaptive Task-distributed Learning for Agentic Self-evolution) alternates between (i) exploration with a supporter agent to generate diverse candidates and a preference dataset, and (ii) EvoDPO updates consisting of DPO fine-tuning (strategist-guided) and reference promotion via an inspector gate based on score improvement and a KL budget. the reference policy used at fine-tuning pha… view at source ↗

**Figure 2.** Figure 2: Experimental Results across distinct domains. (a) Bandit Negative Mean Regret (NMR). (b) PINN Validation Loss (Log Scale). Shaded regions represent the Standard Error of the Mean (SEM) across 5 independent seeds [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Recent multi-LLM agent systems have shown promising capabilities for automated problem-solving, yet they predominantly rely on frozen agents or static fine-tuning pipelines. To address this limitation, our primary contribution is ATLAS (Adaptive Task-distributed Learning for Agentic Self-evolution), a multi-agent framework where specialized meta-agents collaboratively train and refine an active agent toward a domain-specific policy. A core challenge in iterative preference learning within these pipelines is the reliance on fixed reference models, which typically leads to overly conservative updates or training stagnation. To overcome this, the framework's algorithmic engine utilizes Evolving Direct Preference Optimization (EvoDPO). EvoDPO employs an inspection agent to perform adaptive, proxy-KL gated reference policy updates based on continuous training telemetry. We evaluate this full framework across a diverse set of challenging environments-including non-stationary contextual bandits, partial differential equations (PINNs), and combinatorial optimization tasks (TSP, Bin Packing). Through comparison against fixed-reference, adaptive-reference, and external automated-discovery baselines, our results suggest that ATLAS combines supporter-driven exploration with EvoDPO-driven stability to improve long-horizon evaluator-driven self-improvement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ATLAS adds an inspection agent for proxy-KL gated reference updates in multi-LLM EvoDPO training, but the stability gains rest on thin evidence so far.

read the letter

The key point on ATLAS is that it introduces an inspection agent to manage adaptive, proxy-KL gated updates to the reference policy within an EvoDPO setup for multi-LLM agent training. This targets the issue of fixed references causing conservative updates or stagnation in long-running self-improvement loops. The framework itself is described clearly. It uses specialized meta-agents, including supporters for exploration, to collaboratively refine an active agent. EvoDPO is the engine that allows the reference to evolve based on continuous telemetry from training. Testing on non-stationary contextual bandits, PINNs, TSP, and bin packing shows an attempt to cover different problem types, which is a plus for showing broader relevance. What stands out as new is the combination of the inspection agent with the gating mechanism inside the multi-agent loop. It builds on existing DPO ideas but adds this dynamic reference evolution. The soft spots are mainly around the results. The abstract claims that comparisons suggest ATLAS improves long-horizon performance, yet no specific metrics, standard deviations, or baseline implementation details are given. This makes it tough to evaluate if the adaptive component truly provides stability gains or if other factors are at play. The potential for the inspection agent to introduce instability through poor telemetry decisions or reward hacking isn't addressed with any analysis in the provided description, so that remains an open question. This paper would interest researchers focused on building self-evolving multi-agent LLM systems and those experimenting with preference optimization in iterative settings. It could offer practical ideas for extending training pipelines beyond static models. I would recommend sending it to peer review. The core idea is a legitimate incremental step, and referees could help by requesting the missing quantitative evidence and checks on the gating robustness.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces ATLAS, a multi-agent framework for agentic self-evolution in which specialized meta-agents collaboratively train an active agent. Its core algorithmic contribution is EvoDPO, which uses an inspection agent to perform adaptive, proxy-KL-gated reference-policy updates driven by continuous training telemetry. The framework is evaluated on non-stationary contextual bandits, PINNs, TSP, and Bin Packing, with comparisons to fixed-reference, adaptive-reference, and external automated-discovery baselines; the abstract claims that the combination of supporter-driven exploration and EvoDPO-driven stability yields improved long-horizon evaluator-driven self-improvement.

Significance. If the adaptive reference mechanism demonstrably improves stability and performance over static baselines without introducing oscillations or reward hacking, the work would address a recognized bottleneck in iterative preference optimization for multi-LLM systems and could influence automated discovery pipelines. The absence of any quantitative results, ablation tables, or sensitivity analysis in the provided abstract, however, prevents assessment of whether the claimed gains are robust or merely artifacts of unstated hyper-parameter choices.

major comments (3)

[Abstract] Abstract: the central claim that ATLAS 'improves long-horizon evaluator-driven self-improvement' rests on comparisons against fixed-reference, adaptive-reference, and external baselines, yet the abstract supplies no numerical metrics, error bars, statistical tests, or implementation details for those baselines. This omission is load-bearing because the soundness of the improvement claim cannot be evaluated without them.
[Abstract] Abstract / §3 (EvoDPO description): the proxy-KL gating mechanism for reference-policy evolution is described only at a high level; no equations, threshold values, or ablation on gating sensitivity are provided. Without these, it is impossible to determine whether the adaptive component delivers net stability gains or merely trades one form of instability for another across the tested environments.
[Abstract] Abstract: the evaluation environments (non-stationary bandits, PINNs, TSP, Bin Packing) are listed, but no details on task horizons, reward formulations, or how the inspection agent's telemetry decisions were validated against reward hacking are given. These omissions directly affect the weakest assumption identified in the stress-test note.

minor comments (1)

[Abstract] The acronym 'ATLAS' is expanded as 'Adaptive Task-distributed Learning for Agentic Self-evolution' in the abstract; ensure this expansion appears consistently in the introduction and method sections.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, clarifying aspects of the work and indicating revisions where they strengthen the presentation without altering the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that ATLAS 'improves long-horizon evaluator-driven self-improvement' rests on comparisons against fixed-reference, adaptive-reference, and external baselines, yet the abstract supplies no numerical metrics, error bars, statistical tests, or implementation details for those baselines. This omission is load-bearing because the soundness of the improvement claim cannot be evaluated without them.

Authors: We agree that the abstract would be strengthened by including key quantitative results. The full manuscript reports these metrics in the experimental sections, but to make the central claim more self-contained, we have revised the abstract to include representative performance gains (with standard deviations across runs) and brief references to the statistical comparisons performed against the listed baselines. revision: yes
Referee: [Abstract] Abstract / §3 (EvoDPO description): the proxy-KL gating mechanism for reference-policy evolution is described only at a high level; no equations, threshold values, or ablation on gating sensitivity are provided. Without these, it is impossible to determine whether the adaptive component delivers net stability gains or merely trades one form of instability for another across the tested environments.

Authors: Section 3 of the manuscript presents the mathematical formulation of the proxy-KL gating mechanism, including the divergence-based update rule. To directly address the request for implementation specifics, we have added the exact threshold values used in all experiments and inserted a dedicated ablation subsection (with accompanying table) that varies the gating sensitivity parameter and reports resulting stability and performance metrics across environments. These results indicate net stability gains rather than traded instabilities. revision: partial
Referee: [Abstract] Abstract: the evaluation environments (non-stationary bandits, PINNs, TSP, Bin Packing) are listed, but no details on task horizons, reward formulations, or how the inspection agent's telemetry decisions were validated against reward hacking are given. These omissions directly affect the weakest assumption identified in the stress-test note.

Authors: We have expanded the experimental setup and evaluation sections of the revised manuscript to include explicit task horizons, reward function definitions, and a new subsection detailing the inspection agent's telemetry validation. This includes controlled experiments that monitor for reward hacking indicators (e.g., policy divergence spikes without corresponding evaluator improvement) and demonstrate that the proxy-KL gate triggers reference updates only when telemetry confirms beneficial evolution. revision: yes

Circularity Check

0 steps flagged

No circularity: framework description and empirical comparisons lack any derivation chain

full rationale

The provided abstract and description contain no equations, no mathematical derivation, and no load-bearing self-citations that reduce a claimed result to a fitted parameter or prior ansatz by construction. ATLAS and EvoDPO are introduced as a descriptive framework with an inspection agent performing proxy-KL gated updates based on telemetry; results are presented as outcomes of comparisons against baselines across bandits, PINNs, and TSP. No step equates a prediction to its own input, renames a known pattern, or imports uniqueness via overlapping-author citation. The presentation is therefore self-contained as an empirical proposal rather than a closed-form derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Because only the abstract is available, the ledger is necessarily incomplete. The central claim appears to rest on the unstated premise that training telemetry is a reliable signal for reference evolution and that the proxy-KL gate prevents harmful drift. No free parameters, axioms, or invented entities can be extracted from the provided text.

pith-pipeline@v0.9.0 · 5740 in / 1188 out tokens · 27365 ms · 2026-05-22T11:37:22.988285+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GRAFT-ATHENA: Self-Improving Agentic Teams for Autonomous Discovery and Evolutionary Numerical Algorithms
cs.LG 2026-05 unverdicted novelty 6.0

GRAFT-ATHENA projects combinatorial method choices into factored trees that embed as fingerprints in a metric space, enabling an agentic system to accumulate experience across domains and autonomously discover new num...

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 1 Pith paper

[1]

Lσϕ2 max W V t,W + p λ+W ϕ 2max s 2 d 2 log 1 + W ϕ2max dλ + log 1 δ +λθ max # Dividing bym 0cWyields the result. ∥ˆθt −θ t∥2 ≤

ACM, June 2018. doi: 10.1145/3209978.3210051. URL http://dx.doi.org/10.1145/3209978. 3210051. Wu, Q., Bansal, G., Zhang, J., Wu, Y ., Li, B., Zhu, E., Jiang, L., Zhang, X., Zhang, S., Liu, J., Awadallah, A. H., White, R. W., Burger, D., and Wang, C. Autogen: Enabling next- gen llm applications via multi-agent conversation, 2023. URLhttps://arxiv.org/abs/2...

work page doi:10.1145/3209978.3210051 2018
[2]

Use ONLY ‘import numpy as np‘ (never ‘import random‘)

work page
[3]

Use ONLY the provided ‘rnd‘ for randomness (do not create your own RandomState)

work page
[4]

Always derive them as: K, d = context.shape

Do NOT reference global K/d. Always derive them as: K, d = context.shape

work page
[5]

Return a single int arm index in [0, K-1]

work page
[6]

Do NOT call .append() on them

history contexts/history actions/history rewards are NumPy arrays (read-only inputs). Do NOT call .append() on them

work page
[7]

You MUST return an int on ALL code paths (no missing return / never return None)

work page
[8]

You tune hyperparameters using feedback from teacher: policy.window size (int >= 1) policy.lambda reg (float > 0) policy.ucb alpha (float >= 0)

work page
[9]

You must NOT simulate $\theta$, rewards, contexts, or regret (evaluator does this)

work page
[10]

Only ‘import numpy as np‘ is allowed

Do NOT import sklearn, scipy, torch, pandas, or any external library. Only ‘import numpy as np‘ is allowed

work page
[11]

Define exactly ONE function: def policy(context, history contexts, history actions, history rewards, t, rnd) -> int PROBLEM SETUP (Evaluator-owned) At each time step t:

No markdown, only a single function definition. Define exactly ONE function: def policy(context, history contexts, history actions, history rewards, t, rnd) -> int PROBLEM SETUP (Evaluator-owned) At each time step t:

work page
[12]

You observe context $\in Rˆ{K\times d}$ (one row per arm)

work page
[13]

IMPORTANT: history * are NumPy arrays; do NOT modify them and do NOT use .append ()

You choose an arm a t using ONLY history: history contexts[:t], history actions[:t], history rewards[:t]. IMPORTANT: history * are NumPy arrays; do NOT modify them and do NOT use .append (). Use slicing/indexing only

work page
[14]

score threshold

The evaluator generates reward and regret (you never see $\theta$ or expected rewards). The evaluator maintains ridge regression state and exposes it to you as: - policy.A (×dd matrix) - policy.b (d vector) Your policy can compute: 24 ATLAS: Adaptive Self-Evolutionary Research Agent with Task-Distributed Multi-LLM Supporters theta hat = np.linalg.solve(po...

work page

[1] [1]

Lσϕ2 max W V t,W + p λ+W ϕ 2max s 2 d 2 log 1 + W ϕ2max dλ + log 1 δ +λθ max # Dividing bym 0cWyields the result. ∥ˆθt −θ t∥2 ≤

ACM, June 2018. doi: 10.1145/3209978.3210051. URL http://dx.doi.org/10.1145/3209978. 3210051. Wu, Q., Bansal, G., Zhang, J., Wu, Y ., Li, B., Zhu, E., Jiang, L., Zhang, X., Zhang, S., Liu, J., Awadallah, A. H., White, R. W., Burger, D., and Wang, C. Autogen: Enabling next- gen llm applications via multi-agent conversation, 2023. URLhttps://arxiv.org/abs/2...

work page doi:10.1145/3209978.3210051 2018

[2] [2]

Use ONLY ‘import numpy as np‘ (never ‘import random‘)

work page

[3] [3]

Use ONLY the provided ‘rnd‘ for randomness (do not create your own RandomState)

work page

[4] [4]

Always derive them as: K, d = context.shape

Do NOT reference global K/d. Always derive them as: K, d = context.shape

work page

[5] [5]

Return a single int arm index in [0, K-1]

work page

[6] [6]

Do NOT call .append() on them

history contexts/history actions/history rewards are NumPy arrays (read-only inputs). Do NOT call .append() on them

work page

[7] [7]

You MUST return an int on ALL code paths (no missing return / never return None)

work page

[8] [8]

You tune hyperparameters using feedback from teacher: policy.window size (int >= 1) policy.lambda reg (float > 0) policy.ucb alpha (float >= 0)

work page

[9] [9]

You must NOT simulate $\theta$, rewards, contexts, or regret (evaluator does this)

work page

[10] [10]

Only ‘import numpy as np‘ is allowed

Do NOT import sklearn, scipy, torch, pandas, or any external library. Only ‘import numpy as np‘ is allowed

work page

[11] [11]

Define exactly ONE function: def policy(context, history contexts, history actions, history rewards, t, rnd) -> int PROBLEM SETUP (Evaluator-owned) At each time step t:

No markdown, only a single function definition. Define exactly ONE function: def policy(context, history contexts, history actions, history rewards, t, rnd) -> int PROBLEM SETUP (Evaluator-owned) At each time step t:

work page

[12] [12]

You observe context $\in Rˆ{K\times d}$ (one row per arm)

work page

[13] [13]

IMPORTANT: history * are NumPy arrays; do NOT modify them and do NOT use .append ()

You choose an arm a t using ONLY history: history contexts[:t], history actions[:t], history rewards[:t]. IMPORTANT: history * are NumPy arrays; do NOT modify them and do NOT use .append (). Use slicing/indexing only

work page

[14] [14]

score threshold

The evaluator generates reward and regret (you never see $\theta$ or expected rewards). The evaluator maintains ridge regression state and exposes it to you as: - policy.A (×dd matrix) - policy.b (d vector) Your policy can compute: 24 ATLAS: Adaptive Self-Evolutionary Research Agent with Task-Distributed Multi-LLM Supporters theta hat = np.linalg.solve(po...

work page