Learning to Choose: An Empowerment-Guided Multi-Agent System with semantic communication for Adaptive Method Selection

Dimitrios G. Giovanis; Eleni D. Koronaki; Geremy Loacham\'in-Suntaxi; Ioannis G. Kevrekidis; Robert Lazar

arxiv: 2605.30042 · v1 · pith:3QEIIA7Anew · submitted 2026-05-28 · 💻 cs.AI

Learning to Choose: An Empowerment-Guided Multi-Agent System with semantic communication for Adaptive Method Selection

Geremy Loacham\'in-Suntaxi , Robert Lazar , Dimitrios G. Giovanis , Ioannis G. Kevrekidis , Eleni D. Koronaki This is my paper

Pith reviewed 2026-06-29 07:03 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-agent systemssemantic checkpointscontextual banditsempowermentsemantic driftscientific computingmethod selectionLLM agents

0 comments

The pith

Semantic checkpoints in multi-agent systems preserve action-outcome fidelity to improve policy convergence and adaptation in scientific computing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a multi-agent framework for selecting and executing computational methods in scientific workflows. It combines contextual bandits for decision-making with semantic checkpoints that ensure the executed actions match the intended strategies. This prevents semantic drift, where inconsistencies between agents corrupt learning and evaluation. Demonstrations in sensitivity analysis and uncertainty quantification show better convergence, robustness, and handling of new problems compared to systems without such safeguards.

Core claim

Interpreting the system through empowerment, reliable autonomous learning requires not only high-quality action selection but also preserving the integrity of their propagation across agents via semantic checkpoints, which the case studies show leads to improved convergence, robustness, and adaptation.

What carries the argument

Semantic checkpoints combined with contextual bandits and structured inter-agent communication that maintain action-outcome fidelity in the multi-agent pipeline.

Load-bearing premise

The two case studies of sensitivity analysis and uncertainty quantification suffice to establish that semantic checkpoints preserve fidelity across general scientific computing pipelines.

What would settle it

Running the workflows with and without semantic checkpoints and measuring differences in policy convergence rates and adaptation performance to new contexts.

Figures

Figures reproduced from arXiv: 2605.30042 by Dimitrios G. Giovanis, Eleni D. Koronaki, Geremy Loacham\'in-Suntaxi, Ioannis G. Kevrekidis, Robert Lazar.

**Figure 2.** Figure 2: contrasts the causal chains An → Sn → On → Rn for run No-CP-1 and CP-1. In the no-checkpoint case the link between policy selection and code implementation is interrupted silently; the reward measures Chatterjee method quality but updates Sobol weights, injecting noise into the policy. In the checkpoint case, CP2 obseved a Sobol proposal inconsistent with the moment-free constraint in the context vector, a… view at source ↗

**Figure 3.** Figure 3: Reward trajectory under no_cp5 + method_swap ablation (d = 8 Sobol G-function, N = 15,000). Iteration 1 produces R = 7 as the drift causes MorrisSensitivity code to execute in place of the requested Sobol estimator. The bandit recovers over iterations 2-4 via exploration of CVM, reaching R = 93.61 at iteration 3. The dashed line marks the convergence threshold R = 85. Under the full pipeline (full + method… view at source ↗

**Figure 4.** Figure 4: Semantic alignment between the raw user request (User output) and the structured context vector (Coordinator output), [PITH_FULL_IMAGE:figures/full_fig_p028_4.png] view at source ↗

**Figure 5.** Figure 5: Semantic alignment between the approved method (Critic) and the assembled implementation code (Refactor Agent [PITH_FULL_IMAGE:figures/full_fig_p028_5.png] view at source ↗

read the original abstract

Automating scientific computing workflows requires more than generating executable code: autonomous systems must also select appropriate computational strategies, implement them faithfully, and ensure that the resulting outcomes remain causally attributable to the decisions that produced them. In multi-agent pipelines, this process is particularly fragile, as small inconsistencies between agent intentions and actions can lead to semantic drift, where the eventually executed procedure no longer reflects the originally selected strategy, thereby corrupting downstream evaluation and adaptation. In this work, motivated by the ATHENA framework (Toscano et al., 2025; Toscano et al., 2026) and the concept of empowerment (Yiu et al., 2025), we introduce a multi-agent framework that combines contextual bandits with structured inter-agent communication and, most importantly, semantic checkpoints that preserve action-outcome fidelity throughout the pipeline. The system integrates specialized large language model (LLM) agents, grounded code generation, and self-healing execution loops within an adaptive decision-making architecture. Interpreting the framework through the lens of empowerment, we show that reliable autonomous learning requires not only identifying high-quality actions, but also preserving the integrity of their propagation across agents. Using sensitivity analysis and uncertainty quantification workflows as representative case studies, we demonstrate that unchecked semantic drift degrades policy learning, whereas the proposed framework improves convergence, robustness, and adaptation to novel problem contexts. These results suggest a broader design principle for scientific multi-agent systems: adaptive decision-making must be coupled with explicit mechanisms that guarantee semantic consistency and reliable information flow across the computational pipeline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds semantic checkpoints to an empowerment-plus-bandits multi-agent setup for scientific method selection, but rests on two similar case studies without numbers.

read the letter

The core idea here is that multi-agent LLM pipelines for science can suffer semantic drift between what an agent intends and what actually runs, and the fix is explicit checkpoints plus empowerment-style incentives to keep action-outcome links intact. That framing is new enough in the cited literature.

What the work does cleanly is name the drift problem and tie it to existing ATHENA and empowerment references. The architecture description—contextual bandits for method choice, grounded code generation, self-healing loops, and structured inter-agent messages—sounds like a practical assembly of pieces that already exist separately.

The soft spot is the evidence base. The abstract only shows sensitivity analysis and uncertainty quantification as representative cases. Those two workflows share the same basic structure of parameter sweeps plus post-processing, so they do not test drift modes that appear in symbolic manipulation, stateful solvers, or distributed dataflows. No quantitative metrics, error bars, or controls are mentioned, which leaves the claimed gains in convergence and robustness hard to judge.

This is for groups already building LLM agents for autonomous scientific computing who want a concrete way to add consistency checks. A reader already familiar with empowerment or ATHENA will see the integration quickly; others may need the full text to assess whether the checkpoints are doing real work or just adding overhead.

I would send it to referees. The central claim is testable and the motivation is honest, even if the current demonstrations are narrow.

Referee Report

3 major / 0 minor

Summary. The manuscript proposes an empowerment-guided multi-agent framework for adaptive method selection in scientific computing that integrates contextual bandits, LLM agents, grounded code generation, and semantic checkpoints to prevent semantic drift and preserve action-outcome fidelity. Motivated by the ATHENA framework and empowerment concepts, it claims via sensitivity analysis and uncertainty quantification case studies that the approach improves convergence, robustness, and adaptation to novel contexts compared to pipelines without such checkpoints.

Significance. If the semantic checkpoints can be shown to deliver measurable, generalizable gains in fidelity and learning without introducing new inconsistencies, the work would offer a concrete design principle for reliable multi-agent automation in AI-for-science pipelines. The combination of adaptive decision-making with explicit consistency mechanisms addresses a recognized fragility in LLM-mediated workflows.

major comments (3)

[Case Studies] Case studies section: the abstract and description assert that the framework 'improves convergence, robustness, and adaptation' and that 'unchecked semantic drift degrades policy learning,' yet no quantitative metrics, baselines, error bars, or statistical comparisons are reported, leaving the central empirical claim without verifiable support.
[Framework Description] Framework and semantic checkpoints description: the claim that semantic checkpoints preserve action-outcome fidelity across pipelines rests on the assumption that LLM-mediated consistency checks are reliable and domain-independent, but the manuscript provides no formal definition, consistency proof, or ablation showing that the checkpoints themselves do not introduce context-dependent drift.
[Case Studies] Generality of representative cases: sensitivity analysis and uncertainty quantification share similar parameter-variation-plus-post-processing structure; the manuscript does not address whether the observed benefits extend to other scientific pipelines (e.g., symbolic rewriting or stateful iterative solvers) where drift modes may differ.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed comments, which identify key areas where the manuscript can be strengthened. We address each major comment below and indicate the revisions that will be incorporated.

read point-by-point responses

Referee: [Case Studies] Case studies section: the abstract and description assert that the framework 'improves convergence, robustness, and adaptation' and that 'unchecked semantic drift degrades policy learning,' yet no quantitative metrics, baselines, error bars, or statistical comparisons are reported, leaving the central empirical claim without verifiable support.

Authors: We agree that the central empirical claims require quantitative backing. The current manuscript presents the case studies in an illustrative manner to convey the framework's operation. In the revised version we will add quantitative metrics (convergence rates, robustness scores under perturbation), explicit baselines (multi-agent pipelines without semantic checkpoints), error bars from repeated trials, and statistical comparisons to make the reported improvements verifiable. revision: yes
Referee: [Framework Description] Framework and semantic checkpoints description: the claim that semantic checkpoints preserve action-outcome fidelity across pipelines rests on the assumption that LLM-mediated consistency checks are reliable and domain-independent, but the manuscript provides no formal definition, consistency proof, or ablation showing that the checkpoints themselves do not introduce context-dependent drift.

Authors: The manuscript currently offers a conceptual description of semantic checkpoints grounded in the empowerment framework. We accept that a formal definition and supporting evidence are needed. The revision will include a precise operational definition of the checkpoints together with an ablation study that isolates their effect on fidelity and learning. A general consistency proof independent of LLM stochasticity is beyond the scope of the present empirical work and will be noted as future theoretical research. revision: partial
Referee: [Case Studies] Generality of representative cases: sensitivity analysis and uncertainty quantification share similar parameter-variation-plus-post-processing structure; the manuscript does not address whether the observed benefits extend to other scientific pipelines (e.g., symbolic rewriting or stateful iterative solvers) where drift modes may differ.

Authors: The two case studies were selected because they instantiate the common parameter-variation and post-processing pattern found in many scientific workflows. The revision will add an explicit discussion section that considers applicability to other pipeline types (symbolic rewriting, stateful iterative solvers), notes likely differences in drift modes, and outlines the adaptations required. Comprehensive empirical evaluation on additional domains is acknowledged as future work. revision: yes

standing simulated objections not resolved

A general formal consistency proof for LLM-mediated semantic checkpoints that holds independently of specific model behaviors and contexts.

Circularity Check

0 steps flagged

No significant circularity; framework claims rest on empirical case studies rather than definitional reduction

full rationale

The paper introduces a multi-agent system with semantic checkpoints, motivated by external ATHENA and empowerment references, and supports its convergence/robustness claims via sensitivity analysis and uncertainty quantification case studies. No equations, fitted parameters, or predictions are described that reduce by construction to inputs. The derivation chain consists of a new architecture plus empirical demonstration on representative workflows; citations function as motivation rather than load-bearing uniqueness theorems or ansatzes that collapse the result. This is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumption that semantic drift is the dominant failure mode in multi-agent pipelines and that the two chosen case studies generalize; no free parameters or invented entities are quantified in the abstract.

axioms (2)

domain assumption LLM agents can be grounded sufficiently to produce faithful code and self-healing behavior
Invoked when describing integration of LLM agents with grounded code generation and self-healing loops.
ad hoc to paper Semantic checkpoints can be defined and checked without introducing new inconsistencies
Central to the claim that checkpoints preserve action-outcome fidelity.

invented entities (1)

semantic checkpoints no independent evidence
purpose: Preserve action-outcome fidelity and prevent semantic drift across agents
Introduced as the key mechanism in the framework; no independent evidence outside the paper is provided in the abstract.

pith-pipeline@v0.9.1-grok · 5836 in / 1316 out tokens · 29970 ms · 2026-06-29T07:03:54.205817+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 12 canonical work pages · 4 internal anchors

[1]

Further optimal regret bounds for Thompson Sampling

Shipra Agrawal and Navin Goyal. Further optimal regret bounds for Thompson Sampling. InProceedings of the 16th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 99–107, 2013

2013
[2]

Regret analysis of stochastic and nonstochastic multi-armed bandit problems.Foundations and Trends in Machine Learning, 5(1):1–122, 2012

Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems.Foundations and Trends in Machine Learning, 5(1):1–122, 2012

2012
[3]

A new coefficient of correlation.Journal of the American Statistical Association, 116(536): 2009–2022, 2021

Sourav Chatterjee. A new coefficient of correlation.Journal of the American Statistical Association, 116(536): 2009–2022, 2021

2009
[4]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. URLhttps://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Smith, and Tao Yu

Zhoujun Cheng, Tianbao Xie, Peng Shi, Chengzu Li, Rahul Nadkarni, Yushi Hu, Caiming Xiong, Dragomir Radev, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Tao Yu. Is GPT-4 a good data analyst?arXiv preprint arXiv:2305.15038, 2023. URLhttps://arxiv.org/abs/2305.15038

work page arXiv 2023
[6]

Clarke, Thomas A

Edmund M. Clarke, Thomas A. Henzinger, Helmut Veith, and Roderick Bloem.Handbook of Model Checking. Springer, 2018

2018
[7]

Coifman and Stéphane Lafon

Ronald R. Coifman and Stéphane Lafon. Diffusion maps.Applied and Computational Harmonic Analysis, 21(1): 5–30, 2006

2006
[8]

Coifman, Stéphane Lafon, Ann B

Ronald R. Coifman, Stéphane Lafon, Ann B. Lee, Mauro Maggioni, Boaz Nadler, Fred Warner, and Steven W. Zucker. Graph laplacians and their convergence on random neighborhood graphs.Journal of Machine Learning Research, 8:1325–1368, 2008

2008
[9]

Efficient and robust automated machine learning

Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Tobias Springenberg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine learning. InAdvances in Neural Information Processing Systems, volume 28, 2015

2015
[10]

Sensitivity analysis for multidimensional and functional outputs.Electronic Journal of Statistics, 8(1):575–603, 2014

Fabrice Gamboa, Alexandre Janon, Thierry Klein, and Agnès Lagnoux. Sensitivity analysis for multidimensional and functional outputs.Electronic Journal of Statistics, 8(1):575–603, 2014

2014
[11]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues?arXiv preprint arXiv:2310.06770, 2023. URLhttps://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Kevrekidis and Giovanni Samaey.Equation-Free Modeling at the Macroscale

Ioannis G. Kevrekidis and Giovanni Samaey.Equation-Free Modeling at the Macroscale. Springer, 2009

2009
[13]

Kevrekidis, C

Ioannis G. Kevrekidis, C. William Gear, James M. Hyman, Panagiotis G. Kevrekidis, Olof Runborg, and Christos Theodoropoulos. Equation-free multiscale computation: enabling microscopic simulators to perform system-level tasks.Communications in Mathematical Sciences, 1(4):715–762, 2003

2003
[14]

Kevrekidis, C

Ioannis G. Kevrekidis, C. William Gear, and Gerhard Hummer. Equation-free, coarse-grained multiscale com- putation: enabling microscopic simulators to perform system-level analysis.AIChE Journal, 50(7):1346–1355, 2004

2004
[15]

Klyubin, Daniel Polani, and Chrystopher L

Alexander S. Klyubin, Daniel Polani, and Chrystopher L. Nehaniv. Empowerment: A universal agent-centric measure of control. InProceedings of the IEEE Congress on Evolutionary Computation, pages 128–135, 2005

2005
[16]

Implementing nlp in industrial process modeling: Addressing categorical variables.Computers & Chemical Engineering, 199:109146, 2025

Eleni D Koronaki, Geremy Loachamin-Suntaxi, Paris Papavasileiou, Dimitrios G Giovanis, Martin Kathrein, Christoph Czettl, Andreas G Boudouvis, and Stéphane PA Bordas. Implementing nlp in industrial process modeling: Addressing categorical variables.Computers & Chemical Engineering, 199:109146, 2025

2025
[17]

Communication topology and diversity in multi-agent systems, 2026

Robin Langer. Communication topology and diversity in multi-agent systems, 2026. URL https://linkedin. com

2026
[18]

The epoch-greedy algorithm for multi-armed bandits with side information

John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in Neural Information Processing Systems, volume 20, 2007

2007
[19]

Lederman and Ronen Talmon

Roy R. Lederman and Ronen Talmon. Learning the geometry of common latent variables using alternating- diffusion.Applied and Computational Harmonic Analysis, 44(3):509–536, 2018

2018
[20]

Lederman and Ronen Talmon

Roy R. Lederman and Ronen Talmon. Learning the geometry of common latent variables using alternating diffusion.Applied and Computational Harmonic Analysis, 44(3):509–536, 2018

2018
[21]

Alternating diffusion for common manifold learning with application to sleep stage assessment.Biomedical Signal Processing and Control, 40:489–497, 2018

Thomas Ledoux, Ronen Talmon, and Hau-Tieng Wu. Alternating diffusion for common manifold learning with application to sleep stage assessment.Biomedical Signal Processing and Control, 40:489–497, 2018

2018
[22]

Formal verification of a realistic compiler.Communications of the ACM, 52(7):107–115, 2009

Xavier Leroy. Formal verification of a realistic compiler.Communications of the ACM, 52(7):107–115, 2009. 21

2009
[23]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024

2024
[24]

Understanding and bridging the planner-coder gap: A systematic study on the robustness of multi-agent systems for code generation, 2026

Zongyi Lyu, Songqiang Chen, Zhenlan Ji, Liwen Wang, Shuai Wang, Daoyuan Wu, Wenxuan Wang, and Shing-Chi Cheung. Understanding and bridging the planner-coder gap: A systematic study on the robustness of multi-agent systems for code generation, 2026. URLhttps://arxiv.org/abs/2510.10460

work page arXiv 2026
[25]

doi: 10.1061/9780784413609.257

Stefano Marelli and Bruno Sudret.UQLab: A Framework for Uncertainty Quantification in Matlab, pages 2554–2563. doi: 10.1061/9780784413609.257. URL https://ascelibrary.org/doi/abs/10.1061/ 9780784413609.257

work page doi:10.1061/9780784413609.257
[26]

Max D. Morris. Factorial sampling plans for preliminary computational experiments.Technometrics, 33(2): 161–174, 1991

1991
[27]

Shields, and Lori Graham-Brady

Audrey Olivier, Michael D. Shields, and Lori Graham-Brady. UQpy: A general purpose Python package and development environment for uncertainty quantification.Journal of Computational Science, 47:101195, 2020. doi: 10.1016/j.jocs.2020.101195

work page doi:10.1016/j.jocs.2020.101195 2020
[28]

Cambridge University Press, 2 edition, 2009

Judea Pearl.Causality: Models, Reasoning, and Inference. Cambridge University Press, 2 edition, 2009

2009
[29]

MIT Press, 2017

Jonas Peters, Dominik Janzing, and Bernhard Schölkopf.Elements of Causal Inference. MIT Press, 2017

2017
[30]

Empowerment—an introduction

Christoph Salge, Cory Glackin, and Daniel Polani. Empowerment—an introduction. InGuided Self-Organization: Inception, pages 67–114. Springer, 2014

2014
[31]

Variance based sensitivity analysis of model output

Andrea Saltelli, Paola Annoni, Ivano Azzini, Francesca Campolongo, Marco Ratto, and Stefano Tarantola. Variance based sensitivity analysis of model output. Design and estimator for the total sensitivity index.Computer Physics Communications, 181(2):259–270, 2010. doi: 10.1016/j.cpc.2009.09.018

work page doi:10.1016/j.cpc.2009.09.018 2010
[32]

Variance based sensitivity analysis of model output.Computer Physics Communications, 181(2):259–270, 2010

Andrea Saltelli et al. Variance based sensitivity analysis of model output.Computer Physics Communications, 181(2):259–270, 2010

2010
[33]

Building effective agents

Erik Schluntz and Barry Zhang. Building effective agents. Technical report, Anthropic, 2024. URL https: //www.anthropic.com/research/building-effective-agents

2024
[34]

Agent Laboratory: Using LLM Agents as Research Assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, and Emad Barsoum. Agent laboratory: Using LLM agents as research assistants.arXiv preprint arXiv:2501.04227, 2024. URL https://arxiv.org/abs/2501.04227

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Amit Singer and Ronald R. Coifman. Non-linear independent component analysis with diffusion maps.Applied and Computational Harmonic Analysis, 25(2):226–239, 2008

2008
[36]

Ilya M. Sobol. Sensitivity estimates for nonlinear mathematical models.Mathematical Modelling and Computa- tional Experiments, 1(4):407–414, 1993

1993
[37]

On learning what to learn: Heterogeneous observations of dynamics and establishing possibly causal relations among them.PNAS nexus, 3(12):pgae494, 2024

David W Sroczynski, Felix Dietrich, Eleni D Koronaki, Ronen Talmon, Ronald R Coifman, Erik Bollt, and Ioannis G Kevrekidis. On learning what to learn: Heterogeneous observations of dynamics and establishing possibly causal relations among them.PNAS nexus, 3(12):pgae494, 2024

2024
[38]

Global sensitivity analysis using polynomial chaos expansions.Reliability Engineering & System Safety, 93(7):964–979, 2008

Bruno Sudret. Global sensitivity analysis using polynomial chaos expansions.Reliability Engineering & System Safety, 93(7):964–979, 2008

2008
[39]

Bulaong, John E

Kyle Swanson, Wesley Wu, Nash L. Bulaong, John E. Pak, and James Zou. The virtual lab: Ai agents design new sars-cov-2 nanobodies with experimental validation.bioRxiv, 2024. doi: 10.1101/2024.11.11.623004. URL https://www.biorxiv.org/content/early/2024/11/12/2024.11.11.623004

work page doi:10.1101/2024.11.11.623004 2024
[40]

Coifman, and Ioannis G

Ronen Talmon, Ronald R. Coifman, and Ioannis G. Kevrekidis. Empirical intrinsic geometry for nonlinear modeling and time series filtering.Proceedings of the National Academy of Sciences, 110(31):12535–12540, 2013

2013
[41]

Thompson

William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.Biometrika, 25(3/4):285–294, 1933. doi: 10.2307/2332286

work page doi:10.2307/2332286 1933
[42]

Hoos, and Kevin Leyton-Brown

Chris Thornton, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. InProceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 847–855, 2013. doi: 10.1145/2487575.2487629

work page doi:10.1145/2487575.2487629 2013
[43]

ATHENA: Agentic Team for Hierarchical Evolutionary Numerical Algorithms

Juan Diego Toscano, Daniel T. Chen, and George Em Karniadakis. Athena: Agentic team for hierarchical evolutionary numerical algorithms, 2025. URLhttps://arxiv.org/abs/2512.03476. 22

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Graft-athena: Self-improving agentic teams for autonomous discovery and evolutionary numerical algorithms, 2026

Juan Diego Toscano, Zhaojie Chai, and George Em Karniadakis. Graft-athena: Self-improving agentic teams for autonomous discovery and evolutionary numerical algorithms, 2026. URL https://arxiv.org/abs/2605. 11117

2026
[45]

SA", 7 8# Structural classification 9output_class =

Elaine Yiu, Kendra Allen, Shuran Ginosar, and Alison Gopnik. Empowerment gain and causal model construction: Children and adults are sensitive to controllability and variability in their causal interventions.Philosophical Transactions of the Royal Society B, 2025. accepted. 23 A Implementation Details A.1 Action Space and Constraint Filtering The action s...

2025

[1] [1]

Further optimal regret bounds for Thompson Sampling

Shipra Agrawal and Navin Goyal. Further optimal regret bounds for Thompson Sampling. InProceedings of the 16th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 99–107, 2013

2013

[2] [2]

Regret analysis of stochastic and nonstochastic multi-armed bandit problems.Foundations and Trends in Machine Learning, 5(1):1–122, 2012

Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems.Foundations and Trends in Machine Learning, 5(1):1–122, 2012

2012

[3] [3]

A new coefficient of correlation.Journal of the American Statistical Association, 116(536): 2009–2022, 2021

Sourav Chatterjee. A new coefficient of correlation.Journal of the American Statistical Association, 116(536): 2009–2022, 2021

2009

[4] [4]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. URLhttps://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

Smith, and Tao Yu

Zhoujun Cheng, Tianbao Xie, Peng Shi, Chengzu Li, Rahul Nadkarni, Yushi Hu, Caiming Xiong, Dragomir Radev, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Tao Yu. Is GPT-4 a good data analyst?arXiv preprint arXiv:2305.15038, 2023. URLhttps://arxiv.org/abs/2305.15038

work page arXiv 2023

[6] [6]

Clarke, Thomas A

Edmund M. Clarke, Thomas A. Henzinger, Helmut Veith, and Roderick Bloem.Handbook of Model Checking. Springer, 2018

2018

[7] [7]

Coifman and Stéphane Lafon

Ronald R. Coifman and Stéphane Lafon. Diffusion maps.Applied and Computational Harmonic Analysis, 21(1): 5–30, 2006

2006

[8] [8]

Coifman, Stéphane Lafon, Ann B

Ronald R. Coifman, Stéphane Lafon, Ann B. Lee, Mauro Maggioni, Boaz Nadler, Fred Warner, and Steven W. Zucker. Graph laplacians and their convergence on random neighborhood graphs.Journal of Machine Learning Research, 8:1325–1368, 2008

2008

[9] [9]

Efficient and robust automated machine learning

Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Tobias Springenberg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine learning. InAdvances in Neural Information Processing Systems, volume 28, 2015

2015

[10] [10]

Sensitivity analysis for multidimensional and functional outputs.Electronic Journal of Statistics, 8(1):575–603, 2014

Fabrice Gamboa, Alexandre Janon, Thierry Klein, and Agnès Lagnoux. Sensitivity analysis for multidimensional and functional outputs.Electronic Journal of Statistics, 8(1):575–603, 2014

2014

[11] [11]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues?arXiv preprint arXiv:2310.06770, 2023. URLhttps://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Kevrekidis and Giovanni Samaey.Equation-Free Modeling at the Macroscale

Ioannis G. Kevrekidis and Giovanni Samaey.Equation-Free Modeling at the Macroscale. Springer, 2009

2009

[13] [13]

Kevrekidis, C

Ioannis G. Kevrekidis, C. William Gear, James M. Hyman, Panagiotis G. Kevrekidis, Olof Runborg, and Christos Theodoropoulos. Equation-free multiscale computation: enabling microscopic simulators to perform system-level tasks.Communications in Mathematical Sciences, 1(4):715–762, 2003

2003

[14] [14]

Kevrekidis, C

Ioannis G. Kevrekidis, C. William Gear, and Gerhard Hummer. Equation-free, coarse-grained multiscale com- putation: enabling microscopic simulators to perform system-level analysis.AIChE Journal, 50(7):1346–1355, 2004

2004

[15] [15]

Klyubin, Daniel Polani, and Chrystopher L

Alexander S. Klyubin, Daniel Polani, and Chrystopher L. Nehaniv. Empowerment: A universal agent-centric measure of control. InProceedings of the IEEE Congress on Evolutionary Computation, pages 128–135, 2005

2005

[16] [16]

Implementing nlp in industrial process modeling: Addressing categorical variables.Computers & Chemical Engineering, 199:109146, 2025

Eleni D Koronaki, Geremy Loachamin-Suntaxi, Paris Papavasileiou, Dimitrios G Giovanis, Martin Kathrein, Christoph Czettl, Andreas G Boudouvis, and Stéphane PA Bordas. Implementing nlp in industrial process modeling: Addressing categorical variables.Computers & Chemical Engineering, 199:109146, 2025

2025

[17] [17]

Communication topology and diversity in multi-agent systems, 2026

Robin Langer. Communication topology and diversity in multi-agent systems, 2026. URL https://linkedin. com

2026

[18] [18]

The epoch-greedy algorithm for multi-armed bandits with side information

John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in Neural Information Processing Systems, volume 20, 2007

2007

[19] [19]

Lederman and Ronen Talmon

Roy R. Lederman and Ronen Talmon. Learning the geometry of common latent variables using alternating- diffusion.Applied and Computational Harmonic Analysis, 44(3):509–536, 2018

2018

[20] [20]

Lederman and Ronen Talmon

Roy R. Lederman and Ronen Talmon. Learning the geometry of common latent variables using alternating diffusion.Applied and Computational Harmonic Analysis, 44(3):509–536, 2018

2018

[21] [21]

Alternating diffusion for common manifold learning with application to sleep stage assessment.Biomedical Signal Processing and Control, 40:489–497, 2018

Thomas Ledoux, Ronen Talmon, and Hau-Tieng Wu. Alternating diffusion for common manifold learning with application to sleep stage assessment.Biomedical Signal Processing and Control, 40:489–497, 2018

2018

[22] [22]

Formal verification of a realistic compiler.Communications of the ACM, 52(7):107–115, 2009

Xavier Leroy. Formal verification of a realistic compiler.Communications of the ACM, 52(7):107–115, 2009. 21

2009

[23] [23]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024

2024

[24] [24]

Understanding and bridging the planner-coder gap: A systematic study on the robustness of multi-agent systems for code generation, 2026

Zongyi Lyu, Songqiang Chen, Zhenlan Ji, Liwen Wang, Shuai Wang, Daoyuan Wu, Wenxuan Wang, and Shing-Chi Cheung. Understanding and bridging the planner-coder gap: A systematic study on the robustness of multi-agent systems for code generation, 2026. URLhttps://arxiv.org/abs/2510.10460

work page arXiv 2026

[25] [25]

doi: 10.1061/9780784413609.257

Stefano Marelli and Bruno Sudret.UQLab: A Framework for Uncertainty Quantification in Matlab, pages 2554–2563. doi: 10.1061/9780784413609.257. URL https://ascelibrary.org/doi/abs/10.1061/ 9780784413609.257

work page doi:10.1061/9780784413609.257

[26] [26]

Max D. Morris. Factorial sampling plans for preliminary computational experiments.Technometrics, 33(2): 161–174, 1991

1991

[27] [27]

Shields, and Lori Graham-Brady

Audrey Olivier, Michael D. Shields, and Lori Graham-Brady. UQpy: A general purpose Python package and development environment for uncertainty quantification.Journal of Computational Science, 47:101195, 2020. doi: 10.1016/j.jocs.2020.101195

work page doi:10.1016/j.jocs.2020.101195 2020

[28] [28]

Cambridge University Press, 2 edition, 2009

Judea Pearl.Causality: Models, Reasoning, and Inference. Cambridge University Press, 2 edition, 2009

2009

[29] [29]

MIT Press, 2017

Jonas Peters, Dominik Janzing, and Bernhard Schölkopf.Elements of Causal Inference. MIT Press, 2017

2017

[30] [30]

Empowerment—an introduction

Christoph Salge, Cory Glackin, and Daniel Polani. Empowerment—an introduction. InGuided Self-Organization: Inception, pages 67–114. Springer, 2014

2014

[31] [31]

Variance based sensitivity analysis of model output

Andrea Saltelli, Paola Annoni, Ivano Azzini, Francesca Campolongo, Marco Ratto, and Stefano Tarantola. Variance based sensitivity analysis of model output. Design and estimator for the total sensitivity index.Computer Physics Communications, 181(2):259–270, 2010. doi: 10.1016/j.cpc.2009.09.018

work page doi:10.1016/j.cpc.2009.09.018 2010

[32] [32]

Variance based sensitivity analysis of model output.Computer Physics Communications, 181(2):259–270, 2010

Andrea Saltelli et al. Variance based sensitivity analysis of model output.Computer Physics Communications, 181(2):259–270, 2010

2010

[33] [33]

Building effective agents

Erik Schluntz and Barry Zhang. Building effective agents. Technical report, Anthropic, 2024. URL https: //www.anthropic.com/research/building-effective-agents

2024

[34] [34]

Agent Laboratory: Using LLM Agents as Research Assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, and Emad Barsoum. Agent laboratory: Using LLM agents as research assistants.arXiv preprint arXiv:2501.04227, 2024. URL https://arxiv.org/abs/2501.04227

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Amit Singer and Ronald R. Coifman. Non-linear independent component analysis with diffusion maps.Applied and Computational Harmonic Analysis, 25(2):226–239, 2008

2008

[36] [36]

Ilya M. Sobol. Sensitivity estimates for nonlinear mathematical models.Mathematical Modelling and Computa- tional Experiments, 1(4):407–414, 1993

1993

[37] [37]

On learning what to learn: Heterogeneous observations of dynamics and establishing possibly causal relations among them.PNAS nexus, 3(12):pgae494, 2024

David W Sroczynski, Felix Dietrich, Eleni D Koronaki, Ronen Talmon, Ronald R Coifman, Erik Bollt, and Ioannis G Kevrekidis. On learning what to learn: Heterogeneous observations of dynamics and establishing possibly causal relations among them.PNAS nexus, 3(12):pgae494, 2024

2024

[38] [38]

Global sensitivity analysis using polynomial chaos expansions.Reliability Engineering & System Safety, 93(7):964–979, 2008

Bruno Sudret. Global sensitivity analysis using polynomial chaos expansions.Reliability Engineering & System Safety, 93(7):964–979, 2008

2008

[39] [39]

Bulaong, John E

Kyle Swanson, Wesley Wu, Nash L. Bulaong, John E. Pak, and James Zou. The virtual lab: Ai agents design new sars-cov-2 nanobodies with experimental validation.bioRxiv, 2024. doi: 10.1101/2024.11.11.623004. URL https://www.biorxiv.org/content/early/2024/11/12/2024.11.11.623004

work page doi:10.1101/2024.11.11.623004 2024

[40] [40]

Coifman, and Ioannis G

Ronen Talmon, Ronald R. Coifman, and Ioannis G. Kevrekidis. Empirical intrinsic geometry for nonlinear modeling and time series filtering.Proceedings of the National Academy of Sciences, 110(31):12535–12540, 2013

2013

[41] [41]

Thompson

William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.Biometrika, 25(3/4):285–294, 1933. doi: 10.2307/2332286

work page doi:10.2307/2332286 1933

[42] [42]

Hoos, and Kevin Leyton-Brown

Chris Thornton, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. InProceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 847–855, 2013. doi: 10.1145/2487575.2487629

work page doi:10.1145/2487575.2487629 2013

[43] [43]

ATHENA: Agentic Team for Hierarchical Evolutionary Numerical Algorithms

Juan Diego Toscano, Daniel T. Chen, and George Em Karniadakis. Athena: Agentic team for hierarchical evolutionary numerical algorithms, 2025. URLhttps://arxiv.org/abs/2512.03476. 22

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

Graft-athena: Self-improving agentic teams for autonomous discovery and evolutionary numerical algorithms, 2026

Juan Diego Toscano, Zhaojie Chai, and George Em Karniadakis. Graft-athena: Self-improving agentic teams for autonomous discovery and evolutionary numerical algorithms, 2026. URL https://arxiv.org/abs/2605. 11117

2026

[45] [45]

SA", 7 8# Structural classification 9output_class =

Elaine Yiu, Kendra Allen, Shuran Ginosar, and Alison Gopnik. Empowerment gain and causal model construction: Children and adults are sensitive to controllability and variability in their causal interventions.Philosophical Transactions of the Royal Society B, 2025. accepted. 23 A Implementation Details A.1 Action Space and Constraint Filtering The action s...

2025