MOSAIC: Modular Orchestration for Structured Agentic Intelligence and Composition

Anthony K.H. Tung; Hao Ni; Kevin Zhang; Lei Jiang; Lukasz Szpruch; Raad Khraishi; Wen Ge; Xinyu Liu; Xinyu Xi; Yifan Bao

arxiv: 2606.00708 · v1 · pith:XETYMARNnew · submitted 2026-05-30 · 💻 cs.AI · cs.LG

MOSAIC: Modular Orchestration for Structured Agentic Intelligence and Composition

Yifan Bao , Xinyu Xi , Xinyu Liu , Wen Ge , Lei Jiang , Kevin Zhang , Raad Khraishi , Yihao Ang

show 3 more authors

Anthony K.H. Tung Lukasz Szpruch Hao Ni

This is my paper

Pith reviewed 2026-06-28 18:33 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords automated data scienceagentic systemsmodel selectionLLM agentsblueprintfinancial time seriesmodular compositionreinforcement learning

0 comments

The pith

MOSAIC builds semantic task profiles and modular blueprints to ground LLM model selection in retrieved evidence and execution feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that automated data science is best solved by turning model selection into a structured, memory-grounded process rather than open-ended search or pure synthesis. MOSAIC first creates a semantic task profile, pulls relevant prior cases and code modules, assembles an explicit blueprint of components and constraints, then validates and refines candidates through execution traces and a failure-aware reinforcement learning policy. Experiments on financial time-series forecasting and generation show gains in predictive accuracy, distributional fidelity, execution reliability, and downstream financial metrics over both AutoML systems and other agentic baselines. A sympathetic reader would care because the work suggests a concrete route to making agent decisions more verifiable and reusable instead of opaque.

Core claim

Given a task and dataset, MOSAIC builds a semantic task profile, retrieves prior cases and source-code modules, and constructs a blueprint specifying selected modelling components, composition, interface constraints, and execution requirements. Candidate models are generated from this blueprint, validated by execution, and refined using diagnostic feedback, training traces, task metrics, and a failure-aware reinforcement learning policy, producing improvements in task performance, execution success, and decision traceability on financial time-series tasks.

What carries the argument

The blueprint, an intermediate representation that specifies modelling components, composition, interface constraints, and execution requirements, which converts model selection into staged, context-grounded search.

If this is right

Task performance, execution success, and decision traceability all increase relative to AutoML and unconstrained agentic baselines.
Models better satisfy predictive accuracy, distributional fidelity, and downstream financial criteria such as risk and tail behaviour.
Model selection becomes a reusable, staged process instead of one-off synthesis.
Diagnostic feedback and failure-aware reinforcement learning enable iterative refinement grounded in execution traces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same profile-plus-blueprint pattern could be applied to other structured modeling domains such as computer vision pipelines or scientific simulation calibration.
Over repeated tasks the growing library of retrieved modules could reduce the fraction of decisions that rely on fresh LLM synthesis.
The failure-aware policy opens a path to agents that improve their own retrieval and blueprint construction rules across deployments.

Load-bearing premise

Building a semantic task profile and retrieving prior cases plus source-code modules will reliably ground LLM code generation and produce superior modeling decisions compared to unconstrained synthesis or standard search.

What would settle it

Run MOSAIC on a new financial dataset engineered so that retrieval returns no relevant prior cases or modules, then measure whether the performance and success-rate advantages over baselines disappear.

Figures

Figures reproduced from arXiv: 2606.00708 by Anthony K.H. Tung, Hao Ni, Kevin Zhang, Lei Jiang, Lukasz Szpruch, Raad Khraishi, Wen Ge, Xinyu Liu, Xinyu Xi, Yifan Bao, Yihao Ang.

**Figure 1.** Figure 1: Overview of the MOSAIC pipeline. synthesize new architectures from reusable modules. MOSAIC differs by treating model construction as a repository-grounded composition problem: it retrieves existing code, extracts modules, constructs a blueprint, and synthesizes executable models beyond fixed model-bank selection. LLM-Based Agentic Systems. ReAct-style agents [4] interleave reasoning and action for interac… view at source ↗

**Figure 2.** Figure 2: Overview of the repositorygrounded model generation pipeline. which summarises the dataset schema, statistical meta-features, visual time-series patterns, and engineered features. This profile acts as the first layer of context for model search. It is used to query the case bank Ecase and retrieve prior tasks with similar data characteristics, objectives, and evaluation requirements. The retrieved cases… view at source ↗

**Figure 3.** Figure 3: Forecasting performance across datasets, metrics, and LLM backbones. Each metric is [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Radar charts for ranking. Time Series Forecasting. MOSAIC consistently achieves the best results on forecasting tasks across three datasets and four LLM backbones, as supported by [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Model generation module comparison. (a)–(c): per-dataset Win Rate (outermost ring) [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: RL module comparison. (a)–(c): per-dataset metric performance broken down by LLM [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Diagram of failure-aware trajectory branching and invalid-action masking [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Generated model architecture for hourly cryptocurrency forecasting by GPT-4o. [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Generated model architecture for high-frequency LOB generation by Claude Opus 4. [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

read the original abstract

Automated data science is a structured model-selection problem. A solution must choose data transformations, feature representations, architecture, training procedure, evaluation protocol, and refinement strategy for a task. AutoML systems automate parts of this process, but typically search within predefined pipeline, model, and hyperparameter spaces. LLM-based agents offer greater flexibility through retrieval, code generation, and execution feedback, yet their modelling decisions are often unstructured, difficult to verify, and hard to reuse. We introduce \textsc{MOSAIC} (Modular Orchestration for Structured Agentic Intelligence and Composition), a structured agentic framework for memory-grounded model selection and workflow construction. Given a task and dataset, \textsc{MOSAIC} builds a semantic task profile, retrieves prior cases and source-code modules, and constructs a blueprint: an intermediate representation specifying selected modelling components, composition, interface constraints, and execution requirements. This blueprint turns model selection into a staged, context-grounded search and grounds LLM-based code generation in retrieved evidence rather than unconstrained synthesis. Candidate models are validated by execution and refined using diagnostic feedback, training traces, task metrics, and a failure-aware reinforcement learning policy. We instantiate \textsc{MOSAIC} on financial time-series forecasting and generation, where models must satisfy predictive accuracy, distributional fidelity, execution reliability, and downstream financial criteria such as risk and tail behaviour. Experiments against AutoML and agentic baselines show that \textsc{MOSAIC} improves task performance, execution success, and decision traceability, demonstrating the value of treating automated data science as structured, reusable, and execution-grounded model selection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MOSAIC structures LLM agents for data science with blueprints and failure-aware RL, but the abstract provides no evidence to back the performance claims.

read the letter

The punchline on this paper is that MOSAIC introduces a modular framework to make LLM agents for automated data science more traceable and reusable by inserting a blueprint stage and using failure-aware reinforcement learning for model refinement. This seems like a sensible engineering response to the problems of unstructured agent behavior.

What stands out as new is the specific combination of building a semantic task profile, retrieving prior cases and source code modules to form the blueprint, and then grounding code generation in that rather than free-form synthesis. The blueprint specifies components, composition, interfaces, and execution needs, which turns the selection into a staged search. They apply it to financial time-series forecasting and generation, where additional criteria like distributional fidelity and risk metrics come into play. The paper does well in laying out how execution feedback, training traces, and diagnostic info feed into the RL policy for refinement.

On the soft spots, the abstract asserts better task performance, execution success, and decision traceability compared to AutoML and agent baselines, but it supplies zero quantitative results, no error bars, no specific baselines listed, and no dataset or exclusion criteria. That makes it impossible to judge whether the experiments actually support the claims. The full text might fill this in, but as presented, the evidence is missing. There are no equations or fitted parameters that look circular, and the approach seems straightforward without hidden assumptions in the description.

This kind of work is aimed at practitioners and researchers in agentic AI for data science and AutoML, particularly those dealing with financial modeling where traceability matters. A reader working on similar systems could pick up the blueprint concept as a way to add structure.

I would recommend sending this to peer review. The idea has enough potential that getting the experimental details reviewed would be useful, even if revisions are needed to strengthen the results section.

Referee Report

2 major / 0 minor

Summary. The paper introduces MOSAIC, a modular agentic framework for automated data science that constructs semantic task profiles, retrieves prior cases and code modules to build an intermediate blueprint representation of modeling components and constraints, grounds LLM code generation in retrieved evidence, and refines candidates via execution feedback and a failure-aware RL policy. It focuses on financial time-series forecasting and generation tasks and claims experimental improvements in task performance, execution success, and decision traceability over AutoML and agentic baselines.

Significance. If the experimental claims hold with proper controls and metrics, the work could advance LLM-based automated data science by shifting from unstructured synthesis to structured, memory-grounded, and execution-verified model selection, potentially improving reusability and verifiability in agentic workflows.

major comments (2)

[Abstract] Abstract: the central claim that 'Experiments against AutoML and agentic baselines show that MOSAIC improves task performance, execution success, and decision traceability' is unsupported because the abstract (and the query-supplied text) supplies no quantitative results, error bars, baseline details, dataset descriptions, or exclusion criteria, making it impossible to determine whether the data support the claim.
[Abstract] Abstract: the description of blueprint construction, retrieval mechanism, and failure-aware reinforcement learning policy remains at a high level without specifying algorithms, similarity metrics, or how the staged search avoids the weaknesses of unconstrained synthesis, which is load-bearing for the claim that this grounds LLM decisions more reliably.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed comments on the abstract. We agree that abstracts must balance conciseness with sufficient support for claims and address each point below, proposing targeted revisions to the abstract where helpful while preserving its high-level nature.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'Experiments against AutoML and agentic baselines show that MOSAIC improves task performance, execution success, and decision traceability' is unsupported because the abstract (and the query-supplied text) supplies no quantitative results, error bars, baseline details, dataset descriptions, or exclusion criteria, making it impossible to determine whether the data support the claim.

Authors: Abstracts conventionally summarize contributions without full quantitative detail to remain readable; the supporting results (including metrics, baselines, datasets, and statistical details) appear in Sections 4–5. To address the concern directly, we will revise the abstract to incorporate 1–2 key quantitative highlights (e.g., relative gains in forecasting accuracy and execution success) drawn from the experimental tables, while staying within length limits. revision: yes
Referee: [Abstract] Abstract: the description of blueprint construction, retrieval mechanism, and failure-aware reinforcement learning policy remains at a high level without specifying algorithms, similarity metrics, or how the staged search avoids the weaknesses of unconstrained synthesis, which is load-bearing for the claim that this grounds LLM decisions more reliably.

Authors: The abstract is deliberately high-level; the concrete mechanisms—semantic task profiling, embedding-based retrieval, blueprint schema, staged search constraints, and the failure-aware RL update rule—are specified in Section 3, with pseudocode and similarity metrics. The abstract’s role is to outline the overall approach rather than replicate algorithmic detail. We can add a brief clause to the abstract emphasizing that the blueprint “grounds generation in retrieved modules and constraints” if the referee finds this improves clarity. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is a system-description manuscript introducing the MOSAIC framework for structured agentic model selection. The abstract and supplied text contain no equations, no fitted parameters presented as predictions, no self-referential derivations, and no load-bearing self-citations that reduce the central claims to their own inputs by construction. Claims rest on the empirical comparison to AutoML and agentic baselines rather than on any mathematical chain that collapses into a tautology or renamed input. The derivation chain is therefore self-contained as an engineering proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities. Concepts such as 'blueprint' and 'semantic task profile' appear introduced but cannot be audited for fitting, assumptions, or independent evidence without the full manuscript.

pith-pipeline@v0.9.1-grok · 5854 in / 1188 out tokens · 31149 ms · 2026-06-28T18:33:53.552347+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 13 canonical work pages · 6 internal anchors

[1]

Domingos

Pedro M. Domingos. A few useful things to know about machine learning.Commun. ACM, pages 78–87, 2012

2012
[2]

Auto-sklearn 2.0: Hands-free automl via meta-learning.JMLR, pages 1–61, 2022

Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, and Frank Hutter. Auto-sklearn 2.0: Hands-free automl via meta-learning.JMLR, pages 1–61, 2022

2022
[3]

Auto-weka: Combined selection and hyperparameter optimization of classification algorithms

Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Auto-weka: Combined selection and hyperparameter optimization of classification algorithms. InKDD, pages 847–855, 2013

2013
[4]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InICLR, 2023

2023
[5]

V oyager: An open-ended embodied agent with large language models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. Trans. Mach. Learn. Res., 2024

2024
[6]

Codeact: Code adaptive compute-efficient tuning framework for code llms.arXiv preprint arXiv:2408.02193, 2024

Weijie Lv, Xuan Xia, and Sheng-Jun Huang. Codeact: Code adaptive compute-efficient tuning framework for code llms.arXiv preprint arXiv:2408.02193, 2024

work page arXiv 2024
[7]

Ds-agent: Automated data science by empowering large language models with case-based reasoning

Siyuan Guo, Cheng Deng, Ying Wen, Hechang Chen, Yi Chang, and Jun Wang. Ds-agent: Automated data science by empowering large language models with case-based reasoning. In ICML, pages 16813–16848, 2024

2024
[8]

Researchagent: Iterative research idea generation over scientific literature with large language models

Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. Researchagent: Iterative research idea generation over scientific literature with large language models. In NAACL, pages 6709–6738, 2025

2025
[9]

Yihao Ang, Yifan Bao, Lei Jiang, Jiajie Tao, Anthony K. H. Tung, Lukasz Szpruch, and Hao Ni. Structured agentic workflows for financial time-series modeling with llms and reflective feedback. InICAIF, 2025

2025
[10]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InNeurIPS, pages 8634–8652, 2023

2023
[11]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebas- tian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algori...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Kevin Ellis, Lionel Wong, Maxwell Nye, Mathias Sable-Meyer, Luc Cary, Lore Anaya Pozo, Luke Hewitt, Armando Solar-Lezama, and Joshua B Tenenbaum. Dreamcoder: growing gener- alizable, interpretable knowledge with wake–sleep bayesian program learning.Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 2023

2023
[13]

Autogluon-timeseries: Automl for probabilistic time series forecasting

Oleksandr Shchur, Ali Caner Turkmen, Nick Erickson, Huibin Shen, Alexander Shirkov, Tony Hu, and Bernie Wang. Autogluon-timeseries: Automl for probabilistic time series forecasting. InICLR, pages 9–1, 2023

2023
[14]

Optuna: A next-generation hyperparameter optimization framework

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. InKDD, pages 2623–2631, 2019

2019
[15]

Neural architecture search with reinforcement learning

Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. InICLR, 2017

2017
[16]

Neural architecture search: A survey

Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. Journal of Machine Learning Research, pages 1–21, 2019

2019
[17]

Darts: Differentiable architecture search

Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. In ICLR, 2018

2018
[18]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scien- tist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Agent laboratory: Using llm agents as research assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as research assistants. InEMNLP, pages 5977–6043, 2025

2025
[20]

Automated composition of agents: A knapsack approach for agentic component selection.arXiv preprint arXiv:2510.16499, 2025

Michelle Yuan, Khushbu Pahwa, Shuaichen Chang, Mustafa Kaba, Jiarong Jiang, Xiaofei Ma, Yi Zhang, and Monica Sunkara. Automated composition of agents: A knapsack approach for agentic component selection.arXiv preprint arXiv:2510.16499, 2025

work page arXiv 2025
[21]

AIDE: AI-Driven Exploration in the Space of Code

Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Ja- cenko, and Yuxiang Wu. Aide: Ai-driven exploration in the space of code.arXiv preprint arXiv:2502.13138, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Effilearner: Enhancing efficiency of generated code via self-optimization

Dong Huang, Jianbo Dai, Han Weng, Puzhen Wu, Yuhao Qing, Heming Cui, Zhijiang Guo, and Jie Zhang. Effilearner: Enhancing efficiency of generated code via self-optimization. In NeurIPS, 2024

2024
[23]

LLM4EFFI: leveraging large language models to enhance code efficiency and correctness

Tong Ye, Weigang Huang, Xuhong Zhang, Tengfei Ma, Peiyu Liu, Jianwei Yin, and Wenhai Wang. LLM4EFFI: leveraging large language models to enhance code efficiency and correctness. arXiv preprint arXiv:2502.18489, 2025

work page arXiv 2025
[24]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human fee...

2022
[25]

Eureka: Human-level reward design via coding large language model

Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language model. InICLR, 2024

2024
[26]

Offline reinforcement learning with implicit q-learning.ICLR, 2022

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning.ICLR, 2022

2022
[27]

A closer look at invalid action masking in policy gradient algorithms

Shengyi Huang and Santiago Ontañón. A closer look at invalid action masking in policy gradient algorithms. InFLAIRS, 2022

2022
[28]

Safe reinforcement learning via shielding

Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum, and Ufuk Topcu. Safe reinforcement learning via shielding. InAAAI, pages 2669–2678, 2018

2018
[29]

Ctbench: Cryptocurrency time series generation benchmark

Yihao Ang, Qiang Wang, Qiang Huang, Yifan Bao, Xinyu Xi, Anthony KH Tung, Chen Jin, and Zhiyong Huang. Ctbench: Cryptocurrency time series generation benchmark. InICLR, 2026. 11

2026
[30]

Alvinn: An autonomous land vehicle in a neural network

Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. InNeurIPS, 1988

1988
[31]

Rusu, Joel Veness, Marc G

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Pe- tersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcemen...

2015
[32]

OpenAI GPT-5 System Card

OpenAI. Gpt-5 system card.https://arxiv.org/abs/2601.03267, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report.https://arxiv.org/abs/2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

System card: Claude opus 4 & claude sonnet 4

Anthropic. System card: Claude opus 4 & claude sonnet 4. https://www.anthropic.com/, 2025

2025
[35]

The amazon nova family of models: Technical report and model card.https://assets.amazon.science/, 2025

Amazon Artificial General Intelligence. The amazon nova family of models: Technical report and model card.https://assets.amazon.science/, 2025

2025
[36]

Sig-wasserstein gans for time series generation

Hao Ni, Lukasz Szpruch, Marc Sabate-Vidales, Baoren Xiao, Magnus Wiese, and Shujian Liao. Sig-wasserstein gans for time series generation. InICAIF, pages 1–8, 2021

2021
[37]

Pcf-gan: generating sequential data via the characteristic function of measures on the path space

Hang Lou, Siran Li, and Hao Ni. Pcf-gan: generating sequential data via the characteristic function of measures on the path space. InNeurIPS, pages 39755–39781, 2023

2023
[38]

Regime-switching Financial Time-Series Generation

Hao Ni, Lukasz Szpruch, and Jiajie Tao. Regime-switching Financial Time-Series Generation. https://github.com/tjj0502/hackathon_starting_kit, 2023. ICAIF 2023 Hackathon

2023
[39]

Crypto Market Simulation for Risk Es- timation

Hao Ni, Lukasz Szpruch, Jiajie Tao, and Yang Long. Crypto Market Simulation for Risk Es- timation. https://hackathon.deepintomlf.ai/competitions/40, 2024. Antalpha ICAIF 2024 Hackathon

2024
[40]

Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting

Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. InNeurIPS, pages 22419– 22430, 2021

2021
[41]

A time series is worth 64 words: Long-term forecasting with transformers

Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. InICLR, 2023

2023
[42]

Timesnet: Temporal 2d-variation modeling for general time series analysis

Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. Timesnet: Temporal 2d-variation modeling for general time series analysis. InICLR, 2023

2023
[43]

Are transformers effective for time series forecasting? InAAAI, pages 11121–11128, 2023

Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? InAAAI, pages 11121–11128, 2023

2023
[44]

Timemixer: Decomposable multiscale mixing for time series forecasting

Shiyu Wang, Haixu Wu, Xiaoming Shi, Tengge Hu, Huakun Luo, Lintao Ma, James Y Zhang, and Jun Zhou. Timemixer: Decomposable multiscale mixing for time series forecasting. In ICLR, 2024

2024
[45]

Time-series generative adversarial networks

Jinsung Yoon, Daniel Jarrett, and Mihaela Van der Schaar. Time-series generative adversarial networks. InNeurIPS, pages 5509–5519, 2019

2019
[46]

Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs

Cristóbal Esteban, Stephanie L Hyland, and Gunnar Rätsch. Real-valued (medical) time series generation with recurrent conditional gans.arXiv preprint arXiv:1706.02633, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[47]

Quant gans: deep genera- tion of financial time series.Quantitative Finance, pages 1419–1440, 2020

Magnus Wiese, Robert Knobloch, Ralf Korn, and Peter Kretschmer. Quant gans: deep genera- tion of financial time series.Quantitative Finance, pages 1419–1440, 2020

2020
[48]

Generating multivariate time series with common source coordinated gan (cosci-gan)

Ali Seyfi, Jean-Francois Rajotte, and Raymond Ng. Generating multivariate time series with common source coordinated gan (cosci-gan). InNeurIPS, pages 32777–32788, 2022

2022
[49]

Timevae: A variational auto-encoder for multivariate time series generation.arXiv preprint arXiv:2111.08095, 2021

Abhyuday Desai, Cynthia Freeman, Zuhui Wang, and Ian Beaver. Timevae: A variational auto-encoder for multivariate time series generation.arXiv preprint arXiv:2111.08095, 2021

work page arXiv 2021
[50]

Diffusion-ts: Interpretable diffusion for general time series generation

Xinyu Yuan and Yan Qiao. Diffusion-ts: Interpretable diffusion for general time series genera- tion.arXiv preprint arXiv:2403.01742, 2024. 12

work page arXiv 2024
[51]

Fourierflow: Frequency-aware flow matching for generative turbulence modeling.arXiv preprint arXiv:2506.00862, 2025

Haixin Wang, Jiashu Pan, Hao Wu, Fan Zhang, and Tailin Wu. Fourierflow: Frequency-aware flow matching for generative turbulence modeling.arXiv preprint arXiv:2506.00862, 2025

work page arXiv 2025
[52]

Deep latent state space models for time-series generation

Linqi Zhou, Michael Poli, Winnie Xu, Stefano Massaroli, and Stefano Ermon. Deep latent state space models for time-series generation. InICML, pages 42625–42643, 2023

2023
[53]

Fide: Frequency-inflated conditional diffusion model for extreme-aware time series generation

Asadullah Hill Galib, Pang-Ning Tan, and Lifeng Luo. Fide: Frequency-inflated conditional diffusion model for extreme-aware time series generation. InNeurIPS, pages 114434–114457, 2024

2024
[54]

Testing for unit roots in autoregressive-moving average models of unknown order.Biometrika, pages 599–607, 1984

Said E Said and David A Dickey. Testing for unit roots in autoregressive-moving average models of unknown order.Biometrika, pages 599–607, 1984. 13 A Task Generation Details We employ an automated task-generation module to construct diverse, executable time-series fore- casting and generation tasks, forming a scalable corpus of tasks that can be converted...

1984
[55]

and PatchTST [41](Transformer-based), TimesNet [42](temporal CNN), DLinear [43](linear), and TimeMixer [44](token-mixing). For generation, it includes ten models: TimeGAN [45], RCGAN [46], PCFGAN [37], QuantGAN [47], COSCIGAN [48] (GAN-based), TimeV AE [49] (V AE-based), DiffusionTS [50] (diffusion-based), FourierFlow [51] (flow-based), and LS4 [52], FIDE...

work page arXiv 2024

[1] [1]

Domingos

Pedro M. Domingos. A few useful things to know about machine learning.Commun. ACM, pages 78–87, 2012

2012

[2] [2]

Auto-sklearn 2.0: Hands-free automl via meta-learning.JMLR, pages 1–61, 2022

Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, and Frank Hutter. Auto-sklearn 2.0: Hands-free automl via meta-learning.JMLR, pages 1–61, 2022

2022

[3] [3]

Auto-weka: Combined selection and hyperparameter optimization of classification algorithms

Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Auto-weka: Combined selection and hyperparameter optimization of classification algorithms. InKDD, pages 847–855, 2013

2013

[4] [4]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InICLR, 2023

2023

[5] [5]

V oyager: An open-ended embodied agent with large language models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. Trans. Mach. Learn. Res., 2024

2024

[6] [6]

Codeact: Code adaptive compute-efficient tuning framework for code llms.arXiv preprint arXiv:2408.02193, 2024

Weijie Lv, Xuan Xia, and Sheng-Jun Huang. Codeact: Code adaptive compute-efficient tuning framework for code llms.arXiv preprint arXiv:2408.02193, 2024

work page arXiv 2024

[7] [7]

Ds-agent: Automated data science by empowering large language models with case-based reasoning

Siyuan Guo, Cheng Deng, Ying Wen, Hechang Chen, Yi Chang, and Jun Wang. Ds-agent: Automated data science by empowering large language models with case-based reasoning. In ICML, pages 16813–16848, 2024

2024

[8] [8]

Researchagent: Iterative research idea generation over scientific literature with large language models

Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. Researchagent: Iterative research idea generation over scientific literature with large language models. In NAACL, pages 6709–6738, 2025

2025

[9] [9]

Yihao Ang, Yifan Bao, Lei Jiang, Jiajie Tao, Anthony K. H. Tung, Lukasz Szpruch, and Hao Ni. Structured agentic workflows for financial time-series modeling with llms and reflective feedback. InICAIF, 2025

2025

[10] [10]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InNeurIPS, pages 8634–8652, 2023

2023

[11] [11]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebas- tian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algori...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Kevin Ellis, Lionel Wong, Maxwell Nye, Mathias Sable-Meyer, Luc Cary, Lore Anaya Pozo, Luke Hewitt, Armando Solar-Lezama, and Joshua B Tenenbaum. Dreamcoder: growing gener- alizable, interpretable knowledge with wake–sleep bayesian program learning.Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 2023

2023

[13] [13]

Autogluon-timeseries: Automl for probabilistic time series forecasting

Oleksandr Shchur, Ali Caner Turkmen, Nick Erickson, Huibin Shen, Alexander Shirkov, Tony Hu, and Bernie Wang. Autogluon-timeseries: Automl for probabilistic time series forecasting. InICLR, pages 9–1, 2023

2023

[14] [14]

Optuna: A next-generation hyperparameter optimization framework

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. InKDD, pages 2623–2631, 2019

2019

[15] [15]

Neural architecture search with reinforcement learning

Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. InICLR, 2017

2017

[16] [16]

Neural architecture search: A survey

Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. Journal of Machine Learning Research, pages 1–21, 2019

2019

[17] [17]

Darts: Differentiable architecture search

Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. In ICLR, 2018

2018

[18] [18]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scien- tist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Agent laboratory: Using llm agents as research assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as research assistants. InEMNLP, pages 5977–6043, 2025

2025

[20] [20]

Automated composition of agents: A knapsack approach for agentic component selection.arXiv preprint arXiv:2510.16499, 2025

Michelle Yuan, Khushbu Pahwa, Shuaichen Chang, Mustafa Kaba, Jiarong Jiang, Xiaofei Ma, Yi Zhang, and Monica Sunkara. Automated composition of agents: A knapsack approach for agentic component selection.arXiv preprint arXiv:2510.16499, 2025

work page arXiv 2025

[21] [21]

AIDE: AI-Driven Exploration in the Space of Code

Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Ja- cenko, and Yuxiang Wu. Aide: Ai-driven exploration in the space of code.arXiv preprint arXiv:2502.13138, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Effilearner: Enhancing efficiency of generated code via self-optimization

Dong Huang, Jianbo Dai, Han Weng, Puzhen Wu, Yuhao Qing, Heming Cui, Zhijiang Guo, and Jie Zhang. Effilearner: Enhancing efficiency of generated code via self-optimization. In NeurIPS, 2024

2024

[23] [23]

LLM4EFFI: leveraging large language models to enhance code efficiency and correctness

Tong Ye, Weigang Huang, Xuhong Zhang, Tengfei Ma, Peiyu Liu, Jianwei Yin, and Wenhai Wang. LLM4EFFI: leveraging large language models to enhance code efficiency and correctness. arXiv preprint arXiv:2502.18489, 2025

work page arXiv 2025

[24] [24]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human fee...

2022

[25] [25]

Eureka: Human-level reward design via coding large language model

Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language model. InICLR, 2024

2024

[26] [26]

Offline reinforcement learning with implicit q-learning.ICLR, 2022

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning.ICLR, 2022

2022

[27] [27]

A closer look at invalid action masking in policy gradient algorithms

Shengyi Huang and Santiago Ontañón. A closer look at invalid action masking in policy gradient algorithms. InFLAIRS, 2022

2022

[28] [28]

Safe reinforcement learning via shielding

Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum, and Ufuk Topcu. Safe reinforcement learning via shielding. InAAAI, pages 2669–2678, 2018

2018

[29] [29]

Ctbench: Cryptocurrency time series generation benchmark

Yihao Ang, Qiang Wang, Qiang Huang, Yifan Bao, Xinyu Xi, Anthony KH Tung, Chen Jin, and Zhiyong Huang. Ctbench: Cryptocurrency time series generation benchmark. InICLR, 2026. 11

2026

[30] [30]

Alvinn: An autonomous land vehicle in a neural network

Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. InNeurIPS, 1988

1988

[31] [31]

Rusu, Joel Veness, Marc G

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Pe- tersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcemen...

2015

[32] [32]

OpenAI GPT-5 System Card

OpenAI. Gpt-5 system card.https://arxiv.org/abs/2601.03267, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[33] [33]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report.https://arxiv.org/abs/2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

System card: Claude opus 4 & claude sonnet 4

Anthropic. System card: Claude opus 4 & claude sonnet 4. https://www.anthropic.com/, 2025

2025

[35] [35]

The amazon nova family of models: Technical report and model card.https://assets.amazon.science/, 2025

Amazon Artificial General Intelligence. The amazon nova family of models: Technical report and model card.https://assets.amazon.science/, 2025

2025

[36] [36]

Sig-wasserstein gans for time series generation

Hao Ni, Lukasz Szpruch, Marc Sabate-Vidales, Baoren Xiao, Magnus Wiese, and Shujian Liao. Sig-wasserstein gans for time series generation. InICAIF, pages 1–8, 2021

2021

[37] [37]

Pcf-gan: generating sequential data via the characteristic function of measures on the path space

Hang Lou, Siran Li, and Hao Ni. Pcf-gan: generating sequential data via the characteristic function of measures on the path space. InNeurIPS, pages 39755–39781, 2023

2023

[38] [38]

Regime-switching Financial Time-Series Generation

Hao Ni, Lukasz Szpruch, and Jiajie Tao. Regime-switching Financial Time-Series Generation. https://github.com/tjj0502/hackathon_starting_kit, 2023. ICAIF 2023 Hackathon

2023

[39] [39]

Crypto Market Simulation for Risk Es- timation

Hao Ni, Lukasz Szpruch, Jiajie Tao, and Yang Long. Crypto Market Simulation for Risk Es- timation. https://hackathon.deepintomlf.ai/competitions/40, 2024. Antalpha ICAIF 2024 Hackathon

2024

[40] [40]

Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting

Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. InNeurIPS, pages 22419– 22430, 2021

2021

[41] [41]

A time series is worth 64 words: Long-term forecasting with transformers

Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. InICLR, 2023

2023

[42] [42]

Timesnet: Temporal 2d-variation modeling for general time series analysis

Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. Timesnet: Temporal 2d-variation modeling for general time series analysis. InICLR, 2023

2023

[43] [43]

Are transformers effective for time series forecasting? InAAAI, pages 11121–11128, 2023

Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? InAAAI, pages 11121–11128, 2023

2023

[44] [44]

Timemixer: Decomposable multiscale mixing for time series forecasting

Shiyu Wang, Haixu Wu, Xiaoming Shi, Tengge Hu, Huakun Luo, Lintao Ma, James Y Zhang, and Jun Zhou. Timemixer: Decomposable multiscale mixing for time series forecasting. In ICLR, 2024

2024

[45] [45]

Time-series generative adversarial networks

Jinsung Yoon, Daniel Jarrett, and Mihaela Van der Schaar. Time-series generative adversarial networks. InNeurIPS, pages 5509–5519, 2019

2019

[46] [46]

Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs

Cristóbal Esteban, Stephanie L Hyland, and Gunnar Rätsch. Real-valued (medical) time series generation with recurrent conditional gans.arXiv preprint arXiv:1706.02633, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[47] [47]

Quant gans: deep genera- tion of financial time series.Quantitative Finance, pages 1419–1440, 2020

Magnus Wiese, Robert Knobloch, Ralf Korn, and Peter Kretschmer. Quant gans: deep genera- tion of financial time series.Quantitative Finance, pages 1419–1440, 2020

2020

[48] [48]

Generating multivariate time series with common source coordinated gan (cosci-gan)

Ali Seyfi, Jean-Francois Rajotte, and Raymond Ng. Generating multivariate time series with common source coordinated gan (cosci-gan). InNeurIPS, pages 32777–32788, 2022

2022

[49] [49]

Timevae: A variational auto-encoder for multivariate time series generation.arXiv preprint arXiv:2111.08095, 2021

Abhyuday Desai, Cynthia Freeman, Zuhui Wang, and Ian Beaver. Timevae: A variational auto-encoder for multivariate time series generation.arXiv preprint arXiv:2111.08095, 2021

work page arXiv 2021

[50] [50]

Diffusion-ts: Interpretable diffusion for general time series generation

Xinyu Yuan and Yan Qiao. Diffusion-ts: Interpretable diffusion for general time series genera- tion.arXiv preprint arXiv:2403.01742, 2024. 12

work page arXiv 2024

[51] [51]

Fourierflow: Frequency-aware flow matching for generative turbulence modeling.arXiv preprint arXiv:2506.00862, 2025

Haixin Wang, Jiashu Pan, Hao Wu, Fan Zhang, and Tailin Wu. Fourierflow: Frequency-aware flow matching for generative turbulence modeling.arXiv preprint arXiv:2506.00862, 2025

work page arXiv 2025

[52] [52]

Deep latent state space models for time-series generation

Linqi Zhou, Michael Poli, Winnie Xu, Stefano Massaroli, and Stefano Ermon. Deep latent state space models for time-series generation. InICML, pages 42625–42643, 2023

2023

[53] [53]

Fide: Frequency-inflated conditional diffusion model for extreme-aware time series generation

Asadullah Hill Galib, Pang-Ning Tan, and Lifeng Luo. Fide: Frequency-inflated conditional diffusion model for extreme-aware time series generation. InNeurIPS, pages 114434–114457, 2024

2024

[54] [54]

Testing for unit roots in autoregressive-moving average models of unknown order.Biometrika, pages 599–607, 1984

Said E Said and David A Dickey. Testing for unit roots in autoregressive-moving average models of unknown order.Biometrika, pages 599–607, 1984. 13 A Task Generation Details We employ an automated task-generation module to construct diverse, executable time-series fore- casting and generation tasks, forming a scalable corpus of tasks that can be converted...

1984

[55] [55]

and PatchTST [41](Transformer-based), TimesNet [42](temporal CNN), DLinear [43](linear), and TimeMixer [44](token-mixing). For generation, it includes ten models: TimeGAN [45], RCGAN [46], PCFGAN [37], QuantGAN [47], COSCIGAN [48] (GAN-based), TimeV AE [49] (V AE-based), DiffusionTS [50] (diffusion-based), FourierFlow [51] (flow-based), and LS4 [52], FIDE...

work page arXiv 2024