pith. sign in

arxiv: 2605.20690 · v1 · pith:2LNMEWO5new · submitted 2026-05-20 · 💻 cs.AI

Declarative Data Services: Structured Agentic Discovery for Composing Data Systems

Pith reviewed 2026-05-21 05:14 UTC · model grok-4.3

classification 💻 cs.AI
keywords agentic discoverydata system compositiondeclarative servicestyped contractsruntime feedbackLLM agentstrading backend
0
0 comments X

The pith

Structured agentic discovery using four typed contracts lets data-system compositions converge where unbounded search fails.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that agentic discovery for composing heterogeneous data backends needs structure to succeed, because open-ended LLM iteration on failure logs does not reliably produce working stacks. Declarative Data Services decomposes the task into four successive typed contracts—intent, operator DAG, per-system skills, and runtime attribution—so that specialized sub-agents perform bounded searches while knowledge moves forward through inline skill citations and backward through typed error signals. On a trading-backend workload the method reaches consistent convergence and converts runtime failures into patches cited by later deployments. A reader would care because real data systems are assembled from multiple components whose composition knowledge is poorly captured in pretraining, and current agents lose direction in the resulting search space.

Core claim

Declarative Data Services owns four typed contracts at successive layers (intent, operator DAG, per-system skills, runtime attribution) that decompose the global search into bounded sub-searches performed by sub-agents. The framework supplies channels for knowledge to flow forward as inline skill citations and for errors to route backward as typed signals. In a proof-of-life demonstration on a trading-backend workload, this architecture converges to working stacks where unbounded agentic discovery does not, and runtime failures become skill patches cited inline in the next deployment.

What carries the argument

The four typed contracts (intent, operator DAG, per-system skills, runtime attribution) that break the global composition search into bounded sub-searches and route knowledge via inline citations and typed error signals.

If this is right

  • Runtime failures are captured as reusable skill patches that later deployments cite directly.
  • Sub-agents can succeed at their narrower, typed search spaces even when the overall composition problem remains large.
  • Composition knowledge accumulates across deployments through the framework's citation and signal channels rather than depending solely on pretraining.
  • Declarative user intent can drive end-to-end composition of heterogeneous data systems without requiring the agent to maintain the entire search space in one context.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same layered-contract pattern could be applied to composing infrastructure stacks or scientific workflows where components also interact through typed interfaces.
  • Runtime attribution might reduce reliance on exhaustive pretraining by letting the system learn system-specific behaviors from actual deployments.
  • Extending the contracts to include cost or latency objectives could turn the current convergence result into an optimization method.
  • The approach suggests that many agentic discovery tasks become tractable once the search is factored into contract-defined layers rather than left fully open-ended.

Load-bearing premise

That the four contracts can be defined and maintained so sub-agents reliably perform their bounded searches and that inline citations plus typed errors suffice to carry useful knowledge across iterations.

What would settle it

Run repeated trials of unbounded discovery versus DDS on the identical trading-backend workload and observe whether DDS produces a working stack in every trial while unbounded discovery continues to fail to converge.

Figures

Figures reproduced from arXiv: 2605.20690 by Duo Lu, Shanshan Ye.

Figure 1
Figure 1. Figure 1: End-to-end view of DDS. The user states intent in natural language with concrete constraints; [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The four DDS layers (L1–L4), each carrying a typed contract. L0 in the figure is an [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The L4 attribution loop: deploy, observe, attribute, patch. Each runtime signal is routed to [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Live-data proof of life on a DDS-generated stack. A Coinbase public WebSocket feed (20 [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Trading-workload intent populating all six L1 dimensions (§3). The framework emits one [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
read the original abstract

Agentic discovery has shown that LLM-driven search can find novel algorithms, designs, and code under benchmark conditions. Translating the paradigm to multi-system data backends surfaces a harder problem: the search space is heterogeneous, the verifier is whether a deployed stack actually runs, and composition knowledge is unevenly captured in pretraining. Unbounded agentic discovery, a coding agent iterating on failure-log feedback, fails to converge consistently on a working stack even when iteration and explicit composition knowledge are added. We propose Declarative Data Services (DDS), an architecture for structured agentic discovery of data-system compositions from declarative user intent. The framework owns four typed contracts at successive layers (intent, operator DAG, per-system skills, runtime attribution) that decompose the global search into bounded sub-searches; sub-agents search each typed space, while the framework provides the channels by which knowledge flows forward as inline skill citations and errors route backward as typed signals. As a proof of life on a trading-backend workload, DDS converges where unbounded discovery does not; runtime failures become skill patches that the next deployment cites inline. We position this as an early prototype reporting lessons from real-world data-system composition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Declarative Data Services (DDS), an architecture for structured agentic discovery of data-system compositions from declarative user intent using four typed contracts (intent, operator DAG, per-system skills, runtime attribution). These contracts decompose the heterogeneous search space into bounded sub-searches performed by sub-agents, with forward knowledge flow via inline skill citations and backward propagation of typed error signals. As a proof-of-life demonstration on a trading-backend workload, DDS is reported to converge on a working stack where unbounded agentic discovery (even with iteration and composition knowledge) fails, converting runtime failures into reusable skill patches cited in subsequent deployments.

Significance. If the central claim holds, the work offers a structured alternative to unbounded LLM-driven search for practical multi-system data backend composition, where verifiers are deployment success and pretraining knowledge is uneven. The explicit layering of contracts and bidirectional knowledge channels could generalize to other heterogeneous composition tasks; the positioning as an early prototype reporting real-world lessons is a constructive contribution even at this stage.

major comments (2)
  1. [Abstract and Evaluation] Abstract and Evaluation (proof-of-life section): The claim that DDS converges where unbounded discovery does not rests on a single unreported workload run without iteration counts, success rates, failure-mode distributions, baseline comparisons, or logs showing how the four contracts produced bounded sub-searches. This is load-bearing for the central architectural claim and leaves open whether convergence arises from the contract structure, workload simplicity, or unstated human tuning.
  2. [Architecture] Architecture (contract definitions): The assumption that the four typed contracts can be maintained so that sub-agents reliably bound their searches and that inline citations plus typed runtime attribution signals propagate root-cause knowledge (rather than generic errors) is asserted but not supported by ablation or tracing of signal flow across iterations in the reported case.
minor comments (2)
  1. [Introduction/Architecture] Add a dedicated subsection early in the paper that formally defines the interfaces and invariants of each of the four typed contracts to improve readability for readers unfamiliar with the layered approach.
  2. [Related Work] Expand the related-work discussion to include recent agentic discovery systems and data-system composition frameworks for clearer positioning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments correctly identify areas where the proof-of-life demonstration would benefit from greater transparency. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and Evaluation] Abstract and Evaluation (proof-of-life section): The claim that DDS converges where unbounded discovery does not rests on a single unreported workload run without iteration counts, success rates, failure-mode distributions, baseline comparisons, or logs showing how the four contracts produced bounded sub-searches. This is load-bearing for the central architectural claim and leaves open whether convergence arises from the contract structure, workload simplicity, or unstated human tuning.

    Authors: We acknowledge that the current evaluation presents only a qualitative proof-of-life on one trading-backend workload and does not report quantitative metrics such as iteration counts, success rates, or detailed logs. The manuscript positions this as an early prototype illustrating that the contract structure enabled convergence where an unbounded baseline did not. In the revised manuscript we will expand the evaluation section to include available iteration counts, observed failure modes in the unbounded case, and a step-by-step description of how the four contracts produced bounded sub-searches in the reported run. We will also add a brief discussion of potential human tuning and workload characteristics to address the concern that convergence may not generalize from the contract design alone. revision: yes

  2. Referee: [Architecture] Architecture (contract definitions): The assumption that the four typed contracts can be maintained so that sub-agents reliably bound their searches and that inline citations plus typed runtime attribution signals propagate root-cause knowledge (rather than generic errors) is asserted but not supported by ablation or tracing of signal flow across iterations in the reported case.

    Authors: We agree that the manuscript asserts the utility of the four contracts and the bidirectional knowledge channels without providing explicit tracing or ablation evidence. The proof-of-life example shows the outcome but does not walk through signal propagation. In the revision we will add a new figure and accompanying text that traces the forward flow of inline skill citations and the backward propagation of typed error signals for the reported deployment. We will also include a short discussion of observed challenges in maintaining contract consistency and how root-cause attribution differed from generic error logs in the case study. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural proposal without derivation chain or self-referential reduction

full rationale

The manuscript proposes Declarative Data Services as an architectural framework that decomposes agentic search into four typed contracts (intent, operator DAG, per-system skills, runtime attribution) to bound sub-searches and route knowledge via inline citations and typed error signals. This is presented as an original design with a proof-of-life demonstration on a trading-backend workload rather than any numerical derivation, fitted-parameter prediction, or equation that reduces to its own inputs by construction. No self-citation load-bearing steps, uniqueness theorems imported from prior author work, or ansatz smuggling appear in the description; the central claim of convergence where unbounded discovery fails is asserted via empirical illustration without reducing to a tautology or renamed known result. The proposal remains self-contained as an engineering architecture.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the assumption that LLM agents can exploit the typed layers effectively and that runtime feedback can be turned into reusable skill patches without additional mechanisms.

axioms (1)
  • domain assumption LLM agents can perform effective bounded sub-searches when supplied with typed contracts and bidirectional knowledge channels.
    Invoked to justify why decomposing the search space leads to convergence.
invented entities (1)
  • Four typed contracts (intent, operator DAG, per-system skills, runtime attribution) no independent evidence
    purpose: Decompose global search into bounded sub-searches with knowledge flow
    New structure introduced by the paper; no independent evidence supplied beyond the single workload example.

pith-pipeline@v0.9.0 · 5734 in / 1232 out tokens · 30369 ms · 2026-05-21T05:14:05.342220+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 10 internal anchors

  1. [1]

    Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. AlphaEvolve: A coding agent for scientific and algor...

  2. [2]

    Pan, Alexander Du, Kurt Keutzer, Alvin Cheung, Alexandros G

    Shu Liu, Shubham Agarwal, Monishwaran Maheswaran, Mert Cemri, Zhifei Li, Qiuyang Mang, Ashwin Naren, Ethan Boneh, Audrey Cheng, Melissa Z. Pan, Alexander Du, Kurt Keutzer, Alvin Cheung, Alexandros G. Dimakis, Koushik Sen, Matei Zaharia, and Ion Stoica. EvoX: Meta- evolution for automated discovery, 2026. URLhttps://arxiv.org/abs/2602.23413

  3. [3]

    AdaEvolve: Adaptive LLM driven zeroth-order optimization, 2026

    Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutfi Eren Erdogan, Koushik Sen, Matei Zaharia, Alex Dimakis, and Ion Stoica. AdaEvolve: Adaptive LLM driven zeroth-order optimization, 2026. URL https: //arxiv.org/abs/2602.20133

  4. [4]

    GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

    Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl- Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. GEPA: Reflective prompt evolution can outperform reinforcement learning, 2026. URL https: //a...

  5. [5]

    Glia: A human-inspired AI for automated systems design and optimization, 2026

    Pouya Hamadanian, Pantea Karimi, Arash Nasr-Esfahany, Kimia Noorbakhsh, Joseph Chandler, Ali ParandehGheibi, Mohammad Alizadeh, and Hari Balakrishnan. Glia: A human-inspired AI for automated systems design and optimization, 2026. URL https://arxiv.org/abs/2510. 27176

  6. [6]

    Claude Code: An agentic coding tool

    Anthropic. Claude Code: An agentic coding tool. https://www.anthropic.com/ claude-code, 2025. Accessed April 2026

  7. [7]

    Elmore, Michael Stonebraker, Magda Balazinska, Bill Howe, Jeremy Kepner, Sam Madden, David Maier, Tim Mattson, and Stan Zdonik

    Jennie Duggan, Aaron J. Elmore, Michael Stonebraker, Magda Balazinska, Bill Howe, Jeremy Kepner, Sam Madden, David Maier, Tim Mattson, and Stan Zdonik. The BigDAWG polystore system.SIGMOD Rec., 44(2):11–16, August 2015. ISSN 0163-5808. doi: 10.1145/2814710. 2814713. URLhttps://doi.org/10.1145/2814710.2814713

  8. [8]

    Gordon, and Bohan Zhang

    Dana Van Aken, Andrew Pavlo, Geoffrey J. Gordon, and Bohan Zhang. Automatic database management system tuning through large-scale machine learning. InProceedings of the 2017 ACM International Conference on Management of Data, SIGMOD ’17, pages 1009–1024, New York, NY , USA, 2017. Association for Computing Machinery. ISBN 9781450341974. doi: 10.1145/303591...

  9. [9]

    Accessed April 2026

    dbt (data build tool).https://www.getdbt.com/, . Accessed April 2026

  10. [10]

    Accessed April 2026

    Airbyte.https://airbyte.com/. Accessed April 2026

  11. [11]

    Accessed April 2026

    Fivetran.https://www.fivetran.com/. Accessed April 2026

  12. [12]

    Lee, Ashish Motivala, Abdul Q

    Benoit Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Jiansheng Huang, Allison W. Lee, Ashish Motivala, Abdul Q. Munir, Steven Pelley, Peter Povinec, Greg Rahn, Spyridon Triantafyllis, and Philipp Unterbrunner. The Snowflake elastic data warehouse. InProceedings o...

  13. [13]

    Lakehouse: A new generation of open platforms that unify data warehousing and advanced analytics

    Michael Armbrust, Ali Ghodsi, Reynold Xin, and Matei Zaharia. Lakehouse: A new generation of open platforms that unify data warehousing and advanced analytics. InConference on Innovative Data Systems Research, 2021. URL https://vldb.org/cidrdb/papers/2021/ cidr2021_paper17.pdf

  14. [14]

    Accessed April 2026

    DB-Engines ranking.https://db-engines.com/en/ranking. Accessed April 2026. 10

  15. [15]

    Gonzalez, and Aditya G

    Shu Liu, Soujanya Ponnapalli, Shreya Shankar, Sepanta Zeighami, Alan Zhu, Shubham Agarwal, Ruiqi Chen, Samion Suwito, Shuo Yuan, Ion Stoica, Matei Zaharia, Alvin Cheung, Natacha Crooks, Joseph E. Gonzalez, and Aditya G. Parameswaran. Supporting our ai overlords: Redesigning data systems to be agent-first. 2025. URL https://arxiv.org/abs/2509. 00997

  16. [16]

    Accessed April 2026

    Pulumi AI.https://www.pulumi.com/ai/. Accessed April 2026

  17. [17]

    https://developer.hashicorp.com/terraform/docs/tools/ mcp-server

    Terraform MCP server. https://developer.hashicorp.com/terraform/docs/tools/ mcp-server. Accessed April 2026

  18. [18]

    Park, George S

    Patrick Tser Jern Kon, Jiachen Liu, Yiming Qiu, Weijun Fan, Ting He, Lei Lin, Hao- ran Zhang, Owen M. Park, George S. Elengikal, Yuxin Kang, Ang Chen, Mosharaf Chowdhury, Myungjin Lee, and Xinyu Wang. Iac-eval: A code generation benchmark for cloud infrastructure-as-code programs. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and...

  19. [19]

    O’Reilly Media, 2017

    Martin Kleppmann.Designing Data-Intensive Applications. O’Reilly Media, 2017. ISBN 978-1449373320

  20. [20]

    One size fits all

    Michael Stonebraker and U˘gur Çetintemel."One size fits all": an idea whose time has come and gone, pages 441–462. Association for Computing Machinery and Morgan & Claypool,

  21. [21]

    URLhttps://doi.org/10.1145/3226595.3226636

    ISBN 9781947487192. URLhttps://doi.org/10.1145/3226595.3226636

  22. [22]

    Why Do Multi-Agent LLM Systems Fail?

    Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why do multi-agent LLM systems fail?, 2025. URL https://arxiv.org/abs/2503.13657

  23. [23]

    Multi-agent teams hold experts back, 2026

    Aneesh Pappu, Batu El, Hancheng Cao, Carmelo di Nolfo, Yanchao Sun, Meng Cao, and James Zou. Multi-agent teams hold experts back, 2026. URL https://arxiv.org/abs/ 2602.01011

  24. [24]

    Dimakis, Matei Zaharia, and Ion Stoica

    Shu Liu, Mert Cemri, Shubham Agarwal, Alexander Krentsel, Ashwin Naren, Qiuyang Mang, Zhifei Li, Akshat Gupta, Monishwaran Maheswaran, Audrey Cheng, Melissa Pan, Ethan Boneh, Kannan Ramchandran, Koushik Sen, Alexandros G. Dimakis, Matei Zaharia, and Ion Stoica. SkyDiscover: A flexible framework for AI-driven scientific and algorithmic discovery, 2026. URL...

  25. [25]

    OpenEvolve: an open-source evolutionary coding agent, 2025

    Asankhaya Sharma. OpenEvolve: an open-source evolutionary coding agent, 2025. URL https://github.com/algorithmicsuperintelligence/openevolve

  26. [26]

    ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution

    Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin. ShinkaEvolve: Towards open-ended and sample-efficient program evolution, 2025. URL https://arxiv.org/abs/2509.19349

  27. [27]

    Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

    Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic context engineering: Evolving contexts for self-improving language models, 2026. URLhttps://arxiv.org/abs/2510.04618

  28. [28]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative language model calls into self-improving pipelines, 2023. URLhttps://arxiv.org/abs/2310.03714

  29. [29]

    Meta-Harness: End-to-End Optimization of Model Harnesses

    Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-Harness: End-to-end optimization of model harnesses, 2026. URL https://arxiv. org/abs/2603.28052. 11

  30. [30]

    Melissa Z. Pan, Negar Arabzadeh, Riccardo Cogo, Yuxuan Zhu, Alexander Xiong, Lakshya A Agrawal, Huanzhi Mao, Emma Shen, Sid Pallerla, Liana Patel, Shu Liu, Tianneng Shi, Xiaoyuan Liu, Jared Quincy Davis, Emmanuele Lacavalla, Alessandro Basile, Shuyi Yang, Paul Castro, Daniel Kang, Joseph E. Gonzalez, Koushik Sen, Dawn Song, Ion Stoica, Matei Zaharia, and ...

  31. [31]

    Semantic operators and their optimization: Enabling llm-based data processing with accuracy guarantees in lotus.Proc

    Liana Patel, Siddharth Jha, Melissa Pan, Harshit Gupta, Parth Asawa, Carlos Guestrin, and Matei Zaharia. Semantic operators and their optimization: Enabling llm-based data processing with accuracy guarantees in lotus.Proc. VLDB Endow., 18(11):4171–4184, July 2025. ISSN 2150-8097. doi: 10.14778/3749646.3749685. URL https://doi.org/10.14778/3749646. 3749685

  32. [32]

    Parameswaran, and Eugene Wu

    Shreya Shankar, Tristan Chambers, Tarak Shah, Aditya G. Parameswaran, and Eugene Wu. DocETL: Agentic query rewriting and evaluation for complex document processing.Proc. VLDB Endow., 18(9):3035–3048, May 2025. ISSN 2150-8097. doi: 10.14778/3746405.3746426. URLhttps://doi.org/10.14778/3746405.3746426

  33. [33]

    A declarative system for optimizing ai workloads, 2024

    Chunwei Liu, Matthew Russo, Michael Cafarella, Lei Cao, Peter Baille Chen, Zui Chen, Michael Franklin, Tim Kraska, Samuel Madden, and Gerardo Vitagliano. A declarative system for optimizing ai workloads, 2024. URLhttps://arxiv.org/abs/2405.14696

  34. [34]

    Accessed April 2026

    dbt Mesh.https://www.getdbt.com/product/dbt-mesh, . Accessed April 2026

  35. [35]

    Accessed April 2026

    Apache Iceberg.https://iceberg.apache.org/. Accessed April 2026

  36. [36]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: agent-computer interfaces enable automated soft- ware engineering. InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY , USA, 2024. Curran Associates Inc. ISBN 9798331314385

  37. [37]

    MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

    Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander M ˛ adry. MLE-bench: Evaluating machine learning agents on machine learning engineering, 2025. URL https://arxiv.org/abs/2410.07095

  38. [38]

    DS-1000: a natural and reliable benchmark for data science code generation

    Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. DS-1000: a natural and reliable benchmark for data science code generation. InProceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

  39. [39]

    Presto: SQL on everything

    Raghav Sethi, Martin Traverso, Dain Sundstrom, David Phillips, Wenlei Xie, Yutian Sun, Nezih Yegitbasi, Haozhun Jin, Eric Hwang, Nileema Shingte, and Christopher Berner. Presto: SQL on everything. In2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 1802–1813, 2019. doi: 10.1109/ICDE.2019.00196

  40. [40]

    Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K

    Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. Spark SQL: Relational data processing in Spark. InProceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, pages 1383–1394, New York, NY , USA,

  41. [41]

    ISBN 9781450327589

    Association for Computing Machinery. ISBN 9781450327589. doi: 10.1145/2723372. 2742797. URLhttps://doi.org/10.1145/2723372.2742797

  42. [42]

    HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots

    Alfons Kemper and Thomas Neumann. HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In2011 IEEE 27th International Conference on Data Engineering, pages 195–206, 2011. doi: 10.1109/ICDE.2011.5767867

  43. [43]

    SAP HANA database: data management for modern business applications.SIGMOD Rec., 40(4):45–51, January 2012

    Franz Färber, Sang Kyun Cha, Jürgen Primsch, Christof Bornhövd, Stefan Sigg, and Wolfgang Lehner. SAP HANA database: data management for modern business applications.SIGMOD Rec., 40(4):45–51, January 2012. ISSN 0163-5808. doi: 10.1145/2094114.2094126. URL https://doi.org/10.1145/2094114.2094126. 12

  44. [44]

    https://engineering.fb.com/2022/05/04/data-infrastructure/delta/

    Jeff Shute, Radek Vingralek, Bart Samwel, Ben Handy, Chad Whipkey, Eric Rollins, Mircea Oancea, Kyle Littlefield, David Menestrina, Stephan Ellner, John Cieslewicz, Ian Rae, Traian Stancescu, and Himani Apte. F1: a distributed SQL database that scales.Proc. VLDB Endow., 6(11):1068–1079, August 2013. ISSN 2150-8097. doi: 10.14778/2536222.2536232. URL https...

  45. [45]

    A New Presumed Commit Optimization for Two Phase Commit

    James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak,...

  46. [46]

    Mowry, Matthew Perron, Ian Quah, Siddharth Santurkar, Anthony Tomasic, Skye Toor, Dana Van Aken, Ziqi Wang, Yingjun Wu, Ran Xian, and Tieying Zhang

    Andrew Pavlo, Gustavo Angulo, Joy Arulraj, Haibin Lin, Jiexi Lin, Lin Ma, Prashanth Menon, Todd C. Mowry, Matthew Perron, Ian Quah, Siddharth Santurkar, Anthony Tomasic, Skye Toor, Dana Van Aken, Ziqi Wang, Yingjun Wu, Ran Xian, and Tieying Zhang. Self-driving database management systems. InConference on Innovative Data Systems Research, 2017. URLhttps://...

  47. [47]

    An end-to-end automatic cloud database tuning system using deep reinforcement learning

    Ji Zhang, Yu Liu, Ke Zhou, Guoliang Li, Zhili Xiao, Bin Cheng, Jiashu Xing, Yangtao Wang, Tianheng Cheng, Li Liu, Minwei Ran, and Zekang Li. An end-to-end automatic cloud database tuning system using deep reinforcement learning. InProceedings of the 2019 International Conference on Management of Data, SIGMOD ’19, pages 415–432, New York, NY , USA, 2019. A...

  48. [48]

    Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task,

    Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task,

  49. [49]

    URLhttps://arxiv.org/abs/1809.08887

  50. [50]

    Chang, Fei Huang, Reynold Cheng, and Yongbin Li

    Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, Xuanhe Zhou, Chenhao Ma, Guoliang Li, Kevin C.C. Chang, Fei Huang, Reynold Cheng, and Yongbin Li. Can LLM already serve as a database interface? a big bench for large-scale database grounded text-to-SQLs. InProceedings of the 37th International ...

  51. [51]

    DIN-SQL: decomposed in-context learning of text-to-SQL with self-correction

    Mohammadreza Pourreza and Davood Rafiei. DIN-SQL: decomposed in-context learning of text-to-SQL with self-correction. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc

  52. [52]

    E. F. Codd. A relational model of data for large shared data banks.Commun. ACM, 13 (6):377–387, June 1970. ISSN 0001-0782. doi: 10.1145/362384.362685. URL https: //doi.org/10.1145/362384.362685

  53. [53]

    Chamberlin and Raymond F

    Donald D. Chamberlin and Raymond F. Boyce. Sequel: A structured english query language. InProceedings of the 1974 ACM SIGFIDET (Now SIGMOD) Workshop on Data Description, Access and Control, SIGFIDET ’74, pages 249–264, New York, NY , USA, 1974. Association for Computing Machinery. ISBN 9781450374156. doi: 10.1145/800296.811515. URL https://doi.org/10.1145...

  54. [54]

    Inefficiencies of meta agents for agent design,

    Batu El, Mert Yuksekgonul, and James Zou. Inefficiencies of meta agents for agent design,

  55. [55]

    URLhttps://arxiv.org/abs/2510.06711

  56. [56]

    Barbarians at the gate: How AI is upending systems research, 2025

    Audrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Bowen Wang, Alex Krentsel, Tian Xia, Mert Cemri, Jongseok Park, Shuo Yang, Jeff Chen, Lakshya Agrawal, Aditya Desai, Jiarong Xing, Koushik Sen, Matei Zaharia, and Ion Stoica. Barbarians at the gate: How AI is upending systems research, 2025. URLhttps://arxiv.org/abs/2510.06189. 13

  57. [57]

    Let the barbarians in: How AI can accelerate systems performance research, 2025

    Audrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Shubham Agarwal, Mert Cemri, Bowen Wang, Alexander Krentsel, Tian Xia, Jongseok Park, Shuo Yang, Jeff Chen, Lakshya Agrawal, Ashwin Naren, Shulu Li, Ruiying Ma, Aditya Desai, Jiarong Xing, Koushik Sen, Matei Zaharia, and Ion Stoica. Let the barbarians in: How AI can accelerate systems performance research, 20...

  58. [58]

    Cost-of-Pass: An economic framework for evaluating language models, 2026

    Mehmet Hamza Erol, Batu El, Mirac Suzgun, Mert Yuksekgonul, and James Zou. Cost-of-Pass: An economic framework for evaluating language models, 2026. URL https://arxiv.org/ abs/2504.13359

  59. [59]

    V oyager: An open-ended embodied agent with large language models,

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models,

  60. [60]

    URLhttps://arxiv.org/abs/2305.16291

  61. [61]

    Equipping agents for the real world with agent skills

    Anthropic. Equipping agents for the real world with agent skills. https://www.anthropic. com/engineering/equipping-agents-for-the-real-world-with-agent-skills ,

  62. [62]

    24.3" operator_types: [STORE, TRANSFORM] capabilities: data_models: [columnar, time_series, event] access_patterns: [olap, streaming] max_throughput:

    Accessed April 2026. A Example Agent Skill: ClickHouse Figure A shows a trimmed excerpt of clickhouse.yaml skill with one representative entry per block. The dated comments are real attribution-log entries: each was added after a specific failure during the learning-loop experiment (§4.3), which is the traceability property cited in §3. B Per-run detail f...