pith. sign in

arxiv: 2606.05679 · v1 · pith:ZHCI3X5Rnew · submitted 2026-06-04 · 💻 cs.DB · cs.AI

Data Flow Control: Data Safety Policies for AI Agents

Pith reviewed 2026-06-27 23:26 UTC · model grok-4.3

classification 💻 cs.DB cs.AI
keywords data flow controldata safety policiesquery rewritingprovenance monomialsDBMS enginesAI agentspolicy enforcementdata infrastructure
0
0 comments X

The pith

Data Flow Control enforces safety policies on AI-generated SQL queries inside the DBMS through query rewriting without materializing provenance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that safety constraints on how data may be combined and released are a data infrastructure problem rather than a prompt or post-hoc issue. It formalizes data safety as aggregate predicates over provenance monomials. Passant implements this as a portable rewriting layer that translates policies into modified queries. The rewritten queries run on unmodified engines and avoid storing full provenance. Experiments across five DBMS engines show the approach incurs roughly zero overhead while outperforming alternatives that require materialization.

Core claim

Data Flow Control formalizes data safety as aggregate predicates over provenance monomials. Passant enforces these predicates by rewriting queries in an optimizer-invariant way that requires no provenance materialization and no changes to the underlying DBMS, delivering near-zero overhead across DuckDB, Umbra, PostgreSQL, DataFusion, and SQLServer while outperforming materializing alternatives by orders of magnitude.

What carries the argument

Passant, a portable query rewriting layer that translates DFC policies expressed as aggregate provenance predicates into equivalent rewritten queries.

If this is right

  • AI agents can generate and run queries while data combination rules are guaranteed at execution time inside the engine.
  • The same rewritten-query mechanism applies without modification to five different DBMS engines.
  • Policy enforcement does not require storing or querying complete provenance records.
  • Overhead remains negligible relative to the original query cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agents could safely orchestrate longer data-analysis pipelines if DFC policies were attached to common analysis templates.
  • The rewriting technique may extend to other infrastructure-level constraints such as differential privacy or regulatory release rules.
  • Integration with query optimizers that already track provenance statistics could further reduce any residual cost.

Load-bearing premise

Data safety policies can always be expressed as aggregate predicates over provenance monomials that remain optimizer-invariant and can be enforced solely by query rewriting without materializing full provenance or modifying the DBMS.

What would settle it

A concrete data safety policy that cannot be expressed as an aggregate predicate over provenance monomials, or a rewritten query produced by Passant that fails to block a violating result on one of the five tested engines.

Figures

Figures reproduced from arXiv: 2606.05679 by Charlie Summers, Eugene Wu.

Figure 1
Figure 1. Figure 1: DFC vs LLM calls check 13 TPC-H queries (5 runs [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: TaxAgent categorizes Receipts into Expenses. To ensure tax law is followed, DFC policies are defined with the PGN language. Passant rewrites Q with Full-Push into Q ′ that also evaluates the policy. Q ′ violates the policy because biz_use exceeds 50% for a Meal receipt. The row is not inserted (marked red to highlight the failure). Beyond agents, US Department of Education’s Every Student Succeeds Act2 req… view at source ↗
Figure 3
Figure 3. Figure 3: Provenance polynomials are sensitive to the physi [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The naive approach relies on a provenance-enabled [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overview of enforcement strategies for𝑄 = 𝛾 (𝑇 Z 𝑆) a policy with constraint 𝐵 = 𝜎𝛾Σ. The pattern is to compute policy aggregates, evaluate policy constraints and filter viola￾tions, then project out policy-related attributes. Post-process materializes provenance polynomials then must join with the input relations to access attributes needed to compute policy aggregates. Partial-Push optimizes this by push… view at source ↗
Figure 6
Figure 6. Figure 6: Relative overhead of policy enforcement methods [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Relative overhead to enforce 1 policy across 5 major [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Highlighting where (Red) Partial-Push out￾performs Full-Push, or (Blue) Full-Push outperforms Partial-Push [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Scaling the number of sources in a Full-Push policy on a TPC-H Q1 variant. A policy with 𝑛 sources is applied if join count is ≥ 𝑛. To do so, we scale both the number of sources in the policy and in the query from 1 to 8. For the query, we start with a modified Q1 — 𝛾 (𝜎(lineitem)) — that starts as-is only reading lineitem, then incrementally add 1 − 7 joins with other TPC-H relations by foreign key relat… view at source ↗
Figure 12
Figure 12. Figure 12: We compare No Policy, Full-Push, and GPT-5.2 to maintain a state machine. No Policy allows illegal state transitions. Full-Push and GPT-5.2 are both correct, and Full-Push is 2 orders of magnitude faster. 𝑇 (𝑖𝑑, 𝑠𝑡𝑎𝑡𝑒) is initialized with 1000 items all where 𝑠𝑡𝑎𝑡𝑒 = 𝐴. We then run 1000 𝑈 𝑃𝐷𝐴𝑇 𝐸 statements on 𝑇 where 70% are valid state transitions randomly sampled from the state machine, and 30% are inva… view at source ↗
read the original abstract

Agents increasingly generate SQL, orchestrate pipelines, and automate data analysis on behalf of users. While recent work improves query correctness, correctness is not safety. A query may be semantically valid yet violate regulatory, privacy, or business constraints that govern how data may be combined and released. We argue that enforcing such constraints is fundamentally a data infrastructure problem. This paper introduces Data Flow Control (DFC), a framework to declaratively specify and guarantee policy enforcement over tuple-level data flows within a DBMS query. A key challenge is defining a policy language that is optimizer-invariant yet efficient to enforce at scale. We formalize data safety as aggregate predicates over provenance monomials and present Passant, a portable query rewriting layer that enforces DFC policies without materializing provenance. Across five DBMS engines -- DuckDB, Umbra, PostgreSQL, DataFusion, and SQLServer -- Passant achieves ~0% overhead and outperforms alternatives by orders of magnitude. As a result, Data Flow Control is the first step towards moving data safety from prompts and post-hoc checks into the data infrastructure. Data Flow Control is available open source at https://github.com/dataflowcontrol/data-flow-control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Data Flow Control (DFC), a framework to declaratively specify and guarantee policy enforcement over tuple-level data flows within a DBMS query. It formalizes data safety as aggregate predicates over provenance monomials and presents Passant, a portable query rewriting layer that enforces DFC policies without materializing provenance. Across five DBMS engines (DuckDB, Umbra, PostgreSQL, DataFusion, and SQLServer), Passant achieves ~0% overhead and outperforms alternatives by orders of magnitude, positioning DFC as a step toward embedding data safety in data infrastructure rather than prompts or post-hoc checks.

Significance. If the central claims hold, the work is significant for systems research on AI agent data handling: it provides a declarative, optimizer-invariant policy language grounded in provenance and demonstrates portable enforcement via rewriting with negligible overhead. The cross-engine evaluation and open-source release are strengths that could influence practical adoption in DBMS-backed agent pipelines.

major comments (2)
  1. [§4] §4 (Policy Language): The claim that policies expressed as aggregate predicates over provenance monomials remain optimizer-invariant requires explicit proof that the rewriting rules commute with standard relational optimizations (e.g., join reordering, predicate pushdown); without this, the portability guarantee across engines is not fully substantiated by the presented formalism.
  2. [§6] §6 (Evaluation, Table 2): The reported ~0% overhead is load-bearing for the central performance claim, yet the methodology section does not detail how query plans were normalized across engines or whether the baseline alternatives included equivalent provenance tracking; this leaves open whether the orders-of-magnitude improvement is due to the rewriting technique or differences in baseline implementation.
minor comments (2)
  1. [Abstract] Abstract and §1: The term 'provenance monomials' is introduced without a brief inline definition or reference to the standard provenance semiring literature; adding one sentence would improve accessibility.
  2. [§5] §5 (Passant Implementation): The description of the rewriting algorithm would benefit from a small pseudocode listing or diagram showing the transformation from policy predicate to rewritten query.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and positive recommendation for minor revision. We address each major comment below.

read point-by-point responses
  1. Referee: [§4] §4 (Policy Language): The claim that policies expressed as aggregate predicates over provenance monomials remain optimizer-invariant requires explicit proof that the rewriting rules commute with standard relational optimizations (e.g., join reordering, predicate pushdown); without this, the portability guarantee across engines is not fully substantiated by the presented formalism.

    Authors: We acknowledge the need for an explicit proof. In the revised version, we will add to Section 4 a detailed argument establishing that the rewriting rules commute with standard relational optimizations, thereby substantiating the optimizer-invariance and portability claims. revision: yes

  2. Referee: [§6] §6 (Evaluation, Table 2): The reported ~0% overhead is load-bearing for the central performance claim, yet the methodology section does not detail how query plans were normalized across engines or whether the baseline alternatives included equivalent provenance tracking; this leaves open whether the orders-of-magnitude improvement is due to the rewriting technique or differences in baseline implementation.

    Authors: The evaluation methodology can be clarified. We will revise Section 6 to include details on how query plans were normalized (e.g., by disabling certain optimizations where necessary and using consistent cost models) and confirm that the alternative baselines were equipped with comparable provenance tracking capabilities. This will better isolate the benefits of the Passant rewriting layer. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained systems contribution

full rationale

The paper introduces Data Flow Control as a new declarative framework and implements it via the Passant query-rewriting layer. Its central claims rest on formalizing policies as aggregate predicates over provenance monomials and demonstrating portable enforcement through rewriting that avoids materialization. These steps are presented as engineering and systems design choices rather than derivations that reduce by construction to fitted parameters or prior self-citations. Performance results (~0% overhead across five engines) are empirical measurements, not quantities defined by the paper's own equations. No load-bearing step matches any of the enumerated circularity patterns; the work is a self-contained systems artifact whose correctness can be evaluated against external benchmarks and open-source code.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The paper introduces the DFC framework and Passant as new artifacts while relying on standard assumptions from database systems and data provenance research; no free parameters or invented physical entities are described in the abstract.

axioms (2)
  • domain assumption Data safety policies can be captured as aggregate predicates over provenance monomials
    Central to the formalization of data safety stated in the abstract.
  • domain assumption Query rewriting can enforce such predicates without materializing provenance and while remaining optimizer-invariant
    Required for the claim that Passant works portably across engines with near-zero overhead.
invented entities (2)
  • Data Flow Control (DFC) no independent evidence
    purpose: Declarative framework for specifying and guaranteeing policy enforcement over tuple-level data flows
    New framework presented in the paper
  • Passant no independent evidence
    purpose: Portable query rewriting layer that enforces DFC policies
    New implementation artifact introduced by the authors

pith-pipeline@v0.9.1-grok · 5725 in / 1479 out tokens · 23655 ms · 2026-06-27T23:26:48.181862+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 33 canonical work pages · 3 internal anchors

  1. [1]

    Equal Credit Opportunity Act

    1974. Equal Credit Opportunity Act. 15 U.S.C. § 1691(a)

  2. [2]

    General Data Protection Regulation

    2016. General Data Protection Regulation. Regulation (EU) 2016/679, Art.5(1)(c)

  3. [3]

    Report on Statistical Disclosure Limitation Methodology

    2016. Report on Statistical Disclosure Limitation Methodology. Statistical Policy Working Paper 22, Federal Committee on Statistical Methodology

  4. [4]

    Yael Amsterdamer, Daniel Deutch, and Val Tannen. 2011. On the Limitations of Provenance for Queries with Difference. InProceedings of the 3rd USENIX Workshop on the Theory and Practice of Provenance (TaPP)

  5. [5]

    2025.Equipping Agents for the Real World with Agent Skills

    Anthropic. 2025.Equipping Agents for the Real World with Agent Skills. https://www.anthropic.com/engineering/equipping-agents-for-the-real- world-with-agent-skills

  6. [6]

    Bahareh Sadat Arab, Su Feng, Boris Glavic, Seokki Lee, Xing Niu, and Qitian Zeng. 2018. GProM - A Swiss Army Knife for Your Provenance Needs.IEEE Data Eng. Bull.41 (2018), 51–62

  7. [7]

    Luca Beurer-Kellner, Beat Buesser, Ana-Maria Creţu, Edoardo Debenedetti, Daniel Dobos, Daniel Fabian, Marc Fischer, David Froelicher, Kathrin Grosse, Daniel Naeff, Ezinwanne Ozoani, Andrew Paverd, Florian Tramèr, and Václav Vol- hejn. 2025. Design Patterns for Securing LLM Agents against Prompt Injections. arXiv:2506.08837 [cs.LG] https://arxiv.org/abs/2506.08837

  8. [8]

    Uri Braun, Avraham Shinnar, and Margo Seltzer. 2008. Securing provenance. InProceedings of the 3rd Conference on Hot Topics in Security(San Jose, CA) (HOTSEC’08). USENIX Association, USA, Article 4, 5 pages

  9. [9]

    Peter Buneman, Sanjeev Khanna, and Wang Chiew Tan. 2001. Why and Where: A Characterization of Data Provenance. InInternational Conference on Database Theory. https://api.semanticscholar.org/CorpusID:13791826

  10. [10]

    Tyrone Cadenhead, Vaibhav Khadilkar, Murat Kantarcioglu, and Bhavani Thu- raisingham. 2011. A language for provenance access control. InProceedings of the First ACM Conference on Data and Application Security and Privacy(San Antonio, TX, USA)(CODASPY ’11). Association for Computing Machinery, New York, NY, USA, 133–144. https://doi.org/10.1145/1943513.1943532

  11. [11]

    Stefano Ceri, Roberta Cochrane, and Jennifer Widom. 2000. Practical applications of triggers and constraints: Success and lingering issues. InProc. 26th VLDB. 254– 262

  12. [12]

    Lingjiao Chen and Arun Kumar. 2017. Towards Linear Algebra over Normalized Data.Proceedings of the VLDB Endowment10, 11 (2017), 1214–1225. https: //doi.org/10.14778/3137628.3137632

  13. [13]

    Sibei Chen, Hanbing Liu, Waiting Jin, Xiangyu Sun, Xiaoyao Feng, Ju Fan, Xi- aoyong Du, and Nan Tang. 2024. ChatPipe: Orchestrating Data Preparation Pipelines by Optimizing Human-ChatGPT Interactions. InCompanion of the 2024 International Conference on Management of Data(Santiago AA, Chile)(SIG- MOD ’24). Association for Computing Machinery, New York, NY,...

  14. [14]

    James Cheney, Laura Chiticariu, and Wang-Chiew Tan. 2009. Provenance in Databases: Why, How, and Where.Foundations and Trends in Databases1, 4 (2009), 379–474. https://doi.org/10.1561/1900000006 Charlie Summers and Eugene Wu

  15. [15]

    Manuel Costa, Boris Köpf, Aashish Kolluri, Andrew Paverd, Mark Russi- novich, Ahmed Salem, Shruti Tople, Lukas Wutschitz, and Santiago Zanella- Béguelin. 2025. Securing AI Agents with Information-Flow Control. arXiv. https://www.microsoft.com/en-us/research/publication/securing-ai- agents-with-information-flow-control/

  16. [16]

    Umeshwar Dayal, Alejandro P Buchmann, and Dennis R McCarthy. 1988. Rules are objects too: a knowledge model for an active, object-oriented database system. InInternational Workshop on Object-Oriented Database Systems. Springer, 129– 143

  17. [17]

    Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, and Florian Tramèr. 2025. Defeating Prompt Injections by Design. arXiv:2503.18813 [cs.CR] https://arxiv.org/abs/2503.18813

  18. [18]

    Saeed Fathollahzadeh, Essam Mansour, and Matthias Boehm. 2025. CatDB: Data- Catalog-Guided, LLM-Based Generation of Data-Centric ML Pipelines.Proc. VLDB Endow.18, 8 (April 2025), 2639–2652. https://doi.org/10.14778/3742728. 3742754

  19. [19]

    Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou. 2023. Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation. arXiv:2308.15363 [cs.DB] https://arxiv.org/abs/2308. 15363

  20. [20]

    Boris Glavic. 2021. Data Provenance.Foundations and Trends®in Databases9, 3-4 (2021), 1–232. https://doi.org/10.1561/1900000068

  21. [21]

    Boris Glavic and Gustavo Alonso. 2009. Perm: Processing provenance and data on the same data model through query rewriting. InICDE. 174–185

  22. [22]

    Todd J Green, Grigoris Karvounarakis, and Val Tannen. 2007. Provenance semir- ings. InProceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART sympo- sium on Principles of database systems. 31–40

  23. [23]

    Zezhou Huang and Eugene Wu. 2023. Lightweight materialization for fast dashboards over joins.Proceedings of the ACM on Management of Data1, 4 (2023), 1–27

  24. [24]

    Zhen Ming Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. 2025. AIDE: AI-Driven Exploration in the Space of Code.ArXivabs/2502.13138 (2025). https://api.semanticscholar.org/CorpusID: 276421281

  25. [25]

    Andrew Lamb, Yijie Shen, Daniël Heres, Jayjeet Chakraborty, Mehmet Ozan Kabak, Liang-Chi Hsieh, and Chao Sun. 2024. Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine. InCompanion of the 2024 International Conference on Management of Data(Santiago AA, Chile)(SIGMOD ’24). Association for Computing Machinery, New York, NY, USA, 5–17....

  26. [26]

    Samuele Langhi, Angela Bonifati, and Riccardo Tommasini. 2025. Evaluating Continuous Queries with Inconsistency Annotations.Proceedings of the VLDB Endowment18, 5 (2025), 1321–1334. https://doi.org/10.14778/3718057.3718062

  27. [27]

    Alexander W Lee, Justin Chan, Michael Fu, Nicolas Kim, Akshay Mehta, Deepti Raghavan, and Ugur Cetintemel. 2025. Semantic Integrity Constraints: Declar- ative Guardrails for AI-Augmented Data Processing Systems.arXiv preprint arXiv:2503.00600(2025)

  28. [28]

    Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xiaokang Zhang, Jun Zhu, Renjie Wei, Hongyan Pan, Cuiping Li, and Hong Chen. 2024. CodeS: Towards Building Open-source Language Models for Text-to-SQL. arXiv:2402.16347 [cs.CL] https: //arxiv.org/abs/2402.16347

  29. [29]

    Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Rongyu Cao, Ruiying Geng, Nan Huo, Xuanhe Zhou, Chenhao Ma, Guoliang Li, Kevin C. C. Chang, Fei Huang, Reynold Cheng, and Yongbin Li. 2023. Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs. arXiv:2305.03111 [cs....

  30. [30]

    Peiran Li, Xinkai Zou, Zhuohang Wu, Ruifeng Li, Shuo Xing, Hanwen Zheng, Zhikai Hu, Yuping Wang, Haoxi Li, Qin Yuan, Yingmo Zhang, and Zhengzhong Tu. 2025. SAFEFLOW: A Principled Protocol for Trustworthy and Transactional Autonomous Agent Systems. arXiv:2506.07564 [cs.AI] https://arxiv.org/abs/ 2506.07564

  31. [31]

    Gonzalez, and Aditya G

    Shu Liu, Soujanya Ponnapalli, Shreya Shankar, Sepanta Zeighami, Alan Zhu, Shubham Agarwal, Ruiqi Chen, Samion Suwito, Shuo Yuan, Ion Stoica, Matei Zaharia, Alvin Cheung, Natacha Crooks, Joseph E. Gonzalez, and Aditya G. Parameswaran. 2025. Supporting Our AI Overlords: Redesigning Data Systems to be Agent-First. arXiv:2509.00997 [cs.AI] https://arxiv.org/a...

  32. [32]

    2026.Microsoft SQL Server

    Microsoft Corporation. 2026.Microsoft SQL Server. Microsoft Corporation. https://www.microsoft.com/sql-server Relational database management system

  33. [33]

    Haneen Mohammed, Charlie Summers, Sughosh Kaushik, and Eugene Wu. 2023. SmokedDuck Demonstration: SQLStepper. InCompanion of the 2023 International Conference on Management of Data. 183–186. https://doi.org/10.1145/3555041. 3589731

  34. [34]

    Tobias Müller, Benjamin Dietrich, and Torsten Grust. 2018. You say ’what’, i hear ’where’ and ’why’: (mis-)interpreting SQL to derive fine-grained provenance.Proc. VLDB Endow.11, 11 (July 2018), 1536–1549. https://doi.org/10.14778/3236187. 3236204

  35. [35]

    Thomas Neumann and Michael J. Freitag. 2020. Umbra: A Disk-Based System with In-Memory Performance. InConference on Innovative Data Systems Research. https://api.semanticscholar.org/CorpusID:209379505

  36. [36]

    Kimberly Nguyen, Kristal Lew, and Amal Trivedi. 2022. Trends in Collection of Disaggregated Asian American, Native Hawaiian, and Pacific Islander Data: Opportunities in Federal Health Surveys.American Journal of Public Health112, 10 (2022), 1429–1435. https://doi.org/10.2105/AJPH.2022.306969

  37. [37]

    Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R

    Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshn...

  38. [38]

    Granite Guardian: Comprehensive LLM Safeguarding. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track), Weizhu Chen, Yi Yang, Mohammad Kachuee, and Xue-Yong Fu (Eds.). Association for Computational Linguistics, Albuquerque, New M...

  39. [39]

    Jaehong Park, Dang Nguyen, and Ravi Sandhu. 2012. A provenance-based access control model. In2012 Tenth Annual International Conference on Privacy, Security and Trust. 137–144. https://doi.org/10.1109/PST.2012.6297930

  40. [40]

    close the books

    Penrose. 2025. AccountingBench: A Benchmark for Evaluating Large Language Models on Real Long-Horizon Accounting Tasks. Online. https://accounting. penrose.com/ A benchmark that tests LLMs’ ability to “close the books” using real financial data for a year of accounting records

  41. [41]

    Fotis Psallidas and Eugene Wu. 2018. Smoke: Fine-grained lineage at interactive speed.arXiv preprint arXiv:1801.07237(2018)

  42. [42]

    Mark Raasveldt and Hannes Mühleisen. 2019. DuckDB: an Embeddable Analytical Database. InProceedings of the 2019 International Conference on Management of Data(Amsterdam, Netherlands)(SIGMOD ’19). Association for Computing Machinery, New York, NY, USA, 1981–1984. https://doi.org/10.1145/3299869. 3320212

  43. [43]

    Mizanur Rahman, Amran Bhuiyan, Mohammed Saidul Islam, Md Tahmid Rahman Laskar, Ridwan Mahbub, Ahmed Masry, Shafiq Joty, and Enamul Hoque. 2025. LLM-Based Data Science Agents: A Survey of Capabilities, Challenges, and Future Directions. arXiv:2510.04023 [cs.AI] https://arxiv.org/abs/2510.04023

  44. [44]

    Ngo, and XuanLong Nguyen

    Maximilian Schleich, Dan Olteanu, Mahmoud Abo Khamis, Hung Q. Ngo, and XuanLong Nguyen. 2019. A Layered Aggregate Engine for Analytics Work- loads. InProceedings of the 2019 International Conference on Management of Data (SIGMOD). ACM, 581–597. https://doi.org/10.1145/3299869.3314037

  45. [45]

    Amir Shaikhha, Mathieu Huot, Jaclyn Smith, and Dan Olteanu. 2021. Functional Collection Programming with Semi-Ring Dictionaries.Proceedings of the ACM on Programming Languages5, OOPSLA (2021), 1–30. https://doi.org/10.1145/ 3485520

  46. [46]

    Michael Stonebraker and Lawrence A. Rowe. 1986. The design of POSTGRES. In Proceedings of the 1986 ACM SIGMOD International Conference on Management of Data(Washington, D.C., USA)(SIGMOD ’86). Association for Computing Machinery, New York, NY, USA, 340–355. https://doi.org/10.1145/16894.16888

  47. [47]

    Jacopo Tagliabue, Federico Bianchi, and Ciro Greco. 2025. Trustworthy AI in the Agentic Lakehouse: from Concurrency to Governance. arXiv:2511.16402 [cs.AI] https://arxiv.org/abs/2511.16402

  48. [48]

    Seshu Tirupathi, Dhaval Salwala, Elizabeth Daly, and Inge Vejsbjerg. 2025. GAF- Guard: An Agentic Framework for Risk Management and Governance in Large Language Models. arXiv:2507.02986 [cs.CL] https://arxiv.org/abs/2507.02986

  49. [49]

    Lillian Tsai and Eugene Bagdasarian. 2025. Contextual Agent Security: A Policy for Every Purpose. InProceedings of the Workshop on Hot Topics in Operating Systems (HOTOS ’25). ACM, 8–17. https://doi.org/10.1145/3713082.3730378

  50. [50]

    United States Congress. 2024. 26 U.S.C. § 274(n): Only 50 percent of meal expenses allowed as deduction. U.S. Code Title 26. https://www.law.cornell.edu/uscode/ text/26/274#n Internal Revenue Code, Section 274(n)

  51. [51]

    Prasang Upadhyaya, Magdalena Balazinska, and Dan Suciu. 2015. Automatic Enforcement of Data Use Policies with DataLawyer. InProceedings of the 2015 ACM SIGMOD International Conference on Management of Data(Melbourne, Victoria, Australia)(SIGMOD ’15). Association for Computing Machinery, New York, NY, USA, 213–225. https://doi.org/10.1145/2723372.2723721

  52. [52]

    van der Aalst

    Wil M.P. van der Aalst. 2012. Process mining: making knowledge discovery process centric.Commun. ACM55 (2012), 76–83. https://api.semanticscholar. org/CorpusID:36518949

  53. [53]

    Jiakai Xu, Tianle Zhou, Eugene Wu, and Kostis Kaffes. 2025. Toward Systems Foundations for Agentic Exploration. arXiv:2510.05556 [cs.DC] https://arxiv. org/abs/2510.05556

  54. [54]

    Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Proces...

  55. [55]

    Zhexin Zhang, Yida Lu, Jingyuan Ma, Di Zhang, Rui Li, Pei Ke, Hao Sun, Lei Sha, Zhifang Sui, Hongning Wang, and Minlie Huang. 2024. ShieldLM: Em- powering LLMs as Aligned, Customizable and Explainable Safety Detectors. arXiv:2402.16444 [cs.CL] https://arxiv.org/abs/2402.16444