arxiv: 2604.21765 · v1 · submitted 2026-04-23 · 💻 cs.LG · cs.SE

Recognition: unknown

PrismaDV: Automated Task-Aware Data Unit Test Generation

Hao Chen , Arnab Phani , Sebastian Schelter

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:10 UTC · model grok-4.3

classification 💻 cs.LG cs.SE

keywords data unit testingtask-aware validationdata error impactcompound AIprompt optimizationautomated test generationdata assumptions

0 comments

The pith

PrismaDV generates executable data unit tests by jointly analyzing downstream task code and dataset profiles to capture implicit assumptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Data errors often break applications even when datasets pass generic checks. Existing automated testing tools ignore the specific code that consumes the data and therefore miss issues that matter in practice. PrismaDV reads both the task code and data summaries to spot access patterns and unspoken rules about the data, then writes targeted executable tests for those rules. A separate component called SIFTA refines the prompts inside PrismaDV using feedback from running the tests and the tasks. On benchmarks covering 60 tasks across five datasets the resulting tests better reflect the real end-to-end consequences of data errors than either task-agnostic or other task-aware baselines.

Core claim

PrismaDV shows that a compound AI system can identify data access patterns and infer implicit data assumptions from downstream task code together with dataset profiles, then produce executable unit tests whose failures correspond to actual impacts on task correctness.

What carries the argument

The central mechanism is the compound pipeline that extracts data access patterns from task code, infers assumptions by combining those patterns with dataset profiles, generates executable tests, and uses SIFTA to adapt the prompts from scarce execution outcomes of tests and tasks.

If this is right

The generated tests more accurately reflect the end-to-end impact of data errors than task-agnostic or prior task-aware baselines.
SIFTA learns prompts that outperform both hand-written prompts and those produced by generic prompt optimizers.
Tests adapt automatically to specific datasets and tasks as execution feedback accumulates.
The method scales to benchmarks containing 60 tasks across five different datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Task-aware validation of this kind could be inserted into data pipelines to catch relevant errors before models or reports are produced.
Self-improving test generators that refine themselves from their own run results become feasible.
The same joint code-and-data analysis pattern may apply to other consumers such as database queries or API endpoints.

Load-bearing premise

That joint analysis of task code and dataset profiles is sufficient to accurately infer the implicit data assumptions that matter for end-to-end task correctness.

What would settle it

A dataset and task pair where data errors that cause task failure are not flagged by the generated tests, or where the tests flag data variations that leave task outcomes unchanged.

Figures

Figures reproduced from arXiv: 2604.21765 by Arnab Phani, Hao Chen, Sebastian Schelter.

read the original abstract

Data is a central resource for modern enterprises, and data validation is essential for ensuring the reliability of downstream applications. However, existing automated data unit testing frameworks are largely task-agnostic: they validate datasets without considering the semantics and requirements of the code that consumes the data. We present PrismaDV, a compound AI system that analyzes downstream task code together with dataset profiles to identify data access patterns, infer implicit data assumptions, and generate task-aware executable data unit tests. To further adapt the data unit tests over time to specific datasets and downstream tasks, we propose "Selective Informative Feedback for Task Adaptation" (SIFTA), a prompt-optimization framework that leverages the scarce outcomes from the execution of data unit tests and downstream tasks. We evaluate PrismaDV on two new benchmarks spanning 60 tasks across five datasets, where it consistently outperforms both task-agnostic and task-aware baselines in generating unit tests that reflect the end-to-end impact of data errors. Furthermore, we show that with SIFTA, we can automatically learn prompts for PrismaDV's modules that outperform prompts written by hand or generated from a generic prompt optimizer. We publicly release our benchmarks and prototype implementation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PrismaDV builds a usable system for task-specific data tests via code analysis and prompt tuning, but the evaluation details are too thin to judge how well the inferred assumptions actually catch impactful errors.

read the letter

The main things to know are that PrismaDV combines static code analysis with dataset profiles to spot access patterns and infer assumptions, then generates executable unit tests, and it adds SIFTA to automatically refine the prompts using scarce execution feedback. They also ship two new benchmarks with 60 tasks across five datasets plus the prototype code. That combination is the concrete new artifact here. The approach directly targets the limitation that most data validators ignore what the downstream code actually does with the data, which is a practical gap in MLOps settings. Releasing the benchmarks and implementation gives the community something concrete to test or extend, and the SIFTA loop is a reasonable way to adapt prompts without needing large amounts of labeled data. Those pieces are worth crediting. The soft spots sit mainly in the strength of the supporting evidence. The abstract claims consistent outperformance over both task-agnostic and task-aware baselines, yet supplies no numbers, no statistical tests, and no breakdown of where the gains come from or what kinds of errors are still missed. The central assumption that code plus profiles are enough to recover the data properties whose violation changes task outputs could be fragile. Static analysis will miss runtime-only correlations, business-rule invariants, or error-propagation paths that only appear under particular data distributions, and the stress-test concern about benchmark construction favoring the method looks plausible until the full experimental section shows otherwise. Without error analysis or coverage checks on the inferred assumptions, the reported gains could be narrower than they appear. This paper is aimed at engineers and researchers working on automated data validation and prompt-based systems for ML pipelines. Readers who need benchmarks or system designs in that area will find usable material even if the results section needs tightening. It has enough new pieces and a clear problem statement to deserve a serious referee, though the review should press for quantitative metrics, baseline implementation details, and tests of assumption completeness.

Referee Report

2 major / 2 minor

Summary. The manuscript presents PrismaDV, a compound AI system that jointly analyzes downstream task code and dataset profiles to identify data access patterns, infer implicit assumptions, and generate executable task-aware data unit tests. It introduces SIFTA, a prompt-optimization method that uses scarce execution outcomes from unit tests and tasks to adapt prompts automatically. The system is evaluated on two newly introduced benchmarks spanning 60 tasks across five datasets, with claims of consistent outperformance over both task-agnostic and task-aware baselines in capturing the end-to-end effects of data errors. SIFTA is further shown to yield prompts superior to hand-written ones or those from generic optimizers. The benchmarks and prototype implementation are released publicly.

Significance. If the empirical claims hold under detailed scrutiny, the work would represent a meaningful step toward context-sensitive data validation that accounts for downstream task semantics, which is relevant for reliable data pipelines and ML systems. The release of new benchmarks and the SIFTA adaptation framework are positive contributions that could enable further research; the public code release supports reproducibility.

major comments (2)

[Evaluation / Results] Evaluation section (results on the two benchmarks): The abstract and summary claim consistent outperformance on 60 tasks, but no quantitative metrics (e.g., precision/recall of error detection, F1 scores, or end-to-end task accuracy deltas), statistical significance tests, or detailed baseline implementations are referenced. This makes it impossible to assess the magnitude or robustness of the reported gains and is load-bearing for the central empirical claim.
[§3] §3 (PrismaDV architecture and inference of implicit assumptions): The description of how the compound AI system infers implicit data assumptions from static code analysis plus profiles does not address cases where critical assumptions (e.g., statistical correlations, business-rule invariants, or runtime error-propagation paths) are not explicitly visible in the code. Without concrete mechanisms, examples, or ablation showing recovery of such properties, the task-awareness advantage over baselines risks being an artifact of benchmark construction rather than a general property.

minor comments (2)

[Abstract] The abstract refers to 'two new benchmarks spanning 60 tasks across five datasets' but does not name the datasets or tasks; adding this information would improve clarity.
[§4] Notation for SIFTA components (e.g., how 'scarce outcomes' are formalized as feedback signals) could be made more precise to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We have carefully considered each comment and provide point-by-point responses below. We believe these revisions will enhance the clarity and strength of our work.

read point-by-point responses

Referee: [Evaluation / Results] Evaluation section (results on the two benchmarks): The abstract and summary claim consistent outperformance on 60 tasks, but no quantitative metrics (e.g., precision/recall of error detection, F1 scores, or end-to-end task accuracy deltas), statistical significance tests, or detailed baseline implementations are referenced. This makes it impossible to assess the magnitude or robustness of the reported gains and is load-bearing for the central empirical claim.

Authors: We appreciate the referee's observation regarding the presentation of our evaluation results. While the evaluation section includes comparative results across the 60 tasks on the two benchmarks, we agree that the manuscript would benefit from more explicit quantitative metrics and details. In the revised manuscript, we will expand the evaluation section to include specific values for precision, recall, and F1 scores of the generated unit tests in detecting data errors, report deltas in end-to-end task accuracy, include statistical significance tests (e.g., paired t-tests or Wilcoxon tests with p-values), and provide detailed descriptions of the baseline implementations, including how the task-aware baselines were adapted. This will allow readers to better assess the magnitude and robustness of the improvements. revision: yes
Referee: [§3] §3 (PrismaDV architecture and inference of implicit assumptions): The description of how the compound AI system infers implicit data assumptions from static code analysis plus profiles does not address cases where critical assumptions (e.g., statistical correlations, business-rule invariants, or runtime error-propagation paths) are not explicitly visible in the code. Without concrete mechanisms, examples, or ablation showing recovery of such properties, the task-awareness advantage over baselines risks being an artifact of benchmark construction rather than a general property.

Authors: We thank the referee for this insightful comment on the architectural description. The PrismaDV system combines static analysis of downstream code with dataset profiles to identify access patterns and infer assumptions, with the LLM components aiding in synthesizing these into test cases. However, we acknowledge that the current §3 does not sufficiently detail the handling of implicit assumptions not directly visible in the code. In the revision, we will augment §3 with additional examples demonstrating inference of statistical correlations (via profile analysis), business-rule invariants (through contextual LLM reasoning), and error-propagation paths (leveraging task execution feedback). We will also add an ablation study that isolates the impact of these inference capabilities, to demonstrate that the advantages are general and not specific to the benchmark design. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical system evaluation

full rationale

The paper describes an empirical compound AI system (PrismaDV) and prompt optimizer (SIFTA) whose central claims rest on benchmark comparisons against task-agnostic and task-aware baselines across 60 tasks. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided abstract or description that reduce any result to its inputs by construction. The evaluation uses newly introduced benchmarks and reports outperformance on end-to-end impact metrics, which are independent of internal redefinitions. This is a standard self-contained empirical contribution without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is based solely on the abstract; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes that large language models can reliably extract data access patterns and implicit assumptions from code and profiles, but this is a domain assumption whose validity is not audited here.

pith-pipeline@v0.9.0 · 5505 in / 1399 out tokens · 43792 ms · 2026-05-09T22:10:03.862634+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

88 extracted references · 28 canonical work pages · 5 internal anchors

[1]

Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang. 2016. De- tecting data errors: Where are we and what needs to be done?Proceedings of the VLDB Endowment9, 12 (2016), 993–1004

2016
[2]

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. 2025. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457(2025)

work page internal anchor Pith review arXiv 2025
[3]

Amazon. 2025. Automatic Suggestion of Constraints. https://github.com/ awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/examples/ constraint_suggestion_example.md. [Online; accessed March-2025]

2025
[4]

Nastaran Bassamzadeh and Chhaya Methani. 2024. A comparative study of DSL code generation: Fine-tuning vs. optimized retrieval augmentation.arXiv preprint arXiv:2407.02742(2024)

work page arXiv 2024
[5]

Jeffery Cao, Lampros Flokas, Yujian Xu, Eugene Wu, Xu Chu, and Cong Yu. 2025. Prompt Editor: A Taxonomy-driven System for Guided LLM Prompt Develop- ment in Enterprise Settings. InCompanion of the 2025 International Conference on Management of Data(Berlin, Germany)(SIGMOD/PODS ’25). Association for Computing Machinery, New York, NY, USA, 59–62. doi:10.114...

work page doi:10.1145/3722212.3725124 2025
[6]

Junkai Chen, Zhiyuan Pan, Xing Hu, Zhenhao Li, Ge Li, and Xin Xia. 2024. Reasoning runtime behavior of a program with llm: How far are we?arXiv preprint arXiv:2403.16437(2024)

work page arXiv 2024
[7]

Qixu Chen, Yeye He, Raymond Chi-Wing Wong, Weiwei Cui, Song Ge, Haidong Zhang, Dongmei Zhang, and Surajit Chaudhuri. 2025. Auto-Test: Learning Semantic-Domain Constraints for Unsupervised Error Detection in Tables.Proc. ACM Manag. Data3, 3, Article 133 (June 2025), 27 pages. doi:10.1145/3725396

work page doi:10.1145/3725396 2025
[8]

Qixu Chen, Yeye He, Raymond Chi-Wing Wong, Weiwei Cui, Song Ge, Haidong Zhang, Dongmei Zhang, and Surajit Chaudhuri. 2025. Auto-Test: Learning Semantic-Domain Constraints for Unsupervised Error Detection in Tables.Pro- ceedings of the ACM on Management of Data3, 3 (2025), 1–27

2025
[9]

CNN. 2023. A corrupt file led to the FAA ground stoppage. It was also found in the backup system. https://edition.cnn.com/travel/article/faa-ground-stop- causes/index.html. [Online; accessed Aug-2025]

2023
[10]

Databricks. 2025. Manage data quality with pipeline expectations. https://docs. databricks.com/aws/en/dlt/expectations. [Online; accessed Aug-2025]

2025
[11]

Saman Dehghan. 2024. Assessing Code Reasoning in Large Language Models: A Literature Review of Benchmarks and Future Directions. (2024)

2024
[12]

Sijie Dong, Soror Sahri, Themis Palpanas, and Qitong Wang. 2025. Automated Data Quality Validation in an End-to-End GNN Framework. (2025)

2025
[13]

Great Expectations. 2024. Great Expectations. https://greatexpectations.io/. [Online; accessed January-2025]

2024
[14]

Meihao Fan, Ju Fan, Nan Tang, Lei Cao, Guoliang Li, and Xiaoyong Du. 2025. AutoPrep: Natural Language Question-Aware Data Preparation with a Multi- Agent Framework.PVLDB18, 10 (2025), 3504–3517. https://www.vldb.org/ pvldb/vol18/p3504-fan.pdf

2025
[15]

Anna Fariha, Ashish Tiwari, Alexandra Meliou, Arjun Radhakrishna, and Sumit Gulwani. 2021. CoCo: Interactive Exploration of Conformance Constraints for Data Understanding and Data Cleaning(SIGMOD ’21). Association for Computing Machinery, New York, NY, USA, 2706–2710. doi:10.1145/3448016.3452750

work page doi:10.1145/3448016.3452750 2021
[16]

Saeed Fathollahzadeh, Essam Manfsour, and Matthias Boehm. 2025. Demonstrat- ing CatDB: LLM-based Generation of Data-centric ML Pipelines. InCompanion of the 2025 International Conference on Management of Data. 87–90

2025
[17]

Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rocktäschel. 2023. Promptbreeder: Self-referential self-improvement via prompt evolution.arXiv preprint arXiv:2309.16797(2023)

work page arXiv 2023
[18]

Lampros Flokas, Jeffery Cao, Yujian Xu, Eugene Wu, Xu Chu, and Cong Yu. 2025. Towards a Framework for Hierarchical Text Segmentation using Large Language Models. InProceedings of the Workshop on Data Management for End-to-End Machine Learning. 1–9

2025
[19]

Sainyam Galhotra, Anna Fariha, Raoni Lourenço, Juliana Freire, Alexandra Me- liou, and Divesh Srivastava. 2022. Dataprism: Exposing disconnect between data and systems. InProceedings of the 2022 International Conference on Management of Data. 217–231

2022
[20]

Google. 2023. Deliver trusted insights with Dataplex data profiling and automatic data quality. https://cloud.google.com/blog/products/data-analytics/dataplex- data-profiling-and-automatic-data-quality-are-ga?hl=en. [Online; accessed Aug-2025]

2023
[21]

Stefan Grafberger, Hao Chen, Olga Ovcharenko, and Sebastian Schelter. 2025. Towards Regaining Control over Messy Machine Learning Pipelines. InWorkshop on Data-AI Systems (DAIS) at ICDE

2025
[22]

Borgwardt, Malte J

Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. 2012. A kernel two-sample test.Journal of Machine Learning Research13, Mar (2012), 723–773

2012
[23]

Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Syn- naeve, and Sida I Wang. 2024. Cruxeval: A benchmark for code reasoning, understanding and execution.arXiv preprint arXiv:2401.03065(2024)

work page internal anchor Pith review arXiv 2024
[24]

Jinyao Guo, Chengpeng Wang, Xiangzhe Xu, Zian Su, and Xiangyu Zhang. 2025. RepoAudit: An Autonomous LLM-Agent for Repository-Level Code Auditing. arXiv preprint arXiv:2501.18160(2025)

work page arXiv 2025
[25]

Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. [n. d.]. Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers. InThe Twelfth International Conference on Learning Representations
[26]

Sairam Gurajada, Eser Kandogan, and Sajjadur Rahman. 2025. Effectiveness of Prompt Optimization in NL2SQL Systems. InNovel Optimizations for Vision- ary AI Systems Workshop at SIGMOD 2025. https://openreview.net/forum?id= BnLbe5eQaP

2025
[27]

GXCloud. 2025. ExpectAI. https://greatexpectations.io/blog/gx-expectAI%20/. [Online; accessed Aug-2025]

2025
[28]

Hazar Harmouch and Felix Naumann. 2017. Cardinality estimation: An experi- mental survey.Proceedings of the VLDB Endowment11, 4 (2017), 499–512

2017
[29]

Alireza Heidari, Joshua McGrath, Ihab F Ilyas, and Theodoros Rekatsinas. 2019. Holodetect: Few-shot learning for error detection. InProceedings of the 2019 International Conference on Management of Data. 829–846

2019
[30]

Zachary Huang. 2025. PocketFlow. https://github.com/The-Pocket/PocketFlow. [Online; accessed Aug-2025]

2025
[31]

Zhipeng Huang and Yeye He. 2018. Auto-detect: Data-driven error detection in tables. InProceedings of the 2018 International Conference on Management of Data. 1377–1392

2018
[32]

Zezhou Huang and Eugene Wu. 2024. Cocoon: Semantic table profiling using large language models. InProceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics. 1–7

2024
[33]

Hamed Jelodar, Mohammad Meymani, and Roozbeh Razavi-Far. 2025. Large lan- guage models (llms) for source code analysis: applications, models and datasets. arXiv preprint arXiv:2503.17502(2025)

work page arXiv 2025
[34]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770(2023)

work page internal anchor Pith review arXiv 2023
[35]

Joe Hellerstein. 2024. The Data School with Professor Joe Hellerstein – Big Shifts in Data and Analytics. https://www.youtube.com/watch?v=-J0dy3jtLDk. Online; accessed 12 Jan. 2026

2024
[36]

Eser Kandogan, Nikita Bhutani, Dan Zhang, Rafael Li Chen, Sairam Gurajada, and Estevam Hruschka. 2025. Orchestrating Agents and Data for Enterprise: A Blueprint Architecture for Compound AI. In2025 IEEE 41st International Confer- ence on Data Engineering Workshops (ICDEW). 18–27. doi:10.1109/ICDEW67478. 2025.00007

work page doi:10.1109/icdew67478 2025
[37]

Zohar Karnin, Kevin Lang, and Edo Liberty. 2016. Optimal quantile approxima- tion in streams. In2016 ieee 57th annual symposium on foundations of computer science (focs). IEEE, 71–78

2016
[38]

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, Heather Miller, et al. [n. d.]. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. InR0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models
[39]

Hoa Thi Le, Angela Bonifati, and Andrea Mauri. 2025. Graph Consistency Rule Mining with LLMs: an Exploratory Study. (2025)

2025
[40]

Shuocheng Li, Yihao Liu, Silin Du, Wenxuan Zeng, Zhe Xu, Mengyu Zhou, Yeye He, Haoyu Dong, Shi Han, and Dongmei Zhang. 2025. Jupiter: Enhancing LLM Data Analysis Capabilities via Notebook and Inference-Time Value-Guided Search.arXiv preprint arXiv:2509.09245(2025)

work page arXiv 2025
[41]

Changshu Liu and Reyhaneh Jabbarvand. 2025. A Tool for In-depth Analy- sis of Code Execution Reasoning of Large Language Models.arXiv preprint arXiv:2501.18482(2025)

work page arXiv 2025
[42]

Changshu Liu, Shizhuo Dylan Zhang, Ali Reza Ibrahimzada, and Reyhaneh Jabbarvand. 2024. Codemind: A framework to challenge large language models for code reasoning.arXiv preprint arXiv:2402.09664(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest. In2008 eighth ieee international conference on data mining. IEEE, 413–422

2008
[44]

Jiawei Liu, Jia Le Tian, Vijay Daita, Yuxiang Wei, Yifeng Ding, Yuhan Katherine Wang, Jun Yang, and LINGMING ZHANG. 2024. RepoQA: Evaluating Long Context Code Understanding. InFirst Workshop on Long-Context Foundation Models @ ICML 2024. https://openreview.net/forum?id=hK9YSrFuGf

2024
[45]

Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambro- sio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. [n. d.]. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understand- ing and Generation. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)
[46]

Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Mad- den, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2019. Raha: A Configuration-Free Error Detection System.SIGMOD(2019)

2019
[47]

Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers. 2024. Using an llm to help with code understanding. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13. 13 Hao Chen, Arnab Phani, & Sebastian Schelter

2024
[48]

Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers. 2024. Using an LLM to Help With Code Understanding. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Portugal)(ICSE ’24). Association for Computing Machinery, New York, NY, USA, Article 97, 13 pages. doi:10.1145/3597503.3639187

work page doi:10.1145/3597503.3639187 2024
[49]

Mohammad Hossein Namaki, Avrilia Floratou, Fotis Psallidas, Subru Krishnan, Ashvin Agrawal, Yinghui Wu, Yiwen Zhu, and Markus Weimer. 2020. Vamsa: Automated provenance tracking in data science scripts. InProceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 1542–1551

2020
[50]

Dung Manh Nguyen, Thang Chau Phan, Nam Le Hai, Tien-Thong Doan, Nam V Nguyen, Quang Pham, and Nghi DQ Bui. [n. d.]. CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding & Reasoning Capabilities of CodeLLMs. InThe Thirteenth International Conference on Learning Representa- tions
[51]

David Nigenda, Zohar Karnin, Muhammad Bilal Zafar, Raghu Ramesha, Alan Tan, Michele Donini, and Krishnaram Kenthapadi. 2022. Amazon sagemaker model monitor: A system for real-time insights into deployed machine learning models. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3671–3681

2022
[52]

Alexander Novikov, Ngân V ˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. 2025. AlphaEvolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131(2025)

work page internal anchor Pith review arXiv 2025
[53]

Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. 2024. Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Asso...

work page doi:10.18653/v1/2024.emnlp- 2024
[54]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems32 (2019)

2019
[55]

Alberto Sánchez Pérez, Alaa Boukhary, Paolo Papotti, Luis Castejón Lozano, and Adam Elwood. 2025. An LLM-Based Approach for Insight Generation in Data Analysis. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Tech- nologies (Volume 1: Long Papers), Luis Chiruzzo, ...

work page doi:10.18653/v1/2025.naacl-long.24 2025
[56]

Mitchell, and Estevam Hruschka

Pouya Pezeshkpour, Eser Kandogan, Nikita Bhutani, Sajjadur Rahman, Tom M. Mitchell, and Estevam Hruschka. 2024. Reasoning Capacity in Multi-Agent Systems: Limitations, Challenges and Human-Centered Solutions.CoRR abs/2402.01108 (2024). https://doi.org/10.48550/arXiv.2402.01108

work page doi:10.48550/arxiv.2402.01108 2024
[57]

Neoklis Polyzotis, Martin Zinkevich, Sudip Roy, Eric Breck, and Steven Whang
[58]

Data validation for machine learning.MLSys1 (2019), 334–347

2019
[59]

Sergey Redyuk, Zoi Kaoudi, Volker Markl, and Sebastian Schelter. 2021. Au- tomating Data Quality Validation for Dynamic Data Ingestion.. InEDBT. 61–72

2021
[60]

Kenneth A Ross, Divesh Srivastava, Peter J Stuckey, and S Sudarshan. 1998. Foundations of aggregation constraints.Theoretical Computer Science193, 1-2 (1998), 149–179

1998
[61]

Sebastian Schelter, Felix Biessmann, Tim Januschowski, David Salinas, Stephan Seufert, and Gyuri Szarvas. 2015. On challenges in machine learning model management.IEEE Data Engineering Bullettin(2015)

2015
[62]

Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biess- mann, and Andreas Grafberger. 2018. Automating large-scale data quality verifi- cation.Proceedings of the VLDB Endowment11, 12 (2018), 1781–1794

2018
[63]

Sebastian Schelter, Tammo Rukat, and Felix Biessmann. 2021. JENGA: A frame- work to study the impact of data errors on the predictions of machine learning models.EDBT(2021)

2021
[64]

Bernhard Schölkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson. 2001. Estimating the support of a high-dimensional distribution. Neural computation13, 7 (2001), 1443–1471

2001
[65]

Amazon Web Services. 2025. AWS Glue Data Quality. https://docs.aws.amazon. com/glue/latest/dg/glue-data-quality.html. [Online; accessed Aug-2025]

2025
[66]

Amazon Web Services. 2025. pyDeequ. https://github.com/awslabs/python- deequ. [Online; accessed Aug-2025]

2025
[67]

Shreya Shankar, Tristan Chambers, Tarak Shah, Aditya G Parameswaran, and Eugene Wu. 2024. Docetl: Agentic query rewriting and evaluation for complex document processing.arXiv preprint arXiv:2410.12189(2024)

work page arXiv 2024
[68]

Shreya Shankar, Labib Fawaz, Karl Gyllstrom, and Aditya Parameswaran. 2023. Automatic and precise data validation for machine learning. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management. 2198–2207

2023
[69]

Shreya Shankar, Haotian Li, Parth Asawa, Madelon Hulsebos, Yiming Lin, J. D. Zamfirescu-Pereira, Harrison Chase, Will Fu-Hinthorn, Aditya G. Parameswaran, and Eugene Wu. 2024. spade: Synthesizing Data Quality Assertions for Large Language Model Pipelines.Proc. VLDB Endow.17, 12 (Aug. 2024), 4173–4186. doi:10.14778/3685800.3685835

work page doi:10.14778/3685800.3685835 2024
[70]

Chen Shen, Jin Wang, Sajjadur Rahman, and Eser Kandogan. 2024. Demonstration of a Multi-agent Framework for Text to SQL Applications with Large Language Models. InCIKM. 5280–5283. doi:10.1145/3627673.3679216

work page doi:10.1145/3627673.3679216 2024
[71]

Jie Song and Yeye He. 2021. Auto-validate: Unsupervised data validation us- ing data-domain patterns inferred from data lakes. InProceedings of the 2021 International Conference on Management of Data. 1678–1691

2021
[72]

Charlie Summers, Haneen Mohammed, and Eugene Wu. 2025. Please Don’t Kill My Vibe: Empowering Agents with Data Flow Control.arXiv preprint arXiv:2512.05374(2025)

work page arXiv 2025
[73]

Tensorflow. 2025. TensorFlow Data Validation - An Example of a Key Component of TensorFlow Extended. https://colab.research.google.com/github/tensorflow/ tfx/blob/master/docs/tutorials/data_validation/tfdv_basic.ipynb. [Online; ac- cessed March-2025]

2025
[74]

Dezhan Tu, Yeye He, Weiwei Cui, Song Ge, Haidong Zhang, Shi Han, Dongmei Zhang, and Surajit Chaudhuri. 2023. Auto-Validate by-History: Auto-Program Data Quality Constraints to Validate Recurring Data Pipelines. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4991–5003

2023
[75]

The Verge. 2020. Excel spreadsheet error blamed for UK’s 16,000 missing coronavirus cases. https://www.theverge.com/2020/10/5/21502141/uk-missing- coronavirus-cases-excel-spreadsheet-error. [Online; accessed Aug-2025]

2020
[76]

Chengpeng Wang, Wuqi Zhang, Zian Su, Xiangzhe Xu, Xiaoheng Xie, and Xi- angyu Zhang. 2024. LLMDFA: analyzing dataflow in code with large language models.Advances in Neural Information Processing Systems37 (2024), 131545– 131574

2024
[77]

Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. 2023. Plan-and-solve prompting: Improving zero-shot chain- of-thought reasoning by large language models.arXiv preprint arXiv:2305.04091 (2023)

work page arXiv 2023
[78]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reason- ing in large language models.Advances in neural information processing systems 35 (2022), 24824–24837

2022
[79]

Wired. 2018. Timeline of trouble: how the TSB IT meltdown un- folded. https://www.theguardian.com/business/2018/jun/06/timeline-of-trouble- how-the-tsb-it-meltdown-unfolded. [Online; accessed Aug-2025]

2018
[80]

Wired. 2020. How a Facebook Bug Took Down Your Favorite iOS Apps. https: //www.wired.com/story/facebook-sdk-ios-apps-spotify-tiktok-crash/. [Online; accessed Aug-2025]

2020

Showing first 80 references.