MLSkip: Data Skipping for ML Filters via Lightweight Metadata

Andreas Kipf; Andreas Zimmerer; Jan Van den Bussche; Mark Gerarts; Mihail Stoian; Pascal Ginter

arxiv: 2606.03946 · v1 · pith:5M4X7LVYnew · submitted 2026-06-02 · 💻 cs.DB · cs.LG· cs.LO

MLSkip: Data Skipping for ML Filters via Lightweight Metadata

Mihail Stoian , Mark Gerarts , Pascal Ginter , Andreas Zimmerer , Jan Van den Bussche , Andreas Kipf This is my paper

Pith reviewed 2026-06-28 07:42 UTC · model grok-4.3

classification 💻 cs.DB cs.LGcs.LO

keywords data skippingML filtersParquet metadataconvex hullneural network verificationrow group pruningReLU architectures

0 comments

The pith

Parquet min-max metadata enables pruning of row groups for machine learning filter predicates via neural network verification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard metadata already present in Parquet files is sufficient to decide which row groups can be skipped when a query applies a machine learning model as a filter. This connection to verification techniques for neural networks means that expensive model evaluations can be avoided for large portions of the data. An improved metadata structure based on a two-dimensional convex hull further increases the amount of data that can be pruned while remaining compact. The approach yields measurable performance gains when integrated into a database system.

Core claim

The authors demonstrate that min-max metadata already stored in Parquet files can be fed into neural network verification procedures to decide whether an entire row group can be pruned for a ReLU neural network acting as a filter predicate. On tables from standard benchmarks this yields 27.4% average pruning for filters with selectivity under 0.1%. Replacing or augmenting the metadata with a size-bounded two-dimensional convex hull raises pruning effectiveness to 38.31% at a cost of no more than 45 bytes per row group and column pair.

What carries the argument

Neural network verification applied to Parquet min-max metadata (and an enhanced 2D convex hull) to determine row-group pruning for ML filters.

Load-bearing premise

The time required to run verification on the metadata must remain smaller than the time saved by avoiding model evaluations on pruned row groups.

What would settle it

Compare the wall-clock time of verification plus model evaluation on remaining groups against the time of model evaluation on all groups for the same query.

Figures

Figures reproduced from arXiv: 2606.03946 by Andreas Kipf, Andreas Zimmerer, Jan Van den Bussche, Mark Gerarts, Mihail Stoian, Pascal Ginter.

**Figure 2.** Figure 2: Data skipping behavior for varying filter selectivity under two row group sizes [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Data skipping time for models on TPC-H and TPC [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Database vendors recently released AI functions that can be used in filter predicates. As such functions often rely on costly, black-box ML models, they unveil new data management challenges. Concretely, traditional data skipping techniques for integer and string data fail to be applicable to the new filter type. Indeed, there is no known mechanism for pruning non-qualifying row groups, e.g., when reading files from blob storage. In this work, we initiate the study of data skipping techniques for ML filters. We make the case that Parquet's default min-max metadata is enough to enable pruning. To this end, we draw connections to two lines of research: (i) the recently proposed query language for ML models and (ii) neural network verification. Our preliminary results on ReLU architectures show that on tables from TPC-H and TPC-DS, the average pruning effectiveness for filters of selectivity below 0.1% amounts to 27.4%. Finally, inspired by research on spatial joins, we propose an enhanced metadata structure: a size-bounded 2D convex hull that verification tools can make better use of, increasing the pruning effectiveness to 38.31%, while occupying at most 45 bytes per row group and column pair. We observe an end-to-end speedup of 1.07$\times$ over PyTorch in DuckDB.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper opens data skipping for ML filters with Parquet min-max and a new convex-hull structure, but leaves the pruning algorithm and verification cost unspecified.

read the letter

The main point is that this work starts exploring row-group pruning for ML predicates by linking Parquet metadata to neural-network verification techniques. On ReLU models it reports 27.4% average pruning for sub-0.1% selectivity queries on TPC-H and TPC-DS tables, rising to 38.31% with a size-bounded 2D convex hull that stays under 45 bytes per row group and column pair, plus a 1.07× DuckDB speedup.

What is actually new is the application itself—prior skipping work stayed with integer and string predicates—and the convex-hull metadata drawn from spatial-join literature but adapted here. The numbers come from standard benchmarks and the storage overhead looks modest.

The soft spots are the missing pieces. The abstract states that min-max metadata enables pruning yet supplies no algorithm, decision procedure, or pseudocode for turning the bounds into a skip decision. The pruning figures are aggregates only and labeled preliminary. Most critically, nothing addresses the verification cost of the NN techniques; if bound propagation or abstract interpretation per row group exceeds a few microseconds, the I/O savings disappear. The stress-test concern therefore stands on the evidence given.

This is for database researchers working on AI-augmented query engines. A reader already following ML predicate optimization will pick up the direction and the baseline numbers. It deserves a serious referee because the topic is timely and the benchmarks are reproducible, even though the current version needs the algorithmic details and cost measurements filled in.

Recommendation: send it to review rather than desk-reject.

Referee Report

2 major / 1 minor

Summary. The paper claims that Parquet's default min-max metadata, combined with neural network verification techniques, suffices to prune row groups for ReLU-based ML filter predicates, achieving 27.4% average pruning effectiveness on TPC-H/TPC-DS tables for filters with selectivity below 0.1%; an enhanced size-bounded 2D convex hull metadata raises this to 38.31% at ≤45 bytes per row group and column pair, yielding a 1.07× end-to-end speedup in DuckDB.

Significance. If a concrete, low-overhead verification procedure can be shown to produce the reported pruning rates without verification cost exceeding I/O savings, the work would usefully connect database metadata structures to NN verification for a new class of predicates. The lightweight convex-hull proposal and the use of existing Parquet metadata are pragmatic strengths, but the preliminary status and missing algorithmic details reduce immediate impact.

major comments (2)

[Abstract] Abstract: the central claim that 'Parquet's default min-max metadata is enough to enable pruning' is stated without any description of the decision procedure, bound-propagation rules, or algorithm that maps min-max intervals (or the convex hull) to a pruning verdict via NN verification.
[Results] Results and evaluation sections: the reported 27.4% and 38.31% pruning rates and 1.07× DuckDB speedup are aggregate numbers only; no per-row-group timing, verification runtime measurements, or comparison against I/O savings is provided, leaving the skeptic concern about verification cost unaddressed.

minor comments (1)

The manuscript repeatedly labels the results 'preliminary'; a clearer statement of the exact scope of the experiments (network architectures, column pairs, row-group sizes) would help readers assess generalizability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our preliminary work. We address each major comment below and will revise the manuscript to improve clarity on the decision procedure and evaluation details.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'Parquet's default min-max metadata is enough to enable pruning' is stated without any description of the decision procedure, bound-propagation rules, or algorithm that maps min-max intervals (or the convex hull) to a pruning verdict via NN verification.

Authors: We agree the abstract would benefit from additional context on the approach. In the revision we will update the abstract to briefly describe the use of bound-propagation rules from neural network verification to derive pruning verdicts from min-max intervals (and the convex-hull variant) for ReLU networks. The body of the paper already contains the algorithmic mapping, but we will ensure the abstract makes the connection explicit without exceeding length limits. revision: yes
Referee: [Results] Results and evaluation sections: the reported 27.4% and 38.31% pruning rates and 1.07× DuckDB speedup are aggregate numbers only; no per-row-group timing, verification runtime measurements, or comparison against I/O savings is provided, leaving the skeptic concern about verification cost unaddressed.

Authors: The current evaluation focuses on aggregate pruning effectiveness and end-to-end speedup, which already incorporates verification overhead in the DuckDB measurements. We acknowledge that explicit per-row-group verification timings and a direct I/O-savings comparison would better address potential skepticism. In the revised version we will add a table or subsection with verification runtime breakdowns and show that they remain below the I/O savings for the reported selectivity range, using the observed 1.07× net speedup as supporting evidence. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results on external benchmarks

full rationale

The paper's central claims consist of measured pruning rates (27.4% and 38.31%) obtained by applying neural-network verification techniques to Parquet min-max metadata and a proposed convex-hull structure on TPC-H and TPC-DS tables. These percentages are produced by direct experimentation rather than any derivation that reduces to fitted parameters, self-citations, or definitional equivalence. No equations or load-bearing steps in the provided text equate a reported result to its own inputs by construction. The work therefore receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that ReLU networks admit verification routines usable with scalar min-max bounds and on the choice of a 45-byte size limit for the convex hull; no new physical entities are postulated.

free parameters (1)

convex hull size bound = 45 bytes
Chosen by hand to keep metadata lightweight at most 45 bytes per row group and column pair.

axioms (1)

domain assumption ReLU architectures permit pruning decisions via neural network verification on min-max metadata
Results are stated only for ReLU architectures; the connection to verification is invoked without further proof.

invented entities (1)

size-bounded 2D convex hull metadata no independent evidence
purpose: To provide richer bounds than min-max for verification-based pruning of ML filters
New metadata structure proposed in the paper; no independent evidence outside the reported experiments.

pith-pipeline@v0.9.1-grok · 5792 in / 1415 out tokens · 22006 ms · 2026-06-28T07:42:50.834811+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 13 canonical work pages

[1]

Paritosh Aggarwal, Bowei Chen, Anupam Datta, Benjamin Han, Boxin Jiang, Nitish Jindal, Zihan Li, Aaron Lin, Pawel Liskowski, Jay Tayade, et al . 2025. Cortex AISQL: A Production SQL Engine for Unstructured Data.arXiv preprint arXiv:2511.07663(2025). 4

Pith/arXiv arXiv 2025
[2]

2021.Introduction to Neural Network Verification

Aws Albarghouthi. 2021.Introduction to Neural Network Verification. verified- deeplearning.com. arXiv:2109.10317 [cs.LG] http://verifieddeeplearning.com

arXiv 2021
[3]

2026.Amazon Redshift ML

Amazon Web Services. 2026.Amazon Redshift ML. https://docs.aws.amazon. com/redshift/latest/dg/machine_learning.html

2026
[4]

2026.Metadata

Apache Parquet. 2026.Metadata. https://parquet.apache.org/docs/file-format/ metadata/

2026
[5]

Thomas Brinkhoff, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. 1994. Multi-Step Processing of Spatial Joins. InProceedings of the 1994 ACM SIGMOD International Conference on Management of Data, Minneapolis, Minnesota, USA, May 24-27, 1994, Richard T. Snodgrass and Marianne Winslett (Eds.). ACM Press, 197–208. https://doi.org/10.1145/191839.191880

work page doi:10.1145/191839.191880 1994
[6]

Bergman, Vittorio Castelli, Chung-Sheng Li, Ming- Ling Lo, and John R

Yuan-Chi Chang, Lawrence D. Bergman, Vittorio Castelli, Chung-Sheng Li, Ming- Ling Lo, and John R. Smith. 2000. The Onion Technique: Indexing for Linear Optimization Queries. InProceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16-18, 2000, Dallas, Texas, USA, Weidong Chen, Jeffrey F. Naughton, and Philip A. Bernstein (...

work page doi:10.1145/342009.335433 2000
[7]

Yeounoh Chung, Rushabh Desai, Jian He, Yu Xiao, Thibaud Hottelier, Yves- Laurent Kom Samo, Pushkar Kadilkar, Xianshun Chen, Sam Idicula, Fatma Özcan, et al. 2026. 100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models.arXiv preprint arXiv:2603.15970 (2026)

Pith/arXiv arXiv 2026
[8]

G.E. Collins. 1975. Quantifier elimination for real closed fields by cylindrical algebraic decomposition.Lecture Notes in Computer Science33 (1975), 134–183

1975
[9]

2025.Demographia International Housing Affordability

Wendy Cox. 2025.Demographia International Housing Affordability. Technical Report. Frontier Centre for Public Policy, Canada. https://policycommons. net/artifacts/21033541/demographia-international-housing/21933951/ Re- trieved from https://coilink.org/20.500.12592/3hb8vjr on May 31, 2026. COI: 20.500.12592/3hb8vjr

arXiv 2025
[10]

Databricks. 2026. ai_query Function. https://docs.databricks.com/aws/en/sql/ language-manual/functions/ai_query

2026
[11]

Anas Dorbani, Sunny Yasser, Jimmy Lin, and Amine Mhedhbi. 2025. Beyond Quacking: Deep Integration of Language Models and RAG into DuckDB.Proc. VLDB Endow.18, 12 (Sept. 2025), 5415–5418. https://doi.org/10.14778/3750601. 3750685

work page doi:10.14778/3750601 2025
[12]

Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. 2007. Hyper- loglog: the analysis of a near-optimal cardinality estimation algorithm.Discrete mathematics & theoretical computer scienceProceedings (2007)

2007
[13]

Freitag and Thomas Neumann

Michael J. Freitag and Thomas Neumann. 2019. Every Row Counts: Combining Sketches and Sampling for Accurate Group-By Result Esti- mates. In9th Biennial Conference on Innovative Data Systems Research, CIDR 2019, Asilomar, CA, USA, January 13-16, 2019, Online Proceedings. www.cidrdb.org. https://vldb.org/cidrdb/2019/every-row-counts-combining- sketches-and-...

2019
[14]

Mark Gerarts, Juno Steegmans, and Jan Van den Bussche. 2025. SQL4NN: Valida- tion and expressive querying of models as data. InProceedings of the Workshop on Data Management for End-to-End Machine Learning. 1–5

2025
[15]

Google Cloud. 2026. ML.PREDICT Function. https://cloud.google.com/bigquery/ docs/reference/standard-sql/bigqueryml-syntax-predict. Google Cloud Docu- mentation, accessed 2026-05-31

2026
[16]

Martin Grohe, Christoph Standke, Juno Steegmans, and Jan Van den Bussche
[17]

In28th International Conference on Database Theory, ICDT 2025, March 25–28, 2025, Barcelona, Spain (LIPIcs), Sudeepa Roy and Ahmet Kara (Eds.), Vol

Query Languages for Neural Networks. In28th International Conference on Database Theory, ICDT 2025, March 25–28, 2025, Barcelona, Spain (LIPIcs), Sudeepa Roy and Ahmet Kara (Eds.), Vol. 328. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 9:1–9:18. https://doi.org/10.4230/LIPICS.ICDT.2025.9

work page doi:10.4230/lipics.icdt.2025.9 2025
[18]

Yunyan Guo, Guoliang Li, Ruilin Hu, and Yong Wang. 2025. In-database query optimization on SQL with ML predicates.VLDB J.34, 1 (2025), 12

2025
[19]

Dong He, Supun Chathuranga Nakandala, Dalitso Banda, Rathijit Sen, Karla Saur, Kwanghyun Park, Carlo Curino, Jesús Camacho-Rodríguez, Konstanti- nos Karanasos, and Matteo Interlandi. 2022. Query Processing on Tensor Computation Runtimes.Proc. VLDB Endow.15, 11 (2022), 2811–2825. https: //doi.org/10.14778/3551793.3551833

work page doi:10.14778/3551793.3551833 2022
[20]

Saehan Jo and Immanuel Trummer. 2024. ThalamusDB: Approximate Query Processing on Multi-Modal Data.Proc. ACM Manag. Data2, 3 (2024), 186. https: //doi.org/10.1145/3654989

work page doi:10.1145/3654989 2024
[21]

Michael Jungmair, André Kohn, and Jana Giceva. 2022. Designing an Open Framework for Query Optimization and Compilation.Proceedings of the VLDB Endowment15, 11 (2022), 2389–2401

2022
[22]

Gaurav Tarlok Kakkar, Jiashen Cao, Aubhro Sengupta, Joy Arulraj, and Hyesoon Kim. 2025. Aero: Adaptive Query Processing of ML Queries.Proc. ACM Manag. Data3, 3 (2025), 174:1–174:27

2025
[23]

Konstantinos Karanasos, Matteo Interlandi, Doris Xin, Fotis Psallidas, Rathijit Sen, Kwanghyun Park, Ivan Popivanov, Supun Nakandala, Subru Krishnan, Markus Weimer, Yuan Yu, Raghu Ramakrishnan, and Carlo Curino. 2019. Extending Relational Query Processing with ML Inference.CoRRabs/1911.00231 (2019)

arXiv 2019
[24]

Barrett, David L

Guy Katz, Clark W. Barrett, David L. Dill, Kyle Julian, and Mykel J. Kochenderfer
[25]

Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks. In Computer Aided Verification - 29th International Conference, CA V 2017, Heidelberg, Germany, July 24–28, 2017, Proceedings, Part I (Lecture Notes in Computer Science), Rupak Majumdar and Viktor Kuncak (Eds.), Vol. 10426. Springer, 97–117. https: //doi.org/10.1007/978-3-319-63387-9_5

work page doi:10.1007/978-3-319-63387-9_5 2017
[26]

Huang, Duligur Ibeling, Kyle Julian, Christopher Lazarus, Rachel Lim, Parth Shah, Shantanu Thakoor, Haoze Wu, Aleksandar Zeljic, David L

Guy Katz, Derek A. Huang, Duligur Ibeling, Kyle Julian, Christopher Lazarus, Rachel Lim, Parth Shah, Shantanu Thakoor, Haoze Wu, Aleksandar Zeljic, David L. Dill, Mykel J. Kochenderfer, and Clark W. Barrett. 2019. The Marabou Framework for Verification and Analysis of Deep Neural Networks. InCom- puter Aided Verification - 31st International Conference, C...

work page doi:10.1007/978-3-030-25540-4_26 2019
[27]

Kaulen, T

Konstantin Kaulen, Tobias Ladner, Stanley Bak, Christopher Brix, Hai Duong, Thomas Flinkow, Taylor T. Johnson, Lukas Koller, Edoardo Manino, ThanhVu H. Nguyen, and Haoze Wu. 2025. The 6th International Verification of Neural Networks Competition (VNN-COMP 2025): Summary and Results. CoRRabs/2512.19007 (2025). https://doi.org/10.48550/ARXIV.2512.19007 arXi...

work page doi:10.48550/arxiv.2512.19007 2025
[28]

Zico Kolter, Krishnamurthy Dvijotham, and Huan Zhang

Suhas Kotha, Christopher Brix, J. Zico Kolter, Krishnamurthy Dvijotham, and Huan Zhang. 2023. Provably Bounding Neural Network Preimages. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Infor- mation Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, Alice Oh, Tristan Naumann, Amir Glob...

2023
[29]

Strong, Clark W

Changliu Liu, Tomer Arnon, Christopher Lazarus, Christopher A. Strong, Clark W. Barrett, and Mykel J. Kochenderfer. 2021. Algorithms for Veri- fying Deep Neural Networks.Found. Trends Optim.4, 3-4 (2021), 244–404. https://doi.org/10.1561/2400000035

work page doi:10.1561/2400000035 2021
[30]

Chunwei Liu, Matthew Russo, Michael Cafarella, Lei Cao, Peter Baile Chen, Zui Chen, Michael Franklin, Tim Kraska, Samuel Madden, Rana Shahout, et al. 2025. Palimpzest: Optimizing ai-powered analytics with declarative query processing. InProceedings of the Conference on Innovative Database Research (CIDR). 2

2025
[31]

2026.PREDICT (Transact-SQL) - SQL Machine Learning

Microsoft. 2026.PREDICT (Transact-SQL) - SQL Machine Learning. https://learn.microsoft.com/en-us/sql/t-sql/queries/predict-transact- sql?view=sql-server-ver17

2026
[32]

2026.Oracle Machine Learning for SQL (OML4SQL)

Oracle. 2026.Oracle Machine Learning for SQL (OML4SQL). https://docs.oracle. com/en/database/oracle/machine-learning/oml4sql/index.html

2026
[33]

Liana Patel, Siddharth Jha, Carlos Guestrin, and Matei Zaharia. 2024. LOTUS: En- abling Semantic Queries with LLMs Over Tables of Unstructured and Structured Data.CoRRabs/2407.11418 (2024). https://doi.org/10.48550/ARXIV.2407.11418 arXiv:2407.11418

work page doi:10.48550/arxiv.2407.11418 2024
[34]

Maximilian Rieger, Moritz Sichert, and Thomas Neumann. 2022. Integrat- ing deep learning frameworks into main-memory databases. InProceedings of the VLDB 2022 Applied AI for Database Systems and Applications Workshop co-located with (VLDB 2022)(AIDB Workshop Proceedings). https://drive. google. com/file/d/1GfZH3Y1sQKgplnnpTEM_E4skWdhmyrfe/edit

2022
[35]

SciPy Developers. 2025. scipy.spatial.ConvexHull. https://docs.scipy.org/ doc/scipy/reference/generated/scipy.spatial.ConvexHull.html. Accessed: 2026- 05-31

2025
[36]

Parameswaran, and Eugene Wu

Shreya Shankar, Tristan Chambers, Tarak Shah, Aditya G. Parameswaran, and Eugene Wu. 2025. DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing.Proc. VLDB Endow.18, 9 (Sept. 2025), 3035–3048. https: //doi.org/10.14778/3746405.3746426

work page doi:10.14778/3746405.3746426 2025
[37]

2026.Snowflake Model Registry

Snowflake. 2026.Snowflake Model Registry. https://docs.snowflake.com/en/ developer-guide/snowflake-ml/model-registry/overview

2026
[38]

Transaction Processing Performance Council. [n.d.]. TPC Benchmark H (Decision Support) Standard Specification. https://www.tpc.org/tpch/. Version 3.0.1
[39]

Transaction Processing Performance Council. [n.d.]. TPC-DS Benchmark Stan- dard Specification. https://www.tpc.org/tpcds/. Version 4.0.0
[40]

Vincent and Mac Schwager

Joseph A. Vincent and Mac Schwager. 2025. Reachable Polyhedral Marching (RPM): An Exact Analysis Tool for Deep-Learned Control Systems.IEEE Trans. Neural Networks Learn. Syst.36, 10 (2025), 19225–19239. https://doi.org/10.1109/ TNNLS.2025.3571720

arXiv 2025
[41]

Zico Kolter

Shiqi Wang, Huan Zhang, Kaidi Xu, Xue Lin, Suman Jana, Cho-Jui Hsieh, and J. Zico Kolter. 2021. Beta-CROWN: Efficient Bound Propagation with Per-neuron Split Constraints for Neural Network Robustness Verification. InAdvances in Neu- ral Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, Decemb...

2021
[42]

Bowen Wu, Wei Cui, Carlo Curino, Matteo Interlandi, and Rathijit Sen. 2025. Terabyte-Scale Analytics in the Blink of an Eye.Proc. VLDB Endow.19, 2 (2025), 141–155. https://www.vldb.org/pvldb/vol19/p141-sen.pdf

2025
[43]

Daggitt, Wen Kokke, Idan Refaeli, Guy Amir, Kyle Julian, Shahaf Bassan, Pei Huang, Ori Lahav, Min Wu, Min Zhang, Ekaterina Komendantskaya, Guy Katz, and Clark W

Haoze Wu, Omri Isac, Aleksandar Zeljic, Teruhiro Tagomori, Matthew L. Daggitt, Wen Kokke, Idan Refaeli, Guy Amir, Kyle Julian, Shahaf Bassan, Pei Huang, Ori Lahav, Min Wu, Min Zhang, Ekaterina Komendantskaya, Guy Katz, and Clark W. Barrett. 2024. Marabou 2.0: A Versatile Formal Analyzer of Neural 5 Networks. InComputer Aided Verification - 36th Internatio...

work page doi:10.1007/978-3-031-65630-9_13 2024
[44]

Weiming Xiang, Hoang-Dung Tran, and Taylor T. Johnson. 2017. Reachable Set Computation and Safety Verification for Neural Networks with ReLU Activations. CoRRabs/1712.08163 (2017). arXiv:1712.08163 http://arxiv.org/abs/1712.08163

Pith/arXiv arXiv 2017
[45]

[n.d.].Embedding User-Defined Indexes in Apache Parquet Files

Qi Zhu, Jigao Luo, and Andrew Lamb. [n.d.].Embedding User-Defined Indexes in Apache Parquet Files. https://datafusion.apache.org/blog/2025/07/14/user- defined-parquet-indexes/ Apache DataFusion Blog

2025
[46]

Andreas Zimmerer, Damien Dam, Jan Kossmann, Juliane Waack, Ismail Oukid, and Andreas Kipf. 2025. Pruning in Snowflake: Working Smarter, Not Harder. InCompanion of the 2025 International Conference on Management of Data, SIG- MOD/PODS 2025, Berlin, Germany, June 22-27, 2025, Volker Markl, Joseph M. Hellerstein, and Azza Abouzied (Eds.). ACM, 757–770. https...

arXiv 2025

[1] [1]

Paritosh Aggarwal, Bowei Chen, Anupam Datta, Benjamin Han, Boxin Jiang, Nitish Jindal, Zihan Li, Aaron Lin, Pawel Liskowski, Jay Tayade, et al . 2025. Cortex AISQL: A Production SQL Engine for Unstructured Data.arXiv preprint arXiv:2511.07663(2025). 4

Pith/arXiv arXiv 2025

[2] [2]

2021.Introduction to Neural Network Verification

Aws Albarghouthi. 2021.Introduction to Neural Network Verification. verified- deeplearning.com. arXiv:2109.10317 [cs.LG] http://verifieddeeplearning.com

arXiv 2021

[3] [3]

2026.Amazon Redshift ML

Amazon Web Services. 2026.Amazon Redshift ML. https://docs.aws.amazon. com/redshift/latest/dg/machine_learning.html

2026

[4] [4]

2026.Metadata

Apache Parquet. 2026.Metadata. https://parquet.apache.org/docs/file-format/ metadata/

2026

[5] [5]

Thomas Brinkhoff, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. 1994. Multi-Step Processing of Spatial Joins. InProceedings of the 1994 ACM SIGMOD International Conference on Management of Data, Minneapolis, Minnesota, USA, May 24-27, 1994, Richard T. Snodgrass and Marianne Winslett (Eds.). ACM Press, 197–208. https://doi.org/10.1145/191839.191880

work page doi:10.1145/191839.191880 1994

[6] [6]

Bergman, Vittorio Castelli, Chung-Sheng Li, Ming- Ling Lo, and John R

Yuan-Chi Chang, Lawrence D. Bergman, Vittorio Castelli, Chung-Sheng Li, Ming- Ling Lo, and John R. Smith. 2000. The Onion Technique: Indexing for Linear Optimization Queries. InProceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16-18, 2000, Dallas, Texas, USA, Weidong Chen, Jeffrey F. Naughton, and Philip A. Bernstein (...

work page doi:10.1145/342009.335433 2000

[7] [7]

Yeounoh Chung, Rushabh Desai, Jian He, Yu Xiao, Thibaud Hottelier, Yves- Laurent Kom Samo, Pushkar Kadilkar, Xianshun Chen, Sam Idicula, Fatma Özcan, et al. 2026. 100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models.arXiv preprint arXiv:2603.15970 (2026)

Pith/arXiv arXiv 2026

[8] [8]

G.E. Collins. 1975. Quantifier elimination for real closed fields by cylindrical algebraic decomposition.Lecture Notes in Computer Science33 (1975), 134–183

1975

[9] [9]

2025.Demographia International Housing Affordability

Wendy Cox. 2025.Demographia International Housing Affordability. Technical Report. Frontier Centre for Public Policy, Canada. https://policycommons. net/artifacts/21033541/demographia-international-housing/21933951/ Re- trieved from https://coilink.org/20.500.12592/3hb8vjr on May 31, 2026. COI: 20.500.12592/3hb8vjr

arXiv 2025

[10] [10]

Databricks. 2026. ai_query Function. https://docs.databricks.com/aws/en/sql/ language-manual/functions/ai_query

2026

[11] [11]

Anas Dorbani, Sunny Yasser, Jimmy Lin, and Amine Mhedhbi. 2025. Beyond Quacking: Deep Integration of Language Models and RAG into DuckDB.Proc. VLDB Endow.18, 12 (Sept. 2025), 5415–5418. https://doi.org/10.14778/3750601. 3750685

work page doi:10.14778/3750601 2025

[12] [12]

Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. 2007. Hyper- loglog: the analysis of a near-optimal cardinality estimation algorithm.Discrete mathematics & theoretical computer scienceProceedings (2007)

2007

[13] [13]

Freitag and Thomas Neumann

Michael J. Freitag and Thomas Neumann. 2019. Every Row Counts: Combining Sketches and Sampling for Accurate Group-By Result Esti- mates. In9th Biennial Conference on Innovative Data Systems Research, CIDR 2019, Asilomar, CA, USA, January 13-16, 2019, Online Proceedings. www.cidrdb.org. https://vldb.org/cidrdb/2019/every-row-counts-combining- sketches-and-...

2019

[14] [14]

Mark Gerarts, Juno Steegmans, and Jan Van den Bussche. 2025. SQL4NN: Valida- tion and expressive querying of models as data. InProceedings of the Workshop on Data Management for End-to-End Machine Learning. 1–5

2025

[15] [15]

Google Cloud. 2026. ML.PREDICT Function. https://cloud.google.com/bigquery/ docs/reference/standard-sql/bigqueryml-syntax-predict. Google Cloud Docu- mentation, accessed 2026-05-31

2026

[16] [16]

Martin Grohe, Christoph Standke, Juno Steegmans, and Jan Van den Bussche

[17] [17]

In28th International Conference on Database Theory, ICDT 2025, March 25–28, 2025, Barcelona, Spain (LIPIcs), Sudeepa Roy and Ahmet Kara (Eds.), Vol

Query Languages for Neural Networks. In28th International Conference on Database Theory, ICDT 2025, March 25–28, 2025, Barcelona, Spain (LIPIcs), Sudeepa Roy and Ahmet Kara (Eds.), Vol. 328. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 9:1–9:18. https://doi.org/10.4230/LIPICS.ICDT.2025.9

work page doi:10.4230/lipics.icdt.2025.9 2025

[18] [18]

Yunyan Guo, Guoliang Li, Ruilin Hu, and Yong Wang. 2025. In-database query optimization on SQL with ML predicates.VLDB J.34, 1 (2025), 12

2025

[19] [19]

Dong He, Supun Chathuranga Nakandala, Dalitso Banda, Rathijit Sen, Karla Saur, Kwanghyun Park, Carlo Curino, Jesús Camacho-Rodríguez, Konstanti- nos Karanasos, and Matteo Interlandi. 2022. Query Processing on Tensor Computation Runtimes.Proc. VLDB Endow.15, 11 (2022), 2811–2825. https: //doi.org/10.14778/3551793.3551833

work page doi:10.14778/3551793.3551833 2022

[20] [20]

Saehan Jo and Immanuel Trummer. 2024. ThalamusDB: Approximate Query Processing on Multi-Modal Data.Proc. ACM Manag. Data2, 3 (2024), 186. https: //doi.org/10.1145/3654989

work page doi:10.1145/3654989 2024

[21] [21]

Michael Jungmair, André Kohn, and Jana Giceva. 2022. Designing an Open Framework for Query Optimization and Compilation.Proceedings of the VLDB Endowment15, 11 (2022), 2389–2401

2022

[22] [22]

Gaurav Tarlok Kakkar, Jiashen Cao, Aubhro Sengupta, Joy Arulraj, and Hyesoon Kim. 2025. Aero: Adaptive Query Processing of ML Queries.Proc. ACM Manag. Data3, 3 (2025), 174:1–174:27

2025

[23] [23]

Konstantinos Karanasos, Matteo Interlandi, Doris Xin, Fotis Psallidas, Rathijit Sen, Kwanghyun Park, Ivan Popivanov, Supun Nakandala, Subru Krishnan, Markus Weimer, Yuan Yu, Raghu Ramakrishnan, and Carlo Curino. 2019. Extending Relational Query Processing with ML Inference.CoRRabs/1911.00231 (2019)

arXiv 2019

[24] [24]

Barrett, David L

Guy Katz, Clark W. Barrett, David L. Dill, Kyle Julian, and Mykel J. Kochenderfer

[25] [25]

Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks. In Computer Aided Verification - 29th International Conference, CA V 2017, Heidelberg, Germany, July 24–28, 2017, Proceedings, Part I (Lecture Notes in Computer Science), Rupak Majumdar and Viktor Kuncak (Eds.), Vol. 10426. Springer, 97–117. https: //doi.org/10.1007/978-3-319-63387-9_5

work page doi:10.1007/978-3-319-63387-9_5 2017

[26] [26]

Huang, Duligur Ibeling, Kyle Julian, Christopher Lazarus, Rachel Lim, Parth Shah, Shantanu Thakoor, Haoze Wu, Aleksandar Zeljic, David L

Guy Katz, Derek A. Huang, Duligur Ibeling, Kyle Julian, Christopher Lazarus, Rachel Lim, Parth Shah, Shantanu Thakoor, Haoze Wu, Aleksandar Zeljic, David L. Dill, Mykel J. Kochenderfer, and Clark W. Barrett. 2019. The Marabou Framework for Verification and Analysis of Deep Neural Networks. InCom- puter Aided Verification - 31st International Conference, C...

work page doi:10.1007/978-3-030-25540-4_26 2019

[27] [27]

Kaulen, T

Konstantin Kaulen, Tobias Ladner, Stanley Bak, Christopher Brix, Hai Duong, Thomas Flinkow, Taylor T. Johnson, Lukas Koller, Edoardo Manino, ThanhVu H. Nguyen, and Haoze Wu. 2025. The 6th International Verification of Neural Networks Competition (VNN-COMP 2025): Summary and Results. CoRRabs/2512.19007 (2025). https://doi.org/10.48550/ARXIV.2512.19007 arXi...

work page doi:10.48550/arxiv.2512.19007 2025

[28] [28]

Zico Kolter, Krishnamurthy Dvijotham, and Huan Zhang

Suhas Kotha, Christopher Brix, J. Zico Kolter, Krishnamurthy Dvijotham, and Huan Zhang. 2023. Provably Bounding Neural Network Preimages. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Infor- mation Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, Alice Oh, Tristan Naumann, Amir Glob...

2023

[29] [29]

Strong, Clark W

Changliu Liu, Tomer Arnon, Christopher Lazarus, Christopher A. Strong, Clark W. Barrett, and Mykel J. Kochenderfer. 2021. Algorithms for Veri- fying Deep Neural Networks.Found. Trends Optim.4, 3-4 (2021), 244–404. https://doi.org/10.1561/2400000035

work page doi:10.1561/2400000035 2021

[30] [30]

Chunwei Liu, Matthew Russo, Michael Cafarella, Lei Cao, Peter Baile Chen, Zui Chen, Michael Franklin, Tim Kraska, Samuel Madden, Rana Shahout, et al. 2025. Palimpzest: Optimizing ai-powered analytics with declarative query processing. InProceedings of the Conference on Innovative Database Research (CIDR). 2

2025

[31] [31]

2026.PREDICT (Transact-SQL) - SQL Machine Learning

Microsoft. 2026.PREDICT (Transact-SQL) - SQL Machine Learning. https://learn.microsoft.com/en-us/sql/t-sql/queries/predict-transact- sql?view=sql-server-ver17

2026

[32] [32]

2026.Oracle Machine Learning for SQL (OML4SQL)

Oracle. 2026.Oracle Machine Learning for SQL (OML4SQL). https://docs.oracle. com/en/database/oracle/machine-learning/oml4sql/index.html

2026

[33] [33]

Liana Patel, Siddharth Jha, Carlos Guestrin, and Matei Zaharia. 2024. LOTUS: En- abling Semantic Queries with LLMs Over Tables of Unstructured and Structured Data.CoRRabs/2407.11418 (2024). https://doi.org/10.48550/ARXIV.2407.11418 arXiv:2407.11418

work page doi:10.48550/arxiv.2407.11418 2024

[34] [34]

Maximilian Rieger, Moritz Sichert, and Thomas Neumann. 2022. Integrat- ing deep learning frameworks into main-memory databases. InProceedings of the VLDB 2022 Applied AI for Database Systems and Applications Workshop co-located with (VLDB 2022)(AIDB Workshop Proceedings). https://drive. google. com/file/d/1GfZH3Y1sQKgplnnpTEM_E4skWdhmyrfe/edit

2022

[35] [35]

SciPy Developers. 2025. scipy.spatial.ConvexHull. https://docs.scipy.org/ doc/scipy/reference/generated/scipy.spatial.ConvexHull.html. Accessed: 2026- 05-31

2025

[36] [36]

Parameswaran, and Eugene Wu

Shreya Shankar, Tristan Chambers, Tarak Shah, Aditya G. Parameswaran, and Eugene Wu. 2025. DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing.Proc. VLDB Endow.18, 9 (Sept. 2025), 3035–3048. https: //doi.org/10.14778/3746405.3746426

work page doi:10.14778/3746405.3746426 2025

[37] [37]

2026.Snowflake Model Registry

Snowflake. 2026.Snowflake Model Registry. https://docs.snowflake.com/en/ developer-guide/snowflake-ml/model-registry/overview

2026

[38] [38]

Transaction Processing Performance Council. [n.d.]. TPC Benchmark H (Decision Support) Standard Specification. https://www.tpc.org/tpch/. Version 3.0.1

[39] [39]

Transaction Processing Performance Council. [n.d.]. TPC-DS Benchmark Stan- dard Specification. https://www.tpc.org/tpcds/. Version 4.0.0

[40] [40]

Vincent and Mac Schwager

Joseph A. Vincent and Mac Schwager. 2025. Reachable Polyhedral Marching (RPM): An Exact Analysis Tool for Deep-Learned Control Systems.IEEE Trans. Neural Networks Learn. Syst.36, 10 (2025), 19225–19239. https://doi.org/10.1109/ TNNLS.2025.3571720

arXiv 2025

[41] [41]

Zico Kolter

Shiqi Wang, Huan Zhang, Kaidi Xu, Xue Lin, Suman Jana, Cho-Jui Hsieh, and J. Zico Kolter. 2021. Beta-CROWN: Efficient Bound Propagation with Per-neuron Split Constraints for Neural Network Robustness Verification. InAdvances in Neu- ral Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, Decemb...

2021

[42] [42]

Bowen Wu, Wei Cui, Carlo Curino, Matteo Interlandi, and Rathijit Sen. 2025. Terabyte-Scale Analytics in the Blink of an Eye.Proc. VLDB Endow.19, 2 (2025), 141–155. https://www.vldb.org/pvldb/vol19/p141-sen.pdf

2025

[43] [43]

Daggitt, Wen Kokke, Idan Refaeli, Guy Amir, Kyle Julian, Shahaf Bassan, Pei Huang, Ori Lahav, Min Wu, Min Zhang, Ekaterina Komendantskaya, Guy Katz, and Clark W

Haoze Wu, Omri Isac, Aleksandar Zeljic, Teruhiro Tagomori, Matthew L. Daggitt, Wen Kokke, Idan Refaeli, Guy Amir, Kyle Julian, Shahaf Bassan, Pei Huang, Ori Lahav, Min Wu, Min Zhang, Ekaterina Komendantskaya, Guy Katz, and Clark W. Barrett. 2024. Marabou 2.0: A Versatile Formal Analyzer of Neural 5 Networks. InComputer Aided Verification - 36th Internatio...

work page doi:10.1007/978-3-031-65630-9_13 2024

[44] [44]

Weiming Xiang, Hoang-Dung Tran, and Taylor T. Johnson. 2017. Reachable Set Computation and Safety Verification for Neural Networks with ReLU Activations. CoRRabs/1712.08163 (2017). arXiv:1712.08163 http://arxiv.org/abs/1712.08163

Pith/arXiv arXiv 2017

[45] [45]

[n.d.].Embedding User-Defined Indexes in Apache Parquet Files

Qi Zhu, Jigao Luo, and Andrew Lamb. [n.d.].Embedding User-Defined Indexes in Apache Parquet Files. https://datafusion.apache.org/blog/2025/07/14/user- defined-parquet-indexes/ Apache DataFusion Blog

2025

[46] [46]

Andreas Zimmerer, Damien Dam, Jan Kossmann, Juliane Waack, Ismail Oukid, and Andreas Kipf. 2025. Pruning in Snowflake: Working Smarter, Not Harder. InCompanion of the 2025 International Conference on Management of Data, SIG- MOD/PODS 2025, Berlin, Germany, June 22-27, 2025, Volker Markl, Joseph M. Hellerstein, and Azza Abouzied (Eds.). ACM, 757–770. https...

arXiv 2025