ScaleDoc: Scaling LLM-based Predicates over Large Document Collections

Hengrui Zhang; Huanchen Zhang; Yihao Liu; Yulong Hui

arxiv: 2509.12610 · v2 · pith:QFYNVNFQnew · submitted 2025-09-16 · 💻 cs.DB · cs.AI· cs.LG

ScaleDoc: Scaling LLM-based Predicates over Large Document Collections

Hengrui Zhang , Yulong Hui , Yihao Liu , Huanchen Zhang This is my paper

Pith reviewed 2026-05-22 12:35 UTC · model grok-4.3

classification 💻 cs.DB cs.AIcs.LG

keywords LLM predicatesdocument collectionssemantic filteringcontrastive learningproxy modeladaptive cascadeunstructured dataquery optimization

0 comments

The pith

ScaleDoc speeds up semantic predicates on large document collections by using an offline LLM representation phase and an online proxy model that filters most documents before invoking the full LLM.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ScaleDoc to handle the high cost of applying LLMs to predicates over huge unstructured document sets. It runs an LLM once offline to produce semantic representations for every document. For each new query the system trains a small proxy model on those representations using contrastive learning so that the proxy can assign reliable scores and discard the clear majority of documents. Only the uncertain cases reach the full LLM, and an adaptive cascade chooses the filtering threshold to stay within a user-specified accuracy target. Evaluations on three datasets show more than 2x end-to-end speedup and up to 85 percent fewer LLM calls.

Core claim

ScaleDoc decouples predicate execution into an offline phase that uses an LLM to generate semantic representations for each document and an online phase that trains a lightweight contrastive-learning proxy model on those representations; the proxy produces decision scores that, together with an adaptive cascade, filter the bulk of documents while meeting accuracy targets and forwarding only ambiguous cases to the LLM.

What carries the argument

Contrastive-learning-based proxy model trained on offline semantic representations, combined with an adaptive cascade that selects the filtering policy to satisfy accuracy constraints.

If this is right

Semantic predicates over document collections become feasible at scales where full LLM invocation per document would be prohibitive.
Query latency drops by more than half while preserving the accuracy level users specify.
LLM invocations are limited to a small, query-dependent fraction of the collection rather than the entire set.
The same offline representations can be reused across many ad-hoc queries without re-running the LLM.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The offline representation step could be applied to other expensive models besides LLMs, such as large vision or multimodal models.
The approach might combine with traditional database indexes to handle mixed structured and semantic predicates in a single system.
Further gains could come from sharing proxy training across similar queries or from distilling the proxy into an even smaller model.

Load-bearing premise

The contrastive-learning proxy model produces decision scores accurate enough to filter the majority of documents without violating the target accuracy.

What would settle it

Run the proxy on a held-out dataset and measure that either fewer than half the documents are filtered or that end-to-end accuracy falls below the chosen target even after the cascade adjusts its threshold.

Figures

Figures reproduced from arXiv: 2509.12610 by Hengrui Zhang, Huanchen Zhang, Yihao Liu, Yulong Hui.

**Figure 1.** Figure 1: A detailed workflow of ScaleDoc – ScaleDoc efficiently adapts pre-computed embedding semantics for query-specific online processing, through its offline-online structure. The online process comprises a query-aware lightweight encoder and a subsequent cascade workflow. online. Therefore, the second challenge is to design an efficient online calibration mechanism that can dynamically determine the effective … view at source ↗

**Figure 2.** Figure 2: Example score distributions of different lightweight [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Training Phase 1: Semantic Monotonicity The primary goal of Phase 1 is to build the foundational semantic relationship between the documents and the query predicate. To achieve this, we use a contrastive loss, L𝑞𝑠𝑖𝑚, inspired by Dense Passage Retrieval (DPR) [17]. In our training, the query embedding z𝑞 acts as an anchor. The objective is to pull positive document embeddings (𝑑 + ) closer to the anchor whi… view at source ↗

**Figure 3.** Figure 3: Illustration of the objectives adopted in training [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: End-to-end latencies and data reduction rate – [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Breakdown for different pipelines, measuring aver [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 8.** Figure 8: Embedding relocation mapping during Query [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Average score distribution of positives and negatives [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: Zero-shot cascade accuracy and data reduction rate [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: Accuracy and Latency with different hyperparam [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

read the original abstract

Predicates are foundational components in data analysis systems. However, modern workloads increasingly involve unstructured documents, which demands semantic understanding, beyond traditional value-based predicates. Given enormous documents and ad-hoc queries, while Large Language Models (LLMs) demonstrate powerful zero-shot capabilities, their high inference cost leads to unacceptable overhead. Therefore, we introduce \textsc{ScaleDoc}, a novel system that addresses this by decoupling predicate execution into an offline representation phase and an optimized online filtering phase. In the offline phase, \textsc{ScaleDoc} leverages a LLM to generate semantic representations for each document. Online, for each query, it trains a lightweight proxy model on these representations to filter the majority of documents, forwarding only the ambiguous cases to the LLM for final decision. Furthermore, \textsc{ScaleDoc} proposes two core innovations to achieve significant efficiency: (1) a contrastive-learning-based framework that trains the proxy model to generate reliable predicating decision scores; (2) an adaptive cascade mechanism that determines the effective filtering policy while meeting specific accuracy targets. Our evaluations across three datasets demonstrate that \textsc{ScaleDoc} achieves over a 2$\times$ end-to-end speedup and reduces expensive LLM invocations by up to 85\%, making large-scale semantic analysis practical and efficient.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ScaleDoc's offline embeddings plus per-query contrastive proxy and adaptive cascade offer a practical route to cheaper semantic predicates on documents, but the proxy calibration evidence needs to be checked to back the 85% reduction claim.

read the letter

Here's the quick take on ScaleDoc: it decouples LLM predicate evaluation into an offline representation phase and an online phase where a contrastive proxy is trained per query on those representations to filter documents, with an adaptive cascade to hit accuracy goals while skipping most LLM calls. The paper does a solid job explaining why this matters for large document collections and how the two innovations—the contrastive framework and the adaptive cascade—aim to make the filtering reliable. The reported results of more than 2x speedup and 85% reduction in LLM use on three datasets show the approach can deliver real efficiency gains if the numbers hold. Where it feels thinner is in the details around the proxy's performance. The key link is whether that contrastive proxy actually produces well-calibrated scores to separate clear-cut documents from ambiguous ones at scale. Without seeing specifics on calibration, threshold tuning, or accuracy trade-offs in the experiments, it's hard to tell if the reductions come while truly meeting the targets or if there's some slack. The per-query training also raises questions about added latency that aren't addressed in the high-level description. This is the kind of paper for folks in databases and data systems who are trying to bring semantic capabilities to unstructured data workloads without blowing up the compute budget. A reader focused on practical LLM integration in query engines would pick up useful ideas here. I'd send it out for peer review. The core idea is sound enough and the problem is relevant, so referees could help tighten the evaluation and confirm the claims.

Referee Report

2 major / 2 minor

Summary. The paper introduces ScaleDoc, a system for executing LLM-based predicates over large document collections. It decouples the process into an offline phase that uses an LLM to generate semantic representations for all documents and an online phase that, for each ad-hoc query, trains a lightweight proxy model via contrastive learning on those fixed representations; the proxy produces decision scores that feed an adaptive cascade, which filters the majority of documents while forwarding only ambiguous cases to the LLM to meet a user-specified accuracy target. The central claims are a >2× end-to-end speedup and up to 85% reduction in LLM invocations, demonstrated on three datasets.

Significance. If the proxy reliably separates clear from ambiguous documents at scale while preserving accuracy, ScaleDoc would make semantic predicates practical for large-scale document workloads in database systems, substantially lowering inference costs. The offline/online decoupling and per-query proxy training are technically interesting directions for scaling LLM-augmented data processing.

major comments (2)

[Evaluation section] Evaluation section: the abstract and evaluation report concrete numbers (2× speedup, ≤85% LLM reduction) but supply no experimental setup details—dataset sizes and characteristics, query workload, baseline systems, number of runs, statistical significance, or error analysis—making it impossible to assess whether the claimed gains are achieved at the stated accuracy targets.
[§4] §4 (contrastive proxy and adaptive cascade): the manuscript provides no quantitative evidence on proxy calibration, decision-score distributions, threshold selection, or accuracy-vs-filtering trade-off curves. These measurements are load-bearing for the central claim that the cascade meets accuracy targets while correctly filtering the majority of documents.

minor comments (2)

[Abstract] The abstract refers to 'three datasets' without naming them or giving high-level statistics (size, domain, predicate types).
[§4.1] Notation for 'predicating decision scores' and the exact form of the contrastive loss could be stated more precisely to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the current manuscript would benefit from expanded experimental details and additional quantitative analyses to better support the central claims. We will revise the paper accordingly and respond to each major comment below.

read point-by-point responses

Referee: [Evaluation section] Evaluation section: the abstract and evaluation report concrete numbers (2× speedup, ≤85% LLM reduction) but supply no experimental setup details—dataset sizes and characteristics, query workload, baseline systems, number of runs, statistical significance, or error analysis—making it impossible to assess whether the claimed gains are achieved at the stated accuracy targets.

Authors: We acknowledge that the Evaluation section requires more comprehensive details to enable proper assessment of the results. In the revised manuscript, we will expand this section to describe dataset sizes and characteristics, the query workload, baseline systems, number of runs, statistical significance testing, and error analysis. These additions will allow readers to evaluate the reported >2× speedup and up to 85% LLM reduction at the target accuracy levels. revision: yes
Referee: [§4] §4 (contrastive proxy and adaptive cascade): the manuscript provides no quantitative evidence on proxy calibration, decision-score distributions, threshold selection, or accuracy-vs-filtering trade-off curves. These measurements are load-bearing for the central claim that the cascade meets accuracy targets while correctly filtering the majority of documents.

Authors: We agree that quantitative evidence on these aspects is important for validating the proxy and cascade. In the revision, we will augment §4 with analyses including proxy calibration metrics, decision-score distributions, threshold selection methodology, and accuracy-vs-filtering trade-off curves. This will provide direct support for how the adaptive cascade meets accuracy targets while filtering the majority of documents. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical system for scaling LLM predicates via offline embeddings and a per-query contrastive proxy plus adaptive cascade. Performance numbers (2× speedup, ≤85% LLM reduction) are reported from evaluations on three datasets rather than derived as predictions from fitted parameters. The proxy is trained on fixed representations for each new predicate, supplying independent grounding instead of reducing to self-definition or prior fits by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing steps in the abstract or described method. The contrastive framework and accuracy targets are design choices whose effectiveness is measured externally, not tautological.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLM-generated semantic representations contain enough signal for a lightweight proxy to learn reliable predicate decisions; no explicit free parameters or invented entities are named in the abstract.

free parameters (1)

accuracy target
Adaptive cascade is designed to meet specific accuracy targets that are tunable to control the filtering policy.

axioms (1)

domain assumption LLM-generated semantic representations capture predicate-relevant information sufficiently for proxy model training
This underpins the entire offline representation phase and online filtering effectiveness.

pith-pipeline@v0.9.0 · 5764 in / 1208 out tokens · 60929 ms · 2026-05-22T12:35:44.878714+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a contrastive-learning-based framework that trains the proxy model to generate reliable predicating decision scores; an adaptive cascade mechanism that determines the effective filtering policy while meeting specific accuracy targets
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose ScaleDoc, a novel system that decouples execution into a one-time offline representation phase and an optimized online query phase.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PLOP: Cost-Based Placement of Semantic Operators in Hybrid Query Plans
cs.DB 2026-04 conditional novelty 7.0

PLOP is a cost-based optimizer that finds optimal placements for semantic LLM operators in hybrid query plans via dynamic programming, delivering up to 1.5x speedup and 4.29x cost reduction on 44 benchmark queries whi...
Distributed Generative Inference of LLM at Internet Scales with Multi-Dimensional Communication Optimization
cs.DC 2026-04 unverdicted novelty 5.0

BloomBee is a distributed LLM inference system that achieves up to 1.76x higher throughput and 43.2% lower latency than prior decentralized systems by optimizing communication across multiple dimensions in low-bandwid...

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 2 Pith papers · 6 internal anchors

[1]

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 117–134

work page 2024
[2]

Simran Arora, Brandon Yang, Sabri Eyuboglu, Avanika Narayan, Andrew Ho- jel, Immanuel Trummer, and Christopher Ré. 2023. Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes. Proceedings of the VLDB Endowment17, 2 (2023), 92–105

work page 2023
[3]

Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. 2024. LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders. InFirst Conference on Language Modeling. https://openreview.net/forum?id=IW1PR7vEBf

work page 2024
[4]

Lingjiao Chen, Matei Zaharia, and James Zou. 2023. FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance. arXiv:2305.05176 [cs.LG] https://arxiv.org/abs/2305.05176

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. InInter- national conference on machine learning. PmLR, 1597–1607

work page 2020
[6]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashat- tention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems35 (2022), 16344–16359

work page 2022
[7]

Franck Dernoncourt and Ji Young Lee. 2017. PubMed 200k RCT: a Dataset for Se- quential Sentence Classification in Medical Abstracts. InProceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Greg Kondrak and Taro Watanabe (Eds.). Asian Federation of Natural Lan- guage Processing, Taipei, Taiwan, 30...

work page 2017
[8]

Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. InProceedings of the 2021 Conference on Em- pirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Compu- tational Linguistics, Online and Punta Cana, Domini...

work page doi:10.18653/v1/2021.emnlp-main.552 2021
[9]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Mo- mentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9729–9738

work page 2020
[11]

Chuxuan Hu, Austin Peters, and Daniel Kang. 2025. LEAP: LLM-powered End- to-end Automatic Library for Processing Social Science Queries on Unstructured Data.arXiv preprint arXiv:2501.03892(2025)

work page arXiv 2025
[12]

Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. 2021. Efficient Attentions for Long Document Summarization. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies. Association for Computational 13 Linguistics, Online, 1419–1436. https://doi.org/...

work page doi:10.18653/v1/2021.naacl-main.112 2021
[13]

Yulong Hui, Yao Lu, and Huanchen Zhang. [n.d.]. UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-World Document Analysis. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track

work page
[14]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. arXiv:2310.068...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. 2017. NoScope: Optimizing Neural Network Queries over Video at Scale.Proceedings of the VLDB Endowment10, 11 (2017)

work page 2017
[16]

Daniel Kang, Edward Gan, Peter Bailis, Tatsunori Hashimoto, and Matei Zaharia

work page
[17]

Approximate selection with guarantees using proxies.arXiv preprint arXiv:2004.00827(2020)

work page arXiv 2004
[18]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. [n.d.]. Dense Passage Retrieval for Open-Domain Ques- tion Answering

work page
[19]

Moe Kayali, Anton Lykov, Ilias Fountalis, Nikolaos Vasiloglou, Dan Olteanu, and Dan Suciu. 2024. Chorus: Foundation Models for Unified Data Discovery and Exploration.Proceedings of the VLDB Endowment17, 8 (2024), 2104–2114

work page 2024
[20]

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised contrastive learning.Advances in neural information processing systems33 (2020), 18661–18673

work page 2020
[21]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles. 611–626

work page 2023
[22]

Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. 2024. Nv-embed: Improved techniques for train- ing llms as generalist embedding models.arXiv preprint arXiv:2405.17428(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrap- ping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning. PMLR, 19730–19742

work page 2023
[24]

Zhenwen Li and Tao Xie. 2024. Using LLM to select the right SQL Query from candidates.arXiv preprint arXiv:2401.02115(2024)

work page arXiv 2024
[25]

Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards general text embeddings with multi-stage contrastive learning.arXiv preprint arXiv:2308.03281(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Chunwei Liu, Matthew Russo, Michael Cafarella, Lei Cao, Peter Baille Chen, Zui Chen, Michael Franklin, Tim Kraska, Samuel Madden, and Gerardo Vitagliano

work page
[27]

A declarative system for optimizing ai workloads.arXiv preprint arXiv:2405.14696(2024)

work page arXiv 2024
[28]

Shicheng Liu, Jialiang Xu, Wesley Tjangnaka, Sina Semnani, Chen Yu, and Monica Lam. 2024. SUQL: Conversational Search over Structured and Unstructured Data with Large Language Models. InFindings of the Association for Computational Linguistics: NAACL 2024, Kevin Duh, Helena Gomez, and Steven Bethard (Eds.). Association for Computational Linguistics, Mexic...

work page doi:10.18653/v1/2024.findings-naacl.283 2024
[29]

Yao Lu, Aakanksha Chowdhery, Srikanth Kandula, and Surajit Chaudhuri. 2018. Accelerating machine learning inference with probabilistic predicates. InPro- ceedings of the 2018 International Conference on Management of Data. 1493–1508

work page 2018
[30]

Kyle Luoma and Arun Kumar. 2025. SNAILS: Schema Naming Assessments for Improved LLM-Based SQL Inference.Proceedings of the ACM on Management of Data3, 1 (2025), 1–26

work page 2025
[31]

Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernandez Abrego, Ji Ma, Vincent Zhao, Yi Luan, Keith Hall, Ming-Wei Chang, et al . 2022. Large Dual Encoders Are Generalizable Retrievers. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 9844–9855

work page 2022
[32]

OpenAI, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Car- ney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis, Alexi...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Liana Patel, Siddharth Jha, Carlos Guestrin, and Matei Zaharia. 2024. Lotus: Enabling semantic queries with llms over tables of unstructured and structured data.arXiv preprint arXiv:2407.11418(2024)

work page arXiv 2024
[34]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

work page
[35]

In International conference on machine learning

Learning transferable visual models from natural language supervision. In International conference on machine learning. PmLR, 8748–8763

work page
[36]

Abhinav Ramesh Kashyap, Thanh-Tung Nguyen, Viktor Schlegel, Stefan Winkler, See-Kiong Ng, and Soujanya Poria. 2024. A Comprehensive Survey of Sentence Representations: From the BERT Epoch to the CHATGPT Era and Beyond. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), Yv...

work page 2024
[37]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 3982–3992

work page 2019
[38]

Ricardo Salazar-Díaz, Boris Glavic, and Tilmann Rabl. 2024. Inferdb: In-database machine learning inference using indexes.Proceedings of the VLDB Endowment 17, 8 (2024), 1830–1842

work page 2024
[39]

Shreya Shankar, Tristan Chambers, Tarak Shah, Aditya G Parameswaran, and Eu- gene Wu. 2024. DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing.arXiv preprint arXiv:2410.12189(2024)

work page arXiv 2024
[40]

Eva Sharma, Chen Li, and Lu Wang. 2019. BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy, 2204–2213. https://doi.org/10.18653...

work page doi:10.18653/v1/p19-1212 2019
[41]

Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A Smith, Luke Zettlemoyer, and Tao Yu. 2023. One Embed- der, Any Task: Instruction-Finetuned Text Embeddings. InAnnual Meeting of the Association for Computational Linguistics-ACL 2023 (09/07/2023-14/07/2023„, Toronto, Canada)

work page 2023
[42]

Zhihui Yang, Zuozhi Wang, Yicong Huang, Yao Lu, Chen Li, and X Sean Wang

work page
[43]

Optimizing machine learning inference queries with correlative proxy models.Proceedings of the VLDB Endowment15, 10 (2022), 2032–2044

work page 2022
[44]

Enhao Zhang, Nicole Sullivan, Brandon Haynes, Ranjay Krishna, and Magdalena Balazinska. 2025. Self-Enhancing Video Data Management System for Composi- tional Events with Large Language Models.Proc. ACM Manag. Data3, 3, Article 215 (June 2025), 29 pages. https://doi.org/10.1145/3725352

work page doi:10.1145/3725352 2025
[45]

Shuo Zhang, Zezhou Huang, and Eugene Wu. 2024. Data cleaning using large language models.arXiv preprint arXiv:2410.15547(2024)

work page arXiv 2024
[46]

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: disaggregating prefill and decoding for goodput-optimized large language model serving. InProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation(Santa Clara, CA, USA)(OSDI’24). USENIX Association, USA, Article...

work page 2024

[1] [1]

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 117–134

work page 2024

[2] [2]

Simran Arora, Brandon Yang, Sabri Eyuboglu, Avanika Narayan, Andrew Ho- jel, Immanuel Trummer, and Christopher Ré. 2023. Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes. Proceedings of the VLDB Endowment17, 2 (2023), 92–105

work page 2023

[3] [3]

Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. 2024. LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders. InFirst Conference on Language Modeling. https://openreview.net/forum?id=IW1PR7vEBf

work page 2024

[4] [4]

Lingjiao Chen, Matei Zaharia, and James Zou. 2023. FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance. arXiv:2305.05176 [cs.LG] https://arxiv.org/abs/2305.05176

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. InInter- national conference on machine learning. PmLR, 1597–1607

work page 2020

[6] [6]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashat- tention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems35 (2022), 16344–16359

work page 2022

[7] [7]

Franck Dernoncourt and Ji Young Lee. 2017. PubMed 200k RCT: a Dataset for Se- quential Sentence Classification in Medical Abstracts. InProceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Greg Kondrak and Taro Watanabe (Eds.). Asian Federation of Natural Lan- guage Processing, Taipei, Taiwan, 30...

work page 2017

[8] [8]

Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. InProceedings of the 2021 Conference on Em- pirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Compu- tational Linguistics, Online and Punta Cana, Domini...

work page doi:10.18653/v1/2021.emnlp-main.552 2021

[9] [9]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Mo- mentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9729–9738

work page 2020

[11] [11]

Chuxuan Hu, Austin Peters, and Daniel Kang. 2025. LEAP: LLM-powered End- to-end Automatic Library for Processing Social Science Queries on Unstructured Data.arXiv preprint arXiv:2501.03892(2025)

work page arXiv 2025

[12] [12]

Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. 2021. Efficient Attentions for Long Document Summarization. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies. Association for Computational 13 Linguistics, Online, 1419–1436. https://doi.org/...

work page doi:10.18653/v1/2021.naacl-main.112 2021

[13] [13]

Yulong Hui, Yao Lu, and Huanchen Zhang. [n.d.]. UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-World Document Analysis. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track

work page

[14] [14]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. arXiv:2310.068...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. 2017. NoScope: Optimizing Neural Network Queries over Video at Scale.Proceedings of the VLDB Endowment10, 11 (2017)

work page 2017

[16] [16]

Daniel Kang, Edward Gan, Peter Bailis, Tatsunori Hashimoto, and Matei Zaharia

work page

[17] [17]

Approximate selection with guarantees using proxies.arXiv preprint arXiv:2004.00827(2020)

work page arXiv 2004

[18] [18]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. [n.d.]. Dense Passage Retrieval for Open-Domain Ques- tion Answering

work page

[19] [19]

Moe Kayali, Anton Lykov, Ilias Fountalis, Nikolaos Vasiloglou, Dan Olteanu, and Dan Suciu. 2024. Chorus: Foundation Models for Unified Data Discovery and Exploration.Proceedings of the VLDB Endowment17, 8 (2024), 2104–2114

work page 2024

[20] [20]

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised contrastive learning.Advances in neural information processing systems33 (2020), 18661–18673

work page 2020

[21] [21]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles. 611–626

work page 2023

[22] [22]

Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. 2024. Nv-embed: Improved techniques for train- ing llms as generalist embedding models.arXiv preprint arXiv:2405.17428(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrap- ping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning. PMLR, 19730–19742

work page 2023

[24] [24]

Zhenwen Li and Tao Xie. 2024. Using LLM to select the right SQL Query from candidates.arXiv preprint arXiv:2401.02115(2024)

work page arXiv 2024

[25] [25]

Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards general text embeddings with multi-stage contrastive learning.arXiv preprint arXiv:2308.03281(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

Chunwei Liu, Matthew Russo, Michael Cafarella, Lei Cao, Peter Baille Chen, Zui Chen, Michael Franklin, Tim Kraska, Samuel Madden, and Gerardo Vitagliano

work page

[27] [27]

A declarative system for optimizing ai workloads.arXiv preprint arXiv:2405.14696(2024)

work page arXiv 2024

[28] [28]

Shicheng Liu, Jialiang Xu, Wesley Tjangnaka, Sina Semnani, Chen Yu, and Monica Lam. 2024. SUQL: Conversational Search over Structured and Unstructured Data with Large Language Models. InFindings of the Association for Computational Linguistics: NAACL 2024, Kevin Duh, Helena Gomez, and Steven Bethard (Eds.). Association for Computational Linguistics, Mexic...

work page doi:10.18653/v1/2024.findings-naacl.283 2024

[29] [29]

Yao Lu, Aakanksha Chowdhery, Srikanth Kandula, and Surajit Chaudhuri. 2018. Accelerating machine learning inference with probabilistic predicates. InPro- ceedings of the 2018 International Conference on Management of Data. 1493–1508

work page 2018

[30] [30]

Kyle Luoma and Arun Kumar. 2025. SNAILS: Schema Naming Assessments for Improved LLM-Based SQL Inference.Proceedings of the ACM on Management of Data3, 1 (2025), 1–26

work page 2025

[31] [31]

Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernandez Abrego, Ji Ma, Vincent Zhao, Yi Luan, Keith Hall, Ming-Wei Chang, et al . 2022. Large Dual Encoders Are Generalizable Retrievers. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 9844–9855

work page 2022

[32] [32]

OpenAI, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Car- ney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis, Alexi...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Liana Patel, Siddharth Jha, Carlos Guestrin, and Matei Zaharia. 2024. Lotus: Enabling semantic queries with llms over tables of unstructured and structured data.arXiv preprint arXiv:2407.11418(2024)

work page arXiv 2024

[34] [34]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

work page

[35] [35]

In International conference on machine learning

Learning transferable visual models from natural language supervision. In International conference on machine learning. PmLR, 8748–8763

work page

[36] [36]

Abhinav Ramesh Kashyap, Thanh-Tung Nguyen, Viktor Schlegel, Stefan Winkler, See-Kiong Ng, and Soujanya Poria. 2024. A Comprehensive Survey of Sentence Representations: From the BERT Epoch to the CHATGPT Era and Beyond. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), Yv...

work page 2024

[37] [37]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 3982–3992

work page 2019

[38] [38]

Ricardo Salazar-Díaz, Boris Glavic, and Tilmann Rabl. 2024. Inferdb: In-database machine learning inference using indexes.Proceedings of the VLDB Endowment 17, 8 (2024), 1830–1842

work page 2024

[39] [39]

Shreya Shankar, Tristan Chambers, Tarak Shah, Aditya G Parameswaran, and Eu- gene Wu. 2024. DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing.arXiv preprint arXiv:2410.12189(2024)

work page arXiv 2024

[40] [40]

Eva Sharma, Chen Li, and Lu Wang. 2019. BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy, 2204–2213. https://doi.org/10.18653...

work page doi:10.18653/v1/p19-1212 2019

[41] [41]

Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A Smith, Luke Zettlemoyer, and Tao Yu. 2023. One Embed- der, Any Task: Instruction-Finetuned Text Embeddings. InAnnual Meeting of the Association for Computational Linguistics-ACL 2023 (09/07/2023-14/07/2023„, Toronto, Canada)

work page 2023

[42] [42]

Zhihui Yang, Zuozhi Wang, Yicong Huang, Yao Lu, Chen Li, and X Sean Wang

work page

[43] [43]

Optimizing machine learning inference queries with correlative proxy models.Proceedings of the VLDB Endowment15, 10 (2022), 2032–2044

work page 2022

[44] [44]

Enhao Zhang, Nicole Sullivan, Brandon Haynes, Ranjay Krishna, and Magdalena Balazinska. 2025. Self-Enhancing Video Data Management System for Composi- tional Events with Large Language Models.Proc. ACM Manag. Data3, 3, Article 215 (June 2025), 29 pages. https://doi.org/10.1145/3725352

work page doi:10.1145/3725352 2025

[45] [45]

Shuo Zhang, Zezhou Huang, and Eugene Wu. 2024. Data cleaning using large language models.arXiv preprint arXiv:2410.15547(2024)

work page arXiv 2024

[46] [46]

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: disaggregating prefill and decoding for goodput-optimized large language model serving. InProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation(Santa Clara, CA, USA)(OSDI’24). USENIX Association, USA, Article...

work page 2024