Towards Cost-effective LLMs Routing with Batch Prompting

Haotian Xu; Jiadong Xie; Kangfei Zhao

arxiv: 2605.28268 · v1 · pith:O5KSVDFGnew · submitted 2026-05-27 · 💻 cs.DB

Towards Cost-effective LLMs Routing with Batch Prompting

Haotian Xu , Kangfei Zhao , Jiadong Xie This is my paper

Pith reviewed 2026-06-29 09:37 UTC · model grok-4.3

classification 💻 cs.DB

keywords LLM routingbatch promptingcost optimizationPareto frontierutility estimationgreedy schedulingNP-hard problemmodel serving

0 comments

The pith

RoBatch jointly chooses the model and batch size for each query to reach a better cost-performance frontier than routing or batching alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that routing queries to different models and packing multiple queries into one prompt are complementary but have been treated separately until now. It defines the Route with Batching Problem as the task of assigning both a target model and a batch size to every query while respecting a total cost budget, and shows the problem is NP-hard. RoBatch solves it with a two-stage method: first a proxy utility model that estimates performance without batching and then adds a model-specific correction for batching degradation, followed by a greedy scheduler that upgrades assignments along the cost-utility frontier until the budget is spent. Experiments on six benchmarks with Qwen3 and Gemma3 families show the combined approach produces a strictly better Pareto frontier than either technique used by itself.

Core claim

RoBatch solves the Route with Batching Problem by first building a batch-aware proxy utility model that decomposes utility into a no-batching estimate plus a model-specific degradation recalibration term, then running a greedy scheduling algorithm that progressively upgrades each query's model and batch size assignment along the cost-utility Pareto frontier until the budget is exhausted; this produces a superior cost-performance frontier compared with pure routing or pure batch-prompting baselines across six benchmarks and two LLM families.

What carries the argument

The batch-aware proxy utility model that decomposes utility estimation into a no-batching term plus a model-specific degradation recalibration.

If this is right

Jointly optimizing model assignment and batch size produces a strictly better cost-utility Pareto frontier than optimizing either dimension separately.
The greedy scheduler can exhaust any given cost budget while remaining on the frontier defined by the proxy model.
The same two-stage structure works across different LLM families without requiring family-specific redesign.
The formulation shows that the Route with Batching Problem is NP-hard yet admits a practical greedy approximation that outperforms the independent baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The proxy decomposition could be reused as a building block when adding other serving techniques such as quantization or prefix caching.
If the recalibration term proves stable across query domains, the same modeling stage might support online adaptation without retraining the utility estimator.
Extending the scheduler to handle dynamic arrival of queries would turn the static budget allocation into a streaming decision process.

Load-bearing premise

The batch-aware proxy utility model accurately predicts the actual utility loss that occurs when multiple queries are batched together.

What would settle it

A controlled experiment on held-out queries where the measured performance drop after batching deviates significantly from the proxy model's predicted degradation term for one or more models.

Figures

Figures reproduced from arXiv: 2605.28268 by Haotian Xu, Jiadong Xie, Kangfei Zhao.

**Figure 2.** Figure 2: The Impact of Routing on Avg. Acc. and Cost [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: The Impact of Batching on Cost long as the concatenated input remains within the LLM’s reasoning capability and its effective attention span, a.k.a., effective context length. However, as the batch size continues increasing, the excessive complexity of the concatenated prompt overwhelms the model’s reasoning capability and effective context length, and the Accuracy drastically degrades to a low regime. Th… view at source ↗

**Figure 5.** Figure 5: RCU and Avg. Acc. Curves of Different Batch Sizes. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: The Candidate States and Pareto Frontier of a Query [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Overall Cost–Accuracy Trade-Off on Six Benchmarks with the Gemma3 and Qwen3 Model Families [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Ablation Studies: Comparison of RoBatch with Router-Only and Batch-Only Counterparts In contrast, on relatively easy classification tasks such as AGNews and IMDB, the performance gap among methods becomes smaller. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 10.** Figure 10: Sensitivity of MLP and KNN Hyper-Parameters [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 9.** Figure 9: Accuracy of Different Coreset Sizes 6.4 Sensitivity & Design Choice Analysis In this section, we present a series of sensitivity and design-choice analyses for RoBatch using the Qwen3 family. For each test configuration, we use three budget levels: the total cost of the cheapest model (Qwen3-4B), the total cost of the medium-cost model (Qwen3- 14B), and their midpoint. The configurations are maintained as… view at source ↗

**Figure 11.** Figure 11: Comparison of Latency and Scalability Router Prediction Proxy utility Computation Greedy Scheduling 512 1024 2048 4096 8192 # Queries 0 20 40 Time (seconds) 0 0.5 (a) MMLU 512 1024 2048 4096 7600 # Queries 0 10 20 30 Time (seconds) 0 0.5 (b) AGNews 512 1024 2048 4096 8192 # Queries 0 10 20 30 Time (seconds) 0 0.5 (c) IMDB [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗

**Figure 12.** Figure 12: Latency Breakdown of RoBatch queries grows. FrugalGPT is excluded from this comparison because its LLM cascading design makes LLM API latency and the scheduling time intertwined. Compared with BATCHER baselines, RoBatch incurs higher overhead because it performs additional router prediction and budget-aware greedy scheduling. As the workload size doubles, the serving overhead of RoBatch grows approxima… view at source ↗

read the original abstract

Large Language Model (LLM) serving systems must balance task performance against monetary cost. Two prominent optimization techniques have emerged independently: LLM routing, which directs each query to the most cost-effective model in a model pool, and batch prompting, which packs multiple queries into a single invocation to amortize the fixed cost of the shared system prompt. These two techniques are logically complementary; i.e., routing optimizes the model assignment dimension while batching optimizes the query aggregation dimension, jointly reshaping the landscape of model utility and monetary cost. However, existing approaches explore only one side of this decision space. On the basis of empirical studies on their impacts, we are motivated to jointly optimize these two dimensions in this paper. We formulate the Route with Batching Problem, which jointly determines the target model and batch size for each query under a total cost budget, and prove it NP-hard. To solve this challenging problem, we propose RoBatch, a unified two-stage framework. In the modeling stage, RoBatch constructs a batch-aware proxy utility model that decomposes combinatorial utility estimation into utility estimation without batching and recalibration of model-specific utility degradation with batching. In the routing stage, RoBatch employs a greedy scheduling algorithm that progressively upgrades the assignment of the target model and batch size for queries along the cost-utility Pareto frontier until the budget is exhausted. Extensive experiments on six benchmarks across two LLM families (Qwen3 and Gemma3) demonstrate that RoBatch consistently achieves a superior cost-performance Pareto frontier compared with LLM routing and batch prompting baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RoBatch jointly optimizes routing and batching but depends on an unvalidated proxy for batch degradation effects.

read the letter

RoBatch jointly optimizes routing and batching for LLMs and reports better cost-performance results than separate methods, but its success depends on an untested proxy for how batching affects utility.

The new element is the Route with Batching Problem, which assigns both a model and a batch size to each query while respecting a total cost budget. They prove this is NP-hard and then give RoBatch as a two-stage solution. The modeling stage creates a batch-aware proxy that takes utility without batching and adds a model-specific term for degradation under batching. The routing stage runs a greedy scheduler that keeps upgrading assignments to push the Pareto frontier until the budget runs out.

This decomposition makes the problem manageable, and the experiments on six benchmarks with Qwen3 and Gemma3 models show the method beats the baselines on the frontier. That is concrete evidence that combining the two techniques can help.

The main concern is the proxy itself. The paper says the recalibration term comes from empirical studies, but the abstract does not include any check on whether the proxy matches real batched performance. If batching causes effects like context interference that the term misses, then the claimed gains may not appear in deployment. This is the spot where more evidence would strengthen the paper.

The math and algorithm look standard for this type of scheduling problem. No obvious circularity in the evaluation.

This work is aimed at researchers and practitioners focused on efficient LLM serving. It has enough novelty and empirical support to merit peer review, where the main questions would be about proxy validation and experimental design.

I would send it to referees.

Referee Report

1 major / 1 minor

Summary. The paper introduces the Route with Batching Problem, proves it NP-hard, and proposes RoBatch, a two-stage framework. The modeling stage builds a batch-aware proxy utility model that decomposes utility estimation into a no-batching term and a model-specific recalibration for batch-induced degradation. The routing stage uses a greedy algorithm to assign models and batch sizes along the cost-utility Pareto frontier under a budget. Experiments across six benchmarks and two LLM families (Qwen3, Gemma3) claim RoBatch achieves a superior Pareto frontier compared to separate routing and batch-prompting baselines.

Significance. If the proxy model accurately captures real utility degradation under batching, this work would meaningfully advance cost-effective LLM serving by unifying routing and batching optimizations. Strengths include the NP-hardness proof, the greedy scheduler, and evaluation on multiple benchmarks across LLM families. The approach addresses a practical gap in jointly optimizing model assignment and query aggregation.

major comments (1)

[Modeling stage] The description of the batch-aware proxy utility model (decomposing into no-batching utility plus recalibration term) lacks any quantitative validation, such as correlation coefficients, mean absolute error, or error bars, between the proxy predictions and empirically measured utilities on batched query sets. This is load-bearing for the central empirical claim, as the greedy scheduler in the routing stage relies directly on these proxy values to trace the Pareto frontier.

minor comments (1)

[Abstract] The abstract refers to 'empirical studies on their impacts' motivating the work but does not provide specific citations to those studies.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the NP-hardness proof, the greedy scheduler, and the practical value of jointly optimizing routing and batching. We address the single major comment below and will strengthen the manuscript accordingly.

read point-by-point responses

Referee: [Modeling stage] The description of the batch-aware proxy utility model (decomposing into no-batching utility plus recalibration term) lacks any quantitative validation, such as correlation coefficients, mean absolute error, or error bars, between the proxy predictions and empirically measured utilities on batched query sets. This is load-bearing for the central empirical claim, as the greedy scheduler in the routing stage relies directly on these proxy values to trace the Pareto frontier.

Authors: We agree that the absence of direct quantitative validation for the batch-aware proxy utility model is a gap. While the manuscript demonstrates end-to-end Pareto improvements, it does not report correlation coefficients, MAE, or error bars comparing proxy predictions against measured utilities on batched query sets. In the revised version we will add a dedicated validation subsection (under the modeling stage) that computes and reports these metrics—Pearson/Spearman correlations, MAE with error bars—across the six benchmarks and both LLM families (Qwen3, Gemma3). The validation will use held-out batched query sets to confirm that the decomposition (no-batching term + model-specific recalibration) accurately tracks observed utility degradation. This addition will directly support the scheduler’s use of the proxy values. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The derivation proceeds from an NP-hard formulation of the Route with Batching Problem, through an empirical-motivated decomposition in the modeling stage that produces a proxy utility model, to a greedy scheduler in the routing stage, with final claims resting on separate benchmark experiments across Qwen3 and Gemma3. No quoted equation or step equates a claimed prediction or result to its own fitted inputs by construction, nor does any load-bearing premise reduce to a self-citation chain or imported uniqueness theorem. The proxy decomposition is presented as a modeling choice whose accuracy is assessed via external validation rather than assumed tautologically.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The proxy utility model introduces an implicit decomposition assumption that utility without batching plus a recalibration term suffices; no explicit free parameters or invented entities are named in the abstract, but the NP-hardness reduction and greedy upgrade rule rest on standard combinatorial optimization assumptions.

axioms (1)

domain assumption The utility degradation from batching can be captured by a model-specific recalibration factor independent of the specific query mix.
Stated in the modeling stage description of the batch-aware proxy.

pith-pipeline@v0.9.1-grok · 5805 in / 1281 out tokens · 19769 ms · 2026-06-29T09:37:12.429259+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 11 canonical work pages · 8 internal anchors

[1]

Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappa- ganthu, Yiming Yang, Shyam Upadhyay, Manaal Faruqui, and Mausam. 2024. AutoMix: Automatically Mixing Language Models. InAdvances in Neural Infor- mation Processing Systems 38: Annual Conference on Neural Informa...

2024
[2]

Malika Aubakirova, Alex Atallah, Chris Clark, Justin Summerville, and Anjney Midha. 2026. State of AI: An Empirical 100 Trillion Token Study with OpenRouter. CoRRabs/2601.10088 (2026)

work page arXiv 2026
[3]

Bowman, Gabor Angeli, Christopher Potts, and Christopher D

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Man- ning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Pro- cessing, EMNLP 2015. The Association for Computational Linguistics, 632–642

2015
[4]

Chang and Longling Geng

Edward Y. Chang and Longling Geng. 2025. SagaLLM: Context Management, Validation, and Transaction Guarantees for Multi-Agent LLM Planning.Proc. VLDB Endow.18, 12 (2025), 4874–4886

2025
[5]

Huamin Chen, Xunzhuo Liu, Bowei He, Fuyuan Lyu, Yankai Chen, Xue Liu, Yuhan Liu, and Junchen Jiang. 2026. The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project.arXiv preprint arXiv:2603.21354(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation.CoRRabs/2402.03216 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Lingjiao Chen, Matei Zaharia, and James Zou. 2024. FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance.Trans. Mach. Learn. Res.2024 (2024)

2024
[8]

Zhoujun Cheng, Jungo Kasai, and Tao Yu. 2023. Batch Prompting: Efficient Inference with Large Language Model APIs. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: EMNLP 2023 - Industry Track, Mingxuan Wang and Imed Zitouni (Eds.). Association for Computational Linguistics, 792–810

2023
[9]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve Math Word Problems.CoRRabs/2110.14168 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Volume 1 (Long and Short Papers). Associ...

2019
[11]

Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Rühle, Laks V. S. Lakshmanan, and Ahmed Hassan Awadallah. 2024. Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing. InThe Twelfth International Conference on Learning Representations, ICLR 2024, May 7-11, 2024. OpenReview.net

2024
[12]

Dolan and Chris Brockett

William B. Dolan and Chris Brockett. 2005. Automatically Constructing a Corpus of Sentential Paraphrases. InProceedings of the Third International Workshop on Paraphrasing, IWP@IJCNLP 2005. Asian Federation of Natural Language Processing

2005
[13]

Meihao Fan, Xiaoyue Han, Ju Fan, Chengliang Chai, Nan Tang, Guoliang Li, and Xiaoyong Du. 2024. Cost-Effective In-Context Learning for Entity Resolu- tion: A Design Space Exploration. In40th IEEE International Conference on Data Engineering, ICDE 2024, May 13-16, 2024. IEEE, 3696–3709

2024
[14]

Gonzalez

Teofilo F. Gonzalez. 1985. Clustering to Minimize the Maximum Intercluster Distance.Theor. Comput. Sci.38 (1985), 293–306

1985
[15]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Un- derstanding. In9th International Conference on Learning Representations, ICLR

2021
[16]

Zhongzhan Huang, Guoming Ling, Yupei Lin, Yandong Chen, Shanshan Zhong, Hefeng Wu, and Liang Lin. 2025. RouterEval: A Comprehensive Benchmark for Routing LLMs to Explore Model-level Scaling Up in LLMs. InFindings of the Association for Computational Linguistics: EMNLP 2025. Association for Compu- tational Linguistics, 3860–3887

2025
[17]

Zhaoxuan Ji, Xinlu Wang, Zhaojing Luo, Zhongle Xie, and Meihui Zhang. 2025. Optimized Batch Prompting for Cost-effective LLMs.Proc. VLDB Endow.18, 7 (2025), 2172–2184

2025
[18]

Richard M. Karp. 1972. Reducibility Among Combinatorial Problems. InPro- ceedings of a symposium on the Complexity of Computer Computations (The IBM Research Symposia Series). Plenum Press, New York, 85–103

1972
[19]

Guoliang Li, Jiayi Wang, Chenyang Zhang, and Jiannan Wang. 2025. Data+AI: LLM4Data and Data4LLM. InCompanion of the 2025 International Conference on Management of Data, SIGMOD/PODS 2025. ACM, 837–843

2025
[20]

Hui Lin and Jeff A. Bilmes. 2009. How to select a good training-data subset for transcription: submodular active selection for sequences. In10th Annual Conference of the International Speech Communication Association, INTERSPEECH

2009
[21]

Jianzhe Lin, Maurice Diesendruck, Liang Du, and Robin Abraham. 2024. Batch- Prompt: Accomplish more with less. InThe Twelfth International Conference on Learning Representations, ICLR 2024. OpenReview.net

2024
[22]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach.CoRRabs/1907.11692 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[23]

Maas, Raymond E

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. InThe 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference. The Association for Computer Linguistics, 142–150

2011
[24]

Gonzalez, M

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M. Waleed Kadous, and Ion Stoica. 2025. RouteLLM: Learning to Route LLMs from Preference Data. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, April 24-28, 2025. OpenReview.net

2025
[25]

Pan, Harshit Gupta, Parth Asawa, Carlos Guestrin, and Matei Zaharia

Liana Patel, Siddharth Jha, Melissa Z. Pan, Harshit Gupta, Parth Asawa, Carlos Guestrin, and Matei Zaharia. 2025. Semantic Operators and Their Optimization: Towards AI-Based Data Analytics with Accuracy Guarantees.Proc. VLDB Endow. 18, 11 (2025), 4171–4184

2025
[26]

Kangkang Qi, Dongyang Xie, Wenbo Li, Hao Zhang, Yuanyuan Zhu, Jeffrey Xu Yu, and Kangfei Zhao. 2026. Sema: A High-performance System for LLM-based Semantic Query Processing.arXiv preprint arXiv:2603.11622(2026)

work page arXiv 2026
[27]

Parameswaran, and Eugene Wu

Shreya Shankar, Tristan Chambers, Tarak Shah, Aditya G. Parameswaran, and Eugene Wu. 2025. DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing.Proc. VLDB Endow.18, 9 (2025), 3035–3048

2025
[28]

Parameswaran

Shreya Shankar, Sepanta Zeighami, and Aditya G. Parameswaran. 2026. Task Cascades for Efficient Unstructured Data Processing.CoRRabs/2601.05536 (2026)

work page arXiv 2026
[29]

Gemma Team. 2025. Gemma 3 Technical Report.CoRRabs/2503.19786 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Qwen Team. 2025. Qwen3 Technical Report.CoRRabs/2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text Embeddings by Weakly-Supervised Contrastive Pre-training.CoRRabs/2212.03533 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[32]

Joty, and Steven C

Yue Wang, Weishi Wang, Shafiq R. Joty, and Steven C. H. Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Under- standing and Generation. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021. Association for Computa- tional Linguistics, 8696–8708

2021
[33]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Informa- tion Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022

2022
[34]

Max Welling. 2009. Herding dynamical weights to learn. InProceedings of the 26th Annual International Conference on Machine Learning, ICML 2009 (ACM International Conference Proceeding Series), Andrea Pohoreckyj Danyluk, Léon Bottou, and Michael L. Littman (Eds.). ACM, 1121–1128

2009
[35]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InThe Eleventh International Conference on Learning Represen- tations, ICLR 2023. OpenReview.net

2023
[36]

Parameswaran

Sepanta Zeighami, Shreya Shankar, and Aditya G. Parameswaran. 2025. Cut Costs, Not Accuracy: LLM-Powered Data Processing with Guarantees.Proc. ACM Manag. Data3, 6 (2025), 1–26

2025
[37]

Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015. Character-level Con- volutional Networks for Text Classification. InAdvances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Sys- tems 2015. 649–657

2015
[38]

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou
[39]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models.CoRRabs/2506.05176 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Zihuai Zhao, Wenqi Fan, Jiatong Li, Yunqing Liu, Xiaowei Mei, Yiqi Wang, Zhen Wen, Fei Wang, Xiangyu Zhao, Jiliang Tang, and Qing Li. 2024. Recommender Systems in the Era of Large Language Models (LLMs).IEEE Trans. Knowl. Data Eng.36, 11 (2024), 6889–6907

2024
[41]

Jun-Peng Zhu, Peng Cai, Kai Xu, Li Li, Yishen Sun, Shuai Zhou, Haihuang Su, Liu Tang, and Qi Liu. 2024. AutoTQA: Towards Autonomous Tabular Question Answering through Multi-Agent Large Language Models.Proc. VLDB Endow.17, 12 (2024), 3920–3933

2024

[1] [1]

Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappa- ganthu, Yiming Yang, Shyam Upadhyay, Manaal Faruqui, and Mausam. 2024. AutoMix: Automatically Mixing Language Models. InAdvances in Neural Infor- mation Processing Systems 38: Annual Conference on Neural Informa...

2024

[2] [2]

Malika Aubakirova, Alex Atallah, Chris Clark, Justin Summerville, and Anjney Midha. 2026. State of AI: An Empirical 100 Trillion Token Study with OpenRouter. CoRRabs/2601.10088 (2026)

work page arXiv 2026

[3] [3]

Bowman, Gabor Angeli, Christopher Potts, and Christopher D

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Man- ning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Pro- cessing, EMNLP 2015. The Association for Computational Linguistics, 632–642

2015

[4] [4]

Chang and Longling Geng

Edward Y. Chang and Longling Geng. 2025. SagaLLM: Context Management, Validation, and Transaction Guarantees for Multi-Agent LLM Planning.Proc. VLDB Endow.18, 12 (2025), 4874–4886

2025

[5] [5]

Huamin Chen, Xunzhuo Liu, Bowei He, Fuyuan Lyu, Yankai Chen, Xue Liu, Yuhan Liu, and Junchen Jiang. 2026. The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project.arXiv preprint arXiv:2603.21354(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[6] [6]

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation.CoRRabs/2402.03216 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Lingjiao Chen, Matei Zaharia, and James Zou. 2024. FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance.Trans. Mach. Learn. Res.2024 (2024)

2024

[8] [8]

Zhoujun Cheng, Jungo Kasai, and Tao Yu. 2023. Batch Prompting: Efficient Inference with Large Language Model APIs. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: EMNLP 2023 - Industry Track, Mingxuan Wang and Imed Zitouni (Eds.). Association for Computational Linguistics, 792–810

2023

[9] [9]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve Math Word Problems.CoRRabs/2110.14168 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[10] [10]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Volume 1 (Long and Short Papers). Associ...

2019

[11] [11]

Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Rühle, Laks V. S. Lakshmanan, and Ahmed Hassan Awadallah. 2024. Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing. InThe Twelfth International Conference on Learning Representations, ICLR 2024, May 7-11, 2024. OpenReview.net

2024

[12] [12]

Dolan and Chris Brockett

William B. Dolan and Chris Brockett. 2005. Automatically Constructing a Corpus of Sentential Paraphrases. InProceedings of the Third International Workshop on Paraphrasing, IWP@IJCNLP 2005. Asian Federation of Natural Language Processing

2005

[13] [13]

Meihao Fan, Xiaoyue Han, Ju Fan, Chengliang Chai, Nan Tang, Guoliang Li, and Xiaoyong Du. 2024. Cost-Effective In-Context Learning for Entity Resolu- tion: A Design Space Exploration. In40th IEEE International Conference on Data Engineering, ICDE 2024, May 13-16, 2024. IEEE, 3696–3709

2024

[14] [14]

Gonzalez

Teofilo F. Gonzalez. 1985. Clustering to Minimize the Maximum Intercluster Distance.Theor. Comput. Sci.38 (1985), 293–306

1985

[15] [15]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Un- derstanding. In9th International Conference on Learning Representations, ICLR

2021

[16] [16]

Zhongzhan Huang, Guoming Ling, Yupei Lin, Yandong Chen, Shanshan Zhong, Hefeng Wu, and Liang Lin. 2025. RouterEval: A Comprehensive Benchmark for Routing LLMs to Explore Model-level Scaling Up in LLMs. InFindings of the Association for Computational Linguistics: EMNLP 2025. Association for Compu- tational Linguistics, 3860–3887

2025

[17] [17]

Zhaoxuan Ji, Xinlu Wang, Zhaojing Luo, Zhongle Xie, and Meihui Zhang. 2025. Optimized Batch Prompting for Cost-effective LLMs.Proc. VLDB Endow.18, 7 (2025), 2172–2184

2025

[18] [18]

Richard M. Karp. 1972. Reducibility Among Combinatorial Problems. InPro- ceedings of a symposium on the Complexity of Computer Computations (The IBM Research Symposia Series). Plenum Press, New York, 85–103

1972

[19] [19]

Guoliang Li, Jiayi Wang, Chenyang Zhang, and Jiannan Wang. 2025. Data+AI: LLM4Data and Data4LLM. InCompanion of the 2025 International Conference on Management of Data, SIGMOD/PODS 2025. ACM, 837–843

2025

[20] [20]

Hui Lin and Jeff A. Bilmes. 2009. How to select a good training-data subset for transcription: submodular active selection for sequences. In10th Annual Conference of the International Speech Communication Association, INTERSPEECH

2009

[21] [21]

Jianzhe Lin, Maurice Diesendruck, Liang Du, and Robin Abraham. 2024. Batch- Prompt: Accomplish more with less. InThe Twelfth International Conference on Learning Representations, ICLR 2024. OpenReview.net

2024

[22] [22]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach.CoRRabs/1907.11692 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[23] [23]

Maas, Raymond E

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. InThe 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference. The Association for Computer Linguistics, 142–150

2011

[24] [24]

Gonzalez, M

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M. Waleed Kadous, and Ion Stoica. 2025. RouteLLM: Learning to Route LLMs from Preference Data. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, April 24-28, 2025. OpenReview.net

2025

[25] [25]

Pan, Harshit Gupta, Parth Asawa, Carlos Guestrin, and Matei Zaharia

Liana Patel, Siddharth Jha, Melissa Z. Pan, Harshit Gupta, Parth Asawa, Carlos Guestrin, and Matei Zaharia. 2025. Semantic Operators and Their Optimization: Towards AI-Based Data Analytics with Accuracy Guarantees.Proc. VLDB Endow. 18, 11 (2025), 4171–4184

2025

[26] [26]

Kangkang Qi, Dongyang Xie, Wenbo Li, Hao Zhang, Yuanyuan Zhu, Jeffrey Xu Yu, and Kangfei Zhao. 2026. Sema: A High-performance System for LLM-based Semantic Query Processing.arXiv preprint arXiv:2603.11622(2026)

work page arXiv 2026

[27] [27]

Parameswaran, and Eugene Wu

Shreya Shankar, Tristan Chambers, Tarak Shah, Aditya G. Parameswaran, and Eugene Wu. 2025. DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing.Proc. VLDB Endow.18, 9 (2025), 3035–3048

2025

[28] [28]

Parameswaran

Shreya Shankar, Sepanta Zeighami, and Aditya G. Parameswaran. 2026. Task Cascades for Efficient Unstructured Data Processing.CoRRabs/2601.05536 (2026)

work page arXiv 2026

[29] [29]

Gemma Team. 2025. Gemma 3 Technical Report.CoRRabs/2503.19786 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Qwen Team. 2025. Qwen3 Technical Report.CoRRabs/2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text Embeddings by Weakly-Supervised Contrastive Pre-training.CoRRabs/2212.03533 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[32] [32]

Joty, and Steven C

Yue Wang, Weishi Wang, Shafiq R. Joty, and Steven C. H. Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Under- standing and Generation. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021. Association for Computa- tional Linguistics, 8696–8708

2021

[33] [33]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Informa- tion Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022

2022

[34] [34]

Max Welling. 2009. Herding dynamical weights to learn. InProceedings of the 26th Annual International Conference on Machine Learning, ICML 2009 (ACM International Conference Proceeding Series), Andrea Pohoreckyj Danyluk, Léon Bottou, and Michael L. Littman (Eds.). ACM, 1121–1128

2009

[35] [35]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InThe Eleventh International Conference on Learning Represen- tations, ICLR 2023. OpenReview.net

2023

[36] [36]

Parameswaran

Sepanta Zeighami, Shreya Shankar, and Aditya G. Parameswaran. 2025. Cut Costs, Not Accuracy: LLM-Powered Data Processing with Guarantees.Proc. ACM Manag. Data3, 6 (2025), 1–26

2025

[37] [37]

Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015. Character-level Con- volutional Networks for Text Classification. InAdvances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Sys- tems 2015. 649–657

2015

[38] [38]

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou

[39] [39]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models.CoRRabs/2506.05176 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Zihuai Zhao, Wenqi Fan, Jiatong Li, Yunqing Liu, Xiaowei Mei, Yiqi Wang, Zhen Wen, Fei Wang, Xiangyu Zhao, Jiliang Tang, and Qing Li. 2024. Recommender Systems in the Era of Large Language Models (LLMs).IEEE Trans. Knowl. Data Eng.36, 11 (2024), 6889–6907

2024

[41] [41]

Jun-Peng Zhu, Peng Cai, Kai Xu, Li Li, Yishen Sun, Shuai Zhou, Haihuang Su, Liu Tang, and Qi Liu. 2024. AutoTQA: Towards Autonomous Tabular Question Answering through Multi-Agent Large Language Models.Proc. VLDB Endow.17, 12 (2024), 3920–3933

2024