pith. sign in

arxiv: 2512.05525 · v2 · submitted 2025-12-05 · 💻 cs.DB · cs.LG

Poodle: Seamlessly Scaling Down Large Language Models with Just-in-Time Model Replacement

Pith reviewed 2026-05-17 01:29 UTC · model grok-4.3

classification 💻 cs.DB cs.LG
keywords just-in-time model replacementlarge language modelsmodel searchtransfer learningcost reductionenergy efficiencyrecurring tasksdatabase systems
0
0 comments X

The pith

Large language models can be transparently swapped for cheaper alternatives once recurring tasks are detected in usage patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a vision for just-in-time model replacement that monitors calls to a large language model and, upon spotting a repeated task, substitutes a smaller model tailored to that task. This substitution happens without changes to the user experience or additional development work by the original callers. The authors position model search and transfer learning as practical ways to locate or adapt the replacement model quickly. A prototype called Poodle is used to show concrete reductions in cost and energy for example workloads while keeping the flexibility that makes large models attractive in the first place.

Core claim

Just-in-time model replacement (JITR) detects recurring tasks from patterns in large-language-model calls and replaces the expensive model with a cheaper one that performs adequately for the specific task; the replacement is performed transparently using model search and transfer learning, and the Poodle prototype demonstrates measurable savings on representative tasks.

What carries the argument

Just-in-time model replacement (JITR), the mechanism that monitors call patterns to identify recurring tasks and then substitutes a lower-cost model found or adapted via search and transfer learning.

If this is right

  • Organizations using large models for repetitive automation can lower ongoing compute and energy costs without rewriting applications.
  • Model search combined with transfer learning becomes a standard step for turning general-purpose LLM usage into efficient per-task deployments.
  • Development teams retain the low-effort prompting workflow of large models while the system handles the switch to smaller ones behind the scenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same detection-and-replacement logic could be applied inside database engines that expose LLM calls as part of query processing.
  • Over repeated replacements a library of task-specific small models would accumulate, reducing the need for fresh searches in future cases.
  • The approach might extend to other high-cost AI services beyond language models if usage patterns can be grouped similarly.

Load-bearing premise

Recurring tasks can be reliably recognized from patterns in calls to the large model and that suitable cheaper models can be located or created without large additional effort.

What would settle it

Measurements from a production workload showing either low accuracy in detecting the same task across calls or no meaningful performance parity between the original model and the replacement on that task.

Figures

Figures reproduced from arXiv: 2512.05525 by Boris Glavic, Nils Strassenburg, Tilmann Rabl.

Figure 1
Figure 1. Figure 1: Sentiment classification use case. (1) a recurring [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The just-in-time model replacement workflow. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Base Prompt You are a powerful language model. You are given a user request consisting of a system prompt and a user message. Your task is as follows: 1. Identify the type of input the user is providing: one of ["text", "image", "table", "other"] 2. Infer what task you are expected to perform, choosing from: ["sentiment classification", "summarization", "translation", "question answering", "information ext… view at source ↗
Figure 4
Figure 4. Figure 4: Wrapper Prompt { "input_type": "text", "task_type": "sentiment classification", "user_response": "sentiment": "negative" } [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Just-in-time model replacement architecture. [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Time for different model development approaches. [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
read the original abstract

Businesses increasingly rely on large language models (LLMs) to automate simple repetitive tasks instead of developing custom machine learning models. LLMs require few, if any, training examples and can be utilized by users without expertise in model development. However, this comes at the cost of substantially higher resource and energy consumption compared to smaller models, which often achieve similar predictive performance for simple tasks. In this paper, we present our vision for just-in-time model replacement (JITR), where, upon identifying a recurring task in calls to an LLM, the model is replaced transparently with a cheaper alternative that performs well for this specific task. JITR retains the ease of use and low development effort of LLMs, while saving significant cost and energy. We discuss the main challenges in realizing our vision regarding the identification of recurring tasks and the creation of a custom model. Specifically, we argue that model search and transfer learning will play a crucial role in JITR to efficiently identify and fine-tune models for a recurring task. Using our JITR prototype Poodle, we achieve significant savings for exemplary tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents a vision for Just-in-Time Model Replacement (JITR) in which recurring tasks are detected from patterns in calls to a large language model and the model is transparently swapped for a cheaper, task-specific alternative discovered via model search and transfer learning. The approach is illustrated by a prototype named Poodle that is reported to deliver significant cost and energy savings on exemplary tasks while preserving the low development effort of LLMs.

Significance. If the core mechanisms of reliable task identification and low-overhead model adaptation can be realized, JITR would provide a practical bridge between the generality of LLMs and the efficiency of smaller models, with clear relevance to data-management workloads that increasingly rely on LLM automation. The explicit framing of model search and transfer learning as enabling technologies is a constructive contribution to making the vision concrete.

major comments (2)
  1. [Abstract / prototype description] Abstract and prototype description: the claim that the Poodle prototype 'achieve[s] significant savings for exemplary tasks' is presented without any quantitative results, metrics (e.g., cost reduction percentages, energy measurements, accuracy comparisons), experimental setup, or baseline data. This evidence is load-bearing for the central claim that JITR delivers practical benefits.
  2. [Discussion of main challenges] Challenges section: while task identification from call patterns and the role of model search/transfer learning are identified as key challenges, the manuscript supplies no concrete algorithms, pseudocode, preliminary results, or even high-level workflow for how recurring tasks are detected or how suitable replacement models are located and adapted. These omissions leave the feasibility argument unsupported.
minor comments (2)
  1. The abstract would be strengthened by naming one or two concrete exemplary tasks (e.g., a specific data-cleaning or query-rewriting pattern) to illustrate the scope of applicability.
  2. Notation for 'recurring task' and 'cheaper alternative' should be introduced more formally when first used, even in a vision paper, to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential relevance of JITR to data-management workloads. We agree that the manuscript would benefit from additional concrete evidence and details to support the central claims and will revise the paper accordingly.

read point-by-point responses
  1. Referee: [Abstract / prototype description] Abstract and prototype description: the claim that the Poodle prototype 'achieve[s] significant savings for exemplary tasks' is presented without any quantitative results, metrics (e.g., cost reduction percentages, energy measurements, accuracy comparisons), experimental setup, or baseline data. This evidence is load-bearing for the central claim that JITR delivers practical benefits.

    Authors: We agree that the current abstract and prototype description would be strengthened by quantitative support. In the revised manuscript we will add preliminary experimental results from the Poodle prototype, including specific cost-reduction percentages, energy measurements, accuracy comparisons to the original LLM, a description of the experimental setup, and baseline data on the exemplary tasks. revision: yes

  2. Referee: [Discussion of main challenges] Challenges section: while task identification from call patterns and the role of model search/transfer learning are identified as key challenges, the manuscript supplies no concrete algorithms, pseudocode, preliminary results, or even high-level workflow for how recurring tasks are detected or how suitable replacement models are located and adapted. These omissions leave the feasibility argument unsupported.

    Authors: The paper is positioned as a vision paper that identifies the core challenges rather than presenting a fully engineered solution. To address the concern, we will augment the challenges section with a high-level workflow diagram, pseudocode sketches for recurring-task detection from call patterns and for model search combined with transfer learning, and preliminary implementation results from the Poodle prototype that illustrate how these steps operate in practice. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a vision paper presenting the JITR concept and a prototype implementation. It contains no equations, derivations, fitted parameters, or predictions that reduce to inputs by construction. The argument is conceptual, discusses challenges explicitly, and relies on exemplary task savings from the prototype without self-referential fitting or load-bearing self-citations that collapse the central claim.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model or derivation is introduced; the paper relies on standard domain assumptions that task patterns exist in LLM usage and that smaller models can match performance on narrow tasks.

pith-pipeline@v0.9.0 · 5494 in / 1033 out tokens · 34675 ms · 2026-05-17T01:29:55.942241+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 2 internal anchors

  1. [1]

    API Pricing — OpenAI

    2025. API Pricing — OpenAI. https://platform.openai.com/docs/pricing. Accessed: 2025-07-31

  2. [2]

    Llama 3 8B — Together AI Model

    2025. Llama 3 8B — Together AI Model. https://www.together.ai/models/llama- 3-8b. Accessed: 2025-07-31

  3. [3]

    m2-bert-80M-32k-retrieval — Together AI Model

    2025. m2-bert-80M-32k-retrieval — Together AI Model. https://api.together.ai/ models/togethercomputer/m2-bert-80M-32k-retrieval. Accessed: 2025-07-31

  4. [4]

    Meta-Llama-3.1-405B-Instruct-Turbo — Together AI Model

    2025. Meta-Llama-3.1-405B-Instruct-Turbo — Together AI Model. https://api. together.ai/models/meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo. Accessed: 2025-07-31

  5. [5]

    Alessandro Achille, Michael Lam, Rahul Tewari, Avinash Ravichandran, Subhransu Maji, Charless C Fowlkes, Stefano Soatto, and Pietro Perona. 2019. Task2vec: Task embedding for meta-learning. InICCV. 6430–6439

  6. [6]

    Mitchell Bosley, Musashi Jacobs-Harukawa, Hauke Licht, and Alexander Hoyle

  7. [7]

    In2023 Annual Meeting of the Midwest Political Science Association (MPSA)

    Do we still need BERT in the age of GPT? Comparing the benefits of domain-adaptation and in-context-learning approaches to using LLMs for Po- litical Science Research. In2023 Annual Meeting of the Midwest Political Science Association (MPSA)

  8. [8]

    Martin Juan José Bucher and Marco Martini. 2024. Fine-tuned small LLMs (still) significantly outperform zero-shot generative AI models in text classification. arXiv preprint arXiv:2406.08660(2024)

  9. [9]

    Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. 2024. Large Language Models as Tool Makers. InICLR

  10. [10]

    Lingjiao Chen, Matei Zaharia, and James Zou. 2023. Frugalgpt: How to use large language models while reducing cost and improving performance.arXiv preprint arXiv:2305.05176(2023)

  11. [11]

    Timothy Dai, Austin Peters, Jonah B Gelbach, David Freeman Engstrom, and Daniel Kang. 2024. tailwiz: Empowering Domain Experts with Easy-to-Use, Task-Specific Natural Language Processing Models. InProceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning. 12–22

  12. [12]

    Clément Delangue. 2023. Hugging Face just crossed 1,000,000 free public models. https://x.com/ClementDelangue/status/1839375655688884305?s=20. Accessed: 2025-11-27

  13. [13]

    Aleksandra Edwards and Jose Camacho-Collados. 2024. Language Models for Text Classification: Is In-Context Learning Enough? arXiv:2403.17661 [cs.CL] https://arxiv.org/abs/2403.17661

  14. [14]

    Google-BERT. 2018. bert-base-uncased. https://huggingface.co/google-bert/bert- base-uncased. Accessed: 2025-11-26

  15. [15]

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the Knowledge in a Neural Network. arXiv:1503.02531 [stat.ML] https://arxiv.org/abs/1503.02531

  16. [16]

    Long-Kai Huang, Junzhou Huang, Yu Rong, Qiang Yang, and Ying Wei. 2022. Frustratingly Easy Transferability Estimation. InICML, Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.), Vol. 162. 9201–9225

  17. [17]

    Hugging Face. 2025. Hugging Face: Machine Learning Platform. https:// huggingface.co/

  18. [18]

    Hao Li, Charless Fowlkes, Hao Yang, Onkar Dabeer, Zhuowen Tu, and Stefano Soatto. 2023. Guided recommendation for model fine-tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3633–3642

  19. [19]

    Maas, Raymond E

    Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In NAACL-HLT. 142–150

  20. [20]

    Hui Miao, Ang Li, Larry S Davis, and Amol Deshpande. 2017. Towards unified data and lifecycle management for deep learning. InICDE. 571–582

  21. [21]

    NousResearch. 2023. Llama-2-7b-chat-hf. Hugging Face model repository. https: //huggingface.co/NousResearch/Llama-2-7b-chat-hf Accessed: 2025-11-26

  22. [22]

    Nicholas Pangakis and Samuel Wolken. 2024. Knowledge Distillation in Auto- mated Annotation: Supervised Text Classification with LLM-Generated Training Labels. arXiv:2406.17633 [cs.CL] https://arxiv.org/abs/2406.17633

  23. [23]

    Cheng Qian, Chi Han, Yi Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji. 2023. CRE- ATOR: Tool Creation for Disentangling Abstract and Concrete Reasoning of Large Language Models. InThe 2023 Conference on Empirical Methods in Natural Language Processing

  24. [24]

    Cedric Renggli, André Susano Pinto, Luka Rimanic, Joan Puigcerver, Carlos Riquelme, Ce Zhang, and Mario Lučić. 2022. Which model to transfer? finding the needle in the growing haystack. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9205–9214

  25. [25]

    Cedric Renggli, Xiaozhe Yao, Luka Kolar, Luka Rimanic, Ana Klimovic, and Ce Zhang. 2022. SHiFT: an efficient, flexible search engine for transfer learning. Proceedings of the VLDB Endowment16, 2 (2022), 304–316. Publisher: VLDB Endowment

  26. [26]

    sanj.dev. 2025. The Real Cost of AI: OpenAI’s $13.5B Loss Explained. https: //sanj.dev/post/real-cost-of-ai-openai-financials.sanj.dev(3 Oct 2025). Accessed: 2025-11-26

  27. [27]

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems36 (2023), 38154–38180

  28. [28]

    Manning, Andrew Ng, and Christopher Potts

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. InProceedings of the 2013 Con- ference on Empirical Methods in Natural Language Processing, David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu, and ...

  29. [29]

    Nils Strassenburg, Boris Glavic, and Tilmann Rabl. 2025. Alsatian: Optimizing Model Search for Deep Transfer Learning.Proc. ACM Manag. Data3, 3 (2025), 127:1–127:27. https://doi.org/10.1145/3725264

  30. [30]

    Nils Strassenburg, Ilin Tolovski, and Tilmann Rabl. 2022. Efficiently Managing Deep Learning Models in a Distributed Environment. InEDBT

  31. [31]

    TapTwice Digital. 2025. 8 OpenAI Statistics (2025): Revenue, Valuation, Profit, Funding.TapTwice Digital(18 May 2025). https://taptwicedigital.com/stats/ openai Accessed: 2025-11-26

  32. [32]

    Lipton, Mu Li, and Alexander J

    Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola. 2023.Dive into Deep Learning. Cambridge University Press