Recognition: 1 theorem link
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
Pith reviewed 2026-05-11 19:50 UTC · model grok-4.3
The pith
FrugalGPT learns to select LLM combinations for each query to match or beat the best single model's accuracy at much lower cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FrugalGPT is a flexible instantiation of the LLM cascade strategy that learns which combinations of LLMs to use for different queries in order to reduce cost and improve accuracy. Our experiments show that FrugalGPT can match the performance of the best individual LLM (e.g. GPT-4) with up to 98% cost reduction or improve the accuracy over GPT-4 by 4% with the same cost.
What carries the argument
The LLM cascade mechanism, which routes queries to one or more LLMs chosen to balance accuracy and cost, with FrugalGPT learning the routing policy from query and performance data.
If this is right
- Organizations querying LLMs at scale can achieve equivalent performance without incurring the full cost of premium models.
- Users can exploit the heterogeneous pricing of different LLM providers by selectively routing queries.
- The cascade approach can be combined with prompt adaptation and model approximation for additional savings.
- LLM usage becomes more sustainable for large collections of queries and text processing tasks.
Where Pith is reading between the lines
- Routing decisions might need updating if the types of queries shift substantially from the training data.
- The method could extend to choosing among open-source models hosted locally versus paid APIs.
- Similar cascading could apply to other paid AI services like image generators with varying costs and qualities.
Load-bearing premise
A router trained on queries and performance data from one set will continue to pick good LLM combinations for new queries whose cost-accuracy profiles are similar to the training distribution.
What would settle it
Run FrugalGPT on a new collection of queries that differ in topic or complexity from the training queries, such as moving from general web queries to domain-specific technical questions, and check if the claimed cost savings and accuracy levels are maintained.
read the original abstract
There is a rapidly growing number of large language models (LLMs) that users can query for a fee. We review the cost associated with querying popular LLM APIs, e.g. GPT-4, ChatGPT, J1-Jumbo, and find that these models have heterogeneous pricing structures, with fees that can differ by two orders of magnitude. In particular, using LLMs on large collections of queries and text can be expensive. Motivated by this, we outline and discuss three types of strategies that users can exploit to reduce the inference cost associated with using LLMs: 1) prompt adaptation, 2) LLM approximation, and 3) LLM cascade. As an example, we propose FrugalGPT, a simple yet flexible instantiation of LLM cascade which learns which combinations of LLMs to use for different queries in order to reduce cost and improve accuracy. Our experiments show that FrugalGPT can match the performance of the best individual LLM (e.g. GPT-4) with up to 98% cost reduction or improve the accuracy over GPT-4 by 4% with the same cost. The ideas and findings presented here lay a foundation for using LLMs sustainably and efficiently.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reviews the heterogeneous costs of popular LLM APIs and proposes three high-level strategies for cost reduction: prompt adaptation, LLM approximation, and LLM cascades. As a concrete instantiation, it introduces FrugalGPT, which trains a router to select per-query cascades of LLMs. Experiments on real tasks show that FrugalGPT can match the accuracy of the strongest single model (GPT-4) with up to 98% cost reduction or improve accuracy by 4% at the same cost as GPT-4.
Significance. The work has clear practical significance for sustainable LLM usage if the router generalizes. It supplies concrete, task-level numbers rather than abstract bounds, reviews real API pricing, and demonstrates a simple, trainable cascade that improves upon single-model baselines. The empirical results on held-out data from the training distributions constitute a solid starting point for cost-aware inference research.
major comments (2)
- Section 4 (Experiments): The router is trained and evaluated exclusively on held-out splits drawn from the same query pools used to collect performance labels. No cross-domain, temporal-shift, or out-of-distribution experiments are reported, which directly bears on whether the reported 98% cost reduction or +4% accuracy gains will hold for new queries whose cost-accuracy profiles differ from the training distribution.
- Section 3.2 (Router training): The paper does not provide an ablation on the router's input features or on the sensitivity of the learned policy to the exact set of training queries. This makes it hard to determine how much of the headline gains are due to the cascade structure versus query-specific overfitting.
minor comments (3)
- The introduction would benefit from a short table summarizing the three strategy categories and their key trade-offs before diving into FrugalGPT.
- Figure 2 (or equivalent results figure): axis labels and legend entries should explicitly state whether cost is measured in dollars per 1k tokens or total dollars for the evaluated query set.
- A brief discussion of how the router would be retrained or updated when new LLMs become available would clarify the method's long-term practicality.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and constructive comments. We address each major comment below and indicate revisions to the manuscript.
read point-by-point responses
-
Referee: Section 4 (Experiments): The router is trained and evaluated exclusively on held-out splits drawn from the same query pools used to collect performance labels. No cross-domain, temporal-shift, or out-of-distribution experiments are reported, which directly bears on whether the reported 98% cost reduction or +4% accuracy gains will hold for new queries whose cost-accuracy profiles differ from the training distribution.
Authors: We agree this is a valid limitation of the current evaluation. The experiments use held-out splits from the same real-world task distributions to demonstrate feasibility and concrete cost-accuracy tradeoffs. In the revised manuscript we have added an explicit limitations paragraph in Section 4 and the conclusion that notes the in-distribution focus and calls for future cross-domain and temporal-shift studies. We maintain that the multi-task results still offer a solid empirical foundation for the cascade approach within comparable query regimes. revision: partial
-
Referee: Section 3.2 (Router training): The paper does not provide an ablation on the router's input features or on the sensitivity of the learned policy to the exact set of training queries. This makes it hard to determine how much of the headline gains are due to the cascade structure versus query-specific overfitting.
Authors: We acknowledge the absence of these ablations in the original submission. The router employs lightweight, query-derived features chosen for practicality across API calls. To strengthen the analysis, the revised version includes a new sensitivity study (added to Section 3.2 and the appendix) that retrains the router on multiple random subsets of the training queries and reports stable performance, supporting that gains arise primarily from the learned cascade policy rather than overfitting to a specific query set. A full feature ablation is noted as future work given space constraints but is not required to substantiate the core claims. revision: partial
Circularity Check
No significant circularity: FrugalGPT claims rest on empirical router evaluation, not definitional or self-referential reduction.
full rationale
The paper trains a router on query features and LLM performance/cost labels collected from a fixed query pool, then evaluates the resulting cascade selections on held-out splits drawn from the same pool. This is a standard train/test split in supervised learning; the reported accuracy and cost numbers are measured outcomes on unseen queries rather than quantities forced by the training procedure itself. No equations define the router output to equal its own training targets, no uniqueness theorem is invoked via self-citation, and no ansatz or renaming is smuggled in. The central performance claims (98% cost reduction or +4% accuracy) are therefore falsifiable experimental results, not tautologies.
Axiom & Free-Parameter Ledger
free parameters (1)
- Router model parameters
axioms (1)
- domain assumption Individual LLM cost and accuracy are sufficiently consistent across queries to allow a router trained on past data to generalize.
Forward citations
Cited by 48 Pith papers
-
OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents
OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy...
-
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
-
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
-
A Regime Theory of Controller Class Selection for LLM Action Decisions
A regime theory selects the optimal controller class for LLM action decisions from a nested lattice of four classes using three data-estimable bottlenecks, with a Bernstein-tight threshold and empirical matches on mul...
-
Inference-Time Budget Control for LLM Search Agents
A VOI-based controller for dual inference budgets improves multi-hop QA performance by prioritizing search actions and selectively finalizing answers.
-
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
-
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs
Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and through...
-
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs
Hosted open-weight LLMs function as heterogeneous, time-varying services rather than uniform model artifacts, with concentrated demand, decoupled supply and adoption, and measurable gains from task-aware routing.
-
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
-
Model Routing as a Trust Problem: Route Receipts for Adaptive AI Systems
The paper introduces route receipts as a portable runtime record of routing decisions to make adaptive AI systems more transparent and trustworthy.
-
RouteProfile: Elucidating the Design Space of LLM Profiles for Routing
RouteProfile organizes LLM profile design into organizational form, representation type, aggregation depth, and learning configuration, with evaluations showing structured profiles outperform flat ones and aid general...
-
Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models
Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- a...
-
Two-dimensional early exit optimisation of LLM inference
Coordinating layer-wise and sentence-wise early exits in LLMs produces multiplicative speedups of 1.4-2.3x over single-dimension early exit on sentiment classification tasks.
-
Domain Restriction via Multi SAE Layer Transitions
Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.
-
GAR: Carbon-Aware Routing for LLM Inference via Constrained Optimization
GAR routes LLM inference requests via constrained multi-objective optimization to cut per-request CO2 emissions while respecting accuracy floors and p95 latency SLOs.
-
LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?
LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.
-
Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs
A critique-and-routing controller cast as a finite-horizon MDP with policy-gradient optimization outperforms one-shot routing baselines on reasoning benchmarks while using the strongest agent for under 25% of calls.
-
Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning
A small RL-trained policy for stepwise model routing between LLM sizes improves the accuracy-cost tradeoff on math benchmarks over handcrafted strategies and matches large process reward model methods.
-
SynConfRoute: Syntax-Aware Routing for Efficient Code Completion with Small CodeLLMs
SynConfRoute routes code completions using syntax validation and token confidence, improving pass@1 by up to 31% on hard tasks and reducing accelerator usage by 58% versus always using the largest model.
-
SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection
SpecKV uses a small MLP trained on draft model confidence and entropy to dynamically choose the optimal speculation length gamma, achieving 56% better performance than fixed gamma=4 across various tasks and compressio...
-
Agent Capsules: Quality-Gated Granularity Control for Multi-Agent LLM Pipelines
Agent Capsules is an adaptive runtime for multi-agent LLM pipelines that selectively compounds agent executions under quality constraints, delivering 19-68% token reductions at parity or better quality versus LangGrap...
-
ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation
ClawTrace enables cost-aware LLM agent skill distillation by tracing per-step costs and generating preserve, prune, and repair patches, with ablations showing reduced regressions and prune rules transferring to cut co...
-
ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation
ProEval is a proactive framework using pre-trained GPs, Bayesian quadrature, and superlevel set sampling to estimate performance and find failures in generative AI with 8-65x fewer samples than baselines.
-
CADMAS-CTX: Contextual Capability Calibration for Multi-Agent Delegation
CADMAS-CTX replaces static skill profiles with context-conditioned Beta posteriors and uncertainty-penalized routing, yielding higher accuracy on GAIA (0.442) and SWE-bench (31.4%) than static baselines.
-
Privacy-Preserving LLMs Routing
PPRoute achieves plaintext-level LLM routing quality with MPC-based privacy and a 20x speedup over naive encrypted implementations via MPC-friendly encoders, multi-step training, and O(1) communication Top-k search.
-
Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization
A Lagrangian-relaxation plus imitation-learning pipeline adaptively allocates test-time compute to LLMs, outperforming uniform baselines by up to 12.8% relative accuracy on MATH while staying within a fixed average budget.
-
L2D-Clinical: Learning to Defer for Adaptive Model Selection in Clinical Text Classification
L2D-Clinical improves F1 by 1.7 points on ADE detection and 9.3 points on MIMIC treatment classification by deferring 7-17% of cases from BERT to LLM.
-
Local-Splitter: A Measurement Study of Seven Tactics for Reducing Cloud LLM Token Usage on Coding-Agent Workloads
Combining local routing with prompt compression saves 45-79% cloud tokens on edit and explanation workloads, while a fuller set including draft-review saves 51% on RAG-heavy tasks.
-
RouterWise: Joint Resource Allocation and Routing for Latency-Aware Multi-Model LLM Serving
Joint resource allocation and routing for multi-model LLM serving can produce up to 87% variation in achievable output quality across setups on the same GPU cluster.
-
ExecTune: Effective Steering of Black-Box LLMs with Guide Models
ExecTune trains guide models via acceptance sampling, supervised fine-tuning, and structure-aware RL to boost executability of strategies for black-box LLMs, yielding up to 9.2% higher accuracy and 22.4% lower cost on...
-
Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents
A learned embedding-based router selecting among six reasoning paradigms improves LLM agent accuracy from 47.6% to 53.1% on average, beating the best fixed paradigm by 2.8pp.
-
Policy-Governed LLM Routing with Intent Matching for Instrument Laboratories
A governed LLM routing system for lab tutoring raises challenge-alignment from 0.90 to 0.98, boosts productive-struggle time, and cuts token costs by two-thirds while preserving answer accuracy.
-
SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition
SARE adaptively switches between fast retrieval and self-reflective reasoning for training-free fine-grained visual recognition, claiming state-of-the-art accuracy on 14 datasets with substantially lower computation.
-
RouteLLM: Learning to Route LLMs with Preference Data
Router models trained on preference data dynamically select between strong and weak LLMs, cutting inference costs by more than 2x on benchmarks with no quality loss and showing transfer to new model pairs.
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
Retrieval-Conditioned Topology Selection with Provable Budget Conservation for Multi-Agent Code Generation
RGAO combines retrieval-based complexity assessment with a formal budget algebra to enable dynamic topology selection in multi-agent code generation with provable conservation.
-
Complexity Horizons of Compressed Models in Analog Circuit Analysis
Prerequisite graphs map compressed LLM performance boundaries in analog circuit analysis to allow selecting the smallest viable model for a given task complexity.
-
Rethinking AI Hardware: A Three-Layer Cognitive Architecture for Autonomous Agents
Tri-Spirit decomposes autonomous AI into planning, reasoning, and execution layers on heterogeneous hardware, yielding 75.6% lower latency, 71.1% less energy, and 77.6% offline task completion in 2000-task simulations.
-
A-IO: Adaptive Inference Orchestration for Memory-Bound NPUs
A-IO adaptively orchestrates LLM inference on NPUs to address memory bottlenecks, model scaling paradoxes, and synchronization costs in speculative decoding.
-
Lightweight Query Routing for Adaptive RAG: A Baseline Study on RAGRouter-Bench
TF-IDF SVM routing on RAGRouter-Bench reaches 0.928 macro F1 and 93.2 percent accuracy while simulating 28.1 percent token savings, outperforming sentence embeddings.
-
The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project
The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.
-
Separating Intelligence from Execution: A Workflow Engine for the Model Context Protocol
An MCP-native workflow engine decouples agent reasoning from execution by using declarative blueprints, reducing token cost by over 99% on a 67-step Kubernetes synchronization task.
-
Feedback Over Form: Why Execution Feedback Matters More Than Pipeline Topology in 1-3B Code Generation
Execution feedback in refinement loops improves 1-3B code generation performance far more than complex pipeline topologies discovered via evolutionary search on HumanEval and sanitized MBPP.
-
Qualixar OS: A Universal Operating System for AI Agent Orchestration
Qualixar OS provides a runtime for multi-agent AI systems with support for 12 topologies, LLM-driven team design, dynamic routing, consensus judging, content attribution, and protocol bridging, achieving 100% accuracy...
-
Do Small Language Models Know When They're Wrong? Confidence-Based Cascade Scoring for Educational Assessment
Verbalized confidence from small LMs enables cost-effective cascade routing for automated educational scoring, matching large-model accuracy at 76% lower cost when discrimination is strong.
-
Latency and Cost of Multi-Agent Intelligent Tutoring at Scale
Priority PayGo keeps multi-agent tutoring responses under 4 seconds even at 50 concurrent users, while costs stay below textbook prices per student.
-
Pangu-ACE: Adaptive Cascaded Experts for Educational Response Generation on EduBench
Pangu-ACE improves educational response quality on EduBench from 0.457 to 0.538 and format validity from 0.707 to 0.866 by routing 19.7% of samples to a 1B model while escalating the rest to 7B.
-
A Survey on Efficient Inference for Large Language Models
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
Reference graph
Works this paper leans on
-
[1]
Ask me anything: A simple strategy for prompting language models,
[AI2] AI21 LLM API. https://www.ai21.com/. Accessed: 2023-03-31. [ANC+22] Simran Arora, Avanika Narayan, Mayee F Chen, Laurel J Orr, Neel Guha, Kush Bhatia, Ines Chami, Frederic Sala, and Christopher R´ e. Ask me anything: A simple strategy for prompting language models. arXiv preprint arXiv:2210.02441 ,
-
[2]
[BGMMS21] Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages 610–623,
work page 2021
-
[3]
Language models are few-shot learners
[BMR+20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,
work page 1901
-
[4]
Did the model change? efficiently assessing machine learning api shifts
[CCZZ21] Lingjiao Chen, Tracy Cai, Matei Zaharia, and James Zou. Did the model change? efficiently assessing machine learning api shifts. arXiv preprint arXiv:2107.14203 ,
-
[5]
https://openai.com/blog/chatgpt
[Cha] ChatGPT Announcement. https://openai.com/blog/chatgpt. Accessed: 2023-03-31. [CHSV17] Zhaowei Cai, Xiaodong He, Jian Sun, and Nuno Vasconcelos. Deep learning with low precision by half-wave gaussian quantization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5918–5926,
work page 2023
-
[6]
[CoH] CoHere LLM API. https://cohere.com/. Accessed: 2023-03-31. [Cosa] Cost estimation of using GPT-3 for real applications. https://www.semianalysis.com/ p/the-inference-cost-of-search-disruption . Accessed: 2023-03-31. [Cosb] Cost estimation of using GPT-3 for real applications. https://neoteric.eu/blog/ how-much-does-it-cost-to-use-gpt-models-gpt-3-pr...
work page 2023
-
[7]
Successive prompting for decomposing complex questions
[DGSG22] Dheeru Dua, Shivanshu Gupta, Sameer Singh, and Matt Gardner. Successive prompting for decomposing complex questions. arXiv preprint arXiv:2212.04092 ,
-
[8]
[FFA] forefront AI LLM API. https://beta.forefront.ai/. Accessed: 2023-03-31. 11 [Fri02] Jerome H Friedman. Stochastic gradient boosting. Computational statistics & data analysis, 38(4):367–378,
work page 2023
-
[9]
Semi-supervised cascaded clustering for classification of noisy label data
[GDMR22] Ashit Gupta, Anirudh Deodhar, Tathagata Mukherjee, and Venkataramana Runkana. Semi-supervised cascaded clustering for classification of noisy label data. arXiv preprint arXiv:2205.02209,
-
[10]
[HMD15] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149,
work page internal anchor Pith review arXiv
-
[11]
Ziplm: Hardware-aware structured pruning of language models
[KFA23] Eldar Kurtic, Elias Frantar, and Dan Alistarh. Ziplm: Hardware-aware structured pruning of language models. arXiv preprint arXiv:2302.04089 ,
-
[12]
arXiv preprint arXiv:2212.14024 (2022)
[KSL+22] Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp. arXiv preprint arXiv:2212.14024 ,
-
[13]
[KTF+22] Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406 ,
-
[14]
Generated knowledge prompting for commonsense reasoning
[LLL+21] Jiacheng Liu, Alisa Liu, Ximing Lu, Sean Welleck, Peter West, Ronan Le Bras, Yejin Choi, and Hannaneh Hajishirzi. Generated knowledge prompting for commonsense reasoning. arXiv preprint arXiv:2110.08387 ,
-
[15]
11 Wendy Johnson and Thomas J Bouchard Jr
[LSZ+21] Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for gpt-3? arXiv preprint arXiv:2101.06804 ,
-
[16]
Augmented language models: a survey, 2023
[MDL+23] Gr´ egoire Mialon, Roberto Dess` ı, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozi` ere, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. Augmented language models: a survey. arXiv preprint arXiv:2302.07842 ,
-
[17]
[Ope] OpenAI LLM API. https://platform.openai.com/. Accessed: 2023-03-31. [Ope23] OpenAI. Gpt-4 technical report. arXiv preprint https://arxiv.org/abs/2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
[SDCW19] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 ,
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[19]
Impact of news on the commodity market: Dataset and results
12 [SK21] Ankur Sinha and Tanmay Khandait. Impact of news on the commodity market: Dataset and results. In Advances in Information and Communication: Proceedings of the 2021 Future of Information and Communication Conference (FICC), Volume 2 , pages 589–601. Springer,
work page 2021
-
[20]
LLaMA: Open and Efficient Foundation Language Models
[Tex] Textsynth LLM API. https://textsynth.com/. Accessed: 2023-03-31. [TLI+23] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ ee Lacroix, Baptiste Rozi` ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 ,
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
[WWS+22] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Smoothquant: Accurate and efficient post-training quantization for large language models,
[XLS+22] Guangxuan Xiao, Ji Lin, Mickael Seznec, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. arXiv preprint arXiv:2211.10438,
-
[23]
[YLW+23] Zhewei Yao, Cheng Li, Xiaoxia Wu, Stephen Youn, and Yuxiong He. A comprehen- sive study on post-training quantization for large language models. arXiv preprint arXiv:2303.08302,
-
[24]
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
[ZSH+22] Denny Zhou, Nathanael Sch¨ arli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625 ,
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.