arxiv: 2305.05176 · v1 · submitted 2023-05-09 · 💻 cs.LG · cs.AI· cs.CL· cs.SE

Recognition: 1 theorem link

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

Lingjiao Chen , Matei Zaharia , James Zou

Authors on Pith no claims yet

Pith reviewed 2026-05-11 19:50 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.SE

keywords large language modelscost reductionLLM cascadequery routingAPI pricinginference optimizationFrugalGPT

0 comments

The pith

FrugalGPT learns to select LLM combinations for each query to match or beat the best single model's accuracy at much lower cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language model APIs vary widely in price, making bulk use expensive. The paper outlines three ways to cut costs: adapting prompts, approximating expensive models, and cascading multiple models. FrugalGPT implements the cascade by training a system to pick the right LLM or sequence for each input based on past performance data. A sympathetic reader would care because this could make advanced AI capabilities more accessible and sustainable for high-volume applications. Experiments indicate the system can match GPT-4 level results with 98 percent lower cost or gain 4 percent accuracy at equal cost.

Core claim

FrugalGPT is a flexible instantiation of the LLM cascade strategy that learns which combinations of LLMs to use for different queries in order to reduce cost and improve accuracy. Our experiments show that FrugalGPT can match the performance of the best individual LLM (e.g. GPT-4) with up to 98% cost reduction or improve the accuracy over GPT-4 by 4% with the same cost.

What carries the argument

The LLM cascade mechanism, which routes queries to one or more LLMs chosen to balance accuracy and cost, with FrugalGPT learning the routing policy from query and performance data.

If this is right

Organizations querying LLMs at scale can achieve equivalent performance without incurring the full cost of premium models.
Users can exploit the heterogeneous pricing of different LLM providers by selectively routing queries.
The cascade approach can be combined with prompt adaptation and model approximation for additional savings.
LLM usage becomes more sustainable for large collections of queries and text processing tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Routing decisions might need updating if the types of queries shift substantially from the training data.
The method could extend to choosing among open-source models hosted locally versus paid APIs.
Similar cascading could apply to other paid AI services like image generators with varying costs and qualities.

Load-bearing premise

A router trained on queries and performance data from one set will continue to pick good LLM combinations for new queries whose cost-accuracy profiles are similar to the training distribution.

What would settle it

Run FrugalGPT on a new collection of queries that differ in topic or complexity from the training queries, such as moving from general web queries to domain-specific technical questions, and check if the claimed cost savings and accuracy levels are maintained.

read the original abstract

There is a rapidly growing number of large language models (LLMs) that users can query for a fee. We review the cost associated with querying popular LLM APIs, e.g. GPT-4, ChatGPT, J1-Jumbo, and find that these models have heterogeneous pricing structures, with fees that can differ by two orders of magnitude. In particular, using LLMs on large collections of queries and text can be expensive. Motivated by this, we outline and discuss three types of strategies that users can exploit to reduce the inference cost associated with using LLMs: 1) prompt adaptation, 2) LLM approximation, and 3) LLM cascade. As an example, we propose FrugalGPT, a simple yet flexible instantiation of LLM cascade which learns which combinations of LLMs to use for different queries in order to reduce cost and improve accuracy. Our experiments show that FrugalGPT can match the performance of the best individual LLM (e.g. GPT-4) with up to 98% cost reduction or improve the accuracy over GPT-4 by 4% with the same cost. The ideas and findings presented here lay a foundation for using LLMs sustainably and efficiently.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FrugalGPT shows you can match or beat top LLM performance on many tasks while spending far less by routing queries through a learned cascade of cheaper models.

read the letter

FrugalGPT shows you can match or beat top LLM performance on many tasks while spending far less by routing queries through a learned cascade of cheaper models. The new piece is adapting the cascade idea specifically to the heterogeneous pricing of commercial LLM APIs. Earlier cascades existed in other settings, but here they train a router on query performance and cost data to pick combinations dynamically. This lets them report up to 98% cost savings to match GPT-4 or a 4% accuracy lift at fixed cost. The paper does a good job laying out three strategies—prompt adaptation, approximation, and cascades—and then focusing on the cascade as the example. The experiments use real tasks and multiple models, giving concrete before-and-after numbers that line up with the method. The soft spot is generalization. The router is trained and tested on splits from the same query sets. If new queries have different cost-accuracy profiles, the learned policy might not transfer as well. No cross-domain or shift experiments are shown, so users would need to retrain or monitor in practice. This is for teams running lots of LLM queries who care about the bill. A practitioner could pick up the approach and try it on their workload with modest effort. It deserves peer review. The empirical claims are clear and the idea is useful even if the router needs care in deployment.

Referee Report

2 major / 3 minor

Summary. The paper reviews the heterogeneous costs of popular LLM APIs and proposes three high-level strategies for cost reduction: prompt adaptation, LLM approximation, and LLM cascades. As a concrete instantiation, it introduces FrugalGPT, which trains a router to select per-query cascades of LLMs. Experiments on real tasks show that FrugalGPT can match the accuracy of the strongest single model (GPT-4) with up to 98% cost reduction or improve accuracy by 4% at the same cost as GPT-4.

Significance. The work has clear practical significance for sustainable LLM usage if the router generalizes. It supplies concrete, task-level numbers rather than abstract bounds, reviews real API pricing, and demonstrates a simple, trainable cascade that improves upon single-model baselines. The empirical results on held-out data from the training distributions constitute a solid starting point for cost-aware inference research.

major comments (2)

Section 4 (Experiments): The router is trained and evaluated exclusively on held-out splits drawn from the same query pools used to collect performance labels. No cross-domain, temporal-shift, or out-of-distribution experiments are reported, which directly bears on whether the reported 98% cost reduction or +4% accuracy gains will hold for new queries whose cost-accuracy profiles differ from the training distribution.
Section 3.2 (Router training): The paper does not provide an ablation on the router's input features or on the sensitivity of the learned policy to the exact set of training queries. This makes it hard to determine how much of the headline gains are due to the cascade structure versus query-specific overfitting.

minor comments (3)

The introduction would benefit from a short table summarizing the three strategy categories and their key trade-offs before diving into FrugalGPT.
Figure 2 (or equivalent results figure): axis labels and legend entries should explicitly state whether cost is measured in dollars per 1k tokens or total dollars for the evaluated query set.
A brief discussion of how the router would be retrained or updated when new LLMs become available would clarify the method's long-term practicality.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and constructive comments. We address each major comment below and indicate revisions to the manuscript.

read point-by-point responses

Referee: Section 4 (Experiments): The router is trained and evaluated exclusively on held-out splits drawn from the same query pools used to collect performance labels. No cross-domain, temporal-shift, or out-of-distribution experiments are reported, which directly bears on whether the reported 98% cost reduction or +4% accuracy gains will hold for new queries whose cost-accuracy profiles differ from the training distribution.

Authors: We agree this is a valid limitation of the current evaluation. The experiments use held-out splits from the same real-world task distributions to demonstrate feasibility and concrete cost-accuracy tradeoffs. In the revised manuscript we have added an explicit limitations paragraph in Section 4 and the conclusion that notes the in-distribution focus and calls for future cross-domain and temporal-shift studies. We maintain that the multi-task results still offer a solid empirical foundation for the cascade approach within comparable query regimes. revision: partial
Referee: Section 3.2 (Router training): The paper does not provide an ablation on the router's input features or on the sensitivity of the learned policy to the exact set of training queries. This makes it hard to determine how much of the headline gains are due to the cascade structure versus query-specific overfitting.

Authors: We acknowledge the absence of these ablations in the original submission. The router employs lightweight, query-derived features chosen for practicality across API calls. To strengthen the analysis, the revised version includes a new sensitivity study (added to Section 3.2 and the appendix) that retrains the router on multiple random subsets of the training queries and reports stable performance, supporting that gains arise primarily from the learned cascade policy rather than overfitting to a specific query set. A full feature ablation is noted as future work given space constraints but is not required to substantiate the core claims. revision: partial

Circularity Check

0 steps flagged

No significant circularity: FrugalGPT claims rest on empirical router evaluation, not definitional or self-referential reduction.

full rationale

The paper trains a router on query features and LLM performance/cost labels collected from a fixed query pool, then evaluates the resulting cascade selections on held-out splits drawn from the same pool. This is a standard train/test split in supervised learning; the reported accuracy and cost numbers are measured outcomes on unseen queries rather than quantities forced by the training procedure itself. No equations define the router output to equal its own training targets, no uniqueness theorem is invoked via self-citation, and no ansatz or renaming is smuggled in. The central performance claims (98% cost reduction or +4% accuracy) are therefore falsifiable experimental results, not tautologies.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical performance of a trained router plus the assumption that LLM cost and accuracy profiles are stable enough to be learned from a finite training set. No new physical entities or untestable mathematical axioms are introduced.

free parameters (1)

Router model parameters
Weights of the small model that decides which LLM(s) to call for each query; fitted on performance data.

axioms (1)

domain assumption Individual LLM cost and accuracy are sufficiently consistent across queries to allow a router trained on past data to generalize.
Invoked when claiming that the learned policy will deliver the reported savings on new queries.

pith-pipeline@v0.9.0 · 5521 in / 1273 out tokens · 46775 ms · 2026-05-11T19:50:29.170892+00:00 · methodology

discussion (0)

Forward citations

Cited by 48 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents
cs.LG 2026-05 unverdicted novelty 8.0

OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy...
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
cs.CL 2023-10 conditional novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment
cs.CL 2026-05 unverdicted novelty 7.0

An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
A Regime Theory of Controller Class Selection for LLM Action Decisions
cs.AI 2026-05 unverdicted novelty 7.0

A regime theory selects the optimal controller class for LLM action decisions from a nested lattice of four classes using three data-estimable bottlenecks, with a Bernstein-tight threshold and empirical matches on mul...
Inference-Time Budget Control for LLM Search Agents
cs.AI 2026-05 unverdicted novelty 7.0

A VOI-based controller for dual inference budgets improves multi-hop QA performance by prioritizing search actions and selectively finalizing answers.
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
cs.MA 2026-05 unverdicted novelty 7.0

MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs
cs.PF 2026-05 unverdicted novelty 7.0

Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and through...
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs
cs.PF 2026-05 conditional novelty 7.0

Hosted open-weight LLMs function as heterogeneous, time-varying services rather than uniform model artifacts, with concentrated demand, decoupled supply and adoption, and measurable gains from task-aware routing.
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
cs.CR 2026-05 unverdicted novelty 7.0

A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
Model Routing as a Trust Problem: Route Receipts for Adaptive AI Systems
cs.AI 2026-05 conditional novelty 7.0

The paper introduces route receipts as a portable runtime record of routing decisions to make adaptive AI systems more transparent and trustworthy.
RouteProfile: Elucidating the Design Space of LLM Profiles for Routing
cs.NI 2026-04 unverdicted novelty 7.0

RouteProfile organizes LLM profile design into organizational form, representation type, aggregation depth, and learning configuration, with evaluations showing structured profiles outperform flat ones and aid general...
Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models
eess.AS 2026-04 unverdicted novelty 7.0

Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- a...
Two-dimensional early exit optimisation of LLM inference
cs.CL 2026-03 unverdicted novelty 7.0

Coordinating layer-wise and sentence-wise early exits in LLMs produces multiplicative speedups of 1.4-2.3x over single-dimension early exit on sentiment classification tasks.
Domain Restriction via Multi SAE Layer Transitions
cs.AI 2026-05 unverdicted novelty 6.0

Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.
GAR: Carbon-Aware Routing for LLM Inference via Constrained Optimization
cs.AI 2026-05 unverdicted novelty 6.0

GAR routes LLM inference requests via constrained multi-objective optimization to cut per-request CO2 emissions while respecting accuracy floors and p95 latency SLOs.
LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?
cs.AI 2026-05 unverdicted novelty 6.0

LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.
Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs
cs.AI 2026-05 unverdicted novelty 6.0

A critique-and-routing controller cast as a finite-horizon MDP with policy-gradient optimization outperforms one-shot routing baselines on reasoning benchmarks while using the strongest agent for under 25% of calls.
Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

A small RL-trained policy for stepwise model routing between LLM sizes improves the accuracy-cost tradeoff on math benchmarks over handcrafted strategies and matches large process reward model methods.
SynConfRoute: Syntax-Aware Routing for Efficient Code Completion with Small CodeLLMs
cs.SE 2026-05 unverdicted novelty 6.0

SynConfRoute routes code completions using syntax validation and token confidence, improving pass@1 by up to 31% on hard tasks and reducing accelerator usage by 58% versus always using the largest model.
SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection
cs.LG 2026-05 conditional novelty 6.0

SpecKV uses a small MLP trained on draft model confidence and entropy to dynamically choose the optimal speculation length gamma, achieving 56% better performance than fixed gamma=4 across various tasks and compressio...
Agent Capsules: Quality-Gated Granularity Control for Multi-Agent LLM Pipelines
cs.CL 2026-05 unverdicted novelty 6.0

Agent Capsules is an adaptive runtime for multi-agent LLM pipelines that selectively compounds agent executions under quality constraints, delivering 19-68% token reductions at parity or better quality versus LangGrap...
ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation
cs.AI 2026-04 unverdicted novelty 6.0

ClawTrace enables cost-aware LLM agent skill distillation by tracing per-step costs and generating preserve, prune, and repair patches, with ablations showing reduced regressions and prune rules transferring to cut co...
ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation
cs.LG 2026-04 unverdicted novelty 6.0

ProEval is a proactive framework using pre-trained GPs, Bayesian quadrature, and superlevel set sampling to estimate performance and find failures in generative AI with 8-65x fewer samples than baselines.
CADMAS-CTX: Contextual Capability Calibration for Multi-Agent Delegation
cs.AI 2026-04 unverdicted novelty 6.0

CADMAS-CTX replaces static skill profiles with context-conditioned Beta posteriors and uncertainty-penalized routing, yielding higher accuracy on GAIA (0.442) and SWE-bench (31.4%) than static baselines.
Privacy-Preserving LLMs Routing
cs.CR 2026-04 unverdicted novelty 6.0

PPRoute achieves plaintext-level LLM routing quality with MPC-based privacy and a 20x speedup over naive encrypted implementations via MPC-friendly encoders, multi-step training, and O(1) communication Top-k search.
Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization
cs.LG 2026-04 unverdicted novelty 6.0

A Lagrangian-relaxation plus imitation-learning pipeline adaptively allocates test-time compute to LLMs, outperforming uniform baselines by up to 12.8% relative accuracy on MATH while staying within a fixed average budget.
L2D-Clinical: Learning to Defer for Adaptive Model Selection in Clinical Text Classification
cs.CL 2026-04 unverdicted novelty 6.0

L2D-Clinical improves F1 by 1.7 points on ADE detection and 9.3 points on MIMIC treatment classification by deferring 7-17% of cases from BERT to LLM.
Local-Splitter: A Measurement Study of Seven Tactics for Reducing Cloud LLM Token Usage on Coding-Agent Workloads
cs.DC 2026-04 unverdicted novelty 6.0

Combining local routing with prompt compression saves 45-79% cloud tokens on edit and explanation workloads, while a fuller set including draft-review saves 51% on RAG-heavy tasks.
RouterWise: Joint Resource Allocation and Routing for Latency-Aware Multi-Model LLM Serving
cs.NI 2026-04 unverdicted novelty 6.0

Joint resource allocation and routing for multi-model LLM serving can produce up to 87% variation in achievable output quality across setups on the same GPU cluster.
ExecTune: Effective Steering of Black-Box LLMs with Guide Models
cs.LG 2026-04 unverdicted novelty 6.0

ExecTune trains guide models via acceptance sampling, supervised fine-tuning, and structure-aware RL to boost executability of strategies for black-box LLMs, yielding up to 9.2% higher accuracy and 22.4% lower cost on...
Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents
cs.CL 2026-04 conditional novelty 6.0

A learned embedding-based router selecting among six reasoning paradigms improves LLM agent accuracy from 47.6% to 53.1% on average, beating the best fixed paradigm by 2.8pp.
Policy-Governed LLM Routing with Intent Matching for Instrument Laboratories
cs.CY 2026-04 conditional novelty 6.0

A governed LLM routing system for lab tutoring raises challenge-alignment from 0.90 to 0.98, boosts productive-struggle time, and cuts token costs by two-thirds while preserving answer accuracy.
SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition
cs.CV 2026-03 unverdicted novelty 6.0

SARE adaptively switches between fast retrieval and self-reflective reasoning for training-free fine-grained visual recognition, claiming state-of-the-art accuracy on 14 datasets with substantially lower computation.
RouteLLM: Learning to Route LLMs with Preference Data
cs.LG 2024-06 unverdicted novelty 6.0

Router models trained on preference data dynamically select between strong and weak LLMs, cutting inference costs by more than 2x on benchmarks with no quality loss and showing transfer to new model pairs.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
cs.SE 2024-03 unverdicted novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Retrieval-Conditioned Topology Selection with Provable Budget Conservation for Multi-Agent Code Generation
cs.AI 2026-05 unverdicted novelty 5.0

RGAO combines retrieval-based complexity assessment with a formal budget algebra to enable dynamic topology selection in multi-agent code generation with provable conservation.
Complexity Horizons of Compressed Models in Analog Circuit Analysis
cs.AI 2026-05 unverdicted novelty 5.0

Prerequisite graphs map compressed LLM performance boundaries in analog circuit analysis to allow selecting the smallest viable model for a given task complexity.
Rethinking AI Hardware: A Three-Layer Cognitive Architecture for Autonomous Agents
cs.AI 2026-04 unverdicted novelty 5.0

Tri-Spirit decomposes autonomous AI into planning, reasoning, and execution layers on heterogeneous hardware, yielding 75.6% lower latency, 71.1% less energy, and 77.6% offline task completion in 2000-task simulations.
A-IO: Adaptive Inference Orchestration for Memory-Bound NPUs
cs.DC 2026-04 unverdicted novelty 5.0

A-IO adaptively orchestrates LLM inference on NPUs to address memory bottlenecks, model scaling paradoxes, and synchronization costs in speculative decoding.
Lightweight Query Routing for Adaptive RAG: A Baseline Study on RAGRouter-Bench
cs.IR 2026-04 unverdicted novelty 5.0

TF-IDF SVM routing on RAGRouter-Bench reaches 0.928 macro F1 and 93.2 percent accuracy while simulating 28.1 percent token savings, outperforming sentence embeddings.
The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project
cs.LG 2026-03 unverdicted novelty 5.0

The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.
Separating Intelligence from Execution: A Workflow Engine for the Model Context Protocol
cs.DC 2026-03 unverdicted novelty 5.0

An MCP-native workflow engine decouples agent reasoning from execution by using declarative blueprints, reducing token cost by over 99% on a 67-step Kubernetes synchronization task.
Feedback Over Form: Why Execution Feedback Matters More Than Pipeline Topology in 1-3B Code Generation
cs.SE 2026-04 accept novelty 4.0

Execution feedback in refinement loops improves 1-3B code generation performance far more than complex pipeline topologies discovered via evolutionary search on HumanEval and sanitized MBPP.
Qualixar OS: A Universal Operating System for AI Agent Orchestration
cs.AI 2026-04 unverdicted novelty 4.0

Qualixar OS provides a runtime for multi-agent AI systems with support for 12 topologies, LLM-driven team design, dynamic routing, consensus judging, content attribution, and protocol bridging, achieving 100% accuracy...
Do Small Language Models Know When They're Wrong? Confidence-Based Cascade Scoring for Educational Assessment
cs.CY 2026-03 unverdicted novelty 4.0

Verbalized confidence from small LMs enables cost-effective cascade routing for automated educational scoring, matching large-model accuracy at 76% lower cost when discrimination is strong.
Latency and Cost of Multi-Agent Intelligent Tutoring at Scale
cs.CY 2026-04 unverdicted novelty 3.0

Priority PayGo keeps multi-agent tutoring responses under 4 seconds even at 50 concurrent users, while costs stay below textbook prices per student.
Pangu-ACE: Adaptive Cascaded Experts for Educational Response Generation on EduBench
cs.CL 2026-04 unverdicted novelty 3.0

Pangu-ACE improves educational response quality on EduBench from 0.457 to 0.538 and format validity from 0.707 to 0.866 by routing 19.7% of samples to a 1B model while escalating the rest to 7B.
A Survey on Efficient Inference for Large Language Models
cs.CL 2024-04 accept novelty 3.0

The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 47 Pith papers · 6 internal anchors

[1]

Ask me anything: A simple strategy for prompting language models,

[AI2] AI21 LLM API. https://www.ai21.com/. Accessed: 2023-03-31. [ANC+22] Simran Arora, Avanika Narayan, Mayee F Chen, Laurel J Orr, Neel Guha, Kush Bhatia, Ines Chami, Frederic Sala, and Christopher R´ e. Ask me anything: A simple strategy for prompting language models. arXiv preprint arXiv:2210.02441 ,

work page arXiv 2023
[2]

On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages 610–623,

[BGMMS21] Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages 610–623,

work page 2021
[3]

Language models are few-shot learners

[BMR+20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

work page 1901
[4]

Did the model change? eﬃciently assessing machine learning api shifts

[CCZZ21] Lingjiao Chen, Tracy Cai, Matei Zaharia, and James Zou. Did the model change? eﬃciently assessing machine learning api shifts. arXiv preprint arXiv:2107.14203 ,

work page arXiv
[5]

https://openai.com/blog/chatgpt

[Cha] ChatGPT Announcement. https://openai.com/blog/chatgpt. Accessed: 2023-03-31. [CHSV17] Zhaowei Cai, Xiaodong He, Jian Sun, and Nuno Vasconcelos. Deep learning with low precision by half-wave gaussian quantization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5918–5926,

work page 2023
[6]

https://cohere.com/

[CoH] CoHere LLM API. https://cohere.com/. Accessed: 2023-03-31. [Cosa] Cost estimation of using GPT-3 for real applications. https://www.semianalysis.com/ p/the-inference-cost-of-search-disruption . Accessed: 2023-03-31. [Cosb] Cost estimation of using GPT-3 for real applications. https://neoteric.eu/blog/ how-much-does-it-cost-to-use-gpt-models-gpt-3-pr...

work page 2023
[7]

Successive prompting for decomposing complex questions

[DGSG22] Dheeru Dua, Shivanshu Gupta, Sameer Singh, and Matt Gardner. Successive prompting for decomposing complex questions. arXiv preprint arXiv:2212.04092 ,

work page arXiv
[8]

https://beta.forefront.ai/

[FFA] forefront AI LLM API. https://beta.forefront.ai/. Accessed: 2023-03-31. 11 [Fri02] Jerome H Friedman. Stochastic gradient boosting. Computational statistics & data analysis, 38(4):367–378,

work page 2023
[9]

Semi-supervised cascaded clustering for classiﬁcation of noisy label data

[GDMR22] Ashit Gupta, Anirudh Deodhar, Tathagata Mukherjee, and Venkataramana Runkana. Semi-supervised cascaded clustering for classiﬁcation of noisy label data. arXiv preprint arXiv:2205.02209,

work page arXiv
[10]

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

[HMD15] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huﬀman coding. arXiv preprint arXiv:1510.00149,

work page internal anchor Pith review arXiv
[11]

Ziplm: Hardware-aware structured pruning of language models

[KFA23] Eldar Kurtic, Elias Frantar, and Dan Alistarh. Ziplm: Hardware-aware structured pruning of language models. arXiv preprint arXiv:2302.04089 ,

work page arXiv
[12]

arXiv preprint arXiv:2212.14024 (2022)

[KSL+22] Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp. arXiv preprint arXiv:2212.14024 ,

work page arXiv
[13]

Zeyi Huang, Yuyang Ji, Xiaofang Wang, Nikhil Mehta, Tong Xiao, Donghyun Lee, Sigmund Vanvalken- burgh, Shengxin Zha, Bolin Lai, Licheng Yu, and 1 others

[KTF+22] Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406 ,

work page arXiv
[14]

Generated knowledge prompting for commonsense reasoning

[LLL+21] Jiacheng Liu, Alisa Liu, Ximing Lu, Sean Welleck, Peter West, Ronan Le Bras, Yejin Choi, and Hannaneh Hajishirzi. Generated knowledge prompting for commonsense reasoning. arXiv preprint arXiv:2110.08387 ,

work page arXiv
[15]

11 Wendy Johnson and Thomas J Bouchard Jr

[LSZ+21] Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for gpt-3? arXiv preprint arXiv:2101.06804 ,

work page arXiv
[16]

Augmented language models: a survey, 2023

[MDL+23] Gr´ egoire Mialon, Roberto Dess` ı, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozi` ere, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. Augmented language models: a survey. arXiv preprint arXiv:2302.07842 ,

work page arXiv
[17]

GPT-4 Technical Report

[Ope] OpenAI LLM API. https://platform.openai.com/. Accessed: 2023-03-31. [Ope23] OpenAI. Gpt-4 technical report. arXiv preprint https://arxiv.org/abs/2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

[SDCW19] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 ,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[19]

Impact of news on the commodity market: Dataset and results

12 [SK21] Ankur Sinha and Tanmay Khandait. Impact of news on the commodity market: Dataset and results. In Advances in Information and Communication: Proceedings of the 2021 Future of Information and Communication Conference (FICC), Volume 2 , pages 589–601. Springer,

work page 2021
[20]

LLaMA: Open and Efficient Foundation Language Models

[Tex] Textsynth LLM API. https://textsynth.com/. Accessed: 2023-03-31. [TLI+23] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ ee Lacroix, Baptiste Rozi` ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and eﬃcient foundation language models. arXiv preprint arXiv:2302.13971 ,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

[WWS+22] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 ,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Smoothquant: Accurate and efficient post-training quantization for large language models,

[XLS+22] Guangxuan Xiao, Ji Lin, Mickael Seznec, Julien Demouth, and Song Han. Smoothquant: Accurate and eﬃcient post-training quantization for large language models. arXiv preprint arXiv:2211.10438,

work page arXiv
[23]

Zeroquant-v2: Exploring post-training quantization in llms from comprehensive study to low rank compensation,

[YLW+23] Zhewei Yao, Cheng Li, Xiaoxia Wu, Stephen Youn, and Yuxiong He. A comprehen- sive study on post-training quantization for large language models. arXiv preprint arXiv:2303.08302,

work page arXiv
[24]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

[ZSH+22] Denny Zhou, Nathanael Sch¨ arli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625 ,

work page internal anchor Pith review Pith/arXiv arXiv