AdaSwitch: Adaptive Switching between Small and Large Agents for Effective Cloud-Local Collaborative Learning

Bo Wang; Dawei Yin; Hao Sun; Hengyi Cai; Jiayi Wu; Shuaiqiang Wang; Xiaochi Wei; Yan Zhang; Yue Feng

arxiv: 2410.13181 · v3 · pith:D4GX7OZOnew · submitted 2024-10-17 · 💻 cs.CL

AdaSwitch: Adaptive Switching between Small and Large Agents for Effective Cloud-Local Collaborative Learning

Hao Sun , Jiayi Wu , Hengyi Cai , Xiaochi Wei , Yue Feng , Bo Wang , Shuaiqiang Wang , Yan Zhang

show 1 more author

Dawei Yin

This is my paper

Pith reviewed 2026-05-23 18:44 UTC · model grok-4.3

classification 💻 cs.CL

keywords adaptive switchingcloud-local collaborationLLM agentsintrospective error detectionreasoning taskscomputational efficiencyhybrid LLM systemsquestion answering benchmarks

0 comments

The pith

A smaller local LLM improves its reasoning by switching to a larger cloud LLM only after detecting its own errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AdaSwitch as a collaborative setup with a local agent powered by a smaller LLM for routine steps and a cloud agent with a larger LLM for harder ones. The local agent uses introspection to spot its own mistakes and then requests help from the cloud agent. This design seeks to combine local efficiency with cloud accuracy for better overall task results at reduced cost. Experiments on seven benchmarks in mathematical reasoning and complex question answering show the local agent gains performance and sometimes matches the cloud agent while using far less compute.

Core claim

AdaSwitch enables a local agent instantiated with a smaller LLM to handle less complex reasoning steps while a cloud agent with a larger LLM manages intricate ones through an adaptive mechanism. The local agent introspectively identifies errors and proactively seeks assistance from the cloud agent, integrating the strengths of both locally-deployed and cloud-based LLMs to enhance task completion performance and efficiency.

What carries the argument

The adaptive switching mechanism driven by the local agent's introspective error identification that triggers requests for cloud agent assistance.

If this is right

The local agent's performance improves across the tested reasoning and question-answering tasks.
Results can reach levels competitive with the cloud agent alone on some benchmarks.
Computational overhead drops substantially compared to exclusive use of the cloud agent.
The framework operates effectively with different sizes of LLMs for local and cloud agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The switching logic could extend to other settings where one model must decide when to defer to a more capable but costlier one.
If error detection holds across domains, it offers a route to lower API expenses in deployed LLM services by routing only difficult cases.
The method points toward layered agent systems in which capability differences are managed through self-assessment rather than fixed routing rules.

Load-bearing premise

The smaller local LLM can reliably detect its own reasoning errors to decide when assistance from the cloud agent is needed.

What would settle it

A test set of problems where the local agent produces wrong answers without help, checking whether it consistently fails to detect the error and request the cloud agent.

Figures

Figures reproduced from arXiv: 2410.13181 by Bo Wang, Dawei Yin, Hao Sun, Hengyi Cai, Jiayi Wu, Shuaiqiang Wang, Xiaochi Wei, Yan Zhang, Yue Feng.

**Figure 1.** Figure 1: A brief illustration of ADASWITCH framework, in which local agent and cloud agent alternate to collaboratively fulfill the given question. PALM (Anil et al., 2023), are characterized by their massive scale, both in terms of the colossal number of parameters and the substantial volume of data utilized during their training process. Due to their large number of parameters, LLMs are typically deployed on cl… view at source ↗

**Figure 2.** Figure 2: The illustration of ADASWITCH. 1) Self-Practicing: The local agent practices on the training dataset to brain the basic reasoning ability. 2) Collaborative Examination: The local agent undertakes an exam to expose its weakness, during which the cloud agent will be utilized to correct the mistakes. 3) Reflective Learning: The local agent is trained on the mistake-correction trajectories generated in the sec… view at source ↗

**Figure 3.** Figure 3: We conduct an ablation study by removing [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Cost-Effectiveness Analysis. We conduct experiments on the GSM8K dataset. From left to right, the cost of the methods gradually increases. From the bottom to the top, the accuracy of the method increases. ically decide which cloud agents to use based on the quota of computational resources. Cost-Effectiveness Analysis In this section, we analyze the cost and effectiveness of existing methods. Specifically… view at source ↗

**Figure 5.** Figure 5: Case studies of solving Mathematical Reasoning and Complex QA Reasoning problems. Blue text [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Recent advancements in large language models (LLMs) have been remarkable. Users face a choice between using cloud-based LLMs for generation quality and deploying local-based LLMs for lower computational cost. The former option is typically costly and inefficient, while the latter usually fails to deliver satisfactory performance for reasoning steps requiring deliberate thought processes. In this work, we propose a novel LLM utilization paradigm that facilitates the collaborative operation of large cloud-based LLMs and smaller local-deployed LLMs. Our framework comprises two primary modules: the local agent instantiated with a relatively smaller LLM, handling less complex reasoning steps, and the cloud agent equipped with a larger LLM, managing more intricate reasoning steps. This collaborative processing is enabled through an adaptive mechanism where the local agent introspectively identifies errors and proactively seeks assistance from the cloud agent, thereby effectively integrating the strengths of both locally-deployed and cloud-based LLMs, resulting in significant enhancements in task completion performance and efficiency. We evaluate AdaSwitch across 7 benchmarks, ranging from mathematical reasoning and complex question answering, using various types of LLMs to instantiate the local and cloud agents. The empirical results show that AdaSwitch effectively improves the performance of the local agent, and sometimes achieves competitive results compared to the cloud agent while utilizing much less computational overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AdaSwitch's core claim hinges on small models reliably spotting their own errors, but the abstract gives no evidence this works.

read the letter

The paper's main move is to let a small local LLM run first, have it flag its own mistakes through introspection, and then hand off to a larger cloud model only when needed. That framing for cloud-local collaboration is the piece that stands out from prior size-tradeoff work. They test the setup on seven benchmarks covering math and QA, using different model pairs, and report that the hybrid beats the local baseline and sometimes matches the cloud one at lower cost. That empirical spread is useful to see even if the numbers themselves are not in the abstract. The approach is straightforward to implement on top of existing agents, which is a practical plus. The soft spot is exactly the one the stress test flags: the local agent's error detection is load-bearing and unproven. Smaller models are not strong at meta-reasoning, yet the paper treats their self-diagnosis as reliable without showing switch accuracy, ablations on the detection prompt, or failure cases. If those checks are missing from the full text, the gains could come from prompting artifacts or implicit fallback rather than true adaptive collaboration. The citation pattern looks standard for the area. This is the kind of systems paper that a reading group could discuss for the deployment angle, but only if the full version supplies the missing validation on detection reliability. I would send it to review because the idea is concrete and the evaluation scope is reasonable, even though the central mechanism needs tighter evidence before the results can be taken at face value.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes AdaSwitch, a collaborative framework in which a local agent instantiated with a smaller LLM handles simpler reasoning steps while using introspection to detect its own errors and proactively switch to a cloud agent with a larger LLM for more complex steps. The framework is evaluated across seven benchmarks spanning mathematical reasoning and complex question answering, using various LLM pairs for the local and cloud agents. The central empirical claim is that AdaSwitch improves local-agent performance and can achieve results competitive with the cloud agent at substantially lower computational cost.

Significance. If the adaptive switching mechanism is shown to rest on reliable self-error detection rather than ancillary factors, the work would offer a practical route to balancing LLM performance against inference cost. The multi-benchmark evaluation across different model sizes is a clear strength that would support broader applicability if the load-bearing introspection step is validated.

major comments (2)

[Abstract and §3] Abstract and §3 (Method): The claim that the local agent “introspectively identifies errors” is load-bearing for the adaptive mechanism, yet the manuscript supplies neither the precise prompting template used for self-detection nor any quantitative metric (e.g., precision/recall of switch decisions against ground-truth errors) that would confirm the smaller model can perform this meta-reasoning reliably.
[§4] §4 (Experiments): No ablation or control is reported that isolates the contribution of the adaptive switch (e.g., always-on cloud fallback, random switching, or a non-introspective heuristic). Without these, observed gains cannot be attributed to the claimed collaborative mechanism rather than prompting or dataset effects.

minor comments (1)

[Tables] Table captions and axis labels should explicitly state the exact local and cloud model pairs used in each row so that computational-overhead comparisons are immediately interpretable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the validation of our adaptive mechanism. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Method): The claim that the local agent “introspectively identifies errors” is load-bearing for the adaptive mechanism, yet the manuscript supplies neither the precise prompting template used for self-detection nor any quantitative metric (e.g., precision/recall of switch decisions against ground-truth errors) that would confirm the smaller model can perform this meta-reasoning reliably.

Authors: We agree that the exact prompting template and quantitative metrics for the self-detection step are necessary to substantiate the load-bearing claim. In the revised manuscript we will add the full introspection prompt template to Section 3 and report precision/recall of switch decisions against ground-truth errors on a held-out subset of each benchmark. revision: yes
Referee: [§4] §4 (Experiments): No ablation or control is reported that isolates the contribution of the adaptive switch (e.g., always-on cloud fallback, random switching, or a non-introspective heuristic). Without these, observed gains cannot be attributed to the claimed collaborative mechanism rather than prompting or dataset effects.

Authors: We acknowledge that the current experiments lack controls isolating the adaptive switch. The revised version will include the requested ablations—always-on cloud fallback, random switching, and non-introspective heuristics—across the same model pairs and benchmarks to attribute gains specifically to the introspection-based mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical framework with no derivation chain

full rationale

The paper describes an empirical collaborative framework evaluated on 7 benchmarks using various LLMs, with performance claims resting on observed task completion improvements rather than any mathematical derivation, equations, or fitted parameters. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text; the adaptive switching mechanism is presented as an implemented heuristic whose reliability is assessed externally via benchmarks, rendering the work self-contained against independent evaluation data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the premise that smaller LLMs can manage simple steps and that error introspection is feasible; full details of any additional assumptions are unavailable from the abstract alone.

axioms (1)

domain assumption Smaller local LLMs can handle less complex reasoning steps while larger cloud LLMs manage intricate ones.
This division of labor is stated as the basis for the collaborative framework in the abstract.

pith-pipeline@v0.9.0 · 5780 in / 1160 out tokens · 27805 ms · 2026-05-23T18:44:26.507490+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 12 internal anchors

[1]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian-Guang Lou, and Weizhu Chen. 2023. Learning from mistakes makes llm better reasoner. arXiv preprint arXiv:2310.20689

work page arXiv 2023
[5]

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. 2023. Palm 2 technical report. arXiv preprint arXiv:2305.10403

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2022. https://arxiv.org/pdf/2211.10435 Pal: Program-aided language models . arXiv preprint arXiv:2211.10435

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

Tobias Groot and Matias Valdenegro-Toro. 2024. Overconfidence is key: Verbalized uncertainty evaluation in large language and vision-language models. arXiv preprint arXiv:2405.02917

work page arXiv 2024
[10]

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. 2023. Minillm: Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations

work page 2023
[11]

Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. 2022. Making large language models better reasoners with step-aware verifier. arXiv preprint arXiv:2206.02336

work page arXiv 2022
[12]

Kevin J Liang, Weituo Hao, Dinghan Shen, Yufan Zhou, Weizhu Chen, Changyou Chen, and Lawrence Carin. 2020. Mixkd: Towards efficient distillation of large-scale language models. arXiv preprint arXiv:2011.00593

work page arXiv 2020
[13]

Bill Yuchen Lin, Yicheng Fu, Karina Yang, Faeze Brahman, Shiyu Huang, Chandra Bhagavatula, Prithviraj Ammanabrolu, Yejin Choi, and Xiang Ren. 2024. Swiftsage: A generative agent with fast and slow thinking for complex interactive tasks. Advances in Neural Information Processing Systems, 36

work page 2024
[14]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2023. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. https://arxiv.org/abs/1907.11692 RoBERTa : A robustly optimized BERT pretraining approach . arXiv preprint arXiv:1907.11692

work page internal anchor Pith review Pith/arXiv arXiv 2019
[16]

Aman Madaan, Pranjal Aggarwal, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, et al. 2023. Automix: Automatically mixing language models. arXiv preprint arXiv:2310.12963

work page arXiv 2023
[17]

Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. 2021. A diverse corpus for evaluating and developing english math word problem solvers. arXiv preprint arXiv:2106.15772

work page arXiv 2021
[18]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. Are nlp models really able to solve simple math word problems? arXiv preprint arXiv:2103.07191

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

Subhro Roy and Dan Roth. 2016. Solving general arithmetic word problems. arXiv preprint arXiv:1608.01413

work page internal anchor Pith review Pith/arXiv arXiv 2016
[20]

Shannon Zejiang Shen, Hunter Lang, Bailin Wang, Yoon Kim, and David Sontag. 2024 a . Learning to decode collaboratively with multiple language models. arXiv preprint arXiv:2403.03870

work page arXiv 2024
[21]

Weizhou Shen, Chenliang Li, Hongzhan Chen, Ming Yan, Xiaojun Quan, Hehong Chen, Ji Zhang, and Fei Huang. 2024 b . Small llms are weak tool learners: A multi-llm agent. arXiv preprint arXiv:2401.07324

work page arXiv 2024
[22]

Hao Sun, Yong Jiang, Bo Wang, Yingyan Hou, Yan Zhang, Pengjun Xie, and Fei Huang. 2024. Retrieved in-context principles from previous mistakes. arXiv preprint arXiv:2407.05682

work page internal anchor Pith review arXiv 2024
[23]

Qiushi Sun, Zhangyue Yin, Xiang Li, Zhiyong Wu, Xipeng Qiu, and Lingpeng Kong. 2023. Corex: Pushing the boundaries of complex reasoning through multi-model collaboration. arXiv preprint arXiv:2310.00280

work page arXiv 2023
[24]

Yongqi Tong, Dawei Li, Sizhe Wang, Yujia Wang, Fei Teng, and Jingbo Shang. 2024. Can llms learn from previous mistakes? investigating llms' errors to boost for reasoning. arXiv preprint arXiv:2403.20046

work page arXiv 2024
[25]

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539--554

work page 2022
[26]

Renxi Wang, Haonan Li, Xudong Han, Yixuan Zhang, and Timothy Baldwin. 2024. Learning from failure: Integrating negative examples when fine-tuning large language models as agents. arXiv preprint arXiv:2402.11651

work page arXiv 2024
[27]

Zhaoyang Wang, Shaohan Huang, Yuxuan Liu, Jiahai Wang, Minghui Song, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, et al. 2023. Democratizing reasoning ability: Tailored learning from large language model. arXiv preprint arXiv:2310.13332

work page arXiv 2023
[28]

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087--38099. PMLR

work page 2023
[29]

Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. 2023. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. arXiv preprint arXiv:2306.13063

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Haoyan Yang, Yixuan Wang, Xingyin Xu, Hanyuan Zhang, and Yirong Bian. 2024. Can we trust llms? mitigate overconfidence bias in llms through knowledge transfer. arXiv preprint arXiv:2405.16856

work page arXiv 2024
[31]

Zeyuan Yang, Peng Li, and Yang Liu. 2023. Failures pave the way: Enhancing large language models through tuning-free rule accumulation. arXiv preprint arXiv:2310.15746

work page arXiv 2023
[32]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600

work page internal anchor Pith review Pith/arXiv arXiv 2018
[33]

Da Yin, Faeze Brahman, Abhilasha Ravichander, Khyathi Chandu, Kai-Wei Chang, Yejin Choi, and Bill Yuchen Lin. 2023. Lumos: Learning agents with unified data, modular design, and open-source llms. arXiv preprint arXiv:2311.05657

work page arXiv 2023
[34]

Tianjun Zhang, Aman Madaan, Luyu Gao, Steven Zheng, Swaroop Mishra, Yiming Yang, Niket Tandon, and Uri Alon. 2024. In-context principle learning from mistakes. arXiv preprint arXiv:2402.05403

work page arXiv 2024

[1] [1]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian-Guang Lou, and Weizhu Chen. 2023. Learning from mistakes makes llm better reasoner. arXiv preprint arXiv:2310.20689

work page arXiv 2023

[5] [5]

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. 2023. Palm 2 technical report. arXiv preprint arXiv:2305.10403

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323

work page internal anchor Pith review Pith/arXiv arXiv 2022

[8] [8]

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2022. https://arxiv.org/pdf/2211.10435 Pal: Program-aided language models . arXiv preprint arXiv:2211.10435

work page internal anchor Pith review Pith/arXiv arXiv 2022

[9] [9]

Tobias Groot and Matias Valdenegro-Toro. 2024. Overconfidence is key: Verbalized uncertainty evaluation in large language and vision-language models. arXiv preprint arXiv:2405.02917

work page arXiv 2024

[10] [10]

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. 2023. Minillm: Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations

work page 2023

[11] [11]

Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. 2022. Making large language models better reasoners with step-aware verifier. arXiv preprint arXiv:2206.02336

work page arXiv 2022

[12] [12]

Kevin J Liang, Weituo Hao, Dinghan Shen, Yufan Zhou, Weizhu Chen, Changyou Chen, and Lawrence Carin. 2020. Mixkd: Towards efficient distillation of large-scale language models. arXiv preprint arXiv:2011.00593

work page arXiv 2020

[13] [13]

Bill Yuchen Lin, Yicheng Fu, Karina Yang, Faeze Brahman, Shiyu Huang, Chandra Bhagavatula, Prithviraj Ammanabrolu, Yejin Choi, and Xiang Ren. 2024. Swiftsage: A generative agent with fast and slow thinking for complex interactive tasks. Advances in Neural Information Processing Systems, 36

work page 2024

[14] [14]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2023. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. https://arxiv.org/abs/1907.11692 RoBERTa : A robustly optimized BERT pretraining approach . arXiv preprint arXiv:1907.11692

work page internal anchor Pith review Pith/arXiv arXiv 2019

[16] [16]

Aman Madaan, Pranjal Aggarwal, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, et al. 2023. Automix: Automatically mixing language models. arXiv preprint arXiv:2310.12963

work page arXiv 2023

[17] [17]

Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. 2021. A diverse corpus for evaluating and developing english math word problem solvers. arXiv preprint arXiv:2106.15772

work page arXiv 2021

[18] [18]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. Are nlp models really able to solve simple math word problems? arXiv preprint arXiv:2103.07191

work page internal anchor Pith review Pith/arXiv arXiv 2021

[19] [19]

Subhro Roy and Dan Roth. 2016. Solving general arithmetic word problems. arXiv preprint arXiv:1608.01413

work page internal anchor Pith review Pith/arXiv arXiv 2016

[20] [20]

Shannon Zejiang Shen, Hunter Lang, Bailin Wang, Yoon Kim, and David Sontag. 2024 a . Learning to decode collaboratively with multiple language models. arXiv preprint arXiv:2403.03870

work page arXiv 2024

[21] [21]

Weizhou Shen, Chenliang Li, Hongzhan Chen, Ming Yan, Xiaojun Quan, Hehong Chen, Ji Zhang, and Fei Huang. 2024 b . Small llms are weak tool learners: A multi-llm agent. arXiv preprint arXiv:2401.07324

work page arXiv 2024

[22] [22]

Hao Sun, Yong Jiang, Bo Wang, Yingyan Hou, Yan Zhang, Pengjun Xie, and Fei Huang. 2024. Retrieved in-context principles from previous mistakes. arXiv preprint arXiv:2407.05682

work page internal anchor Pith review arXiv 2024

[23] [23]

Qiushi Sun, Zhangyue Yin, Xiang Li, Zhiyong Wu, Xipeng Qiu, and Lingpeng Kong. 2023. Corex: Pushing the boundaries of complex reasoning through multi-model collaboration. arXiv preprint arXiv:2310.00280

work page arXiv 2023

[24] [24]

Yongqi Tong, Dawei Li, Sizhe Wang, Yujia Wang, Fei Teng, and Jingbo Shang. 2024. Can llms learn from previous mistakes? investigating llms' errors to boost for reasoning. arXiv preprint arXiv:2403.20046

work page arXiv 2024

[25] [25]

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539--554

work page 2022

[26] [26]

Renxi Wang, Haonan Li, Xudong Han, Yixuan Zhang, and Timothy Baldwin. 2024. Learning from failure: Integrating negative examples when fine-tuning large language models as agents. arXiv preprint arXiv:2402.11651

work page arXiv 2024

[27] [27]

Zhaoyang Wang, Shaohan Huang, Yuxuan Liu, Jiahai Wang, Minghui Song, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, et al. 2023. Democratizing reasoning ability: Tailored learning from large language model. arXiv preprint arXiv:2310.13332

work page arXiv 2023

[28] [28]

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087--38099. PMLR

work page 2023

[29] [29]

Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. 2023. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. arXiv preprint arXiv:2306.13063

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Haoyan Yang, Yixuan Wang, Xingyin Xu, Hanyuan Zhang, and Yirong Bian. 2024. Can we trust llms? mitigate overconfidence bias in llms through knowledge transfer. arXiv preprint arXiv:2405.16856

work page arXiv 2024

[31] [31]

Zeyuan Yang, Peng Li, and Yang Liu. 2023. Failures pave the way: Enhancing large language models through tuning-free rule accumulation. arXiv preprint arXiv:2310.15746

work page arXiv 2023

[32] [32]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600

work page internal anchor Pith review Pith/arXiv arXiv 2018

[33] [33]

Da Yin, Faeze Brahman, Abhilasha Ravichander, Khyathi Chandu, Kai-Wei Chang, Yejin Choi, and Bill Yuchen Lin. 2023. Lumos: Learning agents with unified data, modular design, and open-source llms. arXiv preprint arXiv:2311.05657

work page arXiv 2023

[34] [34]

Tianjun Zhang, Aman Madaan, Luyu Gao, Steven Zheng, Swaroop Mishra, Yiming Yang, Niket Tandon, and Uri Alon. 2024. In-context principle learning from mistakes. arXiv preprint arXiv:2402.05403

work page arXiv 2024