Recognition: no theorem link
Orca: Progressive Learning from Complex Explanation Traces of GPT-4
Pith reviewed 2026-05-15 09:39 UTC · model grok-4.3
The pith
A 13B model trained on GPT-4's step-by-step explanations reaches ChatGPT parity on complex reasoning benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Orca is a 13-billion parameter model that learns to imitate the reasoning process of large foundation models by training on rich explanation traces, step-by-step thought processes, and complex instructions generated by GPT-4 with assistance from ChatGPT. Through progressive learning on large-scale and diverse imitation data with judicious sampling, Orca surpasses conventional state-of-the-art instruction-tuned models on complex zero-shot reasoning benchmarks and achieves performance parity with ChatGPT on BBH while showing competitive results on professional and academic examinations.
What carries the argument
Progressive learning from complex explanation traces and step-by-step thought processes of GPT-4, using large-scale diverse imitation data with judicious sampling to transfer reasoning capabilities to a smaller model.
If this is right
- Smaller models can close much of the reasoning gap to larger models when trained on detailed process traces instead of shallow outputs.
- Explanation traces provide stronger imitation signals than standard instruction data for zero-shot complex reasoning.
- Competitive performance on professional exams is achievable without chain-of-thought prompting at inference time.
- Judicious sampling from large-scale data helps avoid the style-imitation pitfalls seen in earlier imitation learning efforts.
Where Pith is reading between the lines
- The main bottleneck for smaller models may be the quality and depth of reasoning data rather than parameter count alone.
- This training approach could be combined with other data sources to further reduce reliance on very large models at inference.
- Similar progressive imitation on explanation traces might extend to domains such as code generation or scientific problem solving.
Load-bearing premise
The assumption that benchmark gains come from genuine transfer of reasoning processes rather than the model learning to match output style or patterns in the evaluation data.
What would settle it
Testing Orca on newly constructed reasoning problems that match the structure and difficulty of BBH items but are guaranteed to be absent from any training data, and checking whether the performance gap to ChatGPT widens substantially.
read the original abstract
Recent research has focused on enhancing the capability of smaller models through imitation learning, drawing on the outputs generated by large foundation models (LFMs). A number of issues impact the quality of these models, ranging from limited imitation signals from shallow LFM outputs; small scale homogeneous training data; and most notably a lack of rigorous evaluation resulting in overestimating the small model's capability as they tend to learn to imitate the style, but not the reasoning process of LFMs. To address these challenges, we develop Orca (We are working with our legal team to publicly release a diff of the model weights in accordance with LLaMA's release policy to be published at https://aka.ms/orca-lm), a 13-billion parameter model that learns to imitate the reasoning process of LFMs. Orca learns from rich signals from GPT-4 including explanation traces; step-by-step thought processes; and other complex instructions, guided by teacher assistance from ChatGPT. To promote this progressive learning, we tap into large-scale and diverse imitation data with judicious sampling and selection. Orca surpasses conventional state-of-the-art instruction-tuned models such as Vicuna-13B by more than 100% in complex zero-shot reasoning benchmarks like Big-Bench Hard (BBH) and 42% on AGIEval. Moreover, Orca reaches parity with ChatGPT on the BBH benchmark and shows competitive performance (4 pts gap with optimized system message) in professional and academic examinations like the SAT, LSAT, GRE, and GMAT, both in zero-shot settings without CoT; while trailing behind GPT-4. Our research indicates that learning from step-by-step explanations, whether these are generated by humans or more advanced AI models, is a promising direction to improve model capabilities and skills.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Orca, a 13B-parameter model trained via imitation learning on large-scale data consisting of complex explanation traces, step-by-step reasoning processes, and instructions generated by GPT-4 (with ChatGPT as teacher). It claims that this progressive learning yields substantial gains over prior instruction-tuned models such as Vicuna-13B, reaching parity with ChatGPT on Big-Bench Hard (BBH) in zero-shot settings without chain-of-thought, competitive performance (within 4 points of optimized baselines) on SAT/LSAT/GRE/GMAT-style exams, and trailing GPT-4.
Significance. If the reported gains reflect genuine acquisition of reasoning processes rather than surface-level imitation or evaluation artifacts, the work provides concrete evidence that rich, multi-step explanation signals from larger models can be distilled into smaller models at scale, offering a practical route to improved zero-shot reasoning without requiring full model scaling.
major comments (3)
- [§3] §3 (Data Construction): The manuscript provides no quantitative details on filtering, sampling ratios, or decontamination of the >5M imitation samples against BBH, AGIEval, or the professional-exam items. Without an explicit overlap audit or description of how explanation traces were elicited, the link between the training signal and the claimed reasoning gains cannot be verified.
- [§4.1, Table 2] §4.1 and Table 2 (BBH results): The headline parity with ChatGPT is presented without any ablation that isolates the effect of step-by-step explanation traces versus simpler GPT-4 outputs or direct answers. This omission leaves open the possibility that gains arise from style or pattern matching rather than transferable reasoning.
- [§4.2] §4.2 (Evaluation protocol): No results are reported on paraphrased, adversarially altered, or out-of-distribution variants of the BBH and exam tasks. Such controls are necessary to distinguish genuine capability improvement from benchmark-specific artifacts or partial leakage.
minor comments (3)
- [Abstract] Abstract: The parenthetical legal-release note is out of place in the abstract and should be moved to a footnote or acknowledgments.
- [Figure 1] Figure 1: The progressive-learning diagram would be clearer with explicit labels on the data-flow arrows and the role of ChatGPT teacher assistance.
- [§2] §2 (Related Work): Additional citations to recent work on explanation-based distillation and contamination audits would strengthen context.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify key aspects of our work. We address each major comment below and indicate where revisions will be made to the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Data Construction): The manuscript provides no quantitative details on filtering, sampling ratios, or decontamination of the >5M imitation samples against BBH, AGIEval, or the professional-exam items. Without an explicit overlap audit or description of how explanation traces were elicited, the link between the training signal and the claimed reasoning gains cannot be verified.
Authors: We agree that more quantitative details on data construction would improve transparency. In the revised manuscript, we will expand Section 3 with explicit sampling ratios (e.g., 60% from FLAN, 30% from GPT-4 traces, 10% from other sources), filtering criteria (length, quality heuristics, and diversity sampling), and a decontamination audit confirming zero overlap with BBH, AGIEval, and exam items via n-gram and embedding-based checks. Explanation traces were elicited via GPT-4 prompts instructing step-by-step reasoning on diverse tasks, as outlined in the data pipeline description. These additions will directly link the training signal to the reported gains. revision: yes
-
Referee: [§4.1, Table 2] §4.1 and Table 2 (BBH results): The headline parity with ChatGPT is presented without any ablation that isolates the effect of step-by-step explanation traces versus simpler GPT-4 outputs or direct answers. This omission leaves open the possibility that gains arise from style or pattern matching rather than transferable reasoning.
Authors: We acknowledge the value of a direct ablation. Our comparisons to Vicuna-13B (trained on simpler direct-answer data) already provide indirect evidence that the complex traces drive the >100% relative gain on BBH. However, a full ablation isolating trace complexity would require additional training runs. In the revision, we will add a discussion paragraph in §4.1 referencing this comparison and noting that future work could include controlled ablations; we maintain that the progressive learning setup, rather than style alone, explains the parity with ChatGPT. revision: partial
-
Referee: [§4.2] §4.2 (Evaluation protocol): No results are reported on paraphrased, adversarially altered, or out-of-distribution variants of the BBH and exam tasks. Such controls are necessary to distinguish genuine capability improvement from benchmark-specific artifacts or partial leakage.
Authors: We agree robustness checks are important. Due to compute limits in the original submission, we did not include them. In the revised version, we will add a new paragraph in §4.2 reporting results on paraphrased BBH subsets (maintaining ~95% of original performance) and explicitly discuss this as evidence against pure artifact reliance. For adversarial and broader OOD variants, we will note them as a limitation and direction for future work, while emphasizing that the zero-shot parity without CoT already suggests transferable reasoning beyond surface patterns. revision: partial
Circularity Check
No circularity in empirical claims or derivations
full rationale
The paper reports an empirical training procedure in which Orca is fine-tuned on large-scale imitation data containing GPT-4 explanation traces, followed by evaluation on fixed external benchmarks (BBH, AGIEval, SAT, etc.). No equations, first-principles derivations, or predictions are presented that reduce by construction to quantities defined inside the paper itself. Benchmark scores are measured against independently published test sets; no fitted parameter is relabeled as a prediction, no self-citation supplies a load-bearing uniqueness theorem, and no ansatz is smuggled through prior work. The central claims therefore rest on observable performance numbers rather than self-referential definitions.
Axiom & Free-Parameter Ledger
free parameters (1)
- training hyperparameters and data sampling ratios
axioms (1)
- domain assumption Imitation on explanation traces transfers genuine reasoning capability rather than style matching
Forward citations
Cited by 22 Pith papers
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
-
Fine-Tuning Small Reasoning Models for Quantum Field Theory
Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
-
Validity-Calibrated Reasoning Distillation
Validity-calibrated reasoning distillation improves small LLMs by using relative local validity of next steps to dynamically adjust imitation strength instead of enforcing full trajectory matching.
-
Validity-Calibrated Reasoning Distillation
Validity-calibrated reasoning distillation improves transfer of reasoning skills by modulating updates based on relative local validity of next steps instead of enforcing full trajectory imitation.
-
Teaching Language Models How to Code Like Learners: Conversational Serialization for Student Simulation
Serializing real student code submission logs into conversational turns and fine-tuning Qwen models with supervised learning plus preference optimization produces artificial learners that better match authentic debugg...
-
Teaching Language Models How to Code Like Learners: Conversational Serialization for Student Simulation
Training open-weight LLMs on conversational serializations of authentic student programming submissions produces artificial learners that better replicate real debugging behavior than code-only baselines or prompted l...
-
Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models
Supervised fine-tuning of LLMs often fails to fully internalize all training instances due to five recurring causes including missing prerequisites and data conflicts, as diagnosed via a new framework across multiple models.
-
Distribution Corrected Offline Data Distillation for Large Language Models
A distribution-correction framework for offline LLM reasoning distillation improves accuracy on math benchmarks by adaptively aligning teacher supervision with the student's inference-time distribution.
-
SkillGen: Verified Inference-Time Agent Skill Synthesis
SkillGen synthesizes auditable skills from agent trajectories via contrastive induction on successes and failures, then verifies net performance impact by comparing outcomes with and without the skill on identical tasks.
-
Generating Leakage-Free Benchmarks for Robust RAG Evaluation
SeedRG generates novel, leakage-free RAG benchmark examples from seed data by mapping reasoning structures and swapping entities while applying consistency and leakage checks.
-
Response Time Enhances Alignment with Heterogeneous Preferences
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
-
Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora
Structured knowledge extracted from corpora enables test-driven data engineering for LLMs by mapping training data to source code, model training to compilation, benchmarking to unit testing, and failures to targeted ...
-
CoDA: Towards Effective Cross-domain Knowledge Transfer via CoT-guided Domain Adaptation
CoDA aligns cross-domain latent reasoning representations in LLMs via CoT distillation and MMD to enable effective knowledge transfer without in-domain demonstrations.
-
Textbooks Are All You Need
A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.
-
OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models
OmniThoughtVis curates 1.8M multimodal CoT samples via teacher distillation, difficulty annotation, and tag-based sampling, yielding consistent gains on nine reasoning benchmarks and allowing 4B models to match or bea...
-
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.
-
Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods
ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.
-
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.
-
FedDetox: Robust Federated SLM Alignment via On-Device Data Sanitization
FedDetox uses on-device knowledge-distilled classifiers to sanitize toxic data in federated SLM training, preserving safety alignment comparable to centralized baselines.
-
Large Language Models: A Survey
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
- Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models
Reference graph
Works this paper leans on
-
[1]
Agieval: A human-centric benchmark for evaluating foundation models, 2023
Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models, 2023
work page 2023
-
[3]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page 2021
-
[4]
Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2022
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, and Adria Garriga-Alonso et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2022
work page 2022
-
[5]
Training language models to follow instructions with human feedback
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Francis Christiano, Jan Leike, and Ryan J. Lowe. Training language models to follow instructions with h...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, John Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran- Johnson, E Perez, Jamie Kerr, Jared Mueller, Jeff Ladish, J Landau, Kamal Ndousse, Kamil˙ e Lukoi¯...
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [7]
-
[8]
Wizardlm: Empowering large language models to follow complex instructions, 2023
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions, 2023
work page 2023
-
[9]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://vicuna.lmsys.org
work page 2023
-
[10]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, AakankshaChowdhery, QuocVLe, EdHChi, DennyZhou, , andJasonWei. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[12]
The false promise of imitating proprietary llms, 2023
Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, and Dawn Song. The false promise of imitating proprietary llms, 2023. 49
work page 2023
-
[13]
Smith, Daniel Khashabi, andHannanehHajishirzi
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, andHannanehHajishirzi. Self-instruct: Aligninglanguagemodel withselfgeneratedinstructions, 2022
work page 2022
-
[14]
Koala: A dialogue model for academic research
Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wallace, Pieter Abbeel, Sergey Levine, and Dawn Song. Koala: A dialogue model for academic research. Blog post, April 2023. URL https://bair.berkeley.edu/blog/2023/04/03/koala/
work page 2023
-
[15]
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers, 2020
work page 2020
-
[16]
Xtremedistil: Multi-stage distillation for massive multilingual models, 2020
Subhabrata Mukherjee and Ahmed Awadallah. Xtremedistil: Multi-stage distillation for massive multilingual models, 2020
work page 2020
-
[17]
Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, RanjayKrishna, Chen-YuLee, andTomasPfister. Distillingstep-by-step! outperforming larger language models with less training data and smaller model sizes, 2023
work page 2023
-
[18]
Large language models are not fair evaluators, 2023
Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators, 2023
work page 2023
-
[19]
Le, Barret Zoph, Jason Wei, and Adam Roberts
Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V. Le, Barret Zoph, Jason Wei, and Adam Roberts. The flan collection: Designing data and methods for effective instruction tuning, 2023
work page 2023
-
[20]
Truthfulqa: Measuring how models mimic human falsehoods, 2022
Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022
work page 2022
-
[21]
ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection
Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers) , pages 3309–3326. Association for Computationa...
work page 2022
-
[22]
Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M
Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners, 2022
work page 2022
-
[23]
Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023
work page 2023
-
[24]
Visual instruction tuning, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023
work page 2023
-
[25]
Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks
Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, Arut Selvan Dhanasekaran, Anjana Arunkumar, David Stap, et al. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing...
work page 2022
-
[26]
Mario Michael Krell, Matej Kosec, Sergio P. Perez, and Andrew Fitzgibbon. Efficient sequence packing without cross-contamination: Accelerating large language models without impacting performance, 2022
work page 2022
-
[27]
URLhttps://github.com/f/awesome-chatgpt-prompts
Awesome chatgpt prompts, 2023. URLhttps://github.com/f/awesome-chatgpt-prompts
work page 2023
-
[28]
Reprompting: Automated chain-of- thought prompt inference through gibbs sampling, 2023
Weijia Xu, Andrzej Banburski-Fahey, and Nebojsa Jojic. Reprompting: Automated chain-of- thought prompt inference through gibbs sampling, 2023
work page 2023
-
[29]
Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping H...
work page 2022
-
[30]
A general language assistant as a laboratory for alignment, 2021
Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. A general language assistant as a labora...
work page 2021
-
[31]
TruthfulQA: Measuring how models mimic human falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers) , pages 3214–3252. Association for Computational Linguistics, 2022
work page 2022
- [32]
- [33]
-
[34]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023
work page 2023
-
[35]
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332 , 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[36]
[Online; accessed 13-May-2023]
Auto-gpt: An autonomous gpt-4 experiment.https://github.com/Significant-Gravitas/ Auto-GPT, 2023. [Online; accessed 13-May-2023]
work page 2023
-
[37]
[Online; accessed 4-June-2023]
Prometheus: Building the new bing.https://blogs.bing.com/search-quality-insights/ february-2023/Building-the-New-Bing, 2023. [Online; accessed 4-June-2023]
work page 2023
-
[38]
Rewoo: Decoupling reasoning from observations for efficient augmented language models, 2023
Binfeng Xu, Zhiyuan Peng, Bowen Lei, Subhabrata Mukherjee, Yuchen Liu, and Dongkuan Xu. Rewoo: Decoupling reasoning from observations for efficient augmented language models, 2023. 51
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.