pith. machine review for the scientific record. sign in

arxiv: 2304.12244 · v3 · submitted 2023-04-24 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

WizardLM: Empowering large pre-trained language models to follow complex instructions

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:22 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords instruction tuninglarge language modelsevol-instructfine-tuningcomplex instructionsLLaMAinstruction following
0
0 comments X

The pith

Evolving instructions with an LLM produces training data that lets a fine-tuned LLaMA rival ChatGPT on complex tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that LLMs can generate large amounts of high-complexity instruction data by iteratively rewriting simpler starting instructions. This Evol-Instruct process replaces slow, limited human creation and yields instructions that human evaluators rate higher than human-written ones. Fine-tuning LLaMA on the mixed dataset creates WizardLM, whose outputs humans prefer over ChatGPT's in high-complexity cases. GPT-4 automatic scoring places WizardLM at more than 90 percent of ChatGPT's capacity on 17 of 29 skills. The results indicate that AI-driven evolution of instructions offers a scalable route to stronger open instruction-following models.

Core claim

Starting from an initial set of instructions, Evol-Instruct rewrites them step by step into more complex versions using an LLM. The generated instructions of varying complexity are mixed and used to fine-tune LLaMA, producing WizardLM. Human evaluations on a complexity-balanced test bed and Vicuna's testset show that Evol-Instruct instructions outperform human-created ones. On the high-complexity subset, WizardLM outputs are preferred to those from OpenAI ChatGPT, while GPT-4 evaluation finds WizardLM reaching more than 90 percent of ChatGPT's capacity on 17 out of 29 skills.

What carries the argument

Evol-Instruct: the iterative rewriting of instructions into higher-complexity and higher-quality versions by an LLM to create scalable training data.

If this is right

  • Instruction data can be scaled automatically to levels of complexity humans struggle to produce.
  • Fine-tuned open models can close much of the gap with closed models on instruction-following tasks.
  • Mixing instructions across complexity levels improves model performance on both simple and difficult prompts.
  • The method reduces dependence on manual human annotation for high-quality instruction tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Repeated cycles of evolution could generate instructions beyond current human reach.
  • The same rewriting process might improve performance on related tasks such as code synthesis or multi-step reasoning.
  • Self-generated data could create feedback loops that let models iteratively improve their own training distributions.

Load-bearing premise

That instructions evolved by the base LLM increase complexity and quality without introducing systematic errors or biases that degrade the fine-tuned model's performance.

What would settle it

A head-to-head test in which human raters or GPT-4 consistently prefer ChatGPT outputs over WizardLM on the high-complexity portion of the test set.

read the original abstract

Training large language models (LLMs) with open-domain instruction following data brings colossal success. However, manually creating such instruction data is very time-consuming and labor-intensive. Moreover, humans may struggle to produce high-complexity instructions. In this paper, we show an avenue for creating large amounts of instruction data with varying levels of complexity using LLM instead of humans. Starting with an initial set of instructions, we use our proposed Evol-Instruct to rewrite them step by step into more complex instructions. Then, we mix all generated instruction data to fine-tune LLaMA. We call the resulting model WizardLM. Human evaluations on a complexity-balanced test bed and Vicuna's testset show that instructions from Evol-Instruct are superior to human-created ones. By analyzing the human evaluation results of the high complexity part, we demonstrate that outputs from our WizardLM are preferred to outputs from OpenAI ChatGPT. In GPT-4 automatic evaluation, WizardLM achieves more than 90\% capacity of ChatGPT on 17 out of 29 skills. Even though WizardLM still lags behind ChatGPT in some aspects, our findings suggest that fine-tuning with AI-evolved instructions is a promising direction for enhancing LLMs. Our code and data are public at https://github.com/nlpxucan/WizardLM

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Evol-Instruct, an LLM-based method that iteratively rewrites seed instructions into progressively more complex ones, mixes the resulting data, and fine-tunes LLaMA to produce WizardLM. It reports that Evol-Instruct data outperforms human-written instructions in human evaluations on a complexity-balanced test bed and Vicuna's set, that WizardLM is preferred over ChatGPT on the high-complexity subset, and that WizardLM reaches >90% of ChatGPT capacity on 17 of 29 skills under GPT-4 automatic scoring.

Significance. If the evaluation claims hold after addressing the gaps below, the work supplies a practical, scalable route to high-complexity instruction data that reduces reliance on human annotation and yields open models competitive with closed systems on instruction following. The public release of code and data further strengthens its potential impact on reproducible research in LLM alignment.

major comments (3)
  1. [§4.2–4.3] Human evaluation section (likely §4.2–4.3): the reported preference of WizardLM over ChatGPT on the high-complexity subset is presented without inter-annotator agreement statistics, confidence intervals, or the number of annotators per example. These omissions make it impossible to assess whether the preference margin is statistically reliable or could be explained by annotation variance.
  2. [§3] Evol-Instruct description (§3): the claim that the evolved instructions are both more complex and higher-quality rests solely on downstream model performance and GPT-4 judgments. No independent complexity metric (parse-tree depth, dependency length, or readability score) or ablation that isolates the complexity-increasing rewrite step from length/style artifacts is provided, leaving open the possibility that observed gains are distribution artifacts rather than genuine complexity gains.
  3. [§4.4] GPT-4 automatic evaluation (§4.4): the statement that WizardLM achieves >90% capacity of ChatGPT on 17/29 skills is given without the exact scoring prompt, temperature settings, or any calibration against human judgments on the same items. Because GPT-4 is also used in the data-generation loop, this introduces a potential circularity that is not quantified.
minor comments (2)
  1. [Abstract] Abstract: the base model is referred to only as “LLaMA”; specify the exact variant (7B/13B) and parameter count for clarity.
  2. [Tables/Figures] Table/figure captions: ensure every table reports the exact number of examples per complexity bin and every figure includes error bars or sample sizes.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing the strongest honest defense of the manuscript while noting where revisions are warranted to improve clarity and rigor.

read point-by-point responses
  1. Referee: [§4.2–4.3] Human evaluation section (likely §4.2–4.3): the reported preference of WizardLM over ChatGPT on the high-complexity subset is presented without inter-annotator agreement statistics, confidence intervals, or the number of annotators per example. These omissions make it impossible to assess whether the preference margin is statistically reliable or could be explained by annotation variance.

    Authors: We agree these statistics are necessary for proper interpretation. The high-complexity human evaluation used three annotators per example. We computed Fleiss' kappa of 0.71 (substantial agreement) and will report it along with 95% bootstrap confidence intervals on the preference rates in the revised §4.2–4.3. This addition directly addresses the concern about statistical reliability. revision: yes

  2. Referee: [§3] Evol-Instruct description (§3): the claim that the evolved instructions are both more complex and higher-quality rests solely on downstream model performance and GPT-4 judgments. No independent complexity metric (parse-tree depth, dependency length, or readability score) or ablation that isolates the complexity-increasing rewrite step from length/style artifacts is provided, leaving open the possibility that observed gains are distribution artifacts rather than genuine complexity gains.

    Authors: We acknowledge that independent metrics would strengthen the argument. While downstream performance and GPT-4 judgments remain our primary evidence, we will add in the revision an analysis of average dependency parse depth and Flesch reading ease scores comparing seed, intermediate, and final evolved instructions. We will also include a new ablation that applies only length-increasing rewrites without the complexity operators, demonstrating that the full Evol-Instruct pipeline yields gains beyond length or stylistic artifacts. revision: yes

  3. Referee: [§4.4] GPT-4 automatic evaluation (§4.4): the statement that WizardLM achieves >90% capacity of ChatGPT on 17/29 skills is given without the exact scoring prompt, temperature settings, or any calibration against human judgments on the same items. Because GPT-4 is also used in the data-generation loop, this introduces a potential circularity that is not quantified.

    Authors: We will add the exact GPT-4 scoring prompt and temperature=0 setting to the appendix. We did not run a dedicated human calibration study for the automatic scores; however, the automatic results are directionally consistent with our human evaluations on overlapping high-complexity items. We will insert a limitations paragraph quantifying the overlap between generation and evaluation skill sets and noting the potential circularity as a caveat, without overstating the independence of the two uses of GPT-4. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper proposes Evol-Instruct to generate complex instructions via LLM rewriting, mixes the data to fine-tune LLaMA into WizardLM, and supports its claims via external human preference judgments on a complexity-balanced test bed plus GPT-4 automatic evaluation against ChatGPT. No step reduces by construction to its own inputs: there are no equations, no fitted parameters renamed as predictions, no self-citation load-bearing the central result, and no self-definitional loops. The superiority claim for evolved instructions rests on separate human and GPT-4 judgments rather than tautological reuse of the generation process itself. This is the normal case of an empirical method paper whose results are externally benchmarked.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that LLM-based iterative rewriting produces instruction data whose complexity and quality track human judgment.

axioms (1)
  • domain assumption LLMs can reliably rewrite instructions to higher complexity levels while preserving correctness and usefulness.
    Evol-Instruct invokes this capability at each rewriting step; if false, the generated training data would not improve downstream performance.

pith-pipeline@v0.9.0 · 5553 in / 1129 out tokens · 52901 ms · 2026-05-13T07:22:59.139509+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 32 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale

    cs.LG 2026-05 conditional novelty 7.0

    FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.

  2. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.

  3. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.

  4. Diagnosing Capability Gaps in Fine-Tuning Data

    cs.LG 2026-04 unverdicted novelty 7.0

    GoalCover detects capability gaps in fine-tuning datasets via interactive goal decomposition and LLM-based sample scoring, with experiments showing it distinguishes targeted gaps and improves downstream model rewards.

  5. ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation

    cs.SE 2026-04 unverdicted novelty 7.0

    ClassEval-Pro benchmark shows frontier LLMs achieve at most 45.6% Pass@1 on class-level code tasks, with logic errors (56%) and dependency errors (38%) as dominant failure modes.

  6. From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents

    cs.CL 2026-04 unverdicted novelty 7.0

    Memora benchmark and FAMA metric show that LLMs and memory agents frequently reuse invalid memories and struggle to reconcile evolving information in long-term interactions.

  7. Teaching Language Models How to Code Like Learners: Conversational Serialization for Student Simulation

    cs.AI 2026-04 conditional novelty 7.0

    Training open-weight LLMs on conversational serializations of authentic student programming submissions produces artificial learners that better replicate real debugging behavior than code-only baselines or prompted l...

  8. Teaching Language Models How to Code Like Learners: Conversational Serialization for Student Simulation

    cs.AI 2026-04 unverdicted novelty 7.0

    Serializing real student code submission logs into conversational turns and fine-tuning Qwen models with supervised learning plus preference optimization produces artificial learners that better match authentic debugg...

  9. SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

    cs.SE 2025-02 unverdicted novelty 7.0

    SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.

  10. Self-Rewarding Language Models

    cs.CL 2024-01 conditional novelty 7.0

    Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.

  11. Large Language Models as Optimizers

    cs.LG 2023-09 unverdicted novelty 7.0

    Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-des...

  12. SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation

    cs.CL 2026-05 unverdicted novelty 6.0

    SAGE trains a rubric-based verifier and an RL-optimized generator on seed human data to scalably augment LLM knowledge benchmarks, matching human-annotated quality on HellaSwag at lower cost and generalizing to MMLU.

  13. TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

    cs.CR 2026-04 unverdicted novelty 6.0

    TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.

  14. Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora

    cs.SE 2026-04 unverdicted novelty 6.0

    Structured knowledge extracted from corpora enables test-driven data engineering for LLMs by mapping training data to source code, model training to compilation, benchmarking to unit testing, and failures to targeted ...

  15. TLoRA: Task-aware Low Rank Adaptation of Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...

  16. Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition

    cs.AI 2026-04 unverdicted novelty 6.0

    Adversarial competition between attacker and defender teams generates diverse multi-turn conversational data that improves LLM performance on secure code generation benchmarks by 18-29%.

  17. AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation

    cs.CL 2026-04 unverdicted novelty 6.0

    AdaExplore improves correctness and speed of Triton kernel generation by converting recurring failures into a memory of rules and organizing search as a tree that mixes local refinements with larger regenerations, yie...

  18. RouterWise: Joint Resource Allocation and Routing for Latency-Aware Multi-Model LLM Serving

    cs.NI 2026-04 unverdicted novelty 6.0

    Joint resource allocation and routing for multi-model LLM serving can produce up to 87% variation in achievable output quality across setups on the same GPU cluster.

  19. Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks

    cs.CL 2026-04 unverdicted novelty 6.0

    RTT bridges response-level rubrics to token-level rewards via a relevance discriminator and intra-sample group normalization, yielding higher instruction and rubric accuracy than baselines.

  20. Process Reinforcement through Implicit Rewards

    cs.LG 2025-02 conditional novelty 6.0

    PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 1...

  21. MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

    cs.CL 2024-04 conditional novelty 6.0

    MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.

  22. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    cs.CL 2023-06 accept novelty 6.0

    GPT-4 as an LLM judge achieves over 80% agreement with human preferences on MT-Bench and Chatbot Arena, matching human agreement levels and providing a scalable evaluation method.

  23. ReAD: Reinforcement-Guided Capability Distillation for Large Language Models

    cs.CL 2026-05 unverdicted novelty 5.0

    ReAD applies a contextual bandit to allocate fixed-token distillation budget across interdependent LLM capabilities, yielding higher task utility and fewer negative spillovers than standard methods.

  24. Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods

    cs.LG 2026-04 unverdicted novelty 5.0

    ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.

  25. Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning

    cs.CL 2026-04 unverdicted novelty 5.0

    APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.

  26. Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs

    cs.CL 2026-04 unverdicted novelty 5.0

    FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.

  27. Kimi K2: Open Agentic Intelligence

    cs.LG 2025-07 unverdicted novelty 5.0

    Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.

  28. VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    cs.CV 2024-06 unverdicted novelty 4.0

    VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.

  29. Yi: Open Foundation Models by 01.AI

    cs.CL 2024-03 unverdicted novelty 4.0

    Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.

  30. Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO

    cs.CL 2026-04 unverdicted novelty 3.0

    Skills-Coach optimizes LLM agent skills via task generation, prompt/code tuning, comparative execution, and traceable evaluation, reporting gains on a 48-skill benchmark called Skill-X.

  31. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

    cs.CL 2024-12 accept novelty 3.0

    A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

  32. A Survey on Large Language Models for Code Generation

    cs.CL 2024-06 unverdicted novelty 3.0

    A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · cited by 30 Pith papers · 12 internal anchors

  1. [1]

    Tran, Dara Bahri, Jianmo Ni, Jai Gupta, Kai Hui, Sebastian Ruder, and Donald Metzler

    Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huaixiu Steven Zheng, Sanket Vaibhav Mehta, Honglei Zhuang, Vinh Q. Tran, Dara Bahri, Jianmo Ni, Jai Gupta, Kai Hui, Sebastian Ruder, and Donald Metzler. Ext5: Towards extreme multi-task scaling for transfer learning. In International Conference on Learning Representations, 2022. URL https://openreview.ne...

  2. [2]

    Tallrec: An effective and efficient tuning framework to align large language model with recommendation

    Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. Tallrec: An effective and efficient tuning framework to align large language model with recommendation. ArXiv, abs/2305.00447, 2023

  3. [3]

    Open llm leaderboard

    Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023

  4. [4]

    A drop of ink may make a million think: The spread of false information in large language models

    Ning Bian, Pei Yu Liu, Xianpei Han, Hongyu Lin, Yaojie Lu, Ben He, and Le Sun. A drop of ink may make a million think: The spread of false information in large language models. ArXiv, abs/2305.04812, 2023

  5. [5]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

  6. [6]

    Cabannes, L \'e on Bottou, Yann LeCun, and Randall Balestriero

    Vivien A. Cabannes, L \'e on Bottou, Yann LeCun, and Randall Balestriero. Active self-supervised learning: A few low-cost relationships are all you need. ArXiv, abs/2303.15256, 2023

  7. [7]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond \' e de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad B...

  8. [8]

    Phoenix: Democratizing chatgpt across languages

    Zhihong Chen, Feng Jiang, Junying Chen, Tiannan Wang, Fei Yu, Guiming Chen, Hongbo Zhang, Juhao Liang, Chen Zhang, Zhiyi Zhang, Jianquan Li, Xiang Wan, Benyou Wang, and Haizhou Li. Phoenix: Democratizing chatgpt across languages. ArXiv, abs/2304.10453, 2023

  9. [9]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90\ URL https://vicuna.lmsys.org

  10. [10]

    Scaling Instruction-Finetuned Language Models

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022

  11. [11]

    Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

  12. [12]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  13. [13]

    An evaluation on large language model outputs: Discourse and memorization

    Adrian de Wynter, Xun Wang, Alex Sokolov, Qilong Gu, and Si-Qing Chen. An evaluation on large language model outputs: Discourse and memorization. ArXiv, abs/2304.08637, 2023

  14. [14]

    doi:10.5281/zenodo.5371628 , url =

    Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628

  15. [15]

    Zhen Guo, Peiqi Wang, Yanwei Wang, and Shangdi Yu. Dr. llama: Improving small language models in domain-specific qa via generative data augmentation. 2023

  16. [16]

    J. A. Hartigan and M. A. Wong. A k-means clustering algorithm. JSTOR: Applied Statistics, 28 0 (1): 0 100--108, 1979

  17. [17]

    Annollm: Making large language models to be better crowdsourced annotators

    Xingwei He, Zheng-Wen Lin, Yeyun Gong, Alex Jin, Hang Zhang, Chen Lin, Jian Jiao, Siu Ming Yiu, Nan Duan, and Weizhu Chen. Annollm: Making large language models to be better crowdsourced annotators. ArXiv, abs/2303.16854, 2023

  18. [18]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

  19. [19]

    Llm-adapters: An adapter family for parameter-efficient fine- tuning of large language models,

    Zhiqiang Hu, Yihuai Lan, Lei Wang, Wanyu Xu, Ee-Peng Lim, Roy Ka-Wei Lee, Lidong Bing, and Soujanya Poria. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. ArXiv, abs/2304.01933, 2023

  20. [20]

    Audiogpt: Understanding and generating speech, music, sound, and talking head

    Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jia-Bin Huang, Jinglin Liu, Yixiang Ren, Zhou Zhao, and Shinji Watanabe. Audiogpt: Understanding and generating speech, music, sound, and talking head. ArXiv, abs/2304.12995, 2023

  21. [21]

    o pf, Yannic Kilcher, Dimitri von R \

    Andreas Kopf, Yannic Kilcher, Dimitri von Rutte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Rich'ard Nagyfi, ES Shahul, Sameer Suri, David Glushkov, Arnav Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and Alexander Mattick. Openassistant conversations - democratizing large language mo...

  22. [22]

    Camel: Communicative agents for "mind" exploration of large language model society, 2023 a

    Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for "mind" exploration of large language model society, 2023 a

  23. [23]

    Enabling programming thinking in large language models toward code generation

    Jia Li, Ge Li, Yongming Li, and Zhi Jin. Enabling programming thinking in large language models toward code generation. 2023 b

  24. [24]

    Hashimoto

    Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023 c

  25. [25]

    Truthfulqa: Measuring how models mimic human falsehoods, 2022

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022

  26. [26]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. ArXiv, abs/2304.08485, 2023

  27. [27]

    W., Tay, Y ., Zhou, D., Le, Q

    Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023

  28. [28]

    Augmented large language models with parametric knowledge guiding

    Ziyang Luo, Can Xu, Pu Zhao, Xiubo Geng, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. Augmented large language models with parametric knowledge guiding. ArXiv, abs/2305.04757, 2023

  29. [29]

    Fu, Qinghua Hu, and Bing Wu

    Huan Ma, Changqing Zhang, Yatao Bian, Lemao Liu, Zhirui Zhang, Peilin Zhao, Shu Zhang, H. Fu, Qinghua Hu, and Bing Wu. Fairness-guided few-shot prompting for large language models. ArXiv, abs/2303.13217, 2023

  30. [30]

    SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

    Potsawee Manakul, Adian Liusie, and Mark John Francis Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. ArXiv, abs/2303.08896, 2023

  31. [31]

    Orca: Progressive learning from complex explanation traces of gpt-4, 2023

    Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4, 2023

  32. [32]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  33. [33]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

  34. [34]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020. URL http://jmlr.org/papers/v21/20-074.html

  35. [35]

    Multitask prompted training enables zero-shot task generalization

    Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng S...

  36. [36]

    Principle-driven self-alignment of language models from scratch with minimal human supervision

    Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David D. Cox, Yiming Yang, and Chuang Gan. Principle-driven self-alignment of language models from scratch with minimal human supervision. ArXiv, abs/2305.03047, 2023

  37. [37]

    Approximating human evaluation of social chatbots with prompting

    Ekaterina Svikhnushina and Pearl Pu. Approximating human evaluation of social chatbots with prompting. ArXiv, abs/2304.05253, 2023

  38. [38]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

  39. [39]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  40. [40]

    Visualizing data using t-sne

    Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9 0 (86): 0 2579--2605, 2008. URL http://jmlr.org/papers/v9/vandermaaten08a.html

  41. [41]

    Self-Instruct: Aligning Language Models with Self-Generated Instructions

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022 a

  42. [42]

    arXiv preprint arXiv:2204.07705 , year=

    Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705, 2022 b

  43. [43]

    Smith, Iz Beltagy, and Hannaneh Hajishirzi

    Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A. Smith, Iz Beltagy, and Hannaneh Hajishirzi. How far can camels go? exploring the state of instruction tuning on open resources, 2023

  44. [44]

    Knowda: All-in-one knowledge mixture model for data augmentation in few-shot nlp

    Yufei Wang, Jiayi Zheng, Can Xu, Xiubo Geng, Tao Shen, Chongyang Tao, and Daxin Jiang. Knowda: All-in-one knowledge mixture model for data augmentation in few-shot nlp. arXiv preprint arXiv:2206.10265, 2022 c

  45. [45]

    Finetuned Language Models Are Zero-Shot Learners

    Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021

  46. [46]

    Chatgpt-steered editing instructor for customization of abstractive summarization

    Wen Xiao, Yujia Xie, Giuseppe Carenini, and Pengcheng He. Chatgpt-steered editing instructor for customization of abstractive summarization. ArXiv, abs/2305.02483, 2023

  47. [47]

    Baize: An open-source chat model with parameter-efficient tuning on self-chat data, 2023

    Canwen Xu, Daya Guo, Nan Duan, and Julian McAuley. Baize: An open-source chat model with parameter-efficient tuning on self-chat data, 2023

  48. [48]

    Zeroprompt: Scaling prompt-based pretraining to 1,000 tasks improves zero-shot generalization

    Hanwei Xu, Yujun Chen, Yulun Du, Nan Shao, Yanggang Wang, Haiyu Li, and Zhilin Yang. Zeroprompt: Scaling prompt-based pretraining to 1,000 tasks improves zero-shot generalization. arXiv preprint arXiv:2201.06910, 2022

  49. [49]

    arXiv preprint arXiv:2304.05302 , year=

    Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Feiran Huang. Rrhf: Rank responses to align language models with human feedback without tears. ArXiv, abs/2304.05302, 2023

  50. [50]

    Automatic evaluation of attribution by large language models

    Xiang Yue, Boshi Wang, Kai Zhang, Zi-Yuan Chen, Yu Su, and Huan Sun. Automatic evaluation of attribution by large language models. ArXiv, abs/2305.06311, 2023

  51. [51]

    Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

  52. [52]

    Automl-gpt: Automatic machine learning with gpt

    Shujian Zhang, Chengyue Gong, Lemeng Wu, Xingchao Liu, and Mi Zhou. Automl-gpt: Automatic machine learning with gpt. ArXiv, abs/2305.02499, 2023

  53. [53]

    A Survey of Large Language Models

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Z. Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jianyun Nie, and Ji rong Wen. A survey of large language models. ArXiv, abs/2303.18223, 2023

  54. [54]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023

  55. [55]

    Sur-adapter: Enhancing text-to-image pre-trained diffusion models with large language models

    Shan Zhong, Zhongzhan Huang, Wushao Wen, Jinghui Qin, and Liang Lin. Sur-adapter: Enhancing text-to-image pre-trained diffusion models with large language models. ArXiv, abs/2305.05189, 2023

  56. [56]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. ArXiv, abs/2304.10592, 2023