pith. sign in

arxiv: 2301.13688 · v2 · pith:YWXZK3YDnew · submitted 2023-01-31 · 💻 cs.AI · cs.CL· cs.LG

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

Pith reviewed 2026-05-24 09:11 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords instruction tuningtask balancingmixed promptingFlan-T5zero-shotfew-shotchain-of-thoughtdata collection
0
0 comments X

The pith

Task balancing and mixed zero-shot few-shot chain-of-thought prompts during instruction tuning improve performance by over 2 percent in every setting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper breaks down the development of Flan 2022 through ablation studies on its collection of tasks and methods. It establishes that task balancing and enrichment techniques, previously overlooked, drive large gains, with models trained on mixed prompt formats outperforming those trained on single formats. These choices also produce instruction-tuned models that reach higher performance with less additional finetuning than their base counterparts on new tasks. The work releases the full collection of datasets, templates, and methods to support further research.

Core claim

Ablation studies on the Flan Collection show that task balancing and enrichment techniques are critical to effective instruction tuning. Training with mixed prompt settings that combine zero-shot, few-shot, and chain-of-thought formats yields stronger performance of over 2 percent in all evaluation settings. Flan-T5 requires less finetuning to converge higher and faster than T5 on single downstream tasks.

What carries the argument

The Flan Collection of tasks, templates, and methods, which supports controlled ablations that isolate the contributions of task balancing and mixed prompt training.

If this is right

  • Flan-T5 outperforms prior instruction-tuned models by 3-17 percent or more across evaluation settings.
  • Models trained with mixed prompts perform better than single-format models even when tested in one fixed format.
  • Instruction-tuned models serve as more computationally efficient starting checkpoints for new downstream tasks.
  • Public release of the Flan 2022 collection allows other researchers to replicate and extend the design decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The efficiency advantage of instruction-tuned starting points could compound when applied repeatedly across many sequential tasks.
  • Optimal mixtures of prompt types might differ by domain and could be tuned automatically in future data pipelines.
  • Wider adoption of mixed-prompt training might reduce the need for separate specialized models for different inference modes.

Load-bearing premise

The ablation studies accurately isolate the effects of task balancing and mixed prompting without major confounding from total compute, model scale, or evaluation choices.

What would settle it

Re-running the key ablations while holding total training tokens fixed but removing task balancing or mixed prompts, then checking whether the reported 3-17 percent gains disappear.

Figures

Figures reproduced from arXiv: 2301.13688 by Adam Roberts, Albert Webson, Barret Zoph, Denny Zhou, Hyung Won Chung, Jason Wei, Le Hou, Quoc V. Le, Shayne Longpre, Tu Vu, Yi Tay.

Figure 1
Figure 1. Figure 1: Comparing public instruction tuning collections on Held-In, Held-Out (BIG-Bench Hard (Suzgun et al., 2022) and MMLU (Hendrycks et al., 2020)), and Chain-of-Thought evaluation suites, detailed in Appendix A.3. All models except OPT-IML-Max (175B) are T5-XL with 3B parameters. Green text indicates absolute improvement over the next best comparable T5-XL (3B) model. ∗Research completed while a Student Researc… view at source ↗
Figure 2
Figure 2. Figure 2: A Timeline of Public Instruction Tuning Collections specifies the collection release date, detailed information on the finetuned models (the base model, their size, and whether the model itself is Public (P) or Not Public (NP)), what prompt specification they were trained for (zero-shot, few-shot, or Chain-of-Thought), the number of tasks contained in the Flan 2022 Collection (released with this work), and… view at source ↗
Figure 3
Figure 3. Figure 3: Training jointly with zero-shot and few-shot prompt templates improves performance on both Held-In and Held-Out tasks. The stars indicate the peak performance in each setting [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance Scaling Laws for the number of finetuning tasks and model sizes. Held-In per￾formance (left) and Held-Out MMLU performance (right) are shown. The gold star indicates the peak performance for that model size. Surprisingly, only T5-Small appears to exceed its Held-Out task performance before 1836 tasks, while larger model sizes continue to improve. These results suggest (a) even T5-Base may not h… view at source ↗
Figure 5
Figure 5. Figure 5: Flan-T5 Outperforms T5 on Single-Task Finetuning. We compare single-task finetuned T5, single￾task finetuned Flan-T5, and Flan-T5 without any further finetuning. are not weighted significantly: 4%, 2%, 2%, 2% respectively. We believe example templatization and the mixed prompt formats may pose the largest differences with OPT￾IMLs instruction tuning. Our template repository was significantly updated from F… view at source ↗
Figure 6
Figure 6. Figure 6: Flan-T5 convergences faster than T5 on single-task finetuning for each of 5 Held-Out tasks from Flan finetuning [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Input Inversions permutations for a Zero-Shot Chain-of-Thought example. Each is accompanied by a corresponding instruction template that prompts the model with what the input is, and what to predict as the targets. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
read the original abstract

We study the design decisions of publicly available instruction tuning methods, and break down the development of Flan 2022 (Chung et al., 2022). Through careful ablation studies on the Flan Collection of tasks and methods, we tease apart the effect of design decisions which enable Flan-T5 to outperform prior work by 3-17%+ across evaluation settings. We find task balancing and enrichment techniques are overlooked but critical to effective instruction tuning, and in particular, training with mixed prompt settings (zero-shot, few-shot, and chain-of-thought) actually yields stronger (2%+) performance in all settings. In further experiments, we show Flan-T5 requires less finetuning to converge higher and faster than T5 on single downstream tasks, motivating instruction-tuned models as more computationally-efficient starting checkpoints for new tasks. Finally, to accelerate research on instruction tuning, we make the Flan 2022 collection of datasets, templates, and methods publicly available at https://github.com/google-research/FLAN/tree/main/flan/v2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents the Flan Collection and reports ablation studies on its tasks and methods that isolate design decisions enabling Flan-T5 to outperform prior instruction-tuned models by 3-17%+ across settings. Key findings are that task balancing and enrichment are critical, mixed prompting (zero-shot, few-shot, and chain-of-thought) yields 2%+ gains in all settings, and Flan-T5 converges higher and faster than T5 on downstream tasks; the collection of datasets, templates, and methods is released publicly.

Significance. If the ablation results hold after appropriate controls, the work would usefully highlight overlooked factors in instruction tuning and supply a reusable public resource. The public release of the full collection is a concrete strength that supports reproducibility and further experimentation.

major comments (1)
  1. Abstract and ablation sections: the claim that mixed prompt settings yield 2%+ gains across all settings rests on the ablations isolating the effect of the mixture. The manuscript does not state that total training examples or tokens are held fixed across the mixed-prompt condition and the single-prompt baselines; because task balancing inherently changes per-task counts, any imbalance in total data volume could confound the reported gains with scale rather than the mixing strategy itself.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our ablation design. The single major comment raises a valid point about experimental controls that we will address directly in revision.

read point-by-point responses
  1. Referee: Abstract and ablation sections: the claim that mixed prompt settings yield 2%+ gains across all settings rests on the ablations isolating the effect of the mixture. The manuscript does not state that total training examples or tokens are held fixed across the mixed-prompt condition and the single-prompt baselines; because task balancing inherently changes per-task counts, any imbalance in total data volume could confound the reported gains with scale rather than the mixing strategy itself.

    Authors: We agree that explicit controls for total training volume are necessary to isolate the effect of prompt mixing. In the reported ablations, the total number of training examples (and thus tokens) was held constant across the mixed-prompt and single-prompt conditions by fixing the overall training budget and sampling examples from the prompt mixture while preserving the task-balanced per-task counts used in the main experiments. Task balancing was applied uniformly and is orthogonal to the prompt-type mixture variable. We will revise the ablation sections (and abstract if space permits) to state this control explicitly, including the exact training example counts used in each condition, so that readers can verify the isolation of the mixing effect. revision: yes

Circularity Check

0 steps flagged

Empirical ablation study with no derivation chain or self-referential reductions

full rationale

The paper reports results from ablation experiments on the Flan Collection of public tasks and prompting methods. All central claims (e.g., benefits of task balancing, mixed zero/few-shot/CoT prompting yielding 2%+ gains, faster convergence of Flan-T5) are grounded in measured performance differences across conditions rather than any mathematical derivation, fitted parameter renamed as prediction, or self-citation that substitutes for evidence. The citation to Chung et al. (2022) is to prior related work whose development is being analyzed here, but the new ablations are independent and externally falsifiable on public data. No equations, uniqueness theorems, or ansatzes appear. The study is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical machine learning study relying on standard assumptions of supervised fine-tuning and evaluation on public benchmarks; no new free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5750 in / 1149 out tokens · 31140 ms · 2026-05-24T09:11:30.441422+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Beyond Static Personas: Situational Personality Steering for Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    IRIS is a neuron-based Identify-Retrieve-Steer method for situational personality control in LLMs that outperforms baselines on PersonalityBench and the new SPBench.

  2. QLoRA: Efficient Finetuning of Quantized LLMs

    cs.LG 2023-05 conditional novelty 7.0

    QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.

  3. WizardLM: Empowering large pre-trained language models to follow complex instructions

    cs.CL 2023-04 conditional novelty 7.0

    WizardLM uses LLM-driven iterative rewriting to generate complex instruction data and fine-tunes LLaMA to reach over 90% of ChatGPT capacity on 17 of 29 evaluated skills.

  4. Language Is Not All You Need: Aligning Perception with Language Models

    cs.CL 2023-02 conditional novelty 7.0

    Kosmos-1 shows strong zero-shot and few-shot results on language tasks, image captioning, visual QA, OCR-free document understanding, and image recognition guided by text instructions.

  5. MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification

    cs.LG 2026-05 unverdicted novelty 6.0

    MetaMoE unifies domain-specialized experts into a single MoE via diversity-aware public proxy selection that approximates private data distributions for router training and expert alignment.

  6. Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

    cs.CV 2025-05 unverdicted novelty 6.0

    Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interlea...

  7. The Falcon Series of Open Language Models

    cs.CL 2023-11 conditional novelty 6.0

    Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.

  8. Kosmos-2: Grounding Multimodal Large Language Models to the World

    cs.CL 2023-06 unverdicted novelty 6.0

    Kosmos-2 grounds text to image regions by encoding refer expressions as Markdown links to sequences of location tokens and trains on a new GrIT dataset of grounded image-text pairs.

  9. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    cs.CL 2023-06 accept novelty 6.0

    GPT-4 as an LLM judge achieves over 80% agreement with human preferences on MT-Bench and Chatbot Arena, matching human agreement levels and providing a scalable evaluation method.

  10. Scaling Data-Constrained Language Models

    cs.CL 2023-05 conditional novelty 6.0

    Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.

  11. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

    cs.CL 2023-05 conditional novelty 6.0

    UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.

  12. CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

    cs.AI 2023-03 conditional novelty 6.0

    CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.

  13. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

    cs.CL 2023-03 unverdicted novelty 6.0

    HuggingGPT is an agent system where ChatGPT plans and orchestrates calls to Hugging Face models to solve complex multi-modal AI tasks.

  14. Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods

    cs.LG 2026-04 unverdicted novelty 5.0

    ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.

  15. Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap

    cs.CL 2025-08 unverdicted novelty 5.0

    Selecting preference pairs whose DPO implicit reward gap is small yields better LLM alignment than random or baseline selection while using only 10% of the data.

  16. AppAgent: Multimodal Agents as Smartphone Users

    cs.CV 2023-12 unverdicted novelty 5.0

    AppAgent lets large language models operate diverse smartphone apps via visual interactions and learns app usage from exploration or demonstrations.

  17. PaLM 2 Technical Report

    cs.CL 2023-05 unverdicted novelty 5.0

    PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.

  18. A Survey on Knowledge Distillation of Large Language Models

    cs.CL 2024-02 accept novelty 3.0

    A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.

  19. A Survey of Large Language Models

    cs.CL 2023-03 accept novelty 3.0

    This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

  20. Will LLMs Scaling Hit the Wall? Breaking Barriers via Distributed Resources on Massive Edge Devices

    cs.DC 2025-03 unverdicted novelty 2.0

    Position paper claiming that distributed training across massive edge devices can overcome data depletion and centralized compute monopolies in LLM scaling.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 20 Pith papers · 28 internal anchors

  1. [1]

    URL https:// aclanthology.org/2021.emnlp-main.468. Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi...

  2. [2]

    Ext5: Towards extreme multi-task scaling for transfer learning

    Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huaixiu Steven Zheng, Sanket Vaibhav Mehta, Honglei Zhuang, Vinh Q Tran, Dara Bahri, Jianmo Ni, et al. Ext5: Towards extreme multi-task scaling for transfer learning. arXiv preprint arXiv:2111.10952,

  3. [3]

    Stephen Bach, Victor Sanh, Zheng Xin Yong, Albert Webson, Colin Raffel, Nihal V. Nayak, Abheesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Fevry, Zaid Alyafeai, Manan Dey, Andrea Santilli, Zhiqing Sun, Srulik Ben-david, Canwen Xu, Gunjan Chhablani, Han Wang, Jason Fries, Maged Al-shaibani, Shanya Sharma, Urmish Thakker, Khalid Almubarak, Xiangru Tang, D...

  4. [4]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258,

  5. [6]

    PaLM: Scaling Language Modeling with Pathways

    URLhttps://arxiv.org/abs/2204.02311. Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, MostafaDehghani,SiddharthaBrahma,etal. Scalinginstruction-finetunedlanguagemodels. arXiv preprint arXiv:2210.11416,

  6. [7]

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

    12 ChristopherClark,KentonLee,Ming-WeiChang,TomKwiatkowski,MichaelCollins,andKristinaToutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions.arXiv preprint arXiv:1905.10044,

  7. [8]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457,

  8. [10]

    Training Verifiers to Solve Math Word Problems

    URL https://arxiv.org/abs/2110.14168. Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors,Advances in Neural Information Processing Systems , vol- ume

  9. [11]

    Ashwin Devaraj, William Sheffield, Byron Wallace, and Junyi Jessy Li

    URL https://proceedings.neurips.cc/paper/2015/file/ 7137debd45ae4d0ab9aa953017286b20-Paper.pdf. Ashwin Devaraj, William Sheffield, Byron Wallace, and Junyi Jessy Li. Evaluating factuality in text simplifi- cation. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 7331–7345, Dublin, Ireland, May

  10. [12]

    doi: 10.18653/v1/2022.acl-long.506

    Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.506. URL https://aclanthology.org/2022.acl-long.506. JacobDevlin,Ming-WeiChang,KentonLee,andKristinaToutanova. BERT:Pre-trainingofdeepbidirectional transformers for language understanding.NAACL,

  11. [13]

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    URLhttps://aclanthology.org/N19-1423. Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858,

  12. [14]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027,

  13. [15]

    arXiv preprint arXiv:2210.08726 (2023)

    Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Y Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, et al. Attributed text generation via post-hoc research and revision. arXiv preprint arXiv:2210.08726,

  14. [16]

    Improving alignment of dialogue agents via targeted human judgements

    Amelia Glaese, Nat McAleese, Maja Trębacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, et al. Improving alignment of dialogue agents via targeted human judgements.arXiv preprint arXiv:2209.14375,

  15. [17]

    Improving zero and few-shot generalization in dialogue through instruction tuning.arXiv preprint arXiv:2205.12673,

    Prakhar Gupta, Cathy Jiao, Yi-Ting Yeh, Shikib Mehri, Maxine Eskenazi, and Jeffrey P Bigham. Improving zero and few-shot generalization in dialogue through instruction tuning.arXiv preprint arXiv:2205.12673,

  16. [18]

    URL https://openreview.net/ forum?id=d7KBjmI3GmQ. 13 Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DiegodeLasCasas,LisaAnneHendricks,JohannesWelbl,AidanClark,TomHennigan,EricNoland,Katie Millican,GeorgevandenDriessche,BogdanDamoc,AureliaGuy,SimonOsindero,KarenSimonyan,Erich Elsen, Jack W. Rae, Oriol Vi...

  17. [19]

    Unnatural instructions: Tuning language models with (almost) no human labor.arXiv preprint arXiv:2212.09689,

    Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language models with (almost) no human labor.arXiv preprint arXiv:2212.09689,

  18. [21]

    LoRA: Low-Rank Adaptation of Large Language Models

    URLhttps: //arxiv.org/abs/2106.09685. Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Cosmos qa: Machine reading com- prehension with contextual commonsense reasoning. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJC...

  19. [22]

    Inner Monologue: Embodied Reasoning through Planning with Language Models

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, KarolHausman,andBrianIchter. Innermonologue: Embodiedreasoningthroughplanningwithlanguage models. In arXiv preprint arXiv:2207.05608,

  20. [24]

    OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization

    URL https: //arxiv.org/abs/2212.12017. Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedicalresearchquestionanswering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages...

  21. [25]

    Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, and Richard Socher

    URLhttps://aclanthology.org/D19-1259. Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, and Richard Socher. Unifying question answering, text classification, and regression via span extraction.arXiv preprint arXiv:1904.09286,

  22. [26]

    UnifiedQA:CrossingformatboundarieswithasingleQAsystem

    Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. UnifiedQA:CrossingformatboundarieswithasingleQAsystem. In Findings of the Association for Computational Linguistics: EMNLP 2020,

  23. [27]

    URLhttps://aclanthology.org/2020.findings-emnlp

  24. [28]

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. Bloom: A 176b-parameter open-access multilingual language model.arXiv preprint arXiv:2211.05100,

  25. [29]

    URL https://aclanthology.org/2021

    doi: 10.18653/v1/2021.emnlp-main.243. URL https://aclanthology.org/2021. emnlp-main.243. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. InProceedings of the 5...

  26. [30]

    doi: 10.18653/v1/2020.acl-main.703

    Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.703. URLhttps://aclanthology.org/2020.acl-main.703. Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantita...

  27. [31]

    Solving Quantitative Reasoning Problems with Language Models

    URL https://arxiv.org/abs/2206.14858. Paul Pu Liang, Chiyu Wu, Louis-Philippe Morency, and Ruslan Salakhutdinov. Towards understanding and mitigating social biases in language models. InICML,

  28. [32]

    Wanli: Worker and ai collaboration for natural language inference dataset creation.arXiv preprint arXiv:2201.05955 , 2022a

    Alisa Liu, Swabha Swayamdipta, Noah A Smith, and Yejin Choi. Wanli: Worker and ai collaboration for natural language inference dataset creation.arXiv preprint arXiv:2201.05955 , 2022a. URL https: //arxiv.org/abs/2201.05955. Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin Raffel. Few-shot parameter-efficient fine-tuni...

  29. [33]

    Howeffectiveistask-agnosticdataaugmentationforpretrained transformers? In Findings of the Association for Computational Linguistics: EMNLP 2020 , pages 4401–4411,

    ShayneLongpre,YuWang,andChrisDuBois. Howeffectiveistask-agnosticdataaugmentationforpretrained transformers? In Findings of the Association for Computational Linguistics: EMNLP 2020 , pages 4401–4411,

  30. [34]

    Entity- based knowledge conflicts in question answering

    Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. Entity- based knowledge conflicts in question answering. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7052–7063,

  31. [35]

    On faithfulness and factuality in abstractive summarization

    Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online, July

  32. [36]

    The Natural Language Decathlon: Multitask Learning as Question Answering

    Association for Computational Linguistics. doi: 10.18653/ v1/2020.acl-main.173. URL https://aclanthology.org/2020.acl-main.173. Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering.arXiv preprint arXiv:1806.08730,

  33. [37]

    The radicalization risks of gpt-3 and advanced neural language models

    Kris McGuffie and Alex Newhouse. The radicalization risks of gpt-3 and advanced neural language models. arXiv preprint arXiv:2009.06807,

  34. [38]

    Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi

    URL https://proceedings.neurips.cc/paper/2013/file/ 9aa42b31882ec039965f3c4923ce901b-Paper.pdf. Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. MetaICL: Learning to learn in context. In NAACL,

  35. [39]

    Cross-Task Generalization via Natural Language Crowdsourcing Instructions

    URLhttps://aclanthology.org/2022.naacl-main.201. Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions.arXiv preprint arXiv:2104.08773,

  36. [40]

    Crosslingual generalization through multitask finetuning

    15 Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786,

  37. [41]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332,

  38. [43]

    Training language models to follow instructions with human feedback

    URLhttps://arxiv.org/abs/2203.02155. ZaranaParekh,JasonBaldridge,DanielCer,AustinWaters,andYinfeiYang. Crisscrossedcaptions: Extended intramodalandintermodalsemanticsimilarityjudgmentsforMS-COCO.In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL) , pages 2855–2870,

  39. [44]

    Arkil Patel, Satwik Bhattamishra, and Navin Goyal

    URL https://aclanthology.org/2021.eacl-main.249. Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math word problems? InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T echnologies, pages 2080–2094,

  40. [45]

    Scaling Language Models: Methods, Analysis & Insights from Training Gopher

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models areunsupervisedmultitasklearners. OpenAI blog,1(8):9,2019. URL https://d4mucfpksywv.cloudfront. net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf. Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffman...

  41. [46]

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    URLhttps://arxiv.org/abs/1910.10683. 16 Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (V olume 2: Short Papers), pages 784–789,

  42. [48]

    URLhttps://arxiv.org/ abs/2211.00295. AdamRoberts,HyungWonChung,AnselmLevskaya,GauravMishra,JamesBradbury,DanielAndor,Sharan Narang,BrianLester,ColinGaffney,AfrozMohiuddin,CurtisHawthorne,AitorLewkowycz,AlexSalcianu, Marc van Zee, Jacob Austin, Sebastian Goodman, Livio Baldini Soares, Haitang Hu, Sasha Tsvyashchenko, AakankshaChowdhery,JasmijnBastings,Jann...

  43. [49]

    Alexey Romanov and Chaitanya Shivade

    URLhttps://arxiv.org/ abs/2203.17189. Alexey Romanov and Chaitanya Shivade. Lessons from natural language inference in the clinical domain. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 1586–1596,

  44. [50]

    VictorSanh,AlbertWebson,ColinRaffel,StephenH.Bach,LintangSutawika,ZaidAlyafeai,AntoineChaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al

    URLhttps://aclanthology.org/D18-1187. VictorSanh,AlbertWebson,ColinRaffel,StephenH.Bach,LintangSutawika,ZaidAlyafeai,AntoineChaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. ICLR 2022,

  45. [51]

    Multitask Prompted Training Enables Zero-Shot Task Generalization

    URLhttps://arxiv.org/abs/2110.08207. Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. The woman worked as a babysitter: On biases in language generation. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3...

  46. [52]

    doi: 10.18653/v1/D19-1339

    Association for Computational Linguistics. doi: 10.18653/v1/D19-1339. URL https://aclanthology.org/D19-1339. Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Nathaneal Scharli, Aakanksha Chowdhery, ...

  47. [53]

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al

    URLhttps://arxiv.org/abs/2212.13138. Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.arXiv preprint arXiv:2206.04615,

  48. [54]

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    URL https://arxiv.org/abs/2206.04615. MiracSuzgun,NathanScales,NathanealScharli,SebastianGehrmann,YiTay,HyungWonChung,Aakanksha Chowdhery,QuocV.Le,EdH.Chi,DennyZHou,andJasonWei. ChallengingBIG-Benchtasksandwhether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261,

  49. [55]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    URLhttps://arxiv.org/abs/ 2210.09261. ZeerakTalat,AurélieNévéol,StellaBiderman,MirunaClinciu,MananDey,ShayneLongpre,AlexandraSasha Luccioni10, Maraim Masoud11, Margaret Mitchell10, Dragomir Radev12, et al. You reap what you sow: On the challenges of bias evaluation under multilingual settings.Challenges & Perspectives in Creating Large Language Models, page 26,

  50. [56]

    AlonTalmor,JonathanHerzig,NicholasLourie,andJonathanBerant.Commonsenseqa: Aquestionanswering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T echnologies, V olume 1 (Long and Short Papers), pages 4149–4158,

  51. [57]

    Unifying language learning paradigms.arXiv preprint arXiv:2205.05131, 2022a

    17 Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, and Donald Metzler. Unifying language learning paradigms.arXiv preprint arXiv:2205.05131, 2022a. URL https://arxiv.org/abs/2205.05131. Yi Tay, Jason Wei, Hyung Won Chung, David R. So, Siamak Shakeri, Xavier Garcia, Vinh Q. Tran, Hauixiu S...

  52. [58]

    LaMDA: Language Models for Dialog Applications

    URLhttps://arxiv.org/abs/2201.08239. TuVu, TongWang, TsendsurenMunkhdalai, AlessandroSordoni,AdamTrischler, AndrewMattarella-Micke, SubhransuMaji, andMohitIyyer. ExploringandpredictingtransferabilityacrossNLPtasks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 7882–7926,

  53. [59]

    Tu Vu, Brian Lester, Noah Constant, Rami Al-Rfou’, and Daniel Cer

    URL https://aclanthology.org/2020.emnlp-main.635. Tu Vu, Brian Lester, Noah Constant, Rami Al-Rfou’, and Daniel Cer. SPoT: Better frozen model adaptation through soft prompt transfer. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), pages 5039–5059,

  54. [60]

    Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh

    URLhttps://aclanthology.org/2022.acl-long.346. Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing NLP. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCN...

  55. [61]

    doi: 10.18653/v1/D19-1221

    Association for Computational Linguistics. doi: 10.18653/v1/D19-1221. URL https://aclanthology.org/D19-1221. Ben Wang and Aran Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model.https: //github.com/kingoflolz/mesh-transformer-jax, May

  56. [62]

    What language model architecture and pretraining objective work best for zero-shot generalization? ICML, 2022a

    Thomas Wang, Adam Roberts, Daniel Hesslow, Teven Le Scao, Hyung Won Chung, Iz Beltagy, Julien Launay, and Colin Raffel. What language model architecture and pretraining objective work best for zero-shot generalization? ICML, 2022a. URLhttps://arxiv.org/abs/2204.05832. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, a...

  57. [63]

    Qinyuan Ye, Bill Yuchen Lin, and Xiang Ren

    URLhttps://arxiv.org/abs/2212.10773. Qinyuan Ye, Bill Yuchen Lin, and Xiang Ren. Crossfit: A few-shot learning challenge for cross-task general- ization in NLP. InEMNLP,

  58. [64]

    Seonghyeon Ye, Doyoung Kim, Joel Jang, Joongbo Shin, and Minjoon Seo

    URLhttps://arxiv.org/abs/2104.08835. Seonghyeon Ye, Doyoung Kim, Joel Jang, Joongbo Shin, and Minjoon Seo. Guess the instruction! making language models stronger zero-shot learners.arXiv preprint arXiv:2210.02969,

  59. [65]

    GLM-130B: An Open Bilingual Pre-trained Model

    Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. Glm-130b: An open bilingual pre-trained model.arXiv preprint arXiv:2210.02414,

  60. [66]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068,

  61. [67]

    20 A.2 Single-Task Finetuning

    19 Appendix Table of Contents A Experimental Details 20 A.1 Instruction Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 A.2 Single-Task Finetuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 A.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

  62. [68]

    These datasets are contained in the Flan 2022 finetuning collection and represent challenging benchmarks, often used to evaluate LLMs on QA and NLI

    A.3 Evaluation For Held-In evaluations we use the validation sets from 4 question answering (QA) tasks, BoolQ, ARC Easy, ARC Challenge, and AI2’s Middle School Science Exams, and 4 natural language inference (NLI) tasks, including ANLI R1, R2, R3, and RTE. These datasets are contained in the Flan 2022 finetuning collection and represent challenging benchma...

  63. [69]

    Table 3:Datasets used for Various Finetuning and Evaluation Experiments.ST-FT stands for Single Task Finetuning. For the Chain-of-Thought (CoT) evaluation, we use the mean accuracy across 5 datasets which have been preparedwithpromptswhichrequeststep-by-stepexplanationsintheirtargetanswers: GSM8K,StrategyQA, SVAMP, Asdiv, and CommonsenseQA. FortheHeld-Out...