The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
Pith reviewed 2026-05-24 09:11 UTC · model grok-4.3
The pith
Task balancing and mixed zero-shot few-shot chain-of-thought prompts during instruction tuning improve performance by over 2 percent in every setting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Ablation studies on the Flan Collection show that task balancing and enrichment techniques are critical to effective instruction tuning. Training with mixed prompt settings that combine zero-shot, few-shot, and chain-of-thought formats yields stronger performance of over 2 percent in all evaluation settings. Flan-T5 requires less finetuning to converge higher and faster than T5 on single downstream tasks.
What carries the argument
The Flan Collection of tasks, templates, and methods, which supports controlled ablations that isolate the contributions of task balancing and mixed prompt training.
If this is right
- Flan-T5 outperforms prior instruction-tuned models by 3-17 percent or more across evaluation settings.
- Models trained with mixed prompts perform better than single-format models even when tested in one fixed format.
- Instruction-tuned models serve as more computationally efficient starting checkpoints for new downstream tasks.
- Public release of the Flan 2022 collection allows other researchers to replicate and extend the design decisions.
Where Pith is reading between the lines
- The efficiency advantage of instruction-tuned starting points could compound when applied repeatedly across many sequential tasks.
- Optimal mixtures of prompt types might differ by domain and could be tuned automatically in future data pipelines.
- Wider adoption of mixed-prompt training might reduce the need for separate specialized models for different inference modes.
Load-bearing premise
The ablation studies accurately isolate the effects of task balancing and mixed prompting without major confounding from total compute, model scale, or evaluation choices.
What would settle it
Re-running the key ablations while holding total training tokens fixed but removing task balancing or mixed prompts, then checking whether the reported 3-17 percent gains disappear.
Figures
read the original abstract
We study the design decisions of publicly available instruction tuning methods, and break down the development of Flan 2022 (Chung et al., 2022). Through careful ablation studies on the Flan Collection of tasks and methods, we tease apart the effect of design decisions which enable Flan-T5 to outperform prior work by 3-17%+ across evaluation settings. We find task balancing and enrichment techniques are overlooked but critical to effective instruction tuning, and in particular, training with mixed prompt settings (zero-shot, few-shot, and chain-of-thought) actually yields stronger (2%+) performance in all settings. In further experiments, we show Flan-T5 requires less finetuning to converge higher and faster than T5 on single downstream tasks, motivating instruction-tuned models as more computationally-efficient starting checkpoints for new tasks. Finally, to accelerate research on instruction tuning, we make the Flan 2022 collection of datasets, templates, and methods publicly available at https://github.com/google-research/FLAN/tree/main/flan/v2.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents the Flan Collection and reports ablation studies on its tasks and methods that isolate design decisions enabling Flan-T5 to outperform prior instruction-tuned models by 3-17%+ across settings. Key findings are that task balancing and enrichment are critical, mixed prompting (zero-shot, few-shot, and chain-of-thought) yields 2%+ gains in all settings, and Flan-T5 converges higher and faster than T5 on downstream tasks; the collection of datasets, templates, and methods is released publicly.
Significance. If the ablation results hold after appropriate controls, the work would usefully highlight overlooked factors in instruction tuning and supply a reusable public resource. The public release of the full collection is a concrete strength that supports reproducibility and further experimentation.
major comments (1)
- Abstract and ablation sections: the claim that mixed prompt settings yield 2%+ gains across all settings rests on the ablations isolating the effect of the mixture. The manuscript does not state that total training examples or tokens are held fixed across the mixed-prompt condition and the single-prompt baselines; because task balancing inherently changes per-task counts, any imbalance in total data volume could confound the reported gains with scale rather than the mixing strategy itself.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our ablation design. The single major comment raises a valid point about experimental controls that we will address directly in revision.
read point-by-point responses
-
Referee: Abstract and ablation sections: the claim that mixed prompt settings yield 2%+ gains across all settings rests on the ablations isolating the effect of the mixture. The manuscript does not state that total training examples or tokens are held fixed across the mixed-prompt condition and the single-prompt baselines; because task balancing inherently changes per-task counts, any imbalance in total data volume could confound the reported gains with scale rather than the mixing strategy itself.
Authors: We agree that explicit controls for total training volume are necessary to isolate the effect of prompt mixing. In the reported ablations, the total number of training examples (and thus tokens) was held constant across the mixed-prompt and single-prompt conditions by fixing the overall training budget and sampling examples from the prompt mixture while preserving the task-balanced per-task counts used in the main experiments. Task balancing was applied uniformly and is orthogonal to the prompt-type mixture variable. We will revise the ablation sections (and abstract if space permits) to state this control explicitly, including the exact training example counts used in each condition, so that readers can verify the isolation of the mixing effect. revision: yes
Circularity Check
Empirical ablation study with no derivation chain or self-referential reductions
full rationale
The paper reports results from ablation experiments on the Flan Collection of public tasks and prompting methods. All central claims (e.g., benefits of task balancing, mixed zero/few-shot/CoT prompting yielding 2%+ gains, faster convergence of Flan-T5) are grounded in measured performance differences across conditions rather than any mathematical derivation, fitted parameter renamed as prediction, or self-citation that substitutes for evidence. The citation to Chung et al. (2022) is to prior related work whose development is being analyzed here, but the new ablations are independent and externally falsifiable on public data. No equations, uniqueness theorems, or ansatzes appear. The study is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 20 Pith papers
-
Beyond Static Personas: Situational Personality Steering for Large Language Models
IRIS is a neuron-based Identify-Retrieve-Steer method for situational personality control in LLMs that outperforms baselines on PersonalityBench and the new SPBench.
-
QLoRA: Efficient Finetuning of Quantized LLMs
QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
-
WizardLM: Empowering large pre-trained language models to follow complex instructions
WizardLM uses LLM-driven iterative rewriting to generate complex instruction data and fine-tunes LLaMA to reach over 90% of ChatGPT capacity on 17 of 29 evaluated skills.
-
Language Is Not All You Need: Aligning Perception with Language Models
Kosmos-1 shows strong zero-shot and few-shot results on language tasks, image captioning, visual QA, OCR-free document understanding, and image recognition guided by text instructions.
-
MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification
MetaMoE unifies domain-specialized experts into a single MoE via diversity-aware public proxy selection that approximates private data distributions for router training and expert alignment.
-
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation
Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interlea...
-
The Falcon Series of Open Language Models
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
-
Kosmos-2: Grounding Multimodal Large Language Models to the World
Kosmos-2 grounds text to image regions by encoding refer expressions as Markdown links to sequences of location tokens and trains on a new GrIT dataset of grounded image-text pairs.
-
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
GPT-4 as an LLM judge achieves over 80% agreement with human preferences on MT-Bench and Chatbot Arena, matching human agreement levels and providing a scalable evaluation method.
-
Scaling Data-Constrained Language Models
Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
-
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations
UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.
-
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
-
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
HuggingGPT is an agent system where ChatGPT plans and orchestrates calls to Hugging Face models to solve complex multi-modal AI tasks.
-
Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods
ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.
-
Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap
Selecting preference pairs whose DPO implicit reward gap is small yields better LLM alignment than random or baseline selection while using only 10% of the data.
-
AppAgent: Multimodal Agents as Smartphone Users
AppAgent lets large language models operate diverse smartphone apps via visual interactions and learns app usage from exploration or demonstrations.
-
PaLM 2 Technical Report
PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
-
A Survey on Knowledge Distillation of Large Language Models
A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.
-
A Survey of Large Language Models
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
-
Will LLMs Scaling Hit the Wall? Breaking Barriers via Distributed Resources on Massive Edge Devices
Position paper claiming that distributed training across massive edge devices can overcome data depletion and centralized compute monopolies in LLM scaling.
Reference graph
Works this paper leans on
-
[1]
URL https:// aclanthology.org/2021.emnlp-main.468. Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[2]
Ext5: Towards extreme multi-task scaling for transfer learning
Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huaixiu Steven Zheng, Sanket Vaibhav Mehta, Honglei Zhuang, Vinh Q Tran, Dara Bahri, Jianmo Ni, et al. Ext5: Towards extreme multi-task scaling for transfer learning. arXiv preprint arXiv:2111.10952,
-
[3]
Stephen Bach, Victor Sanh, Zheng Xin Yong, Albert Webson, Colin Raffel, Nihal V. Nayak, Abheesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Fevry, Zaid Alyafeai, Manan Dey, Andrea Santilli, Zhiqing Sun, Srulik Ben-david, Canwen Xu, Gunjan Chhablani, Han Wang, Jason Fries, Maged Al-shaibani, Shanya Sharma, Urmish Thakker, Khalid Almubarak, Xiangru Tang, D...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2022.acl-demo.9 2022
-
[4]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
PaLM: Scaling Language Modeling with Pathways
URLhttps://arxiv.org/abs/2204.02311. Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, MostafaDehghani,SiddharthaBrahma,etal. Scalinginstruction-finetunedlanguagemodels. arXiv preprint arXiv:2210.11416,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
12 ChristopherClark,KentonLee,Ming-WeiChang,TomKwiatkowski,MichaelCollins,andKristinaToutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions.arXiv preprint arXiv:1905.10044,
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[8]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Training Verifiers to Solve Math Word Problems
URL https://arxiv.org/abs/2110.14168. Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors,Advances in Neural Information Processing Systems , vol- ume
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Ashwin Devaraj, William Sheffield, Byron Wallace, and Junyi Jessy Li
URL https://proceedings.neurips.cc/paper/2015/file/ 7137debd45ae4d0ab9aa953017286b20-Paper.pdf. Ashwin Devaraj, William Sheffield, Byron Wallace, and Junyi Jessy Li. Evaluating factuality in text simplifi- cation. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 7331–7345, Dublin, Ireland, May
work page 2015
-
[12]
doi: 10.18653/v1/2022.acl-long.506
Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.506. URL https://aclanthology.org/2022.acl-long.506. JacobDevlin,Ming-WeiChang,KentonLee,andKristinaToutanova. BERT:Pre-trainingofdeepbidirectional transformers for language understanding.NAACL,
-
[13]
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
URLhttps://aclanthology.org/N19-1423. Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
arXiv preprint arXiv:2210.08726 (2023)
Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Y Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, et al. Attributed text generation via post-hoc research and revision. arXiv preprint arXiv:2210.08726,
-
[16]
Improving alignment of dialogue agents via targeted human judgements
Amelia Glaese, Nat McAleese, Maja Trębacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, et al. Improving alignment of dialogue agents via targeted human judgements.arXiv preprint arXiv:2209.14375,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Prakhar Gupta, Cathy Jiao, Yi-Ting Yeh, Shikib Mehri, Maxine Eskenazi, and Jeffrey P Bigham. Improving zero and few-shot generalization in dialogue through instruction tuning.arXiv preprint arXiv:2205.12673,
-
[18]
URL https://openreview.net/ forum?id=d7KBjmI3GmQ. 13 Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DiegodeLasCasas,LisaAnneHendricks,JohannesWelbl,AidanClark,TomHennigan,EricNoland,Katie Millican,GeorgevandenDriessche,BogdanDamoc,AureliaGuy,SimonOsindero,KarenSimonyan,Erich Elsen, Jack W. Rae, Oriol Vi...
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language models with (almost) no human labor.arXiv preprint arXiv:2212.09689,
-
[21]
LoRA: Low-Rank Adaptation of Large Language Models
URLhttps: //arxiv.org/abs/2106.09685. Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Cosmos qa: Machine reading com- prehension with contextual commonsense reasoning. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJC...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[22]
Inner Monologue: Embodied Reasoning through Planning with Language Models
Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, KarolHausman,andBrianIchter. Innermonologue: Embodiedreasoningthroughplanningwithlanguage models. In arXiv preprint arXiv:2207.05608,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization
URL https: //arxiv.org/abs/2212.12017. Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedicalresearchquestionanswering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[25]
Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, and Richard Socher
URLhttps://aclanthology.org/D19-1259. Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, and Richard Socher. Unifying question answering, text classification, and regression via span extraction.arXiv preprint arXiv:1904.09286,
-
[26]
UnifiedQA:CrossingformatboundarieswithasingleQAsystem
Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. UnifiedQA:CrossingformatboundarieswithasingleQAsystem. In Findings of the Association for Computational Linguistics: EMNLP 2020,
work page 2020
-
[27]
URLhttps://aclanthology.org/2020.findings-emnlp
work page 2020
-
[28]
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. Bloom: A 176b-parameter open-access multilingual language model.arXiv preprint arXiv:2211.05100,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
URL https://aclanthology.org/2021
doi: 10.18653/v1/2021.emnlp-main.243. URL https://aclanthology.org/2021. emnlp-main.243. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. InProceedings of the 5...
-
[30]
doi: 10.18653/v1/2020.acl-main.703
Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.703. URLhttps://aclanthology.org/2020.acl-main.703. Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantita...
-
[31]
Solving Quantitative Reasoning Problems with Language Models
URL https://arxiv.org/abs/2206.14858. Paul Pu Liang, Chiyu Wu, Louis-Philippe Morency, and Ruslan Salakhutdinov. Towards understanding and mitigating social biases in language models. InICML,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Alisa Liu, Swabha Swayamdipta, Noah A Smith, and Yejin Choi. Wanli: Worker and ai collaboration for natural language inference dataset creation.arXiv preprint arXiv:2201.05955 , 2022a. URL https: //arxiv.org/abs/2201.05955. Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin Raffel. Few-shot parameter-efficient fine-tuni...
-
[33]
ShayneLongpre,YuWang,andChrisDuBois. Howeffectiveistask-agnosticdataaugmentationforpretrained transformers? In Findings of the Association for Computational Linguistics: EMNLP 2020 , pages 4401–4411,
work page 2020
-
[34]
Entity- based knowledge conflicts in question answering
Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. Entity- based knowledge conflicts in question answering. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7052–7063,
work page 2021
-
[35]
On faithfulness and factuality in abstractive summarization
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online, July
work page 1906
-
[36]
The Natural Language Decathlon: Multitask Learning as Question Answering
Association for Computational Linguistics. doi: 10.18653/ v1/2020.acl-main.173. URL https://aclanthology.org/2020.acl-main.173. Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering.arXiv preprint arXiv:1806.08730,
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[37]
The radicalization risks of gpt-3 and advanced neural language models
Kris McGuffie and Alex Newhouse. The radicalization risks of gpt-3 and advanced neural language models. arXiv preprint arXiv:2009.06807,
-
[38]
Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi
URL https://proceedings.neurips.cc/paper/2013/file/ 9aa42b31882ec039965f3c4923ce901b-Paper.pdf. Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. MetaICL: Learning to learn in context. In NAACL,
work page 2013
-
[39]
Cross-Task Generalization via Natural Language Crowdsourcing Instructions
URLhttps://aclanthology.org/2022.naacl-main.201. Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions.arXiv preprint arXiv:2104.08773,
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[40]
Crosslingual generalization through multitask finetuning
15 Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786,
-
[41]
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332,
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Training language models to follow instructions with human feedback
URLhttps://arxiv.org/abs/2203.02155. ZaranaParekh,JasonBaldridge,DanielCer,AustinWaters,andYinfeiYang. Crisscrossedcaptions: Extended intramodalandintermodalsemanticsimilarityjudgmentsforMS-COCO.In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL) , pages 2855–2870,
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
Arkil Patel, Satwik Bhattamishra, and Navin Goyal
URL https://aclanthology.org/2021.eacl-main.249. Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math word problems? InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T echnologies, pages 2080–2094,
work page 2021
-
[45]
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models areunsupervisedmultitasklearners. OpenAI blog,1(8):9,2019. URL https://d4mucfpksywv.cloudfront. net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf. Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffman...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[46]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
URLhttps://arxiv.org/abs/1910.10683. 16 Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (V olume 2: Short Papers), pages 784–789,
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[48]
URLhttps://arxiv.org/ abs/2211.00295. AdamRoberts,HyungWonChung,AnselmLevskaya,GauravMishra,JamesBradbury,DanielAndor,Sharan Narang,BrianLester,ColinGaffney,AfrozMohiuddin,CurtisHawthorne,AitorLewkowycz,AlexSalcianu, Marc van Zee, Jacob Austin, Sebastian Goodman, Livio Baldini Soares, Haitang Hu, Sasha Tsvyashchenko, AakankshaChowdhery,JasmijnBastings,Jann...
-
[49]
Alexey Romanov and Chaitanya Shivade
URLhttps://arxiv.org/ abs/2203.17189. Alexey Romanov and Chaitanya Shivade. Lessons from natural language inference in the clinical domain. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 1586–1596,
-
[50]
URLhttps://aclanthology.org/D18-1187. VictorSanh,AlbertWebson,ColinRaffel,StephenH.Bach,LintangSutawika,ZaidAlyafeai,AntoineChaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. ICLR 2022,
work page 2022
-
[51]
Multitask Prompted Training Enables Zero-Shot Task Generalization
URLhttps://arxiv.org/abs/2110.08207. Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. The woman worked as a babysitter: On biases in language generation. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[52]
Association for Computational Linguistics. doi: 10.18653/v1/D19-1339. URL https://aclanthology.org/D19-1339. Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Nathaneal Scharli, Aakanksha Chowdhery, ...
-
[53]
URLhttps://arxiv.org/abs/2212.13138. Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.arXiv preprint arXiv:2206.04615,
-
[54]
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
URL https://arxiv.org/abs/2206.04615. MiracSuzgun,NathanScales,NathanealScharli,SebastianGehrmann,YiTay,HyungWonChung,Aakanksha Chowdhery,QuocV.Le,EdH.Chi,DennyZHou,andJasonWei. ChallengingBIG-Benchtasksandwhether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261,
work page internal anchor Pith review Pith/arXiv arXiv
-
[55]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
URLhttps://arxiv.org/abs/ 2210.09261. ZeerakTalat,AurélieNévéol,StellaBiderman,MirunaClinciu,MananDey,ShayneLongpre,AlexandraSasha Luccioni10, Maraim Masoud11, Margaret Mitchell10, Dragomir Radev12, et al. You reap what you sow: On the challenges of bias evaluation under multilingual settings.Challenges & Perspectives in Creating Large Language Models, page 26,
work page internal anchor Pith review Pith/arXiv arXiv
-
[56]
AlonTalmor,JonathanHerzig,NicholasLourie,andJonathanBerant.Commonsenseqa: Aquestionanswering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T echnologies, V olume 1 (Long and Short Papers), pages 4149–4158,
work page 2019
-
[57]
Unifying language learning paradigms.arXiv preprint arXiv:2205.05131, 2022a
17 Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, and Donald Metzler. Unifying language learning paradigms.arXiv preprint arXiv:2205.05131, 2022a. URL https://arxiv.org/abs/2205.05131. Yi Tay, Jason Wei, Hyung Won Chung, David R. So, Siamak Shakeri, Xavier Garcia, Vinh Q. Tran, Hauixiu S...
-
[58]
LaMDA: Language Models for Dialog Applications
URLhttps://arxiv.org/abs/2201.08239. TuVu, TongWang, TsendsurenMunkhdalai, AlessandroSordoni,AdamTrischler, AndrewMattarella-Micke, SubhransuMaji, andMohitIyyer. ExploringandpredictingtransferabilityacrossNLPtasks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 7882–7926,
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[59]
Tu Vu, Brian Lester, Noah Constant, Rami Al-Rfou’, and Daniel Cer
URL https://aclanthology.org/2020.emnlp-main.635. Tu Vu, Brian Lester, Noah Constant, Rami Al-Rfou’, and Daniel Cer. SPoT: Better frozen model adaptation through soft prompt transfer. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), pages 5039–5059,
work page 2020
-
[60]
Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh
URLhttps://aclanthology.org/2022.acl-long.346. Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing NLP. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCN...
work page 2022
-
[61]
Association for Computational Linguistics. doi: 10.18653/v1/D19-1221. URL https://aclanthology.org/D19-1221. Ben Wang and Aran Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model.https: //github.com/kingoflolz/mesh-transformer-jax, May
-
[62]
Thomas Wang, Adam Roberts, Daniel Hesslow, Teven Le Scao, Hyung Won Chung, Iz Beltagy, Julien Launay, and Colin Raffel. What language model architecture and pretraining objective work best for zero-shot generalization? ICML, 2022a. URLhttps://arxiv.org/abs/2204.05832. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, a...
-
[63]
Qinyuan Ye, Bill Yuchen Lin, and Xiang Ren
URLhttps://arxiv.org/abs/2212.10773. Qinyuan Ye, Bill Yuchen Lin, and Xiang Ren. Crossfit: A few-shot learning challenge for cross-task general- ization in NLP. InEMNLP,
-
[64]
Seonghyeon Ye, Doyoung Kim, Joel Jang, Joongbo Shin, and Minjoon Seo
URLhttps://arxiv.org/abs/2104.08835. Seonghyeon Ye, Doyoung Kim, Joel Jang, Joongbo Shin, and Minjoon Seo. Guess the instruction! making language models stronger zero-shot learners.arXiv preprint arXiv:2210.02969,
-
[65]
GLM-130B: An Open Bilingual Pre-trained Model
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. Glm-130b: An open bilingual pre-trained model.arXiv preprint arXiv:2210.02414,
work page internal anchor Pith review Pith/arXiv arXiv
-
[66]
OPT: Open Pre-trained Transformer Language Models
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068,
work page internal anchor Pith review Pith/arXiv arXiv
-
[67]
19 Appendix Table of Contents A Experimental Details 20 A.1 Instruction Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 A.2 Single-Task Finetuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 A.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...
work page 2022
-
[68]
A.3 Evaluation For Held-In evaluations we use the validation sets from 4 question answering (QA) tasks, BoolQ, ARC Easy, ARC Challenge, and AI2’s Middle School Science Exams, and 4 natural language inference (NLI) tasks, including ANLI R1, R2, R3, and RTE. These datasets are contained in the Flan 2022 finetuning collection and represent challenging benchma...
work page 2022
-
[69]
Table 3:Datasets used for Various Finetuning and Evaluation Experiments.ST-FT stands for Single Task Finetuning. For the Chain-of-Thought (CoT) evaluation, we use the mean accuracy across 5 datasets which have been preparedwithpromptswhichrequeststep-by-stepexplanationsintheirtargetanswers: GSM8K,StrategyQA, SVAMP, Asdiv, and CommonsenseQA. FortheHeld-Out...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.