The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

Adam Roberts; Albert Webson; Barret Zoph; Denny Zhou; Hyung Won Chung; Jason Wei; Le Hou; Quoc V. Le; Shayne Longpre; Tu Vu

arxiv: 2301.13688 · v2 · pith:YWXZK3YDnew · submitted 2023-01-31 · 💻 cs.AI · cs.CL· cs.LG

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

Shayne Longpre , Le Hou , Tu Vu , Albert Webson , Hyung Won Chung , Yi Tay , Denny Zhou , Quoc V. Le

show 3 more authors

Barret Zoph Jason Wei Adam Roberts

This is my paper

Pith reviewed 2026-05-24 09:11 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords instruction tuningtask balancingmixed promptingFlan-T5zero-shotfew-shotchain-of-thoughtdata collection

0 comments

The pith

Task balancing and mixed zero-shot few-shot chain-of-thought prompts during instruction tuning improve performance by over 2 percent in every setting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper breaks down the development of Flan 2022 through ablation studies on its collection of tasks and methods. It establishes that task balancing and enrichment techniques, previously overlooked, drive large gains, with models trained on mixed prompt formats outperforming those trained on single formats. These choices also produce instruction-tuned models that reach higher performance with less additional finetuning than their base counterparts on new tasks. The work releases the full collection of datasets, templates, and methods to support further research.

Core claim

Ablation studies on the Flan Collection show that task balancing and enrichment techniques are critical to effective instruction tuning. Training with mixed prompt settings that combine zero-shot, few-shot, and chain-of-thought formats yields stronger performance of over 2 percent in all evaluation settings. Flan-T5 requires less finetuning to converge higher and faster than T5 on single downstream tasks.

What carries the argument

The Flan Collection of tasks, templates, and methods, which supports controlled ablations that isolate the contributions of task balancing and mixed prompt training.

If this is right

Flan-T5 outperforms prior instruction-tuned models by 3-17 percent or more across evaluation settings.
Models trained with mixed prompts perform better than single-format models even when tested in one fixed format.
Instruction-tuned models serve as more computationally efficient starting checkpoints for new downstream tasks.
Public release of the Flan 2022 collection allows other researchers to replicate and extend the design decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The efficiency advantage of instruction-tuned starting points could compound when applied repeatedly across many sequential tasks.
Optimal mixtures of prompt types might differ by domain and could be tuned automatically in future data pipelines.
Wider adoption of mixed-prompt training might reduce the need for separate specialized models for different inference modes.

Load-bearing premise

The ablation studies accurately isolate the effects of task balancing and mixed prompting without major confounding from total compute, model scale, or evaluation choices.

What would settle it

Re-running the key ablations while holding total training tokens fixed but removing task balancing or mixed prompts, then checking whether the reported 3-17 percent gains disappear.

Figures

Figures reproduced from arXiv: 2301.13688 by Adam Roberts, Albert Webson, Barret Zoph, Denny Zhou, Hyung Won Chung, Jason Wei, Le Hou, Quoc V. Le, Shayne Longpre, Tu Vu, Yi Tay.

**Figure 1.** Figure 1: Comparing public instruction tuning collections on Held-In, Held-Out (BIG-Bench Hard (Suzgun et al., 2022) and MMLU (Hendrycks et al., 2020)), and Chain-of-Thought evaluation suites, detailed in Appendix A.3. All models except OPT-IML-Max (175B) are T5-XL with 3B parameters. Green text indicates absolute improvement over the next best comparable T5-XL (3B) model. ∗Research completed while a Student Researc… view at source ↗

**Figure 2.** Figure 2: A Timeline of Public Instruction Tuning Collections specifies the collection release date, detailed information on the finetuned models (the base model, their size, and whether the model itself is Public (P) or Not Public (NP)), what prompt specification they were trained for (zero-shot, few-shot, or Chain-of-Thought), the number of tasks contained in the Flan 2022 Collection (released with this work), and… view at source ↗

**Figure 3.** Figure 3: Training jointly with zero-shot and few-shot prompt templates improves performance on both Held-In and Held-Out tasks. The stars indicate the peak performance in each setting [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Performance Scaling Laws for the number of finetuning tasks and model sizes. Held-In performance (left) and Held-Out MMLU performance (right) are shown. The gold star indicates the peak performance for that model size. Surprisingly, only T5-Small appears to exceed its Held-Out task performance before 1836 tasks, while larger model sizes continue to improve. These results suggest (a) even T5-Base may not h… view at source ↗

**Figure 5.** Figure 5: Flan-T5 Outperforms T5 on Single-Task Finetuning. We compare single-task finetuned T5, singletask finetuned Flan-T5, and Flan-T5 without any further finetuning. are not weighted significantly: 4%, 2%, 2%, 2% respectively. We believe example templatization and the mixed prompt formats may pose the largest differences with OPTIMLs instruction tuning. Our template repository was significantly updated from F… view at source ↗

**Figure 6.** Figure 6: Flan-T5 convergences faster than T5 on single-task finetuning for each of 5 Held-Out tasks from Flan finetuning [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Input Inversions permutations for a Zero-Shot Chain-of-Thought example. Each is accompanied by a corresponding instruction template that prompts the model with what the input is, and what to predict as the targets. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

read the original abstract

We study the design decisions of publicly available instruction tuning methods, and break down the development of Flan 2022 (Chung et al., 2022). Through careful ablation studies on the Flan Collection of tasks and methods, we tease apart the effect of design decisions which enable Flan-T5 to outperform prior work by 3-17%+ across evaluation settings. We find task balancing and enrichment techniques are overlooked but critical to effective instruction tuning, and in particular, training with mixed prompt settings (zero-shot, few-shot, and chain-of-thought) actually yields stronger (2%+) performance in all settings. In further experiments, we show Flan-T5 requires less finetuning to converge higher and faster than T5 on single downstream tasks, motivating instruction-tuned models as more computationally-efficient starting checkpoints for new tasks. Finally, to accelerate research on instruction tuning, we make the Flan 2022 collection of datasets, templates, and methods publicly available at https://github.com/google-research/FLAN/tree/main/flan/v2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Flan Collection supplies a useful public dataset release and ablations on mixed prompting, but the reported gains may reflect extra data volume rather than the mixing itself.

read the letter

The paper's core offering is the public Flan Collection of tasks, templates, and methods, plus ablations that highlight task balancing and mixed zero-shot/few-shot/CoT prompting as drivers of 2%+ gains over prior setups. It also reports that Flan-T5 reaches higher performance with less additional fine-tuning than base T5 on downstream tasks. The release at the GitHub repo is the clearest practical value, since it lets others reuse the exact collection without rebuilding from scratch. The empirical breakdown of design choices in Flan 2022 development gives concrete pointers on what mattered in their runs. The mixed-prompt result is presented as holding across evaluation settings, which is a straightforward takeaway if the controls are tight. The main soft spot is the ablation setup. The abstract does not state that total training examples or tokens were held fixed when comparing mixed prompting to single-prompt baselines. Task balancing changes per-task counts by design, so the same risk applies there. Without explicit normalization for data scale or compute, the gains could partly trace to volume differences rather than the prompt mixture or balancing strategy. The convergence claim is easier to check once the data is out, but it still needs the same scrutiny on experimental details. This work is aimed at groups running instruction-tuning experiments who want ready datasets or quick empirical signals on prompting mixtures. Readers focused on practical fine-tuning improvements will find the collection and reported patterns worth examining. It has enough of a concrete artifact and testable claims to go to peer review, though referees should be asked to verify the data-volume controls in the ablations.

Referee Report

1 major / 0 minor

Summary. The paper presents the Flan Collection and reports ablation studies on its tasks and methods that isolate design decisions enabling Flan-T5 to outperform prior instruction-tuned models by 3-17%+ across settings. Key findings are that task balancing and enrichment are critical, mixed prompting (zero-shot, few-shot, and chain-of-thought) yields 2%+ gains in all settings, and Flan-T5 converges higher and faster than T5 on downstream tasks; the collection of datasets, templates, and methods is released publicly.

Significance. If the ablation results hold after appropriate controls, the work would usefully highlight overlooked factors in instruction tuning and supply a reusable public resource. The public release of the full collection is a concrete strength that supports reproducibility and further experimentation.

major comments (1)

Abstract and ablation sections: the claim that mixed prompt settings yield 2%+ gains across all settings rests on the ablations isolating the effect of the mixture. The manuscript does not state that total training examples or tokens are held fixed across the mixed-prompt condition and the single-prompt baselines; because task balancing inherently changes per-task counts, any imbalance in total data volume could confound the reported gains with scale rather than the mixing strategy itself.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our ablation design. The single major comment raises a valid point about experimental controls that we will address directly in revision.

read point-by-point responses

Referee: Abstract and ablation sections: the claim that mixed prompt settings yield 2%+ gains across all settings rests on the ablations isolating the effect of the mixture. The manuscript does not state that total training examples or tokens are held fixed across the mixed-prompt condition and the single-prompt baselines; because task balancing inherently changes per-task counts, any imbalance in total data volume could confound the reported gains with scale rather than the mixing strategy itself.

Authors: We agree that explicit controls for total training volume are necessary to isolate the effect of prompt mixing. In the reported ablations, the total number of training examples (and thus tokens) was held constant across the mixed-prompt and single-prompt conditions by fixing the overall training budget and sampling examples from the prompt mixture while preserving the task-balanced per-task counts used in the main experiments. Task balancing was applied uniformly and is orthogonal to the prompt-type mixture variable. We will revise the ablation sections (and abstract if space permits) to state this control explicitly, including the exact training example counts used in each condition, so that readers can verify the isolation of the mixing effect. revision: yes

Circularity Check

0 steps flagged

Empirical ablation study with no derivation chain or self-referential reductions

full rationale

The paper reports results from ablation experiments on the Flan Collection of public tasks and prompting methods. All central claims (e.g., benefits of task balancing, mixed zero/few-shot/CoT prompting yielding 2%+ gains, faster convergence of Flan-T5) are grounded in measured performance differences across conditions rather than any mathematical derivation, fitted parameter renamed as prediction, or self-citation that substitutes for evidence. The citation to Chung et al. (2022) is to prior related work whose development is being analyzed here, but the new ablations are independent and externally falsifiable on public data. No equations, uniqueness theorems, or ansatzes appear. The study is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical machine learning study relying on standard assumptions of supervised fine-tuning and evaluation on public benchmarks; no new free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5750 in / 1149 out tokens · 31140 ms · 2026-05-24T09:11:30.441422+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Static Personas: Situational Personality Steering for Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

IRIS is a neuron-based Identify-Retrieve-Steer method for situational personality control in LLMs that outperforms baselines on PersonalityBench and the new SPBench.
QLoRA: Efficient Finetuning of Quantized LLMs
cs.LG 2023-05 conditional novelty 7.0

QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
WizardLM: Empowering large pre-trained language models to follow complex instructions
cs.CL 2023-04 conditional novelty 7.0

WizardLM uses LLM-driven iterative rewriting to generate complex instruction data and fine-tunes LLaMA to reach over 90% of ChatGPT capacity on 17 of 29 evaluated skills.
Language Is Not All You Need: Aligning Perception with Language Models
cs.CL 2023-02 conditional novelty 7.0

Kosmos-1 shows strong zero-shot and few-shot results on language tasks, image captioning, visual QA, OCR-free document understanding, and image recognition guided by text instructions.
MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification
cs.LG 2026-05 unverdicted novelty 6.0

MetaMoE unifies domain-specialized experts into a single MoE via diversity-aware public proxy selection that approximates private data distributions for router training and expert alignment.
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation
cs.CV 2025-05 unverdicted novelty 6.0

Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interlea...
The Falcon Series of Open Language Models
cs.CL 2023-11 conditional novelty 6.0

Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
Kosmos-2: Grounding Multimodal Large Language Models to the World
cs.CL 2023-06 unverdicted novelty 6.0

Kosmos-2 grounds text to image regions by encoding refer expressions as Markdown links to sequences of location tokens and trains on a new GrIT dataset of grounded image-text pairs.
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
cs.CL 2023-06 accept novelty 6.0

GPT-4 as an LLM judge achieves over 80% agreement with human preferences on MT-Bench and Chatbot Arena, matching human agreement levels and providing a scalable evaluation method.
Scaling Data-Constrained Language Models
cs.CL 2023-05 conditional novelty 6.0

Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations
cs.CL 2023-05 conditional novelty 6.0

UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
cs.AI 2023-03 conditional novelty 6.0

CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
cs.CL 2023-03 unverdicted novelty 6.0

HuggingGPT is an agent system where ChatGPT plans and orchestrates calls to Hugging Face models to solve complex multi-modal AI tasks.
Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods
cs.LG 2026-04 unverdicted novelty 5.0

ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.
Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap
cs.CL 2025-08 unverdicted novelty 5.0

Selecting preference pairs whose DPO implicit reward gap is small yields better LLM alignment than random or baseline selection while using only 10% of the data.
AppAgent: Multimodal Agents as Smartphone Users
cs.CV 2023-12 unverdicted novelty 5.0

AppAgent lets large language models operate diverse smartphone apps via visual interactions and learns app usage from exploration or demonstrations.
PaLM 2 Technical Report
cs.CL 2023-05 unverdicted novelty 5.0

PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
A Survey on Knowledge Distillation of Large Language Models
cs.CL 2024-02 accept novelty 3.0

A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.
A Survey of Large Language Models
cs.CL 2023-03 accept novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
Will LLMs Scaling Hit the Wall? Breaking Barriers via Distributed Resources on Massive Edge Devices
cs.DC 2025-03 unverdicted novelty 2.0

Position paper claiming that distributed training across massive edge devices can overcome data depletion and centralized compute monopolies in LLM scaling.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 20 Pith papers · 28 internal anchors

[1]

URL https:// aclanthology.org/2021.emnlp-main.468. Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeﬀrey, Sally Jesmonth, Nikhil J Joshi...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Ext5: Towards extreme multi-task scaling for transfer learning

Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huaixiu Steven Zheng, Sanket Vaibhav Mehta, Honglei Zhuang, Vinh Q Tran, Dara Bahri, Jianmo Ni, et al. Ext5: Towards extreme multi-task scaling for transfer learning. arXiv preprint arXiv:2111.10952,

work page arXiv
[3]

Stephen Bach, Victor Sanh, Zheng Xin Yong, Albert Webson, Colin Raﬀel, Nihal V. Nayak, Abheesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Fevry, Zaid Alyafeai, Manan Dey, Andrea Santilli, Zhiqing Sun, Srulik Ben-david, Canwen Xu, Gunjan Chhablani, Han Wang, Jason Fries, Maged Al-shaibani, Shanya Sharma, Urmish Thakker, Khalid Almubarak, Xiangru Tang, D...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2022.acl-demo.9 2022
[4]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

PaLM: Scaling Language Modeling with Pathways

URLhttps://arxiv.org/abs/2204.02311. Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, MostafaDehghani,SiddharthaBrahma,etal. Scalinginstruction-ﬁnetunedlanguagemodels. arXiv preprint arXiv:2210.11416,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

12 ChristopherClark,KentonLee,Ming-WeiChang,TomKwiatkowski,MichaelCollins,andKristinaToutanova. Boolq: Exploring the surprising diﬃculty of natural yes/no questions.arXiv preprint arXiv:1905.10044,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[8]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Training Verifiers to Solve Math Word Problems

URL https://arxiv.org/abs/2110.14168. Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors,Advances in Neural Information Processing Systems , vol- ume

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Ashwin Devaraj, William Sheﬃeld, Byron Wallace, and Junyi Jessy Li

URL https://proceedings.neurips.cc/paper/2015/file/ 7137debd45ae4d0ab9aa953017286b20-Paper.pdf. Ashwin Devaraj, William Sheﬃeld, Byron Wallace, and Junyi Jessy Li. Evaluating factuality in text simpliﬁ- cation. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 7331–7345, Dublin, Ireland, May

work page 2015
[12]

doi: 10.18653/v1/2022.acl-long.506

Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.506. URL https://aclanthology.org/2022.acl-long.506. JacobDevlin,Ming-WeiChang,KentonLee,andKristinaToutanova. BERT:Pre-trainingofdeepbidirectional transformers for language understanding.NAACL,

work page doi:10.18653/v1/2022.acl-long.506 2022
[13]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

URLhttps://aclanthology.org/N19-1423. Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

arXiv preprint arXiv:2210.08726 (2023)

Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Y Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, et al. Attributed text generation via post-hoc research and revision. arXiv preprint arXiv:2210.08726,

work page arXiv
[16]

Improving alignment of dialogue agents via targeted human judgements

Amelia Glaese, Nat McAleese, Maja Trębacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, et al. Improving alignment of dialogue agents via targeted human judgements.arXiv preprint arXiv:2209.14375,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Improving zero and few-shot generalization in dialogue through instruction tuning.arXiv preprint arXiv:2205.12673,

Prakhar Gupta, Cathy Jiao, Yi-Ting Yeh, Shikib Mehri, Maxine Eskenazi, and Jeﬀrey P Bigham. Improving zero and few-shot generalization in dialogue through instruction tuning.arXiv preprint arXiv:2205.12673,

work page arXiv
[18]

URL https://openreview.net/ forum?id=d7KBjmI3GmQ. 13 Jordan Hoﬀmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DiegodeLasCasas,LisaAnneHendricks,JohannesWelbl,AidanClark,TomHennigan,EricNoland,Katie Millican,GeorgevandenDriessche,BogdanDamoc,AureliaGuy,SimonOsindero,KarenSimonyan,Erich Elsen, Jack W. Rae, Oriol Vi...

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Unnatural instructions: Tuning language models with (almost) no human labor.arXiv preprint arXiv:2212.09689,

Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language models with (almost) no human labor.arXiv preprint arXiv:2212.09689,

work page arXiv
[21]

LoRA: Low-Rank Adaptation of Large Language Models

URLhttps: //arxiv.org/abs/2106.09685. Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Cosmos qa: Machine reading com- prehension with contextual commonsense reasoning. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJC...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[22]

Inner Monologue: Embodied Reasoning through Planning with Language Models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, KarolHausman,andBrianIchter. Innermonologue: Embodiedreasoningthroughplanningwithlanguage models. In arXiv preprint arXiv:2207.05608,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization

URL https: //arxiv.org/abs/2212.12017. Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedicalresearchquestionanswering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[25]

Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, and Richard Socher

URLhttps://aclanthology.org/D19-1259. Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, and Richard Socher. Unifying question answering, text classiﬁcation, and regression via span extraction.arXiv preprint arXiv:1904.09286,

work page arXiv 1904
[26]

UniﬁedQA:CrossingformatboundarieswithasingleQAsystem

Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. UniﬁedQA:CrossingformatboundarieswithasingleQAsystem. In Findings of the Association for Computational Linguistics: EMNLP 2020,

work page 2020
[27]

URLhttps://aclanthology.org/2020.findings-emnlp

work page 2020
[28]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. Bloom: A 176b-parameter open-access multilingual language model.arXiv preprint arXiv:2211.05100,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

URL https://aclanthology.org/2021

doi: 10.18653/v1/2021.emnlp-main.243. URL https://aclanthology.org/2021. emnlp-main.243. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. InProceedings of the 5...

work page doi:10.18653/v1/2021.emnlp-main.243 2021
[30]

doi: 10.18653/v1/2020.acl-main.703

Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.703. URLhttps://aclanthology.org/2020.acl-main.703. Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantita...

work page doi:10.18653/v1/2020.acl-main.703 2020
[31]

Solving Quantitative Reasoning Problems with Language Models

URL https://arxiv.org/abs/2206.14858. Paul Pu Liang, Chiyu Wu, Louis-Philippe Morency, and Ruslan Salakhutdinov. Towards understanding and mitigating social biases in language models. InICML,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Wanli: Worker and ai collaboration for natural language inference dataset creation.arXiv preprint arXiv:2201.05955 , 2022a

Alisa Liu, Swabha Swayamdipta, Noah A Smith, and Yejin Choi. Wanli: Worker and ai collaboration for natural language inference dataset creation.arXiv preprint arXiv:2201.05955 , 2022a. URL https: //arxiv.org/abs/2201.05955. Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin Raﬀel. Few-shot parameter-eﬃcient ﬁne-tuni...

work page arXiv
[33]

Howeﬀectiveistask-agnosticdataaugmentationforpretrained transformers? In Findings of the Association for Computational Linguistics: EMNLP 2020 , pages 4401–4411,

ShayneLongpre,YuWang,andChrisDuBois. Howeﬀectiveistask-agnosticdataaugmentationforpretrained transformers? In Findings of the Association for Computational Linguistics: EMNLP 2020 , pages 4401–4411,

work page 2020
[34]

Entity- based knowledge conﬂicts in question answering

Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. Entity- based knowledge conﬂicts in question answering. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7052–7063,

work page 2021
[35]

On faithfulness and factuality in abstractive summarization

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online, July

work page 1906
[36]

The Natural Language Decathlon: Multitask Learning as Question Answering

Association for Computational Linguistics. doi: 10.18653/ v1/2020.acl-main.173. URL https://aclanthology.org/2020.acl-main.173. Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering.arXiv preprint arXiv:1806.08730,

work page internal anchor Pith review Pith/arXiv arXiv 2020
[37]

The radicalization risks of gpt-3 and advanced neural language models

Kris McGuﬃe and Alex Newhouse. The radicalization risks of gpt-3 and advanced neural language models. arXiv preprint arXiv:2009.06807,

work page arXiv 2009
[38]

Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi

URL https://proceedings.neurips.cc/paper/2013/file/ 9aa42b31882ec039965f3c4923ce901b-Paper.pdf. Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. MetaICL: Learning to learn in context. In NAACL,

work page 2013
[39]

Cross-Task Generalization via Natural Language Crowdsourcing Instructions

URLhttps://aclanthology.org/2022.naacl-main.201. Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions.arXiv preprint arXiv:2104.08773,

work page internal anchor Pith review Pith/arXiv arXiv 2022
[40]

Crosslingual generalization through multitask ﬁnetuning

15 Niklas Muennighoﬀ, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al. Crosslingual generalization through multitask ﬁnetuning. arXiv preprint arXiv:2211.01786,

work page arXiv
[41]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeﬀ Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332,

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Training language models to follow instructions with human feedback

URLhttps://arxiv.org/abs/2203.02155. ZaranaParekh,JasonBaldridge,DanielCer,AustinWaters,andYinfeiYang. Crisscrossedcaptions: Extended intramodalandintermodalsemanticsimilarityjudgmentsforMS-COCO.In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL) , pages 2855–2870,

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal

URL https://aclanthology.org/2021.eacl-main.249. Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math word problems? InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T echnologies, pages 2080–2094,

work page 2021
[45]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Alec Radford, Jeﬀrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models areunsupervisedmultitasklearners. OpenAI blog,1(8):9,2019. URL https://d4mucfpksywv.cloudfront. net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf. Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoﬀman...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[46]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

URLhttps://arxiv.org/abs/1910.10683. 16 Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (V olume 2: Short Papers), pages 784–789,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[48]

URLhttps://arxiv.org/ abs/2211.00295. AdamRoberts,HyungWonChung,AnselmLevskaya,GauravMishra,JamesBradbury,DanielAndor,Sharan Narang,BrianLester,ColinGaﬀney,AfrozMohiuddin,CurtisHawthorne,AitorLewkowycz,AlexSalcianu, Marc van Zee, Jacob Austin, Sebastian Goodman, Livio Baldini Soares, Haitang Hu, Sasha Tsvyashchenko, AakankshaChowdhery,JasmijnBastings,Jann...

work page arXiv
[49]

Alexey Romanov and Chaitanya Shivade

URLhttps://arxiv.org/ abs/2203.17189. Alexey Romanov and Chaitanya Shivade. Lessons from natural language inference in the clinical domain. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 1586–1596,

work page arXiv 2018
[50]

VictorSanh,AlbertWebson,ColinRaﬀel,StephenH.Bach,LintangSutawika,ZaidAlyafeai,AntoineChaﬃn, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al

URLhttps://aclanthology.org/D18-1187. VictorSanh,AlbertWebson,ColinRaﬀel,StephenH.Bach,LintangSutawika,ZaidAlyafeai,AntoineChaﬃn, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. ICLR 2022,

work page 2022
[51]

Multitask Prompted Training Enables Zero-Shot Task Generalization

URLhttps://arxiv.org/abs/2110.08207. Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. The woman worked as a babysitter: On biases in language generation. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[52]

doi: 10.18653/v1/D19-1339

Association for Computational Linguistics. doi: 10.18653/v1/D19-1339. URL https://aclanthology.org/D19-1339. Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Nathaneal Scharli, Aakanksha Chowdhery, ...

work page doi:10.18653/v1/d19-1339
[53]

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al

URLhttps://arxiv.org/abs/2212.13138. Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.arXiv preprint arXiv:2206.04615,

work page arXiv
[54]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

URL https://arxiv.org/abs/2206.04615. MiracSuzgun,NathanScales,NathanealScharli,SebastianGehrmann,YiTay,HyungWonChung,Aakanksha Chowdhery,QuocV.Le,EdH.Chi,DennyZHou,andJasonWei. ChallengingBIG-Benchtasksandwhether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261,

work page internal anchor Pith review Pith/arXiv arXiv
[55]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

URLhttps://arxiv.org/abs/ 2210.09261. ZeerakTalat,AurélieNévéol,StellaBiderman,MirunaClinciu,MananDey,ShayneLongpre,AlexandraSasha Luccioni10, Maraim Masoud11, Margaret Mitchell10, Dragomir Radev12, et al. You reap what you sow: On the challenges of bias evaluation under multilingual settings.Challenges & Perspectives in Creating Large Language Models, page 26,

work page internal anchor Pith review Pith/arXiv arXiv
[56]

AlonTalmor,JonathanHerzig,NicholasLourie,andJonathanBerant.Commonsenseqa: Aquestionanswering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T echnologies, V olume 1 (Long and Short Papers), pages 4149–4158,

work page 2019
[57]

Unifying language learning paradigms.arXiv preprint arXiv:2205.05131, 2022a

17 Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, and Donald Metzler. Unifying language learning paradigms.arXiv preprint arXiv:2205.05131, 2022a. URL https://arxiv.org/abs/2205.05131. Yi Tay, Jason Wei, Hyung Won Chung, David R. So, Siamak Shakeri, Xavier Garcia, Vinh Q. Tran, Hauixiu S...

work page arXiv
[58]

LaMDA: Language Models for Dialog Applications

URLhttps://arxiv.org/abs/2201.08239. TuVu, TongWang, TsendsurenMunkhdalai, AlessandroSordoni,AdamTrischler, AndrewMattarella-Micke, SubhransuMaji, andMohitIyyer. ExploringandpredictingtransferabilityacrossNLPtasks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 7882–7926,

work page internal anchor Pith review Pith/arXiv arXiv 2020
[59]

Tu Vu, Brian Lester, Noah Constant, Rami Al-Rfou’, and Daniel Cer

URL https://aclanthology.org/2020.emnlp-main.635. Tu Vu, Brian Lester, Noah Constant, Rami Al-Rfou’, and Daniel Cer. SPoT: Better frozen model adaptation through soft prompt transfer. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), pages 5039–5059,

work page 2020
[60]

Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh

URLhttps://aclanthology.org/2022.acl-long.346. Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing NLP. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCN...

work page 2022
[61]

doi: 10.18653/v1/D19-1221

Association for Computational Linguistics. doi: 10.18653/v1/D19-1221. URL https://aclanthology.org/D19-1221. Ben Wang and Aran Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model.https: //github.com/kingoflolz/mesh-transformer-jax, May

work page doi:10.18653/v1/d19-1221
[62]

What language model architecture and pretraining objective work best for zero-shot generalization? ICML, 2022a

Thomas Wang, Adam Roberts, Daniel Hesslow, Teven Le Scao, Hyung Won Chung, Iz Beltagy, Julien Launay, and Colin Raﬀel. What language model architecture and pretraining objective work best for zero-shot generalization? ICML, 2022a. URLhttps://arxiv.org/abs/2204.05832. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, a...

work page arXiv 2022
[63]

Qinyuan Ye, Bill Yuchen Lin, and Xiang Ren

URLhttps://arxiv.org/abs/2212.10773. Qinyuan Ye, Bill Yuchen Lin, and Xiang Ren. Crossﬁt: A few-shot learning challenge for cross-task general- ization in NLP. InEMNLP,

work page arXiv
[64]

Seonghyeon Ye, Doyoung Kim, Joel Jang, Joongbo Shin, and Minjoon Seo

URLhttps://arxiv.org/abs/2104.08835. Seonghyeon Ye, Doyoung Kim, Joel Jang, Joongbo Shin, and Minjoon Seo. Guess the instruction! making language models stronger zero-shot learners.arXiv preprint arXiv:2210.02969,

work page arXiv
[65]

GLM-130B: An Open Bilingual Pre-trained Model

Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. Glm-130b: An open bilingual pre-trained model.arXiv preprint arXiv:2210.02414,

work page internal anchor Pith review Pith/arXiv arXiv
[66]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068,

work page internal anchor Pith review Pith/arXiv arXiv
[67]

20 A.2 Single-Task Finetuning

19 Appendix Table of Contents A Experimental Details 20 A.1 Instruction Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 A.2 Single-Task Finetuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 A.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

work page 2022
[68]

These datasets are contained in the Flan 2022 ﬁnetuning collection and represent challenging benchmarks, often used to evaluate LLMs on QA and NLI

A.3 Evaluation For Held-In evaluations we use the validation sets from 4 question answering (QA) tasks, BoolQ, ARC Easy, ARC Challenge, and AI2’s Middle School Science Exams, and 4 natural language inference (NLI) tasks, including ANLI R1, R2, R3, and RTE. These datasets are contained in the Flan 2022 ﬁnetuning collection and represent challenging benchma...

work page 2022
[69]

Table 3:Datasets used for Various Finetuning and Evaluation Experiments.ST-FT stands for Single Task Finetuning. For the Chain-of-Thought (CoT) evaluation, we use the mean accuracy across 5 datasets which have been preparedwithpromptswhichrequeststep-by-stepexplanationsintheirtargetanswers: GSM8K,StrategyQA, SVAMP, Asdiv, and CommonsenseQA. FortheHeld-Out...

work page 2022

[1] [1]

URL https:// aclanthology.org/2021.emnlp-main.468. Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeﬀrey, Sally Jesmonth, Nikhil J Joshi...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[2] [2]

Ext5: Towards extreme multi-task scaling for transfer learning

Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huaixiu Steven Zheng, Sanket Vaibhav Mehta, Honglei Zhuang, Vinh Q Tran, Dara Bahri, Jianmo Ni, et al. Ext5: Towards extreme multi-task scaling for transfer learning. arXiv preprint arXiv:2111.10952,

work page arXiv

[3] [3]

Stephen Bach, Victor Sanh, Zheng Xin Yong, Albert Webson, Colin Raﬀel, Nihal V. Nayak, Abheesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Fevry, Zaid Alyafeai, Manan Dey, Andrea Santilli, Zhiqing Sun, Srulik Ben-david, Canwen Xu, Gunjan Chhablani, Han Wang, Jason Fries, Maged Al-shaibani, Shanya Sharma, Urmish Thakker, Khalid Almubarak, Xiangru Tang, D...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2022.acl-demo.9 2022

[4] [4]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [6]

PaLM: Scaling Language Modeling with Pathways

URLhttps://arxiv.org/abs/2204.02311. Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, MostafaDehghani,SiddharthaBrahma,etal. Scalinginstruction-ﬁnetunedlanguagemodels. arXiv preprint arXiv:2210.11416,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [7]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

12 ChristopherClark,KentonLee,Ming-WeiChang,TomKwiatkowski,MichaelCollins,andKristinaToutanova. Boolq: Exploring the surprising diﬃculty of natural yes/no questions.arXiv preprint arXiv:1905.10044,

work page internal anchor Pith review Pith/arXiv arXiv 1905

[7] [8]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [10]

Training Verifiers to Solve Math Word Problems

URL https://arxiv.org/abs/2110.14168. Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors,Advances in Neural Information Processing Systems , vol- ume

work page internal anchor Pith review Pith/arXiv arXiv

[9] [11]

Ashwin Devaraj, William Sheﬃeld, Byron Wallace, and Junyi Jessy Li

URL https://proceedings.neurips.cc/paper/2015/file/ 7137debd45ae4d0ab9aa953017286b20-Paper.pdf. Ashwin Devaraj, William Sheﬃeld, Byron Wallace, and Junyi Jessy Li. Evaluating factuality in text simpliﬁ- cation. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 7331–7345, Dublin, Ireland, May

work page 2015

[10] [12]

doi: 10.18653/v1/2022.acl-long.506

Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.506. URL https://aclanthology.org/2022.acl-long.506. JacobDevlin,Ming-WeiChang,KentonLee,andKristinaToutanova. BERT:Pre-trainingofdeepbidirectional transformers for language understanding.NAACL,

work page doi:10.18653/v1/2022.acl-long.506 2022

[11] [13]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

URLhttps://aclanthology.org/N19-1423. Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [14]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [15]

arXiv preprint arXiv:2210.08726 (2023)

Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Y Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, et al. Attributed text generation via post-hoc research and revision. arXiv preprint arXiv:2210.08726,

work page arXiv

[14] [16]

Improving alignment of dialogue agents via targeted human judgements

Amelia Glaese, Nat McAleese, Maja Trębacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, et al. Improving alignment of dialogue agents via targeted human judgements.arXiv preprint arXiv:2209.14375,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [17]

Improving zero and few-shot generalization in dialogue through instruction tuning.arXiv preprint arXiv:2205.12673,

Prakhar Gupta, Cathy Jiao, Yi-Ting Yeh, Shikib Mehri, Maxine Eskenazi, and Jeﬀrey P Bigham. Improving zero and few-shot generalization in dialogue through instruction tuning.arXiv preprint arXiv:2205.12673,

work page arXiv

[16] [18]

URL https://openreview.net/ forum?id=d7KBjmI3GmQ. 13 Jordan Hoﬀmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DiegodeLasCasas,LisaAnneHendricks,JohannesWelbl,AidanClark,TomHennigan,EricNoland,Katie Millican,GeorgevandenDriessche,BogdanDamoc,AureliaGuy,SimonOsindero,KarenSimonyan,Erich Elsen, Jack W. Rae, Oriol Vi...

work page internal anchor Pith review Pith/arXiv arXiv

[17] [19]

Unnatural instructions: Tuning language models with (almost) no human labor.arXiv preprint arXiv:2212.09689,

Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language models with (almost) no human labor.arXiv preprint arXiv:2212.09689,

work page arXiv

[18] [21]

LoRA: Low-Rank Adaptation of Large Language Models

URLhttps: //arxiv.org/abs/2106.09685. Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Cosmos qa: Machine reading com- prehension with contextual commonsense reasoning. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJC...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[19] [22]

Inner Monologue: Embodied Reasoning through Planning with Language Models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, KarolHausman,andBrianIchter. Innermonologue: Embodiedreasoningthroughplanningwithlanguage models. In arXiv preprint arXiv:2207.05608,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [24]

OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization

URL https: //arxiv.org/abs/2212.12017. Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedicalresearchquestionanswering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[21] [25]

Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, and Richard Socher

URLhttps://aclanthology.org/D19-1259. Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, and Richard Socher. Unifying question answering, text classiﬁcation, and regression via span extraction.arXiv preprint arXiv:1904.09286,

work page arXiv 1904

[22] [26]

UniﬁedQA:CrossingformatboundarieswithasingleQAsystem

Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. UniﬁedQA:CrossingformatboundarieswithasingleQAsystem. In Findings of the Association for Computational Linguistics: EMNLP 2020,

work page 2020

[23] [27]

URLhttps://aclanthology.org/2020.findings-emnlp

work page 2020

[24] [28]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. Bloom: A 176b-parameter open-access multilingual language model.arXiv preprint arXiv:2211.05100,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [29]

URL https://aclanthology.org/2021

doi: 10.18653/v1/2021.emnlp-main.243. URL https://aclanthology.org/2021. emnlp-main.243. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. InProceedings of the 5...

work page doi:10.18653/v1/2021.emnlp-main.243 2021

[26] [30]

doi: 10.18653/v1/2020.acl-main.703

Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.703. URLhttps://aclanthology.org/2020.acl-main.703. Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantita...

work page doi:10.18653/v1/2020.acl-main.703 2020

[27] [31]

Solving Quantitative Reasoning Problems with Language Models

URL https://arxiv.org/abs/2206.14858. Paul Pu Liang, Chiyu Wu, Louis-Philippe Morency, and Ruslan Salakhutdinov. Towards understanding and mitigating social biases in language models. InICML,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [32]

Wanli: Worker and ai collaboration for natural language inference dataset creation.arXiv preprint arXiv:2201.05955 , 2022a

Alisa Liu, Swabha Swayamdipta, Noah A Smith, and Yejin Choi. Wanli: Worker and ai collaboration for natural language inference dataset creation.arXiv preprint arXiv:2201.05955 , 2022a. URL https: //arxiv.org/abs/2201.05955. Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin Raﬀel. Few-shot parameter-eﬃcient ﬁne-tuni...

work page arXiv

[29] [33]

Howeﬀectiveistask-agnosticdataaugmentationforpretrained transformers? In Findings of the Association for Computational Linguistics: EMNLP 2020 , pages 4401–4411,

ShayneLongpre,YuWang,andChrisDuBois. Howeﬀectiveistask-agnosticdataaugmentationforpretrained transformers? In Findings of the Association for Computational Linguistics: EMNLP 2020 , pages 4401–4411,

work page 2020

[30] [34]

Entity- based knowledge conﬂicts in question answering

Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. Entity- based knowledge conﬂicts in question answering. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7052–7063,

work page 2021

[31] [35]

On faithfulness and factuality in abstractive summarization

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online, July

work page 1906

[32] [36]

The Natural Language Decathlon: Multitask Learning as Question Answering

Association for Computational Linguistics. doi: 10.18653/ v1/2020.acl-main.173. URL https://aclanthology.org/2020.acl-main.173. Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering.arXiv preprint arXiv:1806.08730,

work page internal anchor Pith review Pith/arXiv arXiv 2020

[33] [37]

The radicalization risks of gpt-3 and advanced neural language models

Kris McGuﬃe and Alex Newhouse. The radicalization risks of gpt-3 and advanced neural language models. arXiv preprint arXiv:2009.06807,

work page arXiv 2009

[34] [38]

Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi

URL https://proceedings.neurips.cc/paper/2013/file/ 9aa42b31882ec039965f3c4923ce901b-Paper.pdf. Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. MetaICL: Learning to learn in context. In NAACL,

work page 2013

[35] [39]

Cross-Task Generalization via Natural Language Crowdsourcing Instructions

URLhttps://aclanthology.org/2022.naacl-main.201. Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions.arXiv preprint arXiv:2104.08773,

work page internal anchor Pith review Pith/arXiv arXiv 2022

[36] [40]

Crosslingual generalization through multitask ﬁnetuning

15 Niklas Muennighoﬀ, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al. Crosslingual generalization through multitask ﬁnetuning. arXiv preprint arXiv:2211.01786,

work page arXiv

[37] [41]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeﬀ Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332,

work page internal anchor Pith review Pith/arXiv arXiv

[38] [43]

Training language models to follow instructions with human feedback

URLhttps://arxiv.org/abs/2203.02155. ZaranaParekh,JasonBaldridge,DanielCer,AustinWaters,andYinfeiYang. Crisscrossedcaptions: Extended intramodalandintermodalsemanticsimilarityjudgmentsforMS-COCO.In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL) , pages 2855–2870,

work page internal anchor Pith review Pith/arXiv arXiv

[39] [44]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal

URL https://aclanthology.org/2021.eacl-main.249. Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math word problems? InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T echnologies, pages 2080–2094,

work page 2021

[40] [45]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Alec Radford, Jeﬀrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models areunsupervisedmultitasklearners. OpenAI blog,1(8):9,2019. URL https://d4mucfpksywv.cloudfront. net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf. Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoﬀman...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[41] [46]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

URLhttps://arxiv.org/abs/1910.10683. 16 Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (V olume 2: Short Papers), pages 784–789,

work page internal anchor Pith review Pith/arXiv arXiv 1910

[42] [48]

URLhttps://arxiv.org/ abs/2211.00295. AdamRoberts,HyungWonChung,AnselmLevskaya,GauravMishra,JamesBradbury,DanielAndor,Sharan Narang,BrianLester,ColinGaﬀney,AfrozMohiuddin,CurtisHawthorne,AitorLewkowycz,AlexSalcianu, Marc van Zee, Jacob Austin, Sebastian Goodman, Livio Baldini Soares, Haitang Hu, Sasha Tsvyashchenko, AakankshaChowdhery,JasmijnBastings,Jann...

work page arXiv

[43] [49]

Alexey Romanov and Chaitanya Shivade

URLhttps://arxiv.org/ abs/2203.17189. Alexey Romanov and Chaitanya Shivade. Lessons from natural language inference in the clinical domain. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 1586–1596,

work page arXiv 2018

[44] [50]

VictorSanh,AlbertWebson,ColinRaﬀel,StephenH.Bach,LintangSutawika,ZaidAlyafeai,AntoineChaﬃn, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al

URLhttps://aclanthology.org/D18-1187. VictorSanh,AlbertWebson,ColinRaﬀel,StephenH.Bach,LintangSutawika,ZaidAlyafeai,AntoineChaﬃn, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. ICLR 2022,

work page 2022

[45] [51]

Multitask Prompted Training Enables Zero-Shot Task Generalization

URLhttps://arxiv.org/abs/2110.08207. Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. The woman worked as a babysitter: On biases in language generation. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[46] [52]

doi: 10.18653/v1/D19-1339

Association for Computational Linguistics. doi: 10.18653/v1/D19-1339. URL https://aclanthology.org/D19-1339. Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Nathaneal Scharli, Aakanksha Chowdhery, ...

work page doi:10.18653/v1/d19-1339

[47] [53]

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al

URLhttps://arxiv.org/abs/2212.13138. Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.arXiv preprint arXiv:2206.04615,

work page arXiv

[48] [54]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

URL https://arxiv.org/abs/2206.04615. MiracSuzgun,NathanScales,NathanealScharli,SebastianGehrmann,YiTay,HyungWonChung,Aakanksha Chowdhery,QuocV.Le,EdH.Chi,DennyZHou,andJasonWei. ChallengingBIG-Benchtasksandwhether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261,

work page internal anchor Pith review Pith/arXiv arXiv

[49] [55]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

URLhttps://arxiv.org/abs/ 2210.09261. ZeerakTalat,AurélieNévéol,StellaBiderman,MirunaClinciu,MananDey,ShayneLongpre,AlexandraSasha Luccioni10, Maraim Masoud11, Margaret Mitchell10, Dragomir Radev12, et al. You reap what you sow: On the challenges of bias evaluation under multilingual settings.Challenges & Perspectives in Creating Large Language Models, page 26,

work page internal anchor Pith review Pith/arXiv arXiv

[50] [56]

AlonTalmor,JonathanHerzig,NicholasLourie,andJonathanBerant.Commonsenseqa: Aquestionanswering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T echnologies, V olume 1 (Long and Short Papers), pages 4149–4158,

work page 2019

[51] [57]

Unifying language learning paradigms.arXiv preprint arXiv:2205.05131, 2022a

17 Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, and Donald Metzler. Unifying language learning paradigms.arXiv preprint arXiv:2205.05131, 2022a. URL https://arxiv.org/abs/2205.05131. Yi Tay, Jason Wei, Hyung Won Chung, David R. So, Siamak Shakeri, Xavier Garcia, Vinh Q. Tran, Hauixiu S...

work page arXiv

[52] [58]

LaMDA: Language Models for Dialog Applications

URLhttps://arxiv.org/abs/2201.08239. TuVu, TongWang, TsendsurenMunkhdalai, AlessandroSordoni,AdamTrischler, AndrewMattarella-Micke, SubhransuMaji, andMohitIyyer. ExploringandpredictingtransferabilityacrossNLPtasks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 7882–7926,

work page internal anchor Pith review Pith/arXiv arXiv 2020

[53] [59]

Tu Vu, Brian Lester, Noah Constant, Rami Al-Rfou’, and Daniel Cer

URL https://aclanthology.org/2020.emnlp-main.635. Tu Vu, Brian Lester, Noah Constant, Rami Al-Rfou’, and Daniel Cer. SPoT: Better frozen model adaptation through soft prompt transfer. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), pages 5039–5059,

work page 2020

[54] [60]

Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh

URLhttps://aclanthology.org/2022.acl-long.346. Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing NLP. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCN...

work page 2022

[55] [61]

doi: 10.18653/v1/D19-1221

Association for Computational Linguistics. doi: 10.18653/v1/D19-1221. URL https://aclanthology.org/D19-1221. Ben Wang and Aran Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model.https: //github.com/kingoflolz/mesh-transformer-jax, May

work page doi:10.18653/v1/d19-1221

[56] [62]

What language model architecture and pretraining objective work best for zero-shot generalization? ICML, 2022a

Thomas Wang, Adam Roberts, Daniel Hesslow, Teven Le Scao, Hyung Won Chung, Iz Beltagy, Julien Launay, and Colin Raﬀel. What language model architecture and pretraining objective work best for zero-shot generalization? ICML, 2022a. URLhttps://arxiv.org/abs/2204.05832. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, a...

work page arXiv 2022

[57] [63]

Qinyuan Ye, Bill Yuchen Lin, and Xiang Ren

URLhttps://arxiv.org/abs/2212.10773. Qinyuan Ye, Bill Yuchen Lin, and Xiang Ren. Crossﬁt: A few-shot learning challenge for cross-task general- ization in NLP. InEMNLP,

work page arXiv

[58] [64]

Seonghyeon Ye, Doyoung Kim, Joel Jang, Joongbo Shin, and Minjoon Seo

URLhttps://arxiv.org/abs/2104.08835. Seonghyeon Ye, Doyoung Kim, Joel Jang, Joongbo Shin, and Minjoon Seo. Guess the instruction! making language models stronger zero-shot learners.arXiv preprint arXiv:2210.02969,

work page arXiv

[59] [65]

GLM-130B: An Open Bilingual Pre-trained Model

Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. Glm-130b: An open bilingual pre-trained model.arXiv preprint arXiv:2210.02414,

work page internal anchor Pith review Pith/arXiv arXiv

[60] [66]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068,

work page internal anchor Pith review Pith/arXiv arXiv

[61] [67]

20 A.2 Single-Task Finetuning

19 Appendix Table of Contents A Experimental Details 20 A.1 Instruction Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 A.2 Single-Task Finetuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 A.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

work page 2022

[62] [68]

These datasets are contained in the Flan 2022 ﬁnetuning collection and represent challenging benchmarks, often used to evaluate LLMs on QA and NLI

A.3 Evaluation For Held-In evaluations we use the validation sets from 4 question answering (QA) tasks, BoolQ, ARC Easy, ARC Challenge, and AI2’s Middle School Science Exams, and 4 natural language inference (NLI) tasks, including ANLI R1, R2, R3, and RTE. These datasets are contained in the Flan 2022 ﬁnetuning collection and represent challenging benchma...

work page 2022

[63] [69]

Table 3:Datasets used for Various Finetuning and Evaluation Experiments.ST-FT stands for Single Task Finetuning. For the Chain-of-Thought (CoT) evaluation, we use the mean accuracy across 5 datasets which have been preparedwithpromptswhichrequeststep-by-stepexplanationsintheirtargetanswers: GSM8K,StrategyQA, SVAMP, Asdiv, and CommonsenseQA. FortheHeld-Out...

work page 2022