arxiv: 2109.01652 · v5 · submitted 2021-09-03 · 💻 cs.CL

Recognition: 1 theorem link

Finetuned Language Models Are Zero-Shot Learners

Jason Wei , Maarten Bosma , Vincent Y. Zhao , Kelvin Guu , Adams Wei Yu , Brian Lester , Nan Du , Andrew M. Dai

show 1 more author

Quoc V. Le

Authors on Pith no claims yet

Pith reviewed 2026-05-10 21:08 UTC · model grok-4.3

classification 💻 cs.CL

keywords instruction tuningzero-shot learninglanguage modelsFLANnatural language instructionsfew-shot learningNLP tasks

0 comments

The pith

Finetuning language models on tasks described by instructions improves zero-shot performance on unseen tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that instruction tuning, finetuning a language model on over 60 NLP tasks each presented with natural language instructions, substantially boosts the model's ability to handle new tasks it has never encountered. A 137B parameter model called FLAN trained this way outperforms its unmodified base model and beats zero-shot 175B GPT-3 on 20 of 25 evaluation tasks. It also exceeds few-shot GPT-3 on tasks including ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablations indicate that the gains depend on the number of finetuning tasks, model scale, and the use of natural language instructions. This method offers a route to more adaptable language models without requiring examples at test time.

Core claim

Instruction tuning substantially improves zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 tasks that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of finetuning datasets, model scale, and natural language instructions are key to the s

What carries the argument

Instruction tuning, the process of finetuning pretrained language models on a collection of tasks described via natural language instruction templates.

Load-bearing premise

That the 25 evaluation tasks are truly unseen with no overlap or leakage from the 60+ finetuning tasks and that gains come specifically from the instruction format rather than scale or data volume alone.

What would settle it

Demonstrating that FLAN performs no better than its base model on tasks with no possible overlap to the finetuning set, or that removing natural language instructions from the finetuning process eliminates the zero-shot gains.

read the original abstract

This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning -- finetuning language models on a collection of tasks described via instructions -- substantially improves zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 tasks that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of finetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Instruction tuning on 60+ tasks gives a clear zero-shot lift over the base model and beats zero-shot GPT-3 on most of the 25 held-out tasks, but the unseen claim rests on how clean the task split actually is.

read the letter

The paper's core finding is that finetuning a 137B model on instructions from over 60 NLP tasks produces FLAN, which then does better zero-shot on 25 new task types than the untuned version and beats zero-shot 175B GPT-3 on 20 of them. It also beats few-shot GPT-3 on several reasoning and QA tasks. The ablations are useful: they show gains scale with more datasets, larger models, and the presence of natural-language instructions rather than just raw examples.

Referee Report

2 major / 2 minor

Summary. The paper claims that instruction tuning—finetuning a 137B-parameter pretrained LM on over 60 NLP tasks verbalized with natural-language instructions—substantially boosts zero-shot performance on 25 held-out task types. FLAN improves over its base model and beats zero-shot 175B GPT-3 on 20 of the 25 tasks while also outperforming few-shot GPT-3 on ANLI, RTE, BoolQ, AI2-ARC, OpenBookQA, and StoryCloze. Ablations identify the number of finetuning datasets, model scale, and presence of instructions as key factors.

Significance. If the held-out status of the 25 tasks is rigorously verified, the result supplies concrete evidence that a simple, scalable finetuning procedure can improve zero-shot generalization across diverse NLP tasks without task-specific adaptation. The ablations on dataset count, scale, and instruction format are particularly valuable because they isolate the contribution of the proposed method.

major comments (2)

[§3 and Appendix A] §3 and Appendix A (Task Selection): The central claim that the 25 evaluation tasks are truly unseen rests on a partition by task type, yet the manuscript provides no explicit verification that the underlying datasets, data splits, or near-identical instruction templates do not overlap with any of the 60+ finetuning tasks. Without such checks (e.g., dataset ID matching or template similarity analysis), performance gains could arise from indirect exposure rather than instruction-based zero-shot transfer, weakening the generalization interpretation.
[§4.2 and Table 2] §4.2 and Table 2 (Ablations): The ablation that removes instructions reports a large drop, but the comparison mixes the effect of instruction format with the effect of changing the input distribution; a controlled ablation that keeps the same task data but varies only the presence/absence of the instruction prefix would more cleanly isolate the claimed benefit of natural-language instructions.

minor comments (2)

[Table 1] Table 1: The GPT-3 few-shot numbers are taken from the original GPT-3 paper; confirming that the same prompt templates and number of shots were used would strengthen the direct comparison.
[Figure 3] Figure 3: The scaling curves would benefit from error bars or multiple random seeds to indicate variability across runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the value of our ablations and the potential impact of instruction tuning. We address each major comment below, agreeing where the manuscript can be strengthened and outlining the revisions.

read point-by-point responses

Referee: [§3 and Appendix A] §3 and Appendix A (Task Selection): The central claim that the 25 evaluation tasks are truly unseen rests on a partition by task type, yet the manuscript provides no explicit verification that the underlying datasets, data splits, or near-identical instruction templates do not overlap with any of the 60+ finetuning tasks. Without such checks (e.g., dataset ID matching or template similarity analysis), performance gains could arise from indirect exposure rather than instruction-based zero-shot transfer, weakening the generalization interpretation.

Authors: We appreciate the referee's emphasis on rigorously confirming the held-out status of the evaluation tasks. Our selection process partitioned tasks by type, ensuring the 25 held-out tasks belong to categories absent from the 60+ finetuning tasks (see §3 and Appendix A for the full lists). The underlying datasets are drawn from distinct sources with no shared data splits. That said, the manuscript does not include an explicit cross-check for dataset IDs or template similarity. In the revised version we will add a new subsection with this verification: we will list all dataset identifiers, confirm no overlap in splits, and provide a similarity analysis of the instruction templates to demonstrate they are non-overlapping. revision: yes
Referee: [§4.2 and Table 2] §4.2 and Table 2 (Ablations): The ablation that removes instructions reports a large drop, but the comparison mixes the effect of instruction format with the effect of changing the input distribution; a controlled ablation that keeps the same task data but varies only the presence/absence of the instruction prefix would more cleanly isolate the claimed benefit of natural-language instructions.

Authors: We agree that a more tightly controlled ablation would better isolate the contribution of the natural-language instruction prefix. The current ablation in §4.2 compares the full instruction-tuned format against a version that uses raw task inputs without any instructional framing, which does alter the input distribution. To address this directly, we will add a new controlled experiment in the revised manuscript: for the same set of finetuning tasks and examples, we will train and evaluate two variants that differ only in the presence or absence of the instruction prefix while keeping the remainder of the input identical. The results of this ablation will be reported alongside the existing Table 2. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on held-out tasks

full rationale

The paper reports direct empirical measurements of zero-shot performance after instruction tuning on 60+ tasks, evaluated on 25 partitioned unseen tasks. No equations, derivations, or first-principles results are presented that reduce to inputs by construction. Ablations on dataset count, scale, and instruction presence are independent measurements, not fitted parameters renamed as predictions. The partition into seen/unseen tasks is stated explicitly without self-referential definitions or load-bearing self-citations that collapse the claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical observations from finetuning and testing rather than new theoretical axioms or invented entities. No free parameters are introduced as the approach is data-driven.

axioms (1)

domain assumption Language models pretrained on text can be effectively finetuned using instruction templates for multiple tasks.
This is a standard assumption in transfer learning for NLP models.

pith-pipeline@v0.9.0 · 5482 in / 1314 out tokens · 68171 ms · 2026-05-10T21:08:44.849352+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Unifying Goal-Conditioned RL and Unsupervised Skill Learning via Control-Maximization
cs.LG 2026-05 unverdicted novelty 8.0

GCRL and MISL are unified as control maximization, with three inequivalent GCRL formulations each matched to a MISL objective via bounds on goal-sensitivity.
Discovering Language Model Behaviors with Model-Written Evaluations
cs.CL 2022-12 unverdicted novelty 8.0

Language models can automatically generate high-quality evaluation datasets that reveal new cases of inverse scaling, sycophancy, and concerning goal-seeking behaviors, including some worsened by RLHF.
Editing Models with Task Arithmetic
cs.LG 2022-12 accept novelty 8.0

Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.
PAL: Program-aided Language Models
cs.CL 2022-11 conditional novelty 8.0

PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.
Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning
cs.CV 2026-05 unverdicted novelty 7.0

DRAPE generates query-image conditioned prompts on the fly for multimodal continual instruction tuning and reports SOTA results on MCIT benchmarks.
PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training
cs.LG 2026-04 unverdicted novelty 7.0

Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.
Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms
cs.CL 2026-04 unverdicted novelty 7.0

Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.
Transition-Matrix Regularization for Next Dialogue Act Prediction in Counselling Conversations
cs.CL 2026-04 unverdicted novelty 7.0

KL regularization aligning model predictions with empirical transition patterns improves macro-F1 by 9-42% in next dialogue act prediction on German counselling data and transfers to other datasets.
MNAFT: modality neuron-aware fine-tuning of multimodal large language models for image translation
cs.CL 2026-04 unverdicted novelty 7.0

MNAFT identifies language-agnostic and language-specific neurons via activation analysis and selectively fine-tunes only relevant ones in MLLMs to close the modality gap and outperform full fine-tuning and other metho...
ProtoCycle: Reflective Tool-Augmented Planning for Text-Guided Protein Design
q-bio.QM 2026-04 unverdicted novelty 7.0

ProtoCycle improves text-guided protein design by coupling an LLM planner with tool feedback and reflection to achieve better language alignment and foldability than direct generation.
SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions
cs.AI 2026-04 unverdicted novelty 7.0

SUPERNOVA adapts instruction-tuning data for RLVR and achieves up to 52.8% relative gains on general reasoning benchmarks like BBEH through targeted task selection and mixing.
LLMAR: A Tuning-Free Recommendation Framework for Sparse and Text-Rich Industrial Domains
cs.IR 2026-03 unverdicted novelty 7.0

LLMAR applies LLM reasoning with a self-correction reflection loop to generate semantic user motives for tuning-free recommendations, showing up to 54.6% nDCG@10 gains on a sparse industrial dataset over trained baselines.
Self-Rewarding Language Models
cs.CL 2024-01 conditional novelty 7.0

Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
C-Pack: Packed Resources For General Chinese Embeddings
cs.CL 2023-09 accept novelty 7.0

C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
QLoRA: Efficient Finetuning of Quantized LLMs
cs.LG 2023-05 conditional novelty 7.0

QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
WizardLM: Empowering large pre-trained language models to follow complex instructions
cs.CL 2023-04 conditional novelty 7.0

WizardLM uses LLM-driven iterative rewriting to generate complex instruction data and fine-tunes LLaMA to reach over 90% of ChatGPT capacity on 17 of 29 evaluated skills.
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
cs.AI 2023-04 accept novelty 7.0

LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
cs.CV 2023-03 conditional novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
A Generalist Agent
cs.AI 2022-05 accept novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Flamingo: a Visual Language Model for Few-Shot Learning
cs.CV 2022-04 unverdicted novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
cs.RO 2022-04 accept novelty 7.0

SayCan combines an LLM's high-level semantic knowledge with robot skill value functions to select only feasible actions, enabling completion of abstract natural-language instructions on a real mobile manipulator.
PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts
cs.CL 2026-05 unverdicted novelty 6.0

PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
cs.LG 2026-05 unverdicted novelty 6.0

DECO sparse MoE matches dense Transformer performance at 20% expert activation with a 3x hardware inference speedup.
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
cs.LG 2026-05 conditional novelty 6.0

DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.
Rotation-Preserving Supervised Fine-Tuning
cs.LG 2026-05 unverdicted novelty 6.0

RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.
Response Time Enhances Alignment with Heterogeneous Preferences
cs.LG 2026-05 unverdicted novelty 6.0

Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
Diversity in Large Language Models under Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 6.0

TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
UniBCI: Towards a Unified Pretrained Model for Invasive Brain-Computer Interfaces
cs.NE 2026-04 unverdicted novelty 6.0

UniBCI is a unified pretrained model for invasive neural spike data that uses CST tokenization, IAA attention, and self-supervised masked reconstruction to achieve SOTA downstream performance with better generalizatio...
Improving Zero-Shot Offline RL via Behavioral Task Sampling
cs.AI 2026-04 unverdicted novelty 6.0

Extracting task vectors from the offline dataset for policy training improves zero-shot offline RL performance by an average of 20% over random sampling baselines.
Image Generators are Generalist Vision Learners
cs.CV 2026-04 conditional novelty 6.0

Image generation pretraining builds generalist vision models that reach SOTA on 2D and 3D perception tasks by reframing them as RGB image outputs.
Image Generators are Generalist Vision Learners
cs.CV 2026-04 unverdicted novelty 6.0

Image generation pretraining produces generalist vision models that reframe perception tasks as image synthesis and reach SOTA results on segmentation, depth estimation, and other 2D/3D tasks.
RemoteShield: Enable Robust Multimodal Large Language Models for Earth Observation
cs.CV 2026-04 unverdicted novelty 6.0

RemoteShield improves robustness of Earth observation MLLMs by training on semantic equivalence clusters of clean and perturbed inputs via preference learning to maintain consistent reasoning under noise.
x1: Learning to Think Adaptively Across Languages and Cultures
cs.CL 2026-04 unverdicted novelty 6.0

x1 models adaptively select an advantageous language for reasoning per instance, yielding gains on multilingual math and cultural tasks while showing that scaling does not erase culture-language advantages.
Weight Patching: Toward Source-Level Mechanistic Localization in LLMs
cs.AI 2026-04 unverdicted novelty 6.0

Weight Patching localizes capabilities to specific parameter modules in LLMs by replacing weights from a behavior-specialized model into a base model and validating recovery via a vector-anchor interface, revealing a ...
Filling the Gaps: Selective Knowledge Augmentation for LLM Recommenders
cs.IR 2026-04 unverdicted novelty 6.0

KnowSA_CKP uses comparative knowledge probing to selectively augment LLM prompts for items with knowledge gaps, improving recommendation accuracy and context efficiency.
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
cs.CV 2026-04 unverdicted novelty 6.0

CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
Task-Centric Personalized Federated Fine-Tuning of Language Models
cs.LG 2026-03 unverdicted novelty 6.0

FedRouter clusters adapters locally per task samples and globally across clients to create task-centric personalized models, improving generalization and reducing task interference in federated fine-tuning.
A Foundation Model for Instruction-Conditioned In-Context Time Series Tasks
cs.LG 2026-03 unverdicted novelty 6.0

iAmTime is a time-series foundation model that uses instruction-conditioned in-context learning from demonstrations to perform zero-shot adaptation on forecasting, imputation, classification, and related tasks.
Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings
cs.LG 2026-03 unverdicted novelty 6.0

HAPO adds a hindsight-anchored SSI operator with Thompson gating to GRPO-style RLVR, achieving asymptotic consistency that recovers unbiased on-policy gradients as the policy improves.
Towards an AI co-scientist
cs.AI 2025-02 unverdicted novelty 6.0

A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
cs.LG 2024-10 unverdicted novelty 6.0

π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.
RouteLLM: Learning to Route LLMs with Preference Data
cs.LG 2024-06 unverdicted novelty 6.0

Router models trained on preference data dynamically select between strong and weak LLMs, cutting inference costs by more than 2x on benchmarks with no quality loss and showing transfer to new model pairs.
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
cs.CL 2024-05 accept novelty 6.0

NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.
Capabilities of Gemini Models in Medicine
cs.AI 2024-04 unverdicted novelty 6.0

Med-Gemini sets new records on 10 of 14 medical benchmarks including 91.1% on MedQA-USMLE, beats GPT-4V by 44.5% on multimodal tasks, and surpasses humans on medical text summarization.
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
cs.CL 2024-04 conditional novelty 6.0

MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.
Steering Llama 2 via Contrastive Activation Addition
cs.CL 2023-12 unverdicted novelty 6.0

Contrastive Activation Addition steers Llama 2 Chat by adding averaged residual-stream activation differences from contrastive example pairs to control targeted behaviors at inference time.
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
cs.CL 2023-08 unverdicted novelty 6.0

Pre-training loss predicts LLM math reasoning better than parameter count; rejection sampling fine-tuning with diverse paths raises LLaMA-7B accuracy on GSM8K from 35.9% with SFT to 49.3%.
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
cs.CL 2023-06 accept novelty 6.0

GPT-4 as an LLM judge achieves over 80% agreement with human preferences on MT-Bench and Chatbot Arena, matching human agreement levels and providing a scalable evaluation method.
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations
cs.CL 2023-05 conditional novelty 6.0

UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
cs.AI 2023-03 conditional novelty 6.0

CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
BloombergGPT: A Large Language Model for Finance
cs.LG 2023-03 conditional novelty 6.0

BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
cs.CL 2022-11 unverdicted novelty 6.0

BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
Inner Monologue: Embodied Reasoning through Planning with Language Models
cs.RO 2022-07 unverdicted novelty 6.0

LLMs form an inner monologue from closed-loop language feedback to improve high-level instruction completion in simulated and real robotic rearrangement and kitchen manipulation tasks.
MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning
cs.CL 2022-05 unverdicted novelty 6.0

MRKL is a modular neuro-symbolic architecture that integrates LLMs with external knowledge and discrete reasoning to overcome limitations of pure neural language models.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
The Readability Spectrum: Patterns, Issues, and Prompt Effects in LLM-Generated Code
cs.SE 2026-05 unverdicted novelty 5.0

LLM-generated code matches human-written code in overall readability but exhibits different issue patterns, and prompt engineering has limited impact on improving it.
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
cs.AI 2026-05 unverdicted novelty 5.0

Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.
Why Expert Alignment Is Hard: Evidence from Subjective Evaluation
cs.CL 2026-05 unverdicted novelty 5.0

Expert alignment in subjective LLM evaluations is difficult because expert judgments are heterogeneous, partly tacit, dimension-dependent, and temporally unstable.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · cited by 76 Pith papers · 2 internal anchors

[1]

Kenton Lee, Ming-Wei Chang, and Kristina Toutanova

URL https://aclanthology.org/2020.emnlp-main.363. Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 6086–6096, 2019. URL https://aclanthology.org/ P19-1612. Dmitry Lepikhin, HyoukJoong Lee...

work page 2020
[2]

Brian Lester, Rami Al-Rfou, and Noah Constant

URL https://openreview.net/forum?id=qrwe7XHTmYb. Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efﬁcient prompt tuning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing,

work page
[3]

The Power of Scale for Parameter-Efficient Prompt Tuning

URL https://arxiv.org/abs/2104.08691. Hector Levesque, Ernest Davis, and Leora Morgenstern. The Winograd Schema Challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning,

work page internal anchor Pith review arXiv
[4]

The W inograd S chema C hallenge

URL https://dl.acm.org/doi/10.5555/3031843.3031909. Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. Zero-shot relation extraction via reading comprehension. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017) , pp. 333–342, 2017. URL https://aclanthology. org/K17-1034. Mike Lewis, Yinhan Liu, Naman Goya...

work page doi:10.5555/3031843.3031909 2017
[5]

URL https://aclanthology.org/N16-1098. 17 Published as a conference paper at ICLR 2022 Linyong Nan, Dragomir Radev, Rui Zhang, Amrit Rau, Abhinand Sivaprasad, Chiachun Hsieh, Xiangru Tang, Aadit Vyas, Neha Verma, Pranav Krishna, Yangxiaokang Liu, Nadia Irwanto, Jessica Pan, Faiaz Rahman, Ahmad Zaidi, Mutethia Mutuma, Yasin Tarabar, Ankit Gupta, Tao Yu, Yi...

work page 2022
[6]

Massively multilingual ASR: 50 languages, 1 model, 1 billion parameters.Preprint arXiv:2007.03001,

URL https://cdn.openai.com/papers/Training_language_models_ to_follow_instructions_with_human_feedback.pdf. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Com...

work page arXiv 2018
[7]

A Recipe for Arbitrary Text Style Transfer with Large Language Models https://arxiv.org/pdf/2109.03910.pdf

URL https://arxiv.org/abs/2109.03910. Melissa Roemmele, Cosmin Bejan, and Andrew Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI Spring Symposium Series, 2011. URL https://www.aaai.org/ocs/index.php/SSS/SSS11/paper/view/2418. Bernardino Romera-Paredes and Philip Torr. An embarrassingly simple approach to ze...

work page arXiv 2011
[8]

An Overview of Multi-Task Learning in Deep Neural Networks

URL https://proceedings.mlr.press/v37/romera-paredes15.pdf. Sebastian Ruder. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098, 2017. URL https://arxiv.org/abs/1706.05098. Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adver- sarial winograd schema challenge at scale. In Proc...

work page internal anchor Pith review arXiv 2017
[9]

Ziegler and Nisan Stiennon and Ryan Lowe and Jan Leike and Paul Christiano , year=

URL https://arxiv.org/abs/2109.10862. Wei Wu, Fei Wang, Arianna Yuan, Fei Wu, and Jiwei Li. CorefQA: Coreference resolution as query-based span prediction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6953–6963, 2020. URL https://aclanthology.org/ 2020.acl-main.622. Qinyuan Ye, Bill Yuchen Lin, and Xiang ...

work page arXiv 2020
[10]

‘The dog runs. ’ Translate this sentence to French

as question answering. D.6 I NSTRUCTIONS -BASED NLP Recent improvements in the capabilities of language models have led to increased interest in a nascent area of instructions-based NLP (Goldwasser & Roth, 2014, and see McCarthy (1960)). Schick & Schütze (2021) (also see Gao et al., 2021; Tam et al., 2021) use task descriptions in cloze-style phrases to h...

work page 2014
[11]

ANLI (Nie et al., 2020)

work page 2020
[12]

CB (De Marneffe et al., 2019)

work page 2019
[13]

MNLI (Williams et al., 2018)

work page 2018
[14]

QNLI (Rajpurkar et al., 2018)

work page 2018
[15]

SNLI (Bowman et al., 2015)

work page 2015
[16]

WNLI (Levesque et al., 2012)

work page 2012
[17]

We use the following datasets:

RTE (Dagan et al., 2005; Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009) • Reading comprehension tests the ability to answer a question when given a passage that contains the answer. We use the following datasets:

work page 2005
[18]

BoolQ Clark et al. (2019a)

work page
[19]

DROP (Dua et al., 2019)

work page 2019
[20]

MultiRC (Khashabi et al., 2018)

work page 2018
[21]

OBQA (Mihaylov et al., 2018)

work page 2018
[22]

SQuADv1 (Rajpurkar et al., 2016)

work page 2016
[23]

We use the following datasets:

SQuADv2 (Rajpurkar et al., 2018) • Commonsense reasoning evaluates the ability to perform physical or scientiﬁc reasoning with an element of common sense. We use the following datasets:

work page 2018
[24]

COPA (Roemmele et al., 2011)

work page 2011
[25]

HellaSwag (Zellers et al., 2019)

work page 2019
[26]

PiQA (Bisk et al., 2020)

work page 2020
[27]

We use the following datasets:

StoryCloze (Mostafazadeh et al., 2016) • Sentiment analysis is a classic NLP task aims to understand whether a piece of text is positive or negative. We use the following datasets:

work page 2016
[28]

IMDB (Maas et al., 2011)

work page 2011
[29]

Sentiment140 (Go et al., 2009)

work page 2009
[30]

SST-2 (Socher et al., 2013)

work page 2013
[31]

We use the following datasets:

Yelp (Fast.AI) • Closed-book QA asks models to answer questions about the world without speciﬁc access to information that contains the answer. We use the following datasets:

work page
[32]

ARC (Clark et al., 2018)

work page 2018
[33]

NQ (Lee et al., 2019; Kwiatkowski et al., 2019)

work page 2019
[34]

(2017) • Paraphrase detection asks a model to determine whether two sentences are semantically equiva- lent.4 We use the following datasets:

TriviaQA Joshi et al. (2017) • Paraphrase detection asks a model to determine whether two sentences are semantically equiva- lent.4 We use the following datasets:

work page 2017
[35]

MRPC (Dolan & Brockett, 2005)

work page 2005
[36]

QQP (Wang et al., 2018, see)

work page 2018
[37]

We use the following datasets:

Paws Wiki (Zhang et al., 2019) • Coreference resolution tests the ability to identify expressions of the same entity in some given text. We use the following datasets:

work page 2019
[38]

DPR (Rahman & Ng, 2012)

work page 2012
[39]

36 Published as a conference paper at ICLR 2022

Winogrande (Sakaguchi et al., 2020) 4Although paraphrasing can be seen as positive entailment in both directions, it has been distinct from NLI in the academic literature. 36 Published as a conference paper at ICLR 2022

work page 2020
[40]

We use the following datasets:

WSC273 (Levesque et al., 2012) • Reading comprehension with commonsense combines elements of both reading comprehension with commonsense. We use the following datasets:

work page 2012
[41]

CosmosQA (Huang et al., 2019)

work page 2019
[42]

We use the following datasets:

ReCoRD (Zhang et al., 2018) • Struct to text tests the ability to describe some structured data using natural language. We use the following datasets:

work page 2018
[43]

CommonGen (Lin et al., 2020)

work page 2020
[44]

DART (Nan et al., 2021)

work page 2021
[45]

E2ENLG (Dušek et al., 2019)

work page 2019
[46]

We use the following datasets:

WebNLG (Gardent et al., 2017) • Translation is the task of translating text from one language into a different language. We use the following datasets:

work page 2017
[47]

En–Fr from WMT’14 (Bojar et al., 2014)

work page 2014
[48]

En–De, En–Tr, En–Cs, En–Fi, En–Ro, and En–Ru from WMT’16 (Bojar et al., 2016)

work page 2016
[49]

We use the following datasets:

En–Es from Paracrawl (Bañón et al., 2020) • Summarization asks models to read a piece of text and generate an abbreviated summary of it. We use the following datasets:

work page 2020
[50]

AESLC (Zhang & Tetreault, 2019)

work page 2019
[51]

CNN-DM (See et al., 2017)

work page 2017
[52]

Gigaword (Napoles et al., 2012)

work page 2012
[53]

MultiNews (Fabbri et al., 2019)

work page 2019
[54]

Newsroom (Grusky et al., 2018)

work page 2018
[55]

Samsum (Gliwa et al., 2019)

work page 2019
[56]

XSum (Narayan et al., 2018)

work page 2018
[57]

AG News (Zhang et al., 2015)

work page 2015
[58]

Opinion Abstracts - Rotten Tomatoes (Wang & Ling, 2016)

work page 2016
[59]

Opinion Abstracts - iDebate (Wang & Ling, 2016)

work page 2016
[60]

Wiki Lingua English (Ladhak et al., 2020) • Additional datasets that we assign to a miscellaneous task cluster include:

work page 2020
[61]

Conversational question-answering: QuAC (Choi et al., 2018) and CoQA (Reddy et al., 2019)

work page 2018
[62]

Evaluating context-sentence word meanings: WiC (Pilehvar & Camacho-Collados, 2019)

work page 2019
[63]

Question classiﬁcation: TREC (Li & Roth, 2002; Hovy et al., 2001)

work page 2002
[64]

Linguistic acceptability: CoLA (Warstadt et al., 2019)

work page 2019
[65]

Joey Heindle was highly disliked by people on television

Math questions (Saxton et al., 2019) For all tasks, our ﬁnetuning and evaluation code uses tensorﬂow datasets (TFDS) to load and process datasets. Regarding the number of training examples per dataset, we limited the training set size per dataset to 30,000 so that no dataset dominated the ﬁnetuning distribution. When a test set with labels was available i...

work page 2019
[66]

Of the training set with 16,113 examples, we use 16,013 for train and 100 for dev

is a commonsense QA benchmark for naive physics reasoning, where a solution to a goal must be selected from two choices. Of the training set with 16,113 examples, we use 16,013 for train and 100 for dev. We use the TFDS validation set of 1,838 examples as our test set for reporting numbers. INPUT Caroline never drinks carbonated beverages. Her friends pic...

work page 2016
[67]

The Flat

asks grade-school level 4-way multiple choice science questions. There is a challenge set and an easy set, where the challenge set questions were answered incorrectly by both a retrieval-based algorithm and a co-occurrence algorithm. Of the training sets with 1,119 examples (challenge) and 2,251 (easy), we use we use 919 and 2,051 respectively for train a...

work page 2019