pith. sign in

hub

Cross-Task Generalization via Natural Language Crowdsourcing Instructions

20 Pith papers cite this work. Polarity classification is still indexing.

20 Pith papers citing it
abstract

Humans (e.g., crowdworkers) have a remarkable ability in solving different tasks, by simply reading textual instructions that define them and looking at a few examples. Despite the success of the conventional supervised learning on individual datasets, such models often struggle with generalization across tasks (e.g., a question-answering system cannot solve classification tasks). A long-standing challenge in AI is to build a model that learns a new task by understanding the human-readable instructions that define it. To study this, we introduce NATURAL INSTRUCTIONS, a dataset of 61 distinct tasks, their human-authored instructions, and 193k task instances (input-output pairs). The instructions are obtained from crowdsourcing instructions used to create existing NLP datasets and mapped to a unified schema. Using this meta-dataset, we measure cross-task generalization by training models on seen tasks and measuring generalization to the remaining unseen ones. We adopt generative pre-trained language models to encode task-specific instructions along with input and generate task output. Our results indicate that models benefit from instructions when evaluated in terms of generalization to unseen tasks (19% better for models utilizing instructions). These models, however, are far behind an estimated performance upperbound indicating significant room for more progress in this direction.

hub tools

citation-role summary

background 3

citation-polarity summary

roles

background 3

polarities

background 3

representative citing papers

PRIMETIME : Limits of LLMs in Temporal Primitives

cs.NE · 2025-04-22 · unverdicted · novelty 7.0

PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.

Self-Rewarding Language Models

cs.CL · 2024-01-18 · conditional · novelty 7.0

Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.

LIMA: Less Is More for Alignment

cs.CL · 2023-05-18 · conditional · novelty 7.0

Fine-tuning a 65B model on 1,000 high-quality examples produces output that humans rate as good as or better than GPT-4 in 43% of cases, indicating most capabilities come from pretraining.

VideoChat: Chat-Centric Video Understanding

cs.CV · 2023-05-10 · conditional · novelty 7.0

VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.

OPT: Open Pre-trained Transformer Language Models

cs.CL · 2022-05-02 · unverdicted · novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

Gemini: A Family of Highly Capable Multimodal Models

cs.CL · 2023-12-19 · conditional · novelty 6.0

Gemini Ultra reaches human-expert performance on MMLU for the first time and sets new state-of-the-art results on 30 of 32 benchmarks, including all 20 multimodal ones tested.

PandaGPT: One Model To Instruction-Follow Them All

cs.CL · 2023-05-25 · conditional · novelty 6.0

A single model trained only on image-text pairs gains instruction-following ability across images, video, and audio by routing all modalities through ImageBind's shared embedding space into Vicuna.

Large Language Models: A Survey

cs.CL · 2024-02-09 · accept · novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

citing papers explorer

Showing 20 of 20 citing papers.