Recognition: 2 theorem links
· Lean TheoremEmergent Abilities of Large Language Models
Pith reviewed 2026-05-11 07:30 UTC · model grok-4.3
The pith
Certain abilities emerge in large language models only after they reach a critical size and cannot be predicted by extrapolating from smaller models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
An ability is emergent if it is not present in smaller models but is present in larger models, so that emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The existence of such emergence implies that additional scaling could further expand the range of capabilities of language models.
What carries the argument
Emergent abilities, defined as performance jumps that appear only at larger scales and resist prediction from smaller models.
If this is right
- Additional scaling of language models is expected to reveal more emergent abilities.
- The range of tasks language models can perform will grow in ways that are hard to anticipate.
- Evaluation of model progress must account for the possibility of sudden capability jumps rather than smooth curves.
Where Pith is reading between the lines
- Training strategies might need to prioritize reaching scales where emergence occurs rather than optimizing at smaller sizes.
- Unexpected capabilities could raise new questions about alignment and safety that smaller models do not reveal.
- New theoretical frameworks may be needed to understand why certain abilities appear at specific scales.
Load-bearing premise
The observed jumps in model performance are caused by increases in model scale rather than variations in training data, evaluation methods, or prompting techniques.
What would settle it
A demonstration that the same performance jump occurs even when all models are trained on identical data and evaluated with the same methods, or that the jump can be accurately predicted by a curve fitted to smaller model results.
read the original abstract
Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The existence of such emergence implies that additional scaling could further expand the range of capabilities of language models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that large language models exhibit emergent abilities—capabilities absent in smaller models but present in larger ones—and that these cannot be predicted by extrapolating performance trends from smaller models. It supports this with empirical results across model families (e.g., PaLM 8B/62B/540B) on BIG-bench tasks, arithmetic problems, and other benchmarks, showing sharp performance discontinuities at certain scales, and concludes that further scaling may unlock additional unpredictable capabilities.
Significance. If the observations hold after addressing controls, the work is significant for highlighting potential limits of smooth scaling laws and providing a broad empirical catalog of tasks with discontinuous improvements. It gives credit to prior scaling observations while formalizing emergence as an observational phenomenon, which has influenced subsequent research on LLM capabilities.
major comments (2)
- [Definition of Emergence] Definition section (near abstract and §2): The binary definition of emergence (absent in smaller models, present in larger) is tied to specific performance thresholds and metrics; the manuscript does not test robustness to alternative thresholds, continuous metrics, or log-scale extrapolations, which is load-bearing for the claim that abilities 'cannot be predicted simply by extrapolating the performance of smaller models.'
- [Experimental Results] Experimental results (e.g., PaLM scaling curves in BIG-bench and arithmetic sections): The reported jumps do not include explicit documentation or controls confirming constancy of training data mixtures, tokenization, optimization details, or prompt formatting across sizes within each family. If these factors covary with scale, the discontinuities and resulting unpredictability claim cannot be attributed to parameter count alone.
minor comments (2)
- [Figures] Figures showing performance vs. scale would benefit from explicit markers for emergence thresholds and discussion of variance across runs or seeds.
- [Methods] Clarify in the text whether all model sizes within a family share identical training recipes beyond parameter count.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. We address each major comment below, indicating where revisions will be made to the manuscript.
read point-by-point responses
-
Referee: [Definition of Emergence] Definition section (near abstract and §2): The binary definition of emergence (absent in smaller models, present in larger) is tied to specific performance thresholds and metrics; the manuscript does not test robustness to alternative thresholds, continuous metrics, or log-scale extrapolations, which is load-bearing for the claim that abilities 'cannot be predicted simply by extrapolating the performance of smaller models.'
Authors: Our binary definition is intentionally observational, designed to identify capabilities that transition from near-random to substantially above-random performance in a manner not captured by smooth extrapolation from smaller models. The scaling curves in the paper (e.g., Figures 1-3) demonstrate this unpredictability directly, as performance deviates sharply from power-law trends fitted to smaller scales regardless of exact threshold. We agree that sensitivity to thresholds and metrics merits explicit discussion. In the revision we will add a paragraph in Section 2 clarifying that while the precise scale at which emergence occurs may vary with threshold choice, the core phenomenon of non-extrapolatable jumps persists across reasonable continuous and discrete metrics. This is a clarification rather than new experiments. revision: partial
-
Referee: [Experimental Results] Experimental results (e.g., PaLM scaling curves in BIG-bench and arithmetic sections): The reported jumps do not include explicit documentation or controls confirming constancy of training data mixtures, tokenization, optimization details, or prompt formatting across sizes within each family. If these factors covary with scale, the discontinuities and resulting unpredictability claim cannot be attributed to parameter count alone.
Authors: We agree that explicit documentation strengthens the attribution to scale. Within each model family (PaLM, GPT-3, etc.), the models share identical training data mixtures, tokenization, optimizer settings, and architectural hyperparameters by construction; only parameter count varies. Prompt formatting and evaluation protocols were also held fixed across scales for the reported results. We will add a concise paragraph in the experimental setup section (near the model families description) documenting these controls and noting that the same pattern of discontinuities appears across independent families. This revision directly addresses the concern. revision: yes
Circularity Check
No circularity: observational definition with no reduction to fitted inputs or self-citations
full rationale
The paper defines emergent abilities strictly as performance absent in smaller models but present in larger ones, then notes that such abilities cannot be predicted by extrapolating smaller-model performance. This 'thus' follows directly from the definition itself and does not constitute a derivation that reduces a claimed result back to its inputs by construction. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claim rests on empirical observations across model families and tasks rather than a closed logical loop or ansatz smuggled via citation. This is a standard non-circular empirical framing.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
An ability is emergent if it is not present in smaller models but is present in larger models. Emergent abilities would not have been directly predicted by extrapolating a scaling law... performance is near-random until a certain critical threshold of scale is reached, after which performance increases to substantially above random. This qualitative change is also known as a phase transition
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
Progress measures for grokking via mechanistic interpretability
Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.
-
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.
-
Fin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain
LLMs copy biased analyst ratings in investment decisions but a new detection method encourages independent reasoning and can improve stock return predictions beyond human levels.
-
Graphlets as Building Blocks for Structural Vocabulary in Knowledge Graph Foundation Models
Graphlets mined as structural tokens improve zero-shot inductive and transductive link prediction in knowledge graph foundation models across 51 diverse graphs.
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
On the Emergence of Syntax by Means of Local Interaction
A 2D neural cellular automaton spontaneously self-organizes into a Proto-CKY representation that exhibits syntactic processing capabilities for context-free grammars when trained on membership problems.
-
PERCEIVE: A Benchmark for Personalized Emotion and Communication Behavior Understanding on Social Media
PERCEIVE is the first bilingual benchmark integrating author content, reader emotions from comments, communication behavior, user attributes, and social graphs for personalized social media emotion understanding.
-
A Full-Stack Performance Evaluation Infrastructure for 3D-DRAM-based LLM Accelerators
ATLAS is the first silicon-validated simulation framework for 3D-DRAM LLM accelerators, achieving under 8.57% error and over 97% correlation with real hardware while supporting design exploration.
-
The Shrinking Lifespan of LLMs in Science
LLM adoption in science follows a compressing inverted-U trajectory where release year predicts time-to-peak and lifespan better than model attributes.
-
Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives
Social dynamics in LLM collectives cause representative agents to make less accurate decisions as peer pressure increases through larger adversarial groups, more capable peers, longer arguments, and persuasive styles.
-
BoostTaxo: Zero-Shot Taxonomy Induction via Boosting-Style Agentic Reasoning and Constraint-Aware Calibration
BoostTaxo introduces a boosting-style LLM framework for zero-shot taxonomy induction that uses hybrid candidate selection and constraint-aware calibration to achieve superior or comparable performance to prior methods...
-
Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering
StackRepoQA shows LLMs reach only moderate accuracy on multi-file Java QA tasks, with gains from graph-based retrieval but frequent reliance on verbatim answer reproduction.
-
LLM4Log: A Systematic Review of Large Language Model-based Log Analysis
LLM4Log is a systematic review of 145 papers on LLM-based log analysis that delivers a unified taxonomy, design patterns, and open challenges for reliable adoption in AIOps.
-
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
-
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
WorkArena benchmark shows LLM web agents achieve partial success on enterprise tasks but have a substantial gap to full automation and perform worse with open-source models.
-
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
-
Voyager: An Open-Ended Embodied Agent with Large Language Models
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...
-
A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT
The authors present a catalog of prompt patterns that provide reusable solutions to common problems in generating and interacting with outputs from LLMs.
-
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
-
Valid Best-Model Identification for LLM Evaluation via Low-Rank Factorization
Doubly robust estimators that incorporate low-rank predictions enable valid finite-sample confidence intervals for best-model identification under adaptive sampling and without-replacement example selection in LLM evaluation.
-
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
-
The Propagation Field: A Geometric Substrate Theory of Deep Learning
Neural networks possess a propagation field of trajectories and Jacobians whose quality can be measured and optimized independently of endpoint loss, yielding better unseen-path generalization and reduced forgetting i...
-
Self-Consolidating Language Models: Continual Knowledge Incorporation from Context
SCoL lets LLMs self-generate sparse layer updates via meta-RL to consolidate knowledge from context, outperforming prompting and fine-tuning baselines on QA and long-context tasks while aligning updates with high-Fish...
-
Self-Consolidating Language Models: Continual Knowledge Incorporation from Context
SCoL trains LLMs via meta-reinforcement learning to generate layer-specific update instructions that improve knowledge acquisition and retention from context streams over standard baselines.
-
A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints
A queueing model derives stability conditions for LLM inference services under combined compute and KV cache memory limits, with experimental validation showing typical deviations under 10%.
-
Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA
Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks whe...
-
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
-
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models
LLM accuracy on controlled procedural arithmetic drops from 61% at 5 steps to 20% at 95 steps, with failures including skipped steps, premature answers, and hallucinated operations.
-
Mixture of Heterogeneous Grouped Experts for Language Modeling
MoHGE achieves standard MoE performance with 20% fewer parameters and balanced GPU utilization via grouped heterogeneous experts, two-level routing, and specialized auxiliary losses.
-
How Far Are Video Models from True Multimodal Reasoning?
Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
-
OmniMouse: Scaling properties of multi-modal, multi-task Brain Models on 150B Neural Tokens
OmniMouse demonstrates data-driven scaling in multi-task brain models on a 150B-token neural dataset, achieving SOTA across prediction, decoding, and forecasting while model size gains saturate.
-
LLM-AUG: Robust Wireless Data Augmentation with In-Context Learning in Large Language Models
LLM-AUG applies LLM in-context learning for embedding-space data augmentation in wireless ML, outperforming baselines and reaching near-oracle accuracy with only 15% labeled data on RadioML and IC datasets.
-
Dynamics of Cognitive Heterogeneity: Investigating Behavioral Biases in Multi-Stage Supply Chains with LLM-Based Simulation
Heterogeneous LLM agents in supply chain simulations exhibit myopic self-interested behaviors that worsen inefficiencies, but information sharing mitigates these effects.
-
LACE: Lattice Attention for Cross-thread Exploration
LACE enables parallel reasoning paths in LLMs to communicate via lattice attention and error-correct using synthetic training data, improving accuracy by over 7 points over standard parallel search.
-
The role of System 1 and System 2 semantic memory structure in human and LLM biases
Human semantic memory networks for System 1 and System 2 are structurally distinct and consistently relate to implicit gender bias levels, but LLM networks do not exhibit these properties.
-
Do Transformers Use their Depth Adaptively? Evidence from a Relational Reasoning Task
Transformers show limited adaptive depth use on relational reasoning, with clearer evidence after finetuning on the task.
-
MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control
MMEmb-R1 adaptively applies chain-of-thought reasoning to multimodal embeddings via pair-aware counterfactual selection and RL, reaching 71.2 on MMEB-V2 with a 4B model and lower latency.
-
Video models are zero-shot learners and reasoners
Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.
-
Towards an AI co-scientist
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
-
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.
-
Jailbroken: How Does LLM Safety Training Fail?
LLM safety training fails due to competing objectives and mismatched generalization, enabling new jailbreaks that succeed on all unsafe prompts from red-teaming sets in GPT-4 and Claude.
-
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
HuggingGPT is an agent system where ChatGPT plans and orchestrates calls to Hugging Face models to solve complex multi-modal AI tasks.
-
BloombergGPT: A Large Language Model for Finance
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
-
LLM Jaggedness Unlocks Scientific Creativity
LLMs exhibit jagged scientific creativity across models, prompts, and domains, and this unevenness can be leveraged via model ensembles to outperform any single model on idea generation.
-
Agentic AIs Are the Missing Paradigm for Out-of-Distribution Generalization in Foundation Models
Agentic AI systems are required to overcome the parameter coverage ceiling that prevents foundation models from handling certain out-of-distribution cases.
-
Novelty-based Tree-of-Thought Search for LLM Reasoning and Planning
Novelty estimation via LLM prompts enables pruning in Tree-of-Thought search, reducing overall token usage on language planning benchmarks.
-
Optimized Deferral for Imbalanced Settings
MILD reformulates two-stage learning to defer as cost-sensitive learning over the input-expert domain and derives new margin-based losses with guarantees, yielding better performance than baselines on image classifica...
-
STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator
STELLAR-E modifies the TGRT Self-Instruct framework to produce tailored synthetic LLM evaluation datasets that score an average 5.7% higher on LLM-as-a-judge metrics than existing language-specific benchmarks.
-
Evaluating LLM-Based Goal Extraction in Requirements Engineering: Prompting Strategies and Their Limitations
LLM pipeline with generation-critic feedback reaches 61% accuracy on low-level goal extraction from requirements documents and outperforms standalone few-shot prompting, yet remains best suited as an accelerator for m...
-
Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows
Cooperative profiles from behavioral economics games predict LLM team performance in AI-for-science workflows.
-
Absorber LLM: Harnessing Causal Synchronization for Test-Time Training
Absorber LLM introduces causal synchronization to absorb context into parameters for memory-efficient long-context LLM inference while preserving causal effects.
-
ARMove: Learning to Predict Human Mobility through Agentic Reasoning
ARMove is a transferable framework for human mobility prediction that combines agentic LLM reasoning, feature management, and large-small model synergy to outperform baselines on several metrics while improving interp...
-
LACE: Lattice Attention for Cross-thread Exploration
LACE enables concurrent reasoning paths in LLMs to interact via lattice attention and a synthetic training pipeline, raising accuracy more than 7 points over independent parallel search.
-
LACE: Lattice Attention for Cross-thread Exploration
LACE adds lattice attention to let parallel LLM reasoning threads interact and correct errors, raising accuracy over 7 points versus standard independent sampling.
-
The Cartesian Cut in Agentic AI
LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.
-
Limits of Difficulty Scaling: Hard Samples Yield Diminishing Returns in GRPO-Tuned SLMs
GRPO tuning on SLMs shows diminishing returns from hard math samples, with easier subsets matching full performance using 45% fewer steps and GSM8K training outperforming MATH training on numeric subsets.
-
Emerging Properties in Unified Multimodal Pretraining
BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
-
StarCoder: may the source be with you!
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
-
Artificial Jagged Intelligence as Uneven Optimization Energy Allocation Capability Concentration, Redistribution, and Optimization Governance
AJI frames jagged AI capabilities as lower bounds on performance dispersion arising from concentrated optimization energy allocation under anisotropic objectives, with theorems on tradeoffs and redistribution interventions.
Reference graph
Works this paper leans on
-
[1]
Bootstrapping language acquisition
Omri Abend, Tom Kwiatkowski, Nathaniel J Smith, Sharon Goldwater, and Mark Steedman. Bootstrapping language acquisition. Cognition, 164: 0 116--143, 2017. URL https://homepages.inf.ed.ac.uk/sgwater/papers/cognition17-bootstrapping.pdf
work page 2017
-
[2]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. Do as I can, not as I say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022. URL https://arxiv.org/abs/2204.01691
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: A visual language model for few-shot learning. NeurIPS, 2022. URL https://arxiv.org/abs/2204.14198
work page internal anchor Pith review arXiv 2022
- [4]
-
[5]
Ask me anything: A simple strategy for prompting language models,
Simran Arora, Avanika Narayan, Mayee F Chen, Laurel J Orr, Neel Guha, Kush Bhatia, Ines Chami, Frederic Sala, and Christopher R \'e . Ask me anything: A simple strategy for prompting language models. arXiv preprint arXiv:2210.02441, 2022. URL https://arxiv.org/abs/2210.02441
-
[6]
Efficient large scale language modeling with mixtures of experts
Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, et al. Efficient large scale language modeling with mixtures of experts. arXiv preprint arXiv:2112.10684, 2021. URL https://arxiv.org/abs/2112.10684
-
[7]
A General Language Assistant as a Laboratory for Alignment
Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021. URL https://arxiv.org/abs/2112.00861
work page internal anchor Pith review arXiv 2021
-
[8]
gpt2tc: Text completion and compression using GPT -2, 2021
Fabrice Bellard. gpt2tc: Text completion and compression using GPT -2, 2021. URL https://bellard.org/libnc/gpt2tc.html. Accessed Apr. 26, 2022
work page 2021
-
[9]
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 2021. URL https://dl.acm.org/doi/pdf/10.1145/3442188.3445922
-
[10]
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
BIG-Bench . Beyond the imitation game: Measuring and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022. URL https://arxiv.org/abs/2206.04615
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021. URL https://arxiv.org/abs/2108.07258
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[12]
Improving language models by retrieving from trillions of tokens.Preprint arXiv:2112.04426,
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. arXiv preprint arXiv:2112.04426, 2021. URL https://arxiv.org/abs/2112.04426
-
[13]
Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. NeurIPS, 2020. URL https://papers.nips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
work page 2020
-
[14]
Extracting training data from large language models
Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. USENIX Security, 2021. URL https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting
work page 2021
-
[15]
Quantifying Memorization Across Neural Language Models
Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022. URL https://arxiv.org/abs/2202.07646
work page internal anchor Pith review arXiv 2022
-
[16]
Stephanie C.Y. Chan, Adam Santoro, Andrew K. Lampinen, Jane X. Wang, Aaditya Singh, Pierre H. Richemond, Jay McClelland, and Felix Hill. Data distributional properties drive emergent few-shot learning in transformers. arXiv preprint arXiv:2205.05055, 2022. URL https://arxiv.org/abs/2205.05055
-
[17]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, et al. Pa LM : Scaling language modeling with P athways. arXiv preprint arXiv:2204.02311, 2022. URL https://arxiv.org/abs/2204.02311
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[18]
Scaling Instruction-Finetuned Language Models
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022. URL https://arxiv.org/abs/2210.11416
work page internal anchor Pith review arXiv 2022
-
[19]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. URL https://arxiv.org/abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[20]
Wordcraft: A human- AI collaborative editor for story writing
Andy Coenen, Luke Davis, Daphne Ippolito, Emily Reif, and Ann Yuan. Wordcraft: A human- AI collaborative editor for story writing. arXiv preprint arXiv:2107.07430, 2021. URL https://arxiv.org/abs/2107.07430
-
[21]
Emergence in science and philosophy, volume 6
Antonella Corradini and Timothy O'Connor. Emergence in science and philosophy, volume 6. Routledge, 2010. URL https://books.google.com/books?hl=en&lr=&id=55RaBwAAQBAJ&oi=fnd&pg=PP1&dq=Emergence+in+science+and+philosophy&ots=2_8VNDXLfv&sig=1aisq_WouF95Cx58WWMZ0Gq3RNk
work page 2010
-
[22]
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and ukasz Kaiser. Universal transformers. arXiv preprint arXiv:1807.03819, 2018. URL https://arxiv.org/abs/1807.03819
work page internal anchor Pith review arXiv 2018
-
[23]
BERT : Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. NAACL, 2019. URL https://aclanthology.org/N19-1423
work page 2019
-
[24]
arXiv preprint arXiv:2112.06905 , year =
Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. G LaM : Efficient scaling of language models with mixture-of-experts. ICML, 2021. URL https://arxiv.org/abs/2112.06905
-
[25]
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. arXiv preprint arXiv:2101.03961, 2021. URL https://arxiv.org/abs/2101.03961
work page internal anchor Pith review arXiv 2021
-
[26]
Nonlinear total variation based noise removal algorithms,
Stephanie Forrest. Emergent computation: Self-organizing, collective, and cooperative phenomena in natural and artificial computing networks. Physica D: Nonlinear Phenomena, 42 0 (1-3): 0 1--11, 1990. URL https://www.sciencedirect.com/science/article/abs/pii/016727899090063U
-
[27]
arXiv preprint arXiv:2202.07785 , year=
Deep Ganguli, Danny Hernandez, Liane Lovitt, Nova DasSarma, Tom Henighan, Andy Jones, Nicholas Joseph, Jackson Kernion, Ben Mann, Amanda Askell, et al. Predictability and surprise in large generative models. arXiv preprint arXiv:2202.07785, 2022. URL https://arxiv.org/abs/2202.07785
-
[28]
Making Pre-trained Language Models Better Few-shot Learners
Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners. ACL, 2021. doi:10.18653/v1/2021.acl-long.295. URL https://aclanthology.org/2021.acl-long.295
-
[29]
Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. R eal T oxicity P rompts: Evaluating neural toxic degeneration in language models. In Findings of EMNLP, 2020. doi:10.18653/v1/2020.findings-emnlp.301. URL https://aclanthology.org/2020.findings-emnlp.301
-
[30]
Adaptive Computation Time for Recurrent Neural Networks
Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983, 2016. URL https://arxiv.org/abs/1603.08983
work page internal anchor Pith review arXiv 2016
-
[31]
doi:10.48550/arXiv.2002.08909 , abstract =
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. REALM : Retrieval-augmented language model pre-training. ICML, 2020. URL https://arxiv.org/abs/2002.08909
-
[32]
David A. Harper and Paul A. Lewis. New perspectives on emergence in economics. New Perspectives on Emergence in Economics, pp.\ 2--3, 2012. URL https://www.sciencedirect.com/science/article/pii/S0167268112000200?casa_token=fLs2nCYo_64AAAAA:H2sSpSygJmEqXgmpM4jLyeppph3C4TgEsaSXm5RkOpT0r4q2A1x9Su3u4uycK4sIC6a8NdLiSw
work page 2012
-
[33]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. ICLR, 2021 a . URL https://openreview.net/forum?id=d7KBjmI3GmQ
work page 2021
-
[34]
Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt. Unsolved problems in ML safety. arXiv preprint arXiv:2109.13916, 2021 b . URL https://arxiv.org/abs/2109.13916
-
[35]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. NeurIPS, 2022. URL https://arxiv.org/abs/2203.15556
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[36]
Surface form competition: Why the highest probability answer isn ' t always right
Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. Surface form competition: Why the highest probability answer isn ' t always right. EMNLP, 2021. URL https://aclanthology.org/2021.emnlp-main.564
work page 2021
-
[37]
Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,
Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. arXiv preprint arXiv:2201.07207, 2022. URL https://arxiv.org/pdf/2201.07207
-
[38]
Bernardo A. Huberman and Tad Hogg. Phase transitions in artificial intelligence systems. Artificial Intelligence, 33 0 (2): 0 155--171, 1987. URL https://www.sciencedirect.com/science/article/abs/pii/0004370287900336
-
[39]
Hwang, Yoh Iwasa, Masashi Kawasaki, Bernhard Keimer, Naoto Nagaosa, and Yoshinori Tokura
Harold Y. Hwang, Yoh Iwasa, Masashi Kawasaki, Bernhard Keimer, Naoto Nagaosa, and Yoshinori Tokura. Emergent phenomena at oxide interfaces. Nature M aterials , 11 0 (2): 0 103--113, 2012. URL https://www.nature.com/articles/nmat3223
work page 2012
-
[40]
Decoupled neural interfaces using synthetic gradients
Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, David Silver, and Koray Kavukcuoglu. Decoupled neural interfaces using synthetic gradients. ICML, 2017. URL https://arxiv.org/abs/1608.05343
-
[41]
Dan Jurafsky and James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall series in Artificial Intelligence. Pearson Prentice Hall, 2009. ISBN 9780131873216. URL https://books.google.com/books?id=fZmj5UNK8AQC
work page 2009
-
[42]
Language Models (Mostly) Know What They Know
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022. URL https://arxiv.org/abs/2207.05221
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[43]
Deduplicating training data mitigates privacy risks in language models
Nikhil Kandpal, Eric Wallace, and Colin Raffel. Deduplicating training data mitigates privacy risks in language models. ICML, 2022. URL https://arxiv.org/abs/2202.06539
-
[44]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. URL https://arxiv.org/abs/2001.08361
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[45]
Large Language Models are Zero-Shot Reasoners
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. NeurIPS, 2022. URL https://arxiv.org/abs/2205.11916
work page internal anchor Pith review arXiv 2022
-
[46]
arXiv preprint arXiv:2204.02329 , year=
Andrew K. Lampinen, Ishita Dasgupta, Stephanie C.Y. Chan, Kory Matthewson, Michael Henry Tessler, Antonia Creswell, James L. McClelland, Jane X. Wang, and Felix Hill. Can language models learn from explanations in context? Findings of EMNLP, 2022. URL https://arxiv.org/abs/2204.02329
-
[47]
Deduplicating training data makes language models better
Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. ACL, 2022 a . URL https://arxiv.org/abs/2107.06499
-
[48]
Mina Lee, Percy Liang, and Qian Yang. Coauthor: Designing a human- AI collaborative writing dataset for exploring language model capabilities. CHI, 2022 b . URL https://arxiv.org/abs/2201.06796
-
[49]
Gshard: Scaling giant models with conditional computation and automatic sharding
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. ICLR, 2021. URL https://openreview.net/forum?id=qrwe7XHTmYb
work page 2021
-
[50]
Semi-supervised learning for natural language
Percy Liang. Semi-supervised learning for natural language. PhD thesis, Massachusetts Institute of Technology, 2005. URL https://www-cs.stanford.edu/ pliang/papers/meng-thesis.pdf
work page 2005
-
[51]
Percy Liang, Rishi Bommasani, Kathleen A. Creel, and Rob Reich. The time is now to develop community norms for the release of foundation models, 2022. URL https://crfm.stanford.edu/2022/05/17/community-norms.html
work page 2022
-
[52]
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. Truthful QA : Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021. URL https://arxiv.org/abs/2109.07958
work page internal anchor Pith review arXiv 2021
-
[53]
Christopher D. Manning. Human language understanding & reasoning. Daedalus, 151 0 (2): 0 127--138, 2022. URL https://www.amacad.org/publication/human-language-understanding-reasoning
work page 2022
-
[54]
and Clark, Kevin and Hewitt, John and Khandelwal, Urvashi and Levy, Omer , year =
Christopher D. Manning, Kevin Clark, John Hewitt, Urvashi Khandelwal, and Omer Levy. Emergent linguistic structure in artificial neural networks trained by self-supervision. Proceedings of the National Academy of Sciences, 117 0 (48): 0 30046--30054, 2020. URL https://www.pnas.org/doi/10.1073/pnas.1907367117
-
[55]
The Natural Language Decathlon: Multitask Learning as Question Answering
Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730, 2018. URL https://arxiv.org/abs/1806.08730
work page Pith review arXiv 2018
-
[56]
Pointer sentinel mixture models, 2016
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016. URL https://huggingface.co/datasets/wikitext
work page 2016
-
[57]
Name tagging with word clusters and discriminative training
Scott Miller, Jethran Guinness, and Alex Zamanian. Name tagging with word clusters and discriminative training. In NAACL, 2004. URL https://aclanthology.org/N04-1043
work page 2004
-
[58]
Noisy channel language model prompting for few-shot text classification
Sewon Min, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Noisy channel language model prompting for few-shot text classification. ACL, 2022 a . URL https://arxiv.org/abs/2108.04106
-
[59]
Rethinking the role of demonstrations: What makes in-context learning work?, 2022
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022 b . URL https://arxiv.org/abs/2202.12837
-
[60]
Show Your Work: Scratchpads for Intermediate Computation with Language Models
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021. URL https://openreview.net/forum?id=iedYJm92o0a
work page internal anchor Pith review arXiv 2021
-
[61]
In-context learning and induction heads
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, et al. In-context learning and induction heads. Transformer Circuits, 2022. URL https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html
work page 2022
-
[62]
Training language models to follow instructions with human feedback
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022. URL https://arxiv.org/abs/2203.02155
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[63]
BLEU : a method for automatic evaluation of machine translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU : a method for automatic evaluation of machine translation. In ACL, 2002. URL https://aclanthology.org/P02-1040.pdf
work page 2002
-
[64]
BBQ: A hand-built bias benchmark for question answering.arXiv preprint arXiv:2110.08193,
Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel Bowman. BBQ : A hand-built bias benchmark for question answering. In Findings of ACL, 2022. URL https://arxiv.org/abs/2110.08193
-
[65]
Mapping language models to grounded conceptual spaces
Roma Patel and Ellie Pavlick. Mapping language models to grounded conceptual spaces. ICLR, 2022. URL https://openreview.net/forum?id=gJcEM8sxHK
work page 2022
-
[66]
Red Teaming Language Models with Language Models
Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022. URL https://arxiv.org/abs/2202.03286
work page Pith review arXiv 2022
-
[67]
W i C : the word-in-context dataset for evaluating context-sensitive meaning representations
Mohammad Taher Pilehvar and Jose Camacho-Collados. W i C : the word-in-context dataset for evaluating context-sensitive meaning representations. NAACL, 2019. URL https://aclanthology.org/N19-1128
work page 2019
-
[68]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8), 2019. URL https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
work page 2019
-
[69]
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training G opher. arXiv preprint arXiv:2112.11446, 2021. URL https://arxiv.org/abs/2112.11446
work page internal anchor Pith review arXiv 2021
-
[70]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 2020. URL https://jmlr.org/papers/v21/20-074.html
work page 2020
-
[71]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022. URL https://arxiv.org/abs/2204.06125
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[72]
Logan, Matt Gardner, and Sameer Singh
Yasaman Razeghi, Robert L Logan IV, Matt Gardner, and Sameer Singh. Impact of pretraining term frequencies on few-shot reasoning. arXiv preprint arXiv:2202.07206, 2022. URL https://arxiv.org/abs/2202.07206
-
[73]
Prompt programming for la rge language models: Beyond the few-shot paradigm,
Laria Reynolds and Kyle McDonell. Prompt programming for large language models: Beyond the few-shot paradigm. Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, 2021. URL https://arxiv.org/abs/2102.07350
-
[74]
Social bias in elicited natural language inferences
Rachel Rudinger, Chandler May, and Benjamin Van Durme. Social bias in elicited natural language inferences. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing , 2017. URL https://aclanthology.org/W17-1609
work page 2017
-
[75]
Multitask prompted training enables zero-shot task generalization
Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. ICLR, 2022. URL https://openreview.net/forum?id=9Vrb9D0WI4
work page 2022
-
[76]
A mathematical exploration of why language models help solve downstream tasks
Nikunj Saunshi, Sadhika Malladi, and Sanjeev Arora. A mathematical exploration of why language models help solve downstream tasks. ICLR, 2021. URL https://arxiv.org/abs/2010.03648
-
[77]
It ' s not just size that matters: Small language models are also few-shot learners
Timo Schick and Hinrich Sch \"u tze. It ' s not just size that matters: Small language models are also few-shot learners. NAACL, June 2021. URL https://aclanthology.org/2021.naacl-main.185
work page 2021
-
[78]
Language models are multi- lingual chain-of-thought reasoners,
Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. Language models are multilingual chain-of-thought reasoners. arXiv preprint arXiv:2210.03057, 2022. URL https://arxiv.org/abs/2210.03057
-
[79]
On the risks of emergent behavior in foundation models, October 2021
Jacob Steinhardt. On the risks of emergent behavior in foundation models, October 2021. URL https://bounded-regret.ghost.io/on-the-risks-of-emergent-behavior-in-foundation-models/. Accessed Apr 13, 2022
work page 2021
-
[80]
Future ml systems will be qualitatively different, 2022
Jacob Steinhardt. Future ml systems will be qualitatively different, 2022. URL https://bounded-regret.ghost.io/future-ml-systems-will-be-qualitatively-different/. Accessed May 20, 2022
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.