arxiv: 2206.07682 · v2 · submitted 2022-06-15 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Emergent Abilities of Large Language Models

Jason Wei , Yi Tay , Rishi Bommasani , Colin Raffel , Barret Zoph , Sebastian Borgeaud , Dani Yogatama , Maarten Bosma

show 8 more authors

Denny Zhou Donald Metzler Ed H. Chi Tatsunori Hashimoto Oriol Vinyals Percy Liang Jeff Dean William Fedus

Authors on Pith no claims yet

Pith reviewed 2026-05-11 07:30 UTC · model grok-4.3

classification 💻 cs.CL

keywords emergent abilitieslarge language modelsmodel scalingunpredictable performancescaling lawslanguage model capabilities

0 comments

The pith

Certain abilities emerge in large language models only after they reach a critical size and cannot be predicted by extrapolating from smaller models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models exhibit abilities that smaller models do not, and these abilities cannot be predicted by simply scaling up the performance curves from smaller models. The paper defines such abilities as emergent because they appear abruptly as model size increases. If this holds, then continuing to scale models could reveal entirely new capabilities without warning. This matters because it changes how we think about the limits and potential of scaling language models.

Core claim

An ability is emergent if it is not present in smaller models but is present in larger models, so that emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The existence of such emergence implies that additional scaling could further expand the range of capabilities of language models.

What carries the argument

Emergent abilities, defined as performance jumps that appear only at larger scales and resist prediction from smaller models.

If this is right

Additional scaling of language models is expected to reveal more emergent abilities.
The range of tasks language models can perform will grow in ways that are hard to anticipate.
Evaluation of model progress must account for the possibility of sudden capability jumps rather than smooth curves.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training strategies might need to prioritize reaching scales where emergence occurs rather than optimizing at smaller sizes.
Unexpected capabilities could raise new questions about alignment and safety that smaller models do not reveal.
New theoretical frameworks may be needed to understand why certain abilities appear at specific scales.

Load-bearing premise

The observed jumps in model performance are caused by increases in model scale rather than variations in training data, evaluation methods, or prompting techniques.

What would settle it

A demonstration that the same performance jump occurs even when all models are trained on identical data and evaluated with the same methods, or that the jump can be accurately predicted by a curve fitted to smaller model results.

read the original abstract

Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The existence of such emergence implies that additional scaling could further expand the range of capabilities of language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that large language models exhibit emergent abilities—capabilities absent in smaller models but present in larger ones—and that these cannot be predicted by extrapolating performance trends from smaller models. It supports this with empirical results across model families (e.g., PaLM 8B/62B/540B) on BIG-bench tasks, arithmetic problems, and other benchmarks, showing sharp performance discontinuities at certain scales, and concludes that further scaling may unlock additional unpredictable capabilities.

Significance. If the observations hold after addressing controls, the work is significant for highlighting potential limits of smooth scaling laws and providing a broad empirical catalog of tasks with discontinuous improvements. It gives credit to prior scaling observations while formalizing emergence as an observational phenomenon, which has influenced subsequent research on LLM capabilities.

major comments (2)

[Definition of Emergence] Definition section (near abstract and §2): The binary definition of emergence (absent in smaller models, present in larger) is tied to specific performance thresholds and metrics; the manuscript does not test robustness to alternative thresholds, continuous metrics, or log-scale extrapolations, which is load-bearing for the claim that abilities 'cannot be predicted simply by extrapolating the performance of smaller models.'
[Experimental Results] Experimental results (e.g., PaLM scaling curves in BIG-bench and arithmetic sections): The reported jumps do not include explicit documentation or controls confirming constancy of training data mixtures, tokenization, optimization details, or prompt formatting across sizes within each family. If these factors covary with scale, the discontinuities and resulting unpredictability claim cannot be attributed to parameter count alone.

minor comments (2)

[Figures] Figures showing performance vs. scale would benefit from explicit markers for emergence thresholds and discussion of variance across runs or seeds.
[Methods] Clarify in the text whether all model sizes within a family share identical training recipes beyond parameter count.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address each major comment below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Definition of Emergence] Definition section (near abstract and §2): The binary definition of emergence (absent in smaller models, present in larger) is tied to specific performance thresholds and metrics; the manuscript does not test robustness to alternative thresholds, continuous metrics, or log-scale extrapolations, which is load-bearing for the claim that abilities 'cannot be predicted simply by extrapolating the performance of smaller models.'

Authors: Our binary definition is intentionally observational, designed to identify capabilities that transition from near-random to substantially above-random performance in a manner not captured by smooth extrapolation from smaller models. The scaling curves in the paper (e.g., Figures 1-3) demonstrate this unpredictability directly, as performance deviates sharply from power-law trends fitted to smaller scales regardless of exact threshold. We agree that sensitivity to thresholds and metrics merits explicit discussion. In the revision we will add a paragraph in Section 2 clarifying that while the precise scale at which emergence occurs may vary with threshold choice, the core phenomenon of non-extrapolatable jumps persists across reasonable continuous and discrete metrics. This is a clarification rather than new experiments. revision: partial
Referee: [Experimental Results] Experimental results (e.g., PaLM scaling curves in BIG-bench and arithmetic sections): The reported jumps do not include explicit documentation or controls confirming constancy of training data mixtures, tokenization, optimization details, or prompt formatting across sizes within each family. If these factors covary with scale, the discontinuities and resulting unpredictability claim cannot be attributed to parameter count alone.

Authors: We agree that explicit documentation strengthens the attribution to scale. Within each model family (PaLM, GPT-3, etc.), the models share identical training data mixtures, tokenization, optimizer settings, and architectural hyperparameters by construction; only parameter count varies. Prompt formatting and evaluation protocols were also held fixed across scales for the reported results. We will add a concise paragraph in the experimental setup section (near the model families description) documenting these controls and noting that the same pattern of discontinuities appears across independent families. This revision directly addresses the concern. revision: yes

Circularity Check

0 steps flagged

No circularity: observational definition with no reduction to fitted inputs or self-citations

full rationale

The paper defines emergent abilities strictly as performance absent in smaller models but present in larger ones, then notes that such abilities cannot be predicted by extrapolating smaller-model performance. This 'thus' follows directly from the definition itself and does not constitute a derivation that reduces a claimed result back to its inputs by construction. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claim rests on empirical observations across model families and tasks rather than a closed logical loop or ansatz smuggled via citation. This is a standard non-circular empirical framing.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the definition of emergence is presented as an observational criterion rather than a derived quantity.

pith-pipeline@v0.9.0 · 5423 in / 908 out tokens · 66926 ms · 2026-05-11T07:30:45.520908+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

An ability is emergent if it is not present in smaller models but is present in larger models. Emergent abilities would not have been directly predicted by extrapolating a scaling law... performance is near-random until a certain critical threshold of scale is reached, after which performance increases to substantially above random. This qualitative change is also known as a phase transition
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Progress measures for grokking via mechanistic interpretability
cs.LG 2023-01 accept novelty 8.0

Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
cs.LG 2022-11 conditional novelty 8.0

GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.
Fin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain
cs.CL 2026-05 unverdicted novelty 7.0

LLMs copy biased analyst ratings in investment decisions but a new detection method encourages independent reasoning and can improve stock return predictions beyond human levels.
Graphlets as Building Blocks for Structural Vocabulary in Knowledge Graph Foundation Models
cs.AI 2026-05 unverdicted novelty 7.0

Graphlets mined as structural tokens improve zero-shot inductive and transductive link prediction in knowledge graph foundation models across 51 diverse graphs.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
cs.CR 2026-04 unverdicted novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
On the Emergence of Syntax by Means of Local Interaction
cs.CL 2026-04 unverdicted novelty 7.0

A 2D neural cellular automaton spontaneously self-organizes into a Proto-CKY representation that exhibits syntactic processing capabilities for context-free grammars when trained on membership problems.
PERCEIVE: A Benchmark for Personalized Emotion and Communication Behavior Understanding on Social Media
cs.SI 2026-04 unverdicted novelty 7.0

PERCEIVE is the first bilingual benchmark integrating author content, reader emotions from comments, communication behavior, user attributes, and social graphs for personalized social media emotion understanding.
A Full-Stack Performance Evaluation Infrastructure for 3D-DRAM-based LLM Accelerators
cs.AR 2026-04 conditional novelty 7.0

ATLAS is the first silicon-validated simulation framework for 3D-DRAM LLM accelerators, achieving under 8.57% error and over 97% correlation with real hardware while supporting design exploration.
The Shrinking Lifespan of LLMs in Science
cs.DL 2026-04 unverdicted novelty 7.0

LLM adoption in science follows a compressing inverted-U trajectory where release year predicts time-to-peak and lifespan better than model attributes.
Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives
cs.CL 2026-04 unverdicted novelty 7.0

Social dynamics in LLM collectives cause representative agents to make less accurate decisions as peer pressure increases through larger adversarial groups, more capable peers, longer arguments, and persuasive styles.
BoostTaxo: Zero-Shot Taxonomy Induction via Boosting-Style Agentic Reasoning and Constraint-Aware Calibration
cs.CL 2026-04 unverdicted novelty 7.0

BoostTaxo introduces a boosting-style LLM framework for zero-shot taxonomy induction that uses hybrid candidate selection and constraint-aware calibration to achieve superior or comparable performance to prior methods...
Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering
cs.SE 2026-03 unverdicted novelty 7.0

StackRepoQA shows LLMs reach only moderate accuracy on multi-file Java QA tasks, with gains from graph-based retrieval but frequent reliance on verbatim answer reproduction.
LLM4Log: A Systematic Review of Large Language Model-based Log Analysis
cs.SE 2026-03 accept novelty 7.0

LLM4Log is a systematic review of 145 papers on LLM-based log analysis that delivers a unified taxonomy, design patterns, and open challenges for reliable adoption in AIOps.
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
cs.CV 2024-06 conditional novelty 7.0

Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
cs.LG 2024-03 unverdicted novelty 7.0

WorkArena benchmark shows LLM web agents achieve partial success on enterprise tasks but have a substantial gap to full automation and perform worse with open-source models.
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
cs.RO 2023-07 unverdicted novelty 7.0

VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
Voyager: An Open-Ended Embodied Agent with Large Language Models
cs.AI 2023-05 unverdicted novelty 7.0

Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...
A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT
cs.SE 2023-02 accept novelty 7.0

The authors present a catalog of prompt patterns that provide reusable solutions to common problems in generating and interacting with outputs from LLMs.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
cs.LG 2026-05 unverdicted novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Valid Best-Model Identification for LLM Evaluation via Low-Rank Factorization
cs.LG 2026-05 unverdicted novelty 6.0

Doubly robust estimators that incorporate low-rank predictions enable valid finite-sample confidence intervals for best-model identification under adaptive sampling and without-replacement example selection in LLM evaluation.
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
cs.AI 2026-05 unverdicted novelty 6.0

OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
The Propagation Field: A Geometric Substrate Theory of Deep Learning
cs.LG 2026-05 unverdicted novelty 6.0

Neural networks possess a propagation field of trajectories and Jacobians whose quality can be measured and optimized independently of endpoint loss, yielding better unseen-path generalization and reduced forgetting i...
Self-Consolidating Language Models: Continual Knowledge Incorporation from Context
cs.CL 2026-05 unverdicted novelty 6.0

SCoL lets LLMs self-generate sparse layer updates via meta-RL to consolidate knowledge from context, outperforming prompting and fine-tuning baselines on QA and long-context tasks while aligning updates with high-Fish...
Self-Consolidating Language Models: Continual Knowledge Incorporation from Context
cs.CL 2026-05 unverdicted novelty 6.0

SCoL trains LLMs via meta-reinforcement learning to generate layer-specific update instructions that improve knowledge acquisition and retention from context streams over standard baselines.
A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints
cs.LG 2026-05 unverdicted novelty 6.0

A queueing model derives stability conditions for LLM inference services under combined compute and KV cache memory limits, with experimental validation showing typical deviations under 10%.
Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA
cs.AI 2026-05 unverdicted novelty 6.0

Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks whe...
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
cs.LG 2026-05 unverdicted novelty 6.0

MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models
cs.CL 2026-05 unverdicted novelty 6.0

LLM accuracy on controlled procedural arithmetic drops from 61% at 5 steps to 20% at 95 steps, with failures including skipped steps, premature answers, and hallucinated operations.
Mixture of Heterogeneous Grouped Experts for Language Modeling
cs.CL 2026-04 unverdicted novelty 6.0

MoHGE achieves standard MoE performance with 20% fewer parameters and balanced GPU utilization via grouped heterogeneous experts, two-level routing, and specialized auxiliary losses.
How Far Are Video Models from True Multimodal Reasoning?
cs.CV 2026-04 unverdicted novelty 6.0

Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
OmniMouse: Scaling properties of multi-modal, multi-task Brain Models on 150B Neural Tokens
q-bio.NC 2026-04 unverdicted novelty 6.0

OmniMouse demonstrates data-driven scaling in multi-task brain models on a 150B-token neural dataset, achieving SOTA across prediction, decoding, and forecasting while model size gains saturate.
LLM-AUG: Robust Wireless Data Augmentation with In-Context Learning in Large Language Models
cs.LG 2026-04 unverdicted novelty 6.0

LLM-AUG applies LLM in-context learning for embedding-space data augmentation in wireless ML, outperforming baselines and reaching near-oracle accuracy with only 15% labeled data on RadioML and IC datasets.
Dynamics of Cognitive Heterogeneity: Investigating Behavioral Biases in Multi-Stage Supply Chains with LLM-Based Simulation
cs.MA 2026-04 unverdicted novelty 6.0

Heterogeneous LLM agents in supply chain simulations exhibit myopic self-interested behaviors that worsen inefficiencies, but information sharing mitigates these effects.
LACE: Lattice Attention for Cross-thread Exploration
cs.AI 2026-04 unverdicted novelty 6.0

LACE enables parallel reasoning paths in LLMs to communicate via lattice attention and error-correct using synthetic training data, improving accuracy by over 7 points over standard parallel search.
The role of System 1 and System 2 semantic memory structure in human and LLM biases
cs.CL 2026-04 unverdicted novelty 6.0

Human semantic memory networks for System 1 and System 2 are structurally distinct and consistently relate to implicit gender bias levels, but LLM networks do not exhibit these properties.
Do Transformers Use their Depth Adaptively? Evidence from a Relational Reasoning Task
cs.LG 2026-04 unverdicted novelty 6.0

Transformers show limited adaptive depth use on relational reasoning, with clearer evidence after finetuning on the task.
MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control
cs.CV 2026-04 unverdicted novelty 6.0

MMEmb-R1 adaptively applies chain-of-thought reasoning to multimodal embeddings via pair-aware counterfactual selection and RL, reaching 71.2 on MMEB-V2 with a 4B model and lower latency.
Video models are zero-shot learners and reasoners
cs.LG 2025-09 unverdicted novelty 6.0

Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.
Towards an AI co-scientist
cs.AI 2025-02 unverdicted novelty 6.0

A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
cs.LG 2024-10 unverdicted novelty 6.0

π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.
Jailbroken: How Does LLM Safety Training Fail?
cs.LG 2023-07 unverdicted novelty 6.0

LLM safety training fails due to competing objectives and mismatched generalization, enabling new jailbreaks that succeed on all unsafe prompts from red-teaming sets in GPT-4 and Claude.
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
cs.CL 2023-03 unverdicted novelty 6.0

HuggingGPT is an agent system where ChatGPT plans and orchestrates calls to Hugging Face models to solve complex multi-modal AI tasks.
BloombergGPT: A Large Language Model for Finance
cs.LG 2023-03 conditional novelty 6.0

BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
LLM Jaggedness Unlocks Scientific Creativity
cs.AI 2026-05 unverdicted novelty 5.0

LLMs exhibit jagged scientific creativity across models, prompts, and domains, and this unevenness can be leveraged via model ensembles to outperform any single model on idea generation.
Agentic AIs Are the Missing Paradigm for Out-of-Distribution Generalization in Foundation Models
cs.LG 2026-05 unverdicted novelty 5.0

Agentic AI systems are required to overcome the parameter coverage ceiling that prevents foundation models from handling certain out-of-distribution cases.
Novelty-based Tree-of-Thought Search for LLM Reasoning and Planning
cs.AI 2026-05 unverdicted novelty 5.0

Novelty estimation via LLM prompts enables pruning in Tree-of-Thought search, reducing overall token usage on language planning benchmarks.
Optimized Deferral for Imbalanced Settings
cs.LG 2026-04 unverdicted novelty 5.0

MILD reformulates two-stage learning to defer as cost-sensitive learning over the input-expert domain and derives new margin-based losses with guarantees, yielding better performance than baselines on image classifica...
STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator
cs.AI 2026-04 unverdicted novelty 5.0

STELLAR-E modifies the TGRT Self-Instruct framework to produce tailored synthetic LLM evaluation datasets that score an average 5.7% higher on LLM-as-a-judge metrics than existing language-specific benchmarks.
Evaluating LLM-Based Goal Extraction in Requirements Engineering: Prompting Strategies and Their Limitations
cs.SE 2026-04 conditional novelty 5.0

LLM pipeline with generation-critic feedback reaches 61% accuracy on low-level goal extraction from requirements documents and outperforms standalone few-shot prompting, yet remains best suited as an accelerator for m...
Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows
cs.CL 2026-04 unverdicted novelty 5.0

Cooperative profiles from behavioral economics games predict LLM team performance in AI-for-science workflows.
Absorber LLM: Harnessing Causal Synchronization for Test-Time Training
cs.LG 2026-04 unverdicted novelty 5.0

Absorber LLM introduces causal synchronization to absorb context into parameters for memory-efficient long-context LLM inference while preserving causal effects.
ARMove: Learning to Predict Human Mobility through Agentic Reasoning
cs.MA 2026-04 unverdicted novelty 5.0

ARMove is a transferable framework for human mobility prediction that combines agentic LLM reasoning, feature management, and large-small model synergy to outperform baselines on several metrics while improving interp...
LACE: Lattice Attention for Cross-thread Exploration
cs.AI 2026-04 unverdicted novelty 5.0

LACE enables concurrent reasoning paths in LLMs to interact via lattice attention and a synthetic training pipeline, raising accuracy more than 7 points over independent parallel search.
LACE: Lattice Attention for Cross-thread Exploration
cs.AI 2026-04 unverdicted novelty 5.0

LACE adds lattice attention to let parallel LLM reasoning threads interact and correct errors, raising accuracy over 7 points versus standard independent sampling.
The Cartesian Cut in Agentic AI
cs.AI 2026-04 unverdicted novelty 5.0

LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.
Limits of Difficulty Scaling: Hard Samples Yield Diminishing Returns in GRPO-Tuned SLMs
cs.LG 2026-04 unverdicted novelty 5.0

GRPO tuning on SLMs shows diminishing returns from hard math samples, with easier subsets matching full performance using 45% fewer steps and GSM8K training outperforming MATH training on numeric subsets.
Emerging Properties in Unified Multimodal Pretraining
cs.CV 2025-05 unverdicted novelty 5.0

BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
StarCoder: may the source be with you!
cs.CL 2023-05 accept novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
Artificial Jagged Intelligence as Uneven Optimization Energy Allocation Capability Concentration, Redistribution, and Optimization Governance
cs.AI 2026-05 unverdicted novelty 4.0

AJI frames jagged AI capabilities as lower bounds on performance dispersion arising from concentrated optimization energy allocation under anisotropic objectives, with theorems on tradeoffs and redistribution interventions.

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · cited by 61 Pith papers · 27 internal anchors

[1]

Bootstrapping language acquisition

Omri Abend, Tom Kwiatkowski, Nathaniel J Smith, Sharon Goldwater, and Mark Steedman. Bootstrapping language acquisition. Cognition, 164: 0 116--143, 2017. URL https://homepages.inf.ed.ac.uk/sgwater/papers/cognition17-bootstrapping.pdf

work page 2017
[2]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. Do as I can, not as I say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022. URL https://arxiv.org/abs/2204.01691

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Flamingo: a Visual Language Model for Few-Shot Learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: A visual language model for few-shot learning. NeurIPS, 2022. URL https://arxiv.org/abs/2204.14198

work page internal anchor Pith review arXiv 2022
[4]

Anderson

Philip W. Anderson. More is different: B roken symmetry and the nature of the hierarchical structure of science. Science, 177 0 (4047): 0 393--396, 1972. URL http://www.lanais.famaf.unc.edu.ar/cursos/em/Anderson-MoreDifferent-1972.pdf

work page 1972
[5]

Ask me anything: A simple strategy for prompting language models,

Simran Arora, Avanika Narayan, Mayee F Chen, Laurel J Orr, Neel Guha, Kush Bhatia, Ines Chami, Frederic Sala, and Christopher R \'e . Ask me anything: A simple strategy for prompting language models. arXiv preprint arXiv:2210.02441, 2022. URL https://arxiv.org/abs/2210.02441

work page arXiv 2022
[6]

Efﬁcient large scale language modeling with mixtures of experts

Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, et al. Efficient large scale language modeling with mixtures of experts. arXiv preprint arXiv:2112.10684, 2021. URL https://arxiv.org/abs/2112.10684

work page arXiv 2021
[7]

A General Language Assistant as a Laboratory for Alignment

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021. URL https://arxiv.org/abs/2112.00861

work page internal anchor Pith review arXiv 2021
[8]

gpt2tc: Text completion and compression using GPT -2, 2021

Fabrice Bellard. gpt2tc: Text completion and compression using GPT -2, 2021. URL https://bellard.org/libnc/gpt2tc.html. Accessed Apr. 26, 2022

work page 2021
[9]

ISBN 978-1-4503-8309-7

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 2021. URL https://dl.acm.org/doi/pdf/10.1145/3442188.3445922

work page doi:10.1145/3442188.3445922 2021
[10]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

BIG-Bench . Beyond the imitation game: Measuring and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022. URL https://arxiv.org/abs/2206.04615

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021. URL https://arxiv.org/abs/2108.07258

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

Improving language models by retrieving from trillions of tokens.Preprint arXiv:2112.04426,

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. arXiv preprint arXiv:2112.04426, 2021. URL https://arxiv.org/abs/2112.04426

work page arXiv 2021
[13]

Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. NeurIPS, 2020. URL https://papers.nips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html

work page 2020
[14]

Extracting training data from large language models

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. USENIX Security, 2021. URL https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting

work page 2021
[15]

Quantifying Memorization Across Neural Language Models

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022. URL https://arxiv.org/abs/2202.07646

work page internal anchor Pith review arXiv 2022
[16]

Chan, Adam Santoro, Andrew K

Stephanie C.Y. Chan, Adam Santoro, Andrew K. Lampinen, Jane X. Wang, Aaditya Singh, Pierre H. Richemond, Jay McClelland, and Felix Hill. Data distributional properties drive emergent few-shot learning in transformers. arXiv preprint arXiv:2205.05055, 2022. URL https://arxiv.org/abs/2205.05055

work page arXiv 2022
[17]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, et al. Pa LM : Scaling language modeling with P athways. arXiv preprint arXiv:2204.02311, 2022. URL https://arxiv.org/abs/2204.02311

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022. URL https://arxiv.org/abs/2210.11416

work page internal anchor Pith review arXiv 2022
[19]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. URL https://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[20]

Wordcraft: A human- AI collaborative editor for story writing

Andy Coenen, Luke Davis, Daphne Ippolito, Emily Reif, and Ann Yuan. Wordcraft: A human- AI collaborative editor for story writing. arXiv preprint arXiv:2107.07430, 2021. URL https://arxiv.org/abs/2107.07430

work page arXiv 2021
[21]

Emergence in science and philosophy, volume 6

Antonella Corradini and Timothy O'Connor. Emergence in science and philosophy, volume 6. Routledge, 2010. URL https://books.google.com/books?hl=en&lr=&id=55RaBwAAQBAJ&oi=fnd&pg=PP1&dq=Emergence+in+science+and+philosophy&ots=2_8VNDXLfv&sig=1aisq_WouF95Cx58WWMZ0Gq3RNk

work page 2010
[22]

Universal Transformers

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and ukasz Kaiser. Universal transformers. arXiv preprint arXiv:1807.03819, 2018. URL https://arxiv.org/abs/1807.03819

work page internal anchor Pith review arXiv 2018
[23]

BERT : Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. NAACL, 2019. URL https://aclanthology.org/N19-1423

work page 2019
[24]

arXiv preprint arXiv:2112.06905 , year =

Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. G LaM : Efficient scaling of language models with mixture-of-experts. ICML, 2021. URL https://arxiv.org/abs/2112.06905

work page arXiv 2021
[25]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. arXiv preprint arXiv:2101.03961, 2021. URL https://arxiv.org/abs/2101.03961

work page internal anchor Pith review arXiv 2021
[26]

Nonlinear total variation based noise removal algorithms,

Stephanie Forrest. Emergent computation: Self-organizing, collective, and cooperative phenomena in natural and artificial computing networks. Physica D: Nonlinear Phenomena, 42 0 (1-3): 0 1--11, 1990. URL https://www.sciencedirect.com/science/article/abs/pii/016727899090063U

work page arXiv 1990
[27]

arXiv preprint arXiv:2202.07785 , year=

Deep Ganguli, Danny Hernandez, Liane Lovitt, Nova DasSarma, Tom Henighan, Andy Jones, Nicholas Joseph, Jackson Kernion, Ben Mann, Amanda Askell, et al. Predictability and surprise in large generative models. arXiv preprint arXiv:2202.07785, 2022. URL https://arxiv.org/abs/2202.07785

work page arXiv 2022
[28]

Making Pre-trained Language Models Better Few-shot Learners

Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners. ACL, 2021. doi:10.18653/v1/2021.acl-long.295. URL https://aclanthology.org/2021.acl-long.295

work page doi:10.18653/v1/2021.acl-long.295 2021
[29]

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. R eal T oxicity P rompts: Evaluating neural toxic degeneration in language models. In Findings of EMNLP, 2020. doi:10.18653/v1/2020.findings-emnlp.301. URL https://aclanthology.org/2020.findings-emnlp.301

work page doi:10.18653/v1/2020.findings-emnlp.301 2020
[30]

Adaptive Computation Time for Recurrent Neural Networks

Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983, 2016. URL https://arxiv.org/abs/1603.08983

work page internal anchor Pith review arXiv 2016
[31]

doi:10.48550/arXiv.2002.08909 , abstract =

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. REALM : Retrieval-augmented language model pre-training. ICML, 2020. URL https://arxiv.org/abs/2002.08909

work page arXiv 2020
[32]

Harper and Paul A

David A. Harper and Paul A. Lewis. New perspectives on emergence in economics. New Perspectives on Emergence in Economics, pp.\ 2--3, 2012. URL https://www.sciencedirect.com/science/article/pii/S0167268112000200?casa_token=fLs2nCYo_64AAAAA:H2sSpSygJmEqXgmpM4jLyeppph3C4TgEsaSXm5RkOpT0r4q2A1x9Su3u4uycK4sIC6a8NdLiSw

work page 2012
[33]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. ICLR, 2021 a . URL https://openreview.net/forum?id=d7KBjmI3GmQ

work page 2021
[34]

2022 , month = jun, journal =

Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt. Unsolved problems in ML safety. arXiv preprint arXiv:2109.13916, 2021 b . URL https://arxiv.org/abs/2109.13916

work page arXiv 2021
[35]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. NeurIPS, 2022. URL https://arxiv.org/abs/2203.15556

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

Surface form competition: Why the highest probability answer isn ' t always right

Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. Surface form competition: Why the highest probability answer isn ' t always right. EMNLP, 2021. URL https://aclanthology.org/2021.emnlp-main.564

work page 2021
[37]

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,

Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. arXiv preprint arXiv:2201.07207, 2022. URL https://arxiv.org/pdf/2201.07207

work page arXiv 2022
[38]

Huberman and Tad Hogg

Bernardo A. Huberman and Tad Hogg. Phase transitions in artificial intelligence systems. Artificial Intelligence, 33 0 (2): 0 155--171, 1987. URL https://www.sciencedirect.com/science/article/abs/pii/0004370287900336

work page arXiv 1987
[39]

Hwang, Yoh Iwasa, Masashi Kawasaki, Bernhard Keimer, Naoto Nagaosa, and Yoshinori Tokura

Harold Y. Hwang, Yoh Iwasa, Masashi Kawasaki, Bernhard Keimer, Naoto Nagaosa, and Yoshinori Tokura. Emergent phenomena at oxide interfaces. Nature M aterials , 11 0 (2): 0 103--113, 2012. URL https://www.nature.com/articles/nmat3223

work page 2012
[40]

Decoupled neural interfaces using synthetic gradients

Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, David Silver, and Koray Kavukcuoglu. Decoupled neural interfaces using synthetic gradients. ICML, 2017. URL https://arxiv.org/abs/1608.05343

work page arXiv 2017
[41]

Dan Jurafsky and James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall series in Artificial Intelligence. Pearson Prentice Hall, 2009. ISBN 9780131873216. URL https://books.google.com/books?id=fZmj5UNK8AQC

work page 2009
[42]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022. URL https://arxiv.org/abs/2207.05221

work page internal anchor Pith review Pith/arXiv arXiv 2022
[43]

Deduplicating training data mitigates privacy risks in language models

Nikhil Kandpal, Eric Wallace, and Colin Raffel. Deduplicating training data mitigates privacy risks in language models. ICML, 2022. URL https://arxiv.org/abs/2202.06539

work page arXiv 2022
[44]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. URL https://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2001
[45]

Large Language Models are Zero-Shot Reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. NeurIPS, 2022. URL https://arxiv.org/abs/2205.11916

work page internal anchor Pith review arXiv 2022
[46]

arXiv preprint arXiv:2204.02329 , year=

Andrew K. Lampinen, Ishita Dasgupta, Stephanie C.Y. Chan, Kory Matthewson, Michael Henry Tessler, Antonia Creswell, James L. McClelland, Jane X. Wang, and Felix Hill. Can language models learn from explanations in context? Findings of EMNLP, 2022. URL https://arxiv.org/abs/2204.02329

work page arXiv 2022
[47]

Deduplicating training data makes language models better

Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. ACL, 2022 a . URL https://arxiv.org/abs/2107.06499

work page arXiv 2022
[48]

Coauthor: Designing a human- AI collaborative writing dataset for exploring language model capabilities

Mina Lee, Percy Liang, and Qian Yang. Coauthor: Designing a human- AI collaborative writing dataset for exploring language model capabilities. CHI, 2022 b . URL https://arxiv.org/abs/2201.06796

work page arXiv 2022
[49]

Gshard: Scaling giant models with conditional computation and automatic sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. ICLR, 2021. URL https://openreview.net/forum?id=qrwe7XHTmYb

work page 2021
[50]

Semi-supervised learning for natural language

Percy Liang. Semi-supervised learning for natural language. PhD thesis, Massachusetts Institute of Technology, 2005. URL https://www-cs.stanford.edu/ pliang/papers/meng-thesis.pdf

work page 2005
[51]

Creel, and Rob Reich

Percy Liang, Rishi Bommasani, Kathleen A. Creel, and Rob Reich. The time is now to develop community norms for the release of foundation models, 2022. URL https://crfm.stanford.edu/2022/05/17/community-norms.html

work page 2022
[52]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthful QA : Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021. URL https://arxiv.org/abs/2109.07958

work page internal anchor Pith review arXiv 2021
[53]

Christopher D. Manning. Human language understanding & reasoning. Daedalus, 151 0 (2): 0 127--138, 2022. URL https://www.amacad.org/publication/human-language-understanding-reasoning

work page 2022
[54]

and Clark, Kevin and Hewitt, John and Khandelwal, Urvashi and Levy, Omer , year =

Christopher D. Manning, Kevin Clark, John Hewitt, Urvashi Khandelwal, and Omer Levy. Emergent linguistic structure in artificial neural networks trained by self-supervision. Proceedings of the National Academy of Sciences, 117 0 (48): 0 30046--30054, 2020. URL https://www.pnas.org/doi/10.1073/pnas.1907367117

work page doi:10.1073/pnas.1907367117 2020
[55]

The Natural Language Decathlon: Multitask Learning as Question Answering

Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730, 2018. URL https://arxiv.org/abs/1806.08730

work page Pith review arXiv 2018
[56]

Pointer sentinel mixture models, 2016

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016. URL https://huggingface.co/datasets/wikitext

work page 2016
[57]

Name tagging with word clusters and discriminative training

Scott Miller, Jethran Guinness, and Alex Zamanian. Name tagging with word clusters and discriminative training. In NAACL, 2004. URL https://aclanthology.org/N04-1043

work page 2004
[58]

Noisy channel language model prompting for few-shot text classification

Sewon Min, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Noisy channel language model prompting for few-shot text classification. ACL, 2022 a . URL https://arxiv.org/abs/2108.04106

work page arXiv 2022
[59]

Rethinking the role of demonstrations: What makes in-context learning work?, 2022

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022 b . URL https://arxiv.org/abs/2202.12837

work page arXiv 2022
[60]

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021. URL https://openreview.net/forum?id=iedYJm92o0a

work page internal anchor Pith review arXiv 2021
[61]

In-context learning and induction heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, et al. In-context learning and induction heads. Transformer Circuits, 2022. URL https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html

work page 2022
[62]

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022. URL https://arxiv.org/abs/2203.02155

work page internal anchor Pith review Pith/arXiv arXiv 2022
[63]

BLEU : a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU : a method for automatic evaluation of machine translation. In ACL, 2002. URL https://aclanthology.org/P02-1040.pdf

work page 2002
[64]

BBQ: A hand-built bias benchmark for question answering.arXiv preprint arXiv:2110.08193,

Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel Bowman. BBQ : A hand-built bias benchmark for question answering. In Findings of ACL, 2022. URL https://arxiv.org/abs/2110.08193

work page arXiv 2022
[65]

Mapping language models to grounded conceptual spaces

Roma Patel and Ellie Pavlick. Mapping language models to grounded conceptual spaces. ICLR, 2022. URL https://openreview.net/forum?id=gJcEM8sxHK

work page 2022
[66]

Red Teaming Language Models with Language Models

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022. URL https://arxiv.org/abs/2202.03286

work page Pith review arXiv 2022
[67]

W i C : the word-in-context dataset for evaluating context-sensitive meaning representations

Mohammad Taher Pilehvar and Jose Camacho-Collados. W i C : the word-in-context dataset for evaluating context-sensitive meaning representations. NAACL, 2019. URL https://aclanthology.org/N19-1128

work page 2019
[68]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8), 2019. URL https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

work page 2019
[69]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training G opher. arXiv preprint arXiv:2112.11446, 2021. URL https://arxiv.org/abs/2112.11446

work page internal anchor Pith review arXiv 2021
[70]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 2020. URL https://jmlr.org/papers/v21/20-074.html

work page 2020
[71]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022. URL https://arxiv.org/abs/2204.06125

work page internal anchor Pith review Pith/arXiv arXiv 2022
[72]

Logan, Matt Gardner, and Sameer Singh

Yasaman Razeghi, Robert L Logan IV, Matt Gardner, and Sameer Singh. Impact of pretraining term frequencies on few-shot reasoning. arXiv preprint arXiv:2202.07206, 2022. URL https://arxiv.org/abs/2202.07206

work page arXiv 2022
[73]

Prompt programming for la rge language models: Beyond the few-shot paradigm,

Laria Reynolds and Kyle McDonell. Prompt programming for large language models: Beyond the few-shot paradigm. Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, 2021. URL https://arxiv.org/abs/2102.07350

work page arXiv 2021
[74]

Social bias in elicited natural language inferences

Rachel Rudinger, Chandler May, and Benjamin Van Durme. Social bias in elicited natural language inferences. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing , 2017. URL https://aclanthology.org/W17-1609

work page 2017
[75]

Multitask prompted training enables zero-shot task generalization

Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. ICLR, 2022. URL https://openreview.net/forum?id=9Vrb9D0WI4

work page 2022
[76]

A mathematical exploration of why language models help solve downstream tasks

Nikunj Saunshi, Sadhika Malladi, and Sanjeev Arora. A mathematical exploration of why language models help solve downstream tasks. ICLR, 2021. URL https://arxiv.org/abs/2010.03648

work page arXiv 2021
[77]

It ' s not just size that matters: Small language models are also few-shot learners

Timo Schick and Hinrich Sch \"u tze. It ' s not just size that matters: Small language models are also few-shot learners. NAACL, June 2021. URL https://aclanthology.org/2021.naacl-main.185

work page 2021
[78]

Language models are multi- lingual chain-of-thought reasoners,

Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. Language models are multilingual chain-of-thought reasoners. arXiv preprint arXiv:2210.03057, 2022. URL https://arxiv.org/abs/2210.03057

work page arXiv 2022
[79]

On the risks of emergent behavior in foundation models, October 2021

Jacob Steinhardt. On the risks of emergent behavior in foundation models, October 2021. URL https://bounded-regret.ghost.io/on-the-risks-of-emergent-behavior-in-foundation-models/. Accessed Apr 13, 2022

work page 2021
[80]

Future ml systems will be qualitatively different, 2022

Jacob Steinhardt. Future ml systems will be qualitatively different, 2022. URL https://bounded-regret.ghost.io/future-ml-systems-will-be-qualitatively-different/. Accessed May 20, 2022

work page 2022

Showing first 80 references.