super hub Canonical reference

Emergent Abilities of Large Language Models

Barret Zoph, Colin Raffel, Jason Wei, Rishi Bommasani, Sebastian Borgeaud, Yi Tay · 2022 · cs.CL · arXiv 2206.07682

Canonical reference. 86% of citing Pith papers cite this work as background.

121 Pith papers citing it

Background 86% of classified citations

open full Pith review browse 121 citing papers more from Barret Zoph arXiv PDF

abstract

Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The existence of such emergence implies that additional scaling could further expand the range of capabilities of language models.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 33 baseline 2

citation-polarity summary

background 30 support 3 baseline 2

claims ledger

abstract Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The existence of such emergence implies that additional scaling could further expand the range of capabilities of language model

authors

Barret Zoph Colin Raffel Jason Wei Rishi Bommasani Sebastian Borgeaud Yi Tay

co-cited works

representative citing papers

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

cs.CL · 2023-04-03 · accept · novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

Progress measures for grokking via mechanistic interpretability

cs.LG · 2023-01-12 · accept · novelty 8.0

Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

cs.LG · 2022-11-01 · conditional · novelty 8.0

GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.

Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.

TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization

cs.AI · 2026-05-20 · unverdicted · novelty 7.0

A multi-agent pipeline iteratively refines topology optimization outputs to match natural language preferences for branched structures, achieving 60% success rate across replicates in cantilever and phone-stand tasks.

Fin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain

cs.CL · 2026-05-09 · unverdicted · novelty 7.0

LLMs copy biased analyst ratings in investment decisions but a new detection method encourages independent reasoning and can improve stock return predictions beyond human levels.

Graphlets as Building Blocks for Structural Vocabulary in Knowledge Graph Foundation Models

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

Graphlets mined as structural tokens improve zero-shot inductive and transductive link prediction in knowledge graph foundation models across 51 diverse graphs.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

On the Emergence of Syntax by Means of Local Interaction

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

A 2D neural cellular automaton spontaneously self-organizes into a Proto-CKY representation that exhibits syntactic processing capabilities for context-free grammars when trained on membership problems.

PERCEIVE: A Benchmark for Personalized Emotion and Communication Behavior Understanding on Social Media

cs.SI · 2026-04-10 · unverdicted · novelty 7.0

PERCEIVE is the first bilingual benchmark integrating author content, reader emotions from comments, communication behavior, user attributes, and social graphs for personalized social media emotion understanding.

A Full-Stack Performance Evaluation Infrastructure for 3D-DRAM-based LLM Accelerators

cs.AR · 2026-04-09 · conditional · novelty 7.0

ATLAS is the first silicon-validated simulation framework for 3D-DRAM LLM accelerators, achieving under 8.57% error and over 97% correlation with real hardware while supporting design exploration.

The Shrinking Lifespan of LLMs in Science

cs.DL · 2026-04-08 · unverdicted · novelty 7.0

LLM adoption in science follows a compressing inverted-U trajectory where release year predicts time-to-peak and lifespan better than model attributes.

Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives

cs.CL · 2026-04-07 · unverdicted · novelty 7.0

Social dynamics in LLM collectives cause representative agents to make less accurate decisions as peer pressure increases through larger adversarial groups, more capable peers, longer arguments, and persuasive styles.

BoostTaxo: Zero-Shot Taxonomy Induction via Boosting-Style Agentic Reasoning and Constraint-Aware Calibration

cs.CL · 2026-04-03 · unverdicted · novelty 7.0

BoostTaxo introduces a boosting-style LLM framework for zero-shot taxonomy induction that uses hybrid candidate selection and constraint-aware calibration to achieve superior or comparable performance to prior methods on WordNet, DBLP, and SemEval-Sci benchmarks.

Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering

cs.SE · 2026-03-27 · unverdicted · novelty 7.0

StackRepoQA shows LLMs reach only moderate accuracy on multi-file Java QA tasks, with gains from graph-based retrieval but frequent reliance on verbatim answer reproduction.

FactorEngine: A Program-level Knowledge-Infused Factor Mining Framework for Quantitative Investment

cs.AI · 2026-03-17 · unverdicted · novelty 7.0

FactorEngine mines alpha factors as Turing-complete code via LLM-guided directional search, parameter separation, and a multi-agent pipeline that converts financial reports into executable programs, delivering higher IC/ICIR and Sharpe ratios than baselines in backtests.

Retrieval-Augmented Large Language Models for Evidence-Informed Guidance on Cannabidiol Use in Older Adults

cs.IR · 2026-01-16 · unverdicted · novelty 7.0

Retrieval-augmented LLMs produce more cautious and guideline-aligned recommendations on cannabidiol for older adults than standalone models, demonstrated via automated evaluation on 64 diverse scenarios.

A ghost mechanism: An analytical model of abrupt learning in recurrent networks

cs.LG · 2025-01-04 · unverdicted · novelty 7.0

The ghost mechanism derives a 1D canonical model of abrupt learning in RNNs from ghost points of saddle-node bifurcations, predicting an inverse-power-law critical learning rate and gradient-based failure modes.

Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark

cs.AI · 2024-10-06 · unverdicted · novelty 7.0

PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

cs.CV · 2024-06-10 · conditional · novelty 7.0

Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.

Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions

cs.CL · 2024-05-29 · unverdicted · novelty 7.0

Introduces YesBut benchmark showing state-of-the-art multimodal models lag humans on interpreting humorous contradictions in comics.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

cs.CL · 2024-05-07 · unverdicted · novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

LLM Agents can Autonomously Exploit One-day Vulnerabilities

cs.CR · 2024-04-11 · unverdicted · novelty 7.0

GPT-4 LLM agents autonomously exploit 87% of tested one-day vulnerabilities when given CVE descriptions, far outperforming other models and tools.

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

cs.LG · 2024-03-12 · unverdicted · novelty 7.0

WorkArena benchmark shows LLM web agents achieve partial success on enterprise tasks but have a substantial gap to full automation and perform worse with open-source models.

citing papers explorer

Showing 21 of 21 citing papers after filters.

TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization cs.AI · 2026-05-20 · unverdicted · none · ref 23 · internal anchor
A multi-agent pipeline iteratively refines topology optimization outputs to match natural language preferences for branched structures, achieving 60% success rate across replicates in cantilever and phone-stand tasks.
Graphlets as Building Blocks for Structural Vocabulary in Knowledge Graph Foundation Models cs.AI · 2026-05-07 · unverdicted · none · ref 47 · internal anchor
Graphlets mined as structural tokens improve zero-shot inductive and transductive link prediction in knowledge graph foundation models across 51 diverse graphs.
FactorEngine: A Program-level Knowledge-Infused Factor Mining Framework for Quantitative Investment cs.AI · 2026-03-17 · unverdicted · none · ref 18 · internal anchor
FactorEngine mines alpha factors as Turing-complete code via LLM-guided directional search, parameter separation, and a multi-agent pipeline that converts financial reports into executable programs, delivering higher IC/ICIR and Sharpe ratios than baselines in backtests.
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark cs.AI · 2024-10-06 · unverdicted · none · ref 44 · internal anchor
PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.
Voyager: An Open-Ended Embodied Agent with Large Language Models cs.AI · 2023-05-25 · unverdicted · none · ref 36 · internal anchor
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more unique items and 15.3x faster milestone unlocks than prior methods while generalizing技能
From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs cs.AI · 2026-06-08 · unverdicted · none · ref 41 · internal anchor
EntropyInfer adaptively allocates inference compute using per-head attention entropy for rigid/dynamic classification during prefilling and compresses KV cache with generated tokens, achieving up to 2.39x speedup on long contexts.
LLM Jaggedness Unlocks Scientific Creativity cs.AI · 2026-05-11 · unverdicted · none · ref 23 · 2 links · internal anchor
Jagged capabilities in LLMs for scientific idea generation can be leveraged through inference-time ensembles to outperform individual models.
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces cs.AI · 2026-05-09 · unverdicted · none · ref 90 · internal anchor
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA cs.AI · 2026-05-05 · unverdicted · none · ref 88 · internal anchor
Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks when representations are correct.
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning cs.AI · 2025-07-01 · conditional · none · ref 212 · internal anchor
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
Towards an AI co-scientist cs.AI · 2025-02-26 · unverdicted · none · ref 140 · internal anchor
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
Novelty-based Tree-of-Thought Search for LLM Reasoning and Planning cs.AI · 2026-05-07 · unverdicted · none · ref 19 · internal anchor
Novelty estimation via LLM prompts enables pruning in Tree-of-Thought search, reducing overall token usage on language planning benchmarks.
STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator cs.AI · 2026-04-27 · unverdicted · none · ref 36 · internal anchor
STELLAR-E modifies the TGRT Self-Instruct framework to produce tailored synthetic LLM evaluation datasets that score an average 5.7% higher on LLM-as-a-judge metrics than existing language-specific benchmarks.
LACE: Lattice Attention for Cross-thread Exploration cs.AI · 2026-04-16 · unverdicted · none · ref 38 · 3 links · internal anchor
LACE enables concurrent reasoning paths in LLMs to interact via lattice attention and a synthetic training pipeline, raising accuracy more than 7 points over independent parallel search.
The Cartesian Cut in Agentic AI cs.AI · 2026-04-09 · unverdicted · none · ref 64 · internal anchor
LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment cs.AI · 2023-08-10 · accept · none · ref 28 · internal anchor
Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.
Artificial Jagged Intelligence as Uneven Optimization Energy Allocation Capability Concentration, Redistribution, and Optimization Governance cs.AI · 2026-05-02 · unverdicted · none · ref 39 · internal anchor
AJI frames jagged AI capabilities as lower bounds on performance dispersion arising from concentrated optimization energy allocation under anisotropic objectives, with theorems on tradeoffs and redistribution interventions.
Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding cs.AI · 2026-05-10 · unverdicted · none · ref 55 · internal anchor
Advanced language representations shape LLMs' schemas to improve knowledge activation and problem-solving.
A Survey of Scaling in Large Language Model Reasoning cs.AI · 2025-04-02 · unverdicted · none · ref 215 · internal anchor
A survey categorizing scaling in LLM reasoning across input size, steps, rounds, training, and future directions, noting that scaling can negatively affect performance.
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models cs.AI · 2025-01-16 · unverdicted · none · ref 161 · internal anchor
The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.
Enhancing Instructional Quality: Leveraging Computer-Assisted Textual Analysis to Generate In-Depth Insights from Educational Artifacts cs.AI · 2024-03-06 · unverdicted · none · ref 71 · internal anchor
AI and NLP applied to educational artifacts within the Instructional Core Framework can identify advantages for teacher coaching, student support, and personalized learning.

Emergent Abilities of Large Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer