super hub Canonical reference

Emergent Abilities of Large Language Models

Barret Zoph, Colin Raffel, Jason Wei, Rishi Bommasani, Sebastian Borgeaud, Yi Tay · 2022 · cs.CL · arXiv 2206.07682

Canonical reference. 86% of citing Pith papers cite this work as background.

153 Pith papers citing it

Background 86% of classified citations

open full Pith review browse 153 citing papers more from Barret Zoph arXiv PDF

abstract

Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The existence of such emergence implies that additional scaling could further expand the range of capabilities of language models.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 34 baseline 2

citation-polarity summary

background 31 support 3 baseline 2

claims ledger

abstract Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The existence of such emergence implies that additional scaling could further expand the range of capabilities of language model

authors

Barret Zoph Colin Raffel Jason Wei Rishi Bommasani Sebastian Borgeaud Yi Tay

co-cited works

representative citing papers

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

cs.CL · 2023-04-03 · accept · novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

Progress measures for grokking via mechanistic interpretability

cs.LG · 2023-01-12 · accept · novelty 8.0

Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

cs.LG · 2022-11-01 · conditional · novelty 8.0

GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.

Smooth Scaling Laws Hide Stepwise Token Learning

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

Token loss trajectories follow localized sigmoids whose learning-time spectrum quantitatively reconstructs scaling-law derivatives on T, D, and M axes and enables faster training via distribution reshaping.

The Pitfall of Scaling Up: Uncovering and Mitigating Popularity Bias Amplification in Scaling Transformer-based Recommenders

cs.IR · 2026-06-20 · unverdicted · novelty 7.0

Transformer recommenders amplify popularity bias via spectral collapse when scaled; SPRINT constrains attention column-sums and feed-forward spectral norms to improve fairness and scaling behavior.

DICE: Entropy-Regularized Equilibrium Selection for Stable Multi-Agent LLM Coordination

cs.LG · 2026-06-06 · unverdicted · novelty 7.0

DICE formalizes multi-agent LLM coordination as discounted incomplete-information Markov games and introduces Heterogeneous Quantal Response Equilibrium (HQRE) to achieve unique stable equilibria with bounded regret, demonstrated via prompt-control and fine-tuning algorithms on eleven benchmarks.

EvoBrain: Continual Learning of EEG Foundation Models Across Heterogeneous BCI Tasks

cs.AI · 2026-06-01 · unverdicted · novelty 7.0

EvoBrain introduces a continual learning method with Neuro-Spectral Task Normalization and Response-Affinity Distillation to enable unified EEG decoding across heterogeneous BCI tasks.

Attention-guided Fine-tuning of Multimodal Large Language Models Improves Chain-of-Thought Reasoning

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

Attentive-CoT is an attention-guided fine-tuning objective that improves chain-of-thought performance in multimodal LLMs by delaying answer commitment and increasing sustained visual-token access during rationale generation.

Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.

TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization

cs.AI · 2026-05-20 · unverdicted · novelty 7.0

A multi-agent pipeline iteratively refines topology optimization outputs to match natural language preferences for branched structures, achieving 60% success rate across replicates in cantilever and phone-stand tasks.

Fin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain

cs.CL · 2026-05-09 · unverdicted · novelty 7.0

LLMs copy biased analyst ratings in investment decisions but a new detection method encourages independent reasoning and can improve stock return predictions beyond human levels.

Graphlets as Building Blocks for Structural Vocabulary in Knowledge Graph Foundation Models

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

Graphlets mined as structural tokens improve zero-shot inductive and transductive link prediction in knowledge graph foundation models across 51 diverse graphs.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

On the Emergence of Syntax by Means of Local Interaction

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

A 2D neural cellular automaton spontaneously self-organizes into a Proto-CKY representation that exhibits syntactic processing capabilities for context-free grammars when trained on membership problems.

PERCEIVE: A Benchmark for Personalized Emotion and Communication Behavior Understanding on Social Media

cs.SI · 2026-04-10 · unverdicted · novelty 7.0

PERCEIVE is the first bilingual benchmark integrating author content, reader emotions from comments, communication behavior, user attributes, and social graphs for personalized social media emotion understanding.

A Full-Stack Performance Evaluation Infrastructure for 3D-DRAM-based LLM Accelerators

cs.AR · 2026-04-09 · conditional · novelty 7.0

ATLAS is the first silicon-validated simulation framework for 3D-DRAM LLM accelerators, achieving under 8.57% error and over 97% correlation with real hardware while supporting design exploration.

The Shrinking Lifespan of LLMs in Science

cs.DL · 2026-04-08 · unverdicted · novelty 7.0

LLM adoption in science follows a compressing inverted-U trajectory where release year predicts time-to-peak and lifespan better than model attributes.

Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives

cs.CL · 2026-04-07 · unverdicted · novelty 7.0

Social dynamics in LLM collectives cause representative agents to make less accurate decisions as peer pressure increases through larger adversarial groups, more capable peers, longer arguments, and persuasive styles.

BoostTaxo: Zero-Shot Taxonomy Induction via Boosting-Style Agentic Reasoning and Constraint-Aware Calibration

cs.CL · 2026-04-03 · unverdicted · novelty 7.0

BoostTaxo introduces a boosting-style LLM framework for zero-shot taxonomy induction that uses hybrid candidate selection and constraint-aware calibration to achieve superior or comparable performance to prior methods on WordNet, DBLP, and SemEval-Sci benchmarks.

Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering

cs.SE · 2026-03-27 · unverdicted · novelty 7.0

StackRepoQA shows LLMs reach only moderate accuracy on multi-file Java QA tasks, with gains from graph-based retrieval but frequent reliance on verbatim answer reproduction.

FactorEngine: A Program-level Knowledge-Infused Factor Mining Framework for Quantitative Investment

cs.AI · 2026-03-17 · unverdicted · novelty 7.0

FactorEngine mines alpha factors as Turing-complete code via LLM-guided directional search, parameter separation, and a multi-agent pipeline that converts financial reports into executable programs, delivering higher IC/ICIR and Sharpe ratios than baselines in backtests.

Retrieval-Augmented Large Language Models for Evidence-Informed Guidance on Cannabidiol Use in Older Adults

cs.IR · 2026-01-16 · unverdicted · novelty 7.0

Retrieval-augmented LLMs produce more cautious and guideline-aligned recommendations on cannabidiol for older adults than standalone models, demonstrated via automated evaluation on 64 diverse scenarios.

A ghost mechanism: An analytical model of abrupt learning in recurrent networks

cs.LG · 2025-01-04 · unverdicted · novelty 7.0

The ghost mechanism derives a 1D canonical model of abrupt learning in RNNs from ghost points of saddle-node bifurcations, predicting an inverse-power-law critical learning rate and gradient-based failure modes.

Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark

cs.AI · 2024-10-06 · unverdicted · novelty 7.0

PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.

citing papers explorer

Showing 23 of 23 citing papers after filters.

A ghost mechanism: An analytical model of abrupt learning in recurrent networks cs.LG · 2025-01-04 · unverdicted · none · ref 18 · internal anchor
The ghost mechanism derives a 1D canonical model of abrupt learning in RNNs from ghost points of saddle-node bifurcations, predicting an inverse-power-law critical learning rate and gradient-based failure modes.
SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control cs.RO · 2025-11-11 · unverdicted · none · ref 57 · internal anchor
Scaling motion tracking models along size, data volume, and compute produces a foundation model for natural, robust humanoid whole-body control with downstream uses in kinematic planning and vision-language-action models.
Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs cs.LG · 2025-10-21 · unverdicted · none · ref 43 · internal anchor
A conditional scaling law fitted on over 200 models from 80M to 3B parameters identifies architectures that deliver up to 2.1% higher accuracy and 42% higher inference throughput than LLaMA-3.2 under the same training budget.
Video models are zero-shot learners and reasoners cs.LG · 2025-09-24 · unverdicted · none · ref 8 · internal anchor
Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning cs.AI · 2025-07-01 · conditional · none · ref 212 · internal anchor
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models cs.CV · 2025-05-22 · unverdicted · none · ref 65 · internal anchor
Multi-SpatialMLLM integrates depth perception, visual correspondence, and dynamic perception into MLLMs via a 27M-sample MultiSPA dataset and benchmark, yielding gains on multi-frame spatial tasks.
Towards an AI co-scientist cs.AI · 2025-02-26 · unverdicted · none · ref 140 · internal anchor
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference cs.AR · 2025-09-11 · unverdicted · none · ref 73 · internal anchor
PLENA introduces a co-designed system with three optimization pathways for long-context agentic LLM inference, claiming up to 2.23x throughput over A100 and 4.04x energy efficiency.
Query Expansion in the Age of Pre-trained and Large Language Models: A Comprehensive Survey cs.IR · 2025-09-09 · unverdicted · none · ref 112 · internal anchor
A comprehensive survey that organizes query expansion methods in the PLM/LLM era along four design dimensions, synthesizes application patterns, and outlines future directions.
Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Output Prefilling cs.CL · 2025-05-21 · conditional · none · ref 42 · internal anchor
Output prefilling with a structured prefix steers LLMs to produce cleaner first tokens in MCQA, raising accuracy and calibration over standard first-token probability.
Emerging Properties in Unified Multimodal Pretraining cs.CV · 2025-05-20 · unverdicted · none · ref 82 · internal anchor
BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning cs.LG · 2025-05-16 · unverdicted · none · ref 40 · internal anchor
TokUR estimates token-level uncertainty via low-rank weight perturbations in LLMs, aggregates signals to correlate with correctness, and uses them to improve reasoning performance on math tasks.
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model cs.CV · 2025-02-14 · unverdicted · none · ref 157 · internal anchor
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
Large Language Models for Multi-Robot Systems: A Survey cs.RO · 2025-02-06 · unverdicted · none · ref 121 · internal anchor
A survey that categorizes LLM uses in multi-robot systems across task allocation, motion planning, action generation, and human interaction, while noting challenges and future research opportunities.
A Survey of Scaling in Large Language Model Reasoning cs.AI · 2025-04-02 · unverdicted · none · ref 215 · internal anchor
A survey categorizing scaling in LLM reasoning across input size, steps, rounds, training, and future directions, noting that scaling can negatively affect performance.
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models cs.AI · 2025-01-16 · unverdicted · none · ref 161 · internal anchor
The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.
The Role of Quantum Computing in Advancing Scientific High-Performance Computing: A perspective from the ADAC Institute quant-ph · 2025-08-15 · unverdicted · none · ref 143 · internal anchor
A synthesis of expert insights from the ADAC Quantum Computing Working Group and member survey on the complementary roles of quantum and classical high-performance computing in future hybrid infrastructures.
Bridging Brains and Machines: A Unified Frontier in Neuroscience, Artificial Intelligence, and Neuromorphic Systems q-bio.NC · 2025-07-14 · unverdicted · none · ref 140 · internal anchor
A position and survey paper that identifies convergence between neuroscience, AGI, and neuromorphic computing and outlines four key integration challenges.
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey cs.CV · 2025-03-16 · unverdicted · none · ref 185 · internal anchor
The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.
Bridging Language Models and Financial Analysis q-fin.ST · 2025-03-14 · unverdicted · none · ref 105 · internal anchor
A survey synthesizing recent LLM research and assessing its applicability to financial data analysis.
Fine-Tuning Causal LLMs for Text Classification: Embedding-Based vs. Instruction-Based Approaches cs.CL · 2025-12-14 · unreviewed · ref 2 · internal anchor
MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs cs.CV · 2025-11-18 · unreviewed · ref 57 · internal anchor
The Ratchet Effect in Silico: How Interaction Drives Cumulative Intelligence in Large Language Models cs.LG · 2025-07-25 · unreviewed · ref 49 · internal anchor

Emergent Abilities of Large Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer