Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
super hub Canonical reference
Emergent Abilities of Large Language Models
Canonical reference. 86% of citing Pith papers cite this work as background.
abstract
Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The existence of such emergence implies that additional scaling could further expand the range of capabilities of language models.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The existence of such emergence implies that additional scaling could further expand the range of capabilities of language model
authors
co-cited works
representative citing papers
Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.
GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.
Token loss trajectories follow localized sigmoids whose learning-time spectrum quantitatively reconstructs scaling-law derivatives on T, D, and M axes and enables faster training via distribution reshaping.
DICE formalizes multi-agent LLM coordination as discounted incomplete-information Markov games and introduces Heterogeneous Quantal Response Equilibrium (HQRE) to achieve unique stable equilibria with bounded regret, demonstrated via prompt-control and fine-tuning algorithms on eleven benchmarks.
EvoBrain introduces a continual learning method with Neuro-Spectral Task Normalization and Response-Affinity Distillation to enable unified EEG decoding across heterogeneous BCI tasks.
Attentive-CoT is an attention-guided fine-tuning objective that improves chain-of-thought performance in multimodal LLMs by delaying answer commitment and increasing sustained visual-token access during rationale generation.
Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.
A multi-agent pipeline iteratively refines topology optimization outputs to match natural language preferences for branched structures, achieving 60% success rate across replicates in cantilever and phone-stand tasks.
LLMs copy biased analyst ratings in investment decisions but a new detection method encourages independent reasoning and can improve stock return predictions beyond human levels.
Graphlets mined as structural tokens improve zero-shot inductive and transductive link prediction in knowledge graph foundation models across 51 diverse graphs.
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
A 2D neural cellular automaton spontaneously self-organizes into a Proto-CKY representation that exhibits syntactic processing capabilities for context-free grammars when trained on membership problems.
PERCEIVE is the first bilingual benchmark integrating author content, reader emotions from comments, communication behavior, user attributes, and social graphs for personalized social media emotion understanding.
ATLAS is the first silicon-validated simulation framework for 3D-DRAM LLM accelerators, achieving under 8.57% error and over 97% correlation with real hardware while supporting design exploration.
LLM adoption in science follows a compressing inverted-U trajectory where release year predicts time-to-peak and lifespan better than model attributes.
Social dynamics in LLM collectives cause representative agents to make less accurate decisions as peer pressure increases through larger adversarial groups, more capable peers, longer arguments, and persuasive styles.
BoostTaxo introduces a boosting-style LLM framework for zero-shot taxonomy induction that uses hybrid candidate selection and constraint-aware calibration to achieve superior or comparable performance to prior methods on WordNet, DBLP, and SemEval-Sci benchmarks.
StackRepoQA shows LLMs reach only moderate accuracy on multi-file Java QA tasks, with gains from graph-based retrieval but frequent reliance on verbatim answer reproduction.
FactorEngine mines alpha factors as Turing-complete code via LLM-guided directional search, parameter separation, and a multi-agent pipeline that converts financial reports into executable programs, delivering higher IC/ICIR and Sharpe ratios than baselines in backtests.
Retrieval-augmented LLMs produce more cautious and guideline-aligned recommendations on cannabidiol for older adults than standalone models, demonstrated via automated evaluation on 64 diverse scenarios.
The ghost mechanism derives a 1D canonical model of abrupt learning in RNNs from ghost points of saddle-node bifurcations, predicting an inverse-power-law critical learning rate and gradient-based failure modes.
PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
citing papers explorer
-
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.
-
A Full-Stack Performance Evaluation Infrastructure for 3D-DRAM-based LLM Accelerators
ATLAS is the first silicon-validated simulation framework for 3D-DRAM LLM accelerators, achieving under 8.57% error and over 97% correlation with real hardware while supporting design exploration.
-
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
-
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
-
BloombergGPT: A Large Language Model for Finance
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
-
Evaluating LLM-Based Goal Extraction in Requirements Engineering: Prompting Strategies and Their Limitations
LLM pipeline with generation-critic feedback reaches 61% accuracy on low-level goal extraction from requirements documents and outperforms standalone few-shot prompting, yet remains best suited as an accelerator for manual work.
-
Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Output Prefilling
Output prefilling with a structured prefix steers LLMs to produce cleaner first tokens in MCQA, raising accuracy and calibration over standard first-token probability.
-
CHESS: Contextual Harnessing for Efficient SQL Synthesis
CHESS deploys four LLM agents to retrieve information, prune schemas, generate refined SQL candidates, and validate via unit tests, reporting up to 71.10% accuracy on BIRD with 83% fewer calls than leading proprietary baselines.
-
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.