hub

OLM o: Accelerating the science of language models

Dirk Groeneveld, Iz Beltagy, Evan Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord + 2 more · 2024 · Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) · DOI 10.18653/v1/2024.acl-long.841

16 Pith papers cite this work, alongside 57 external citations. Polarity classification is still indexing.

16 Pith papers citing it

57 external citations · Crossref

open at publisher browse 16 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Sumi: Open Uniform Diffusion Language Model from Scratch

cs.CL · 2026-06-17 · unverdicted · novelty 8.0

Sumi is an openly released 7B parameter uniform diffusion language model pretrained from scratch on 1.5T tokens that matches autoregressive models on several benchmarks.

Disentangling MLP Neuron Weights in Vocabulary Space

cs.CL · 2026-04-07 · unverdicted · novelty 8.0

ROTATE disentangles MLP neurons into faithful vocabulary channels by optimizing weight rotations to maximize vocabulary-space kurtosis, outperforming activation-based baselines for neuron descriptions.

Output Vector Editing for Memorization Mitigation in Large Language Models

cs.CL · 2026-06-17 · unverdicted · novelty 7.0

Output vector editing on MLP neurons suppresses memorization in LLMs up to 87.9% on 6831 sequences in OLMo-7B with a 2.7x gap over zero ablation, ensemble covering 96.5%.

PhantomBench: Benchmarking the Non-existential Threat of Language Models

cs.CL · 2026-06-09 · unverdicted · novelty 7.0

PhantomBench is a new benchmark of 60K+ non-existent terms showing language models hallucinate at rates up to 86.7 percent even when inputs assume the concepts exist.

Many Circuits, One Mechanism: Input Variation and Evaluation Granularity in Circuit Discovery

cs.CL · 2026-06-04 · unverdicted · novelty 7.0

Structurally distinct circuits for literal sequence copying across token frequency bands implement the same computation, shown by broad transfer of band-specific edges, a shared core recovering 99% performance, and interchangeable representations via causal interventions.

Low-Resource Safety Failures Are Action Failures, Not Representation Failures

cs.CL · 2026-05-31 · conditional · novelty 7.0

Low-resource safety failures are action failures because the harmfulness representation transfers but the decision calibration does not; this is fixed by recalibrating a high-resource gate with 1-4 target-language examples.

BOOKMARKS: Efficient Active Storyline Memory for Role-playing

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

BOOKMARKS introduces searchable bookmarks as reusable answers to storyline questions, enabling active initialization and passive synchronization for more consistent role-playing agent memory than recurrent summarization.

ToxiREX: A Dataset on Toxic REasoning in ConteXt

cs.CL · 2026-06-26 · unverdicted · novelty 6.0

ToxiREX is a new dataset of 128k Reddit comments in six languages with hierarchical annotations for implicit toxicity in conversational context based on an existing reasoning schema.

Variable-Width Transformers

cs.CL · 2026-06-16 · conditional · novelty 6.0

×-shaped variable-width transformers outperform parameter-matched uniform baselines on language modeling loss with 22% fewer FLOPs and 15% smaller KV cache.

Unifying Local Communications and Local Updates for LLM Pretraining

cs.LG · 2026-06-09 · unverdicted · novelty 6.0

GASLoC generalizes communication acceleration to the outer optimizer to enable gossip-based decentralized LLM pretraining that supports adaptive optimizers, local steps, and outperforms prior decentralized methods on standard tasks while matching DiLoCo in multi-step regimes.

What's in a Name? Morphological Shortcuts by LLMs in Pharmacology

cs.CL · 2026-06-04 · unverdicted · novelty 6.0

LLMs induce pharmacological meaning primarily from affix cues in drug names, as revealed by a framework applied to 653 drugs and localized via activation patching to early-mid layers.

Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time

cs.CL · 2026-05-13 · conditional · novelty 6.0

OP-Mix is an on-policy data mixing method that uses low-rank adapter interpolation to find near-optimal data mixtures throughout language model training with reduced compute.

COPUS: Co-adaptive Parallelism and Batch Size Selection in Large Language Model Training

cs.DC · 2026-04-29 · unverdicted · novelty 6.0

COPUS co-adapts batch size and parallelism during LLM training via goodput to deliver 3.9-8% average faster convergence than fixing one while tuning the other.

How Hypocritical Is Your LLM judge? Listener-Speaker Asymmetries in the Pragmatic Competence of Large Language Models

cs.CL · 2026-04-17 · unverdicted · novelty 6.0

LLMs perform substantially better as pragmatic listeners judging language than as speakers generating it, revealing weak alignment between the two roles.

LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

cs.LG · 2025-02-17 · unverdicted · novelty 6.0

Pretraining data determines loss-to-loss scaling laws in LLMs, while model size, optimization, tokenizer, and architecture have limited impact.

Model Internal Sleuthing: Finding Lexical Identity and Inflectional Features in Modern Language Models

cs.CL · 2025-06-02 · unverdicted · novelty 5.0

Inflectional features stay linearly decodable across all layers while lexical identity weakens with depth in modern transformers.

citing papers explorer

Showing 16 of 16 citing papers.

Sumi: Open Uniform Diffusion Language Model from Scratch cs.CL · 2026-06-17 · unverdicted · none · ref 13
Sumi is an openly released 7B parameter uniform diffusion language model pretrained from scratch on 1.5T tokens that matches autoregressive models on several benchmarks.
Disentangling MLP Neuron Weights in Vocabulary Space cs.CL · 2026-04-07 · unverdicted · none · ref 1
ROTATE disentangles MLP neurons into faithful vocabulary channels by optimizing weight rotations to maximize vocabulary-space kurtosis, outperforming activation-based baselines for neuron descriptions.
Output Vector Editing for Memorization Mitigation in Large Language Models cs.CL · 2026-06-17 · unverdicted · none · ref 29
Output vector editing on MLP neurons suppresses memorization in LLMs up to 87.9% on 6831 sequences in OLMo-7B with a 2.7x gap over zero ablation, ensemble covering 96.5%.
PhantomBench: Benchmarking the Non-existential Threat of Language Models cs.CL · 2026-06-09 · unverdicted · none · ref 26
PhantomBench is a new benchmark of 60K+ non-existent terms showing language models hallucinate at rates up to 86.7 percent even when inputs assume the concepts exist.
Many Circuits, One Mechanism: Input Variation and Evaluation Granularity in Circuit Discovery cs.CL · 2026-06-04 · unverdicted · none · ref 36
Structurally distinct circuits for literal sequence copying across token frequency bands implement the same computation, shown by broad transfer of band-specific edges, a shared core recovering 99% performance, and interchangeable representations via causal interventions.
Low-Resource Safety Failures Are Action Failures, Not Representation Failures cs.CL · 2026-05-31 · conditional · none · ref 9
Low-resource safety failures are action failures because the harmfulness representation transfers but the decision calibration does not; this is fixed by recalibrating a high-resource gate with 1-4 target-language examples.
BOOKMARKS: Efficient Active Storyline Memory for Role-playing cs.CL · 2026-05-13 · unverdicted · none · ref 54
BOOKMARKS introduces searchable bookmarks as reusable answers to storyline questions, enabling active initialization and passive synchronization for more consistent role-playing agent memory than recurrent summarization.
ToxiREX: A Dataset on Toxic REasoning in ConteXt cs.CL · 2026-06-26 · unverdicted · none · ref 294
ToxiREX is a new dataset of 128k Reddit comments in six languages with hierarchical annotations for implicit toxicity in conversational context based on an existing reasoning schema.
Variable-Width Transformers cs.CL · 2026-06-16 · conditional · none · ref 12
×-shaped variable-width transformers outperform parameter-matched uniform baselines on language modeling loss with 22% fewer FLOPs and 15% smaller KV cache.
Unifying Local Communications and Local Updates for LLM Pretraining cs.LG · 2026-06-09 · unverdicted · none · ref 12
GASLoC generalizes communication acceleration to the outer optimizer to enable gossip-based decentralized LLM pretraining that supports adaptive optimizers, local steps, and outperforms prior decentralized methods on standard tasks while matching DiLoCo in multi-step regimes.
What's in a Name? Morphological Shortcuts by LLMs in Pharmacology cs.CL · 2026-06-04 · unverdicted · none · ref 52
LLMs induce pharmacological meaning primarily from affix cues in drug names, as revealed by a framework applied to 653 drugs and localized via activation patching to early-mid layers.
Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time cs.CL · 2026-05-13 · conditional · none · ref 32
OP-Mix is an on-policy data mixing method that uses low-rank adapter interpolation to find near-optimal data mixtures throughout language model training with reduced compute.
COPUS: Co-adaptive Parallelism and Batch Size Selection in Large Language Model Training cs.DC · 2026-04-29 · unverdicted · none · ref 12
COPUS co-adapts batch size and parallelism during LLM training via goodput to deliver 3.9-8% average faster convergence than fixing one while tuning the other.
How Hypocritical Is Your LLM judge? Listener-Speaker Asymmetries in the Pragmatic Competence of Large Language Models cs.CL · 2026-04-17 · unverdicted · none · ref 14
LLMs perform substantially better as pragmatic listeners judging language than as speakers generating it, revealing weak alignment between the two roles.
LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws cs.LG · 2025-02-17 · unverdicted · none · ref 17
Pretraining data determines loss-to-loss scaling laws in LLMs, while model size, optimization, tokenizer, and architecture have limited impact.
Model Internal Sleuthing: Finding Lexical Identity and Inflectional Features in Modern Language Models cs.CL · 2025-06-02 · unverdicted · none · ref 15
Inflectional features stay linearly decodable across all layers while lexical identity weakens with depth in modern transformers.

OLM o: Accelerating the science of language models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer