hub Mixed citations

2 OLMo 2 Furious

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora · 2024 · cs.CL · arXiv 2501.00656

Mixed citation behavior. Most common role is background (46%).

68 Pith papers citing it

Background 46% of classified citations

open full Pith review browse 68 citing papers arXiv PDF

abstract

We present OLMo 2, the next generation of our fully open language models. OLMo 2 includes a family of dense autoregressive language models at 7B, 13B and 32B scales with fully released artifacts -- model weights, full training data, training code and recipes, training logs and thousands of intermediate checkpoints. In this work, we describe our modified model architecture and training recipe, focusing on techniques for achieving better training stability and improved per-token efficiency. Our updated pretraining data mixture introduces a new, specialized data mix called Dolmino Mix 1124, which significantly improves model capabilities across many downstream task benchmarks when introduced via late-stage curriculum training (i.e. specialized data during the annealing phase of pretraining). Finally, we incorporate best practices from T\"ulu 3 to develop OLMo 2-Instruct, focusing on permissive data and extending our final-stage reinforcement learning with verifiable rewards (RLVR). Our OLMo 2 base models sit at the Pareto frontier of performance to training compute, often matching or outperforming open-weight only models like Llama 3.1, Qwen 2.5, and Gemma 2 while using fewer FLOPs and with fully transparent training data, code, and recipe. Our fully open OLMo 2-Instruct models are competitive with open-weight only models of comparable size and even some proprietary models like GPT-3.5 Turbo and GPT 4o Mini.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9 method 3 other 1

citation-polarity summary

background 6 unclear 3 use method 3 support 1

claims ledger

abstract We present OLMo 2, the next generation of our fully open language models. OLMo 2 includes a family of dense autoregressive language models at 7B, 13B and 32B scales with fully released artifacts -- model weights, full training data, training code and recipes, training logs and thousands of intermediate checkpoints. In this work, we describe our modified model architecture and training recipe, focusing on techniques for achieving better training stability and improved per-token efficiency. Our updated pretraining data mixture introduces a new, specialized data mix called Dolmino Mix 1124, which

co-cited works

representative citing papers

Demystifying the Silence of Correctness Bugs in PyTorch Compiler

cs.SE · 2026-04-09 · conditional · novelty 8.0

First empirical study of correctness bugs in torch.compile characterizes their patterns and proposes AlignGuard, which found 23 confirmed new bugs via LLM-guided test mutation.

Spurious Rewards: Rethinking Training Signals in RLVR

cs.AI · 2025-06-12 · accept · novelty 8.0

Spurious rewards in RLVR can produce large gains in mathematical reasoning for certain language models via GRPO's clipping bias amplifying pretraining behaviors like code reasoning.

Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning

cs.CL · 2026-05-22 · unverdicted · novelty 7.0

Representational convergence across 16 LLMs on 800 reasoning problems is stronger for failed tasks and pre-decision stages but shows minimal causal influence on predictions, pointing to shared processing constraints over shared reasoning.

Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback

cs.CR · 2026-05-17 · unverdicted · novelty 7.0

Presents TRUST-Bench benchmark for hidden-trigger tool compromises in LLM agents and VISTA-Guard framework for trajectory-aware risk scoring of final actions under untrusted feedback.

How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, and sparsity.

From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation

cs.LG · 2026-05-12 · conditional · novelty 7.0

Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.

Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

RL on binary rewards boosts LLM factual recall by ~27% relative across models by redistributing probability mass to latent correct answers rather than acquiring new knowledge.

Implicit Representations of Grammaticality in Language Models

cs.CL · 2026-05-06 · unverdicted · novelty 7.0

Linear probes on LM hidden states detect grammaticality better than string probabilities, generalize to human benchmarks and other languages, and correlate weakly with likelihood.

The Hidden Cost of Thinking: Energy Use and Environmental Impact of LMs Beyond Pretraining

cs.CY · 2026-05-01 · unverdicted · novelty 7.0

Full development of 7B and 32B Olmo 3 models used 12.3 GWh datacenter energy and emitted 4,251 tCO2eq, with development overheads accounting for 82% of compute and reasoning models costing 17x more to post-train than instruction-tuned ones.

Characterizing the Expressivity of Local Attention in Transformers

cs.CL · 2026-05-01 · conditional · novelty 7.0 · 2 refs

Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by introducing a second temporal operator in LTL, with global and local attention being expressively complementary.

Supernodes and Halos: Loss-Critical Hubs in LLM Feed-Forward Layers

cs.LG · 2026-04-26 · unverdicted · novelty 7.0

In LLM feed-forward networks, the top 1% of channels per layer carry a median 58.7% of loss sensitivity, forming supernodes whose protection enables effective 50% sparsity pruning with much lower perplexity than baselines.

EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training

cs.CV · 2026-04-21 · unverdicted · novelty 7.0

EmbodiedMidtrain mid-trains VLMs on curated VLA-aligned data subsets to improve downstream performance on robot manipulation benchmarks.

Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models

cs.CL · 2026-04-11 · unverdicted · novelty 7.0

Supervised fine-tuning of LLMs often fails to fully internalize all training instances due to five recurring causes including missing prerequisites and data conflicts, as diagnosed via a new framework across multiple models.

Perceptrons and localization of attention's mean-field landscape

cs.LG · 2026-01-29 · unverdicted · novelty 7.0

In the mean-field limit of attention with perceptron blocks, critical points of the energy landscape are generically atomic and localized on subsets of the unit sphere.

MURPHY: Feedback-Aware GRPO with Retrospective Credit Assignment for Multi-Turn Code Generation

cs.LG · 2025-11-11 · unverdicted · novelty 7.0

MURPHY improves code generation pass rates by up to 6% through retrospective credit assignment on multi-turn feedback trees using max or mean reward propagation.

Vocab Diet: Reshaping the Vocabulary of LLMs via Vector Arithmetic

cs.CL · 2025-10-19 · conditional · novelty 7.0

LLMs can compose surface-form tokens from base embeddings plus learned transformation vectors, freeing 10-40% of vocabulary slots while expanding coverage and preserving downstream performance across five languages.

Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training

cs.LG · 2025-07-21 · unverdicted · novelty 7.0

An RL agent learns domain re-weighting policies from evaluation feedback to improve balanced performance in continual pre-training of LLMs across source and target domains.

Sampling from Your Language Model One Byte at a Time

cs.CL · 2025-06-17 · unverdicted · novelty 7.0

An inference-time technique turns BPE-based LMs into byte- or character-level models, solving the prompt boundary problem while unifying vocabularies across different tokenizers.

Pre-trained Large Language Models Learn Hidden Markov Models In-context

cs.LG · 2025-06-08 · unverdicted · novelty 7.0

Pre-trained LLMs learn to predict HMM-generated sequences via in-context learning, approaching theoretical optimum on synthetic HMMs and matching expert models on real animal decision data.

Explaining Sources of Uncertainty in Automated Fact-Checking

cs.CL · 2025-05-23 · unverdicted · novelty 7.0

CLUE generates natural language explanations of model uncertainty in fact-checking by unsupervised identification of claim-evidence and inter-evidence conflicts and agreements, followed by prompting and attention steering.

Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?

cs.CL · 2025-02-11 · unverdicted · novelty 7.0

Evaluation of 22 LLMs shows they are more susceptible to spin in medical abstracts than humans but can recognize and mitigate it when prompted.

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

cs.LG · 2025-02-07 · unverdicted · novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

The Shannon Scaling Law treats LLM training as noisy-channel transmission and predicts U-shaped performance degradation when signal-to-noise ratio falls below a threshold, outperforming monotonic scaling laws on Pythia and OLMo2 data.

Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

A framework quantifies hyperparameter transfer via scaling-law fit quality, extrapolation robustness, and loss penalty, with ablations showing that μP's advantage over standard parameterization stems from maximizing the embedding layer learning rate to avoid bottlenecks and instabilities in AdamW.

citing papers explorer

Showing 18 of 68 citing papers.

Muon is Scalable for LLM Training cs.LG · 2025-02-24 · unverdicted · none · ref 94 · internal anchor
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
DataComp-LM: In search of the next generation of training sets for language models cs.LG · 2024-06-17 · unverdicted · none · ref 132 · internal anchor
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models cs.LG · 2026-04-24 · unverdicted · none · ref 70 · internal anchor
Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.
GiVA: Gradient-Informed Bases for Vector-Based Adaptation cs.CL · 2026-04-23 · unverdicted · none · ref 68 · internal anchor
GiVA uses gradients to initialize vector adapters so they match LoRA performance at eight times lower rank while keeping extreme parameter efficiency.
VLA Foundry: A Unified Framework for Training Vision-Language-Action Models cs.RO · 2026-04-21 · unverdicted · none · ref 68 · internal anchor
VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.
Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling cs.LG · 2026-04-21 · unverdicted · none · ref 16 · internal anchor
Nexusformer uses a three-stage nonlinear mapping in attention to enable stable, inheritable scaling of transformers, matching baseline perplexity with up to 41.5% less compute when growing from 240M to 440M parameters.
(How) Learning Rates Regulate Catastrophic Overtraining cs.LG · 2026-04-15 · unverdicted · none · ref 16 · internal anchor
Learning rate decay during SFT increases pretrained model sharpness, which exacerbates catastrophic forgetting and causes overtraining in LLMs.
HTMuon: Improving Muon via Heavy-Tailed Spectral Correction cs.LG · 2026-03-10 · unverdicted · none · ref 19 · internal anchor
HTMuon modifies Muon to produce heavier-tailed updates and weight spectra via HT-SR theory, yielding up to 0.98 lower perplexity on LLaMA pretraining and serving as a plug-in for other Muon variants.
Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning cs.LG · 2026-02-16 · unverdicted · none · ref 16 · internal anchor
A teacher-driven sampling method selects appropriately difficult questions for student models in GRPO-based RL to improve reasoning performance under fixed compute on OpenMathReasoning.
What Is The Political Content in LLMs' Pre- and Post-Training Data? cs.CL · 2025-09-26 · unverdicted · none · ref 29 · internal anchor
Training data for open LLMs is systematically left-leaning, with pre-training corpora containing more political material than post-training data and model stances aligning with data distributions.
Data Compressibility Quantifies LLM Memorization cs.CL · 2025-07-08 · unverdicted · none · ref 21 · internal anchor
Set-level data entropy estimators show linear correlation with LLM memorization scores, forming the Entropy-Memorization Linearity.
Revisiting the Past: Data Unlearning with Model State History cs.LG · 2025-06-26 · unverdicted · none · ref 28 · internal anchor
MSA performs data unlearning in LLMs by arithmetic operations on prior model checkpoints to remove targeted datapoint influence, with experiments showing competitive or better results than existing unlearning methods.
Tracing the ongoing emergence of human-like reasoning in Large Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 81 · internal anchor
LLMs function as accurate semantic processors for conditionals but do not replicate the pragmatic inferences that define human reasoning.
Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization cs.LG · 2026-05-06 · unverdicted · none · ref 15 · internal anchor
Outcome-level RL with binary or composite rewards improves compositional generalization over supervised fine-tuning by avoiding overfitting to frequent training patterns.
Reinforcement Learning from Human Feedback cs.LG · 2025-04-16 · unverdicted · none · ref 58 · internal anchor
The book introduces the origins, mathematical setup, and optimization stages of RLHF including reward modeling, reinforcement learning, rejection sampling, and direct alignment algorithms.
Fully Open Meditron: An Auditable Pipeline for Clinical LLMs cs.AI · 2026-05-15 · unreviewed · ref 35 · internal anchor
Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale cs.CV · 2026-04-20 · unreviewed · ref 68 · internal anchor
Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space cs.CL · 2026-03-15 · unreviewed · ref 22 · internal anchor

2 OLMo 2 Furious

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer