super hub Mixed citations

Title resolution pending

· 2025 · cs.CL · arXiv 2512.13961

Mixed citation behavior. Most common role is background (56%).

133 Pith papers citing it

Background 56% of classified citations

open full Pith review browse 133 citing papers arXiv PDF

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

abstract

We introduce Olmo 3, a family of state-of-the-art, fully-open language models at the 7B and 32B parameter scales. Olmo 3 model construction targets long-context reasoning, function calling, coding, instruction following, general chat, and knowledge recall. This release includes the entire model flow, i.e., the full lifecycle of the family of models, including every stage, checkpoint, data point, and dependency used to build it. Our flagship model, Olmo 3 Think 32B, is the strongest fully-open thinking model released to-date.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 13 dataset 2 method 2 baseline 1

citation-polarity summary

background 10 support 2 use dataset 2 use method 2 baseline 1 unclear 1

claims ledger

abstract We introduce Olmo 3, a family of state-of-the-art, fully-open language models at the 7B and 32B parameter scales. Olmo 3 model construction targets long-context reasoning, function calling, coding, instruction following, general chat, and knowledge recall. This release includes the entire model flow, i.e., the full lifecycle of the family of models, including every stage, checkpoint, data point, and dependency used to build it. Our flagship model, Olmo 3 Think 32B, is the strongest fully-open thinking model released to-date.

co-cited works

representative citing papers

LACUNA: A Testbed for Evaluating Localization Precision for LLM Unlearning

cs.CL · 2026-07-02 · conditional · novelty 8.0

LACUNA is a new testbed that injects PII into predefined model parameters to benchmark the localization precision of LLM unlearning methods, revealing that SOTA approaches are imprecise despite strong output performance.

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth

cs.CL · 2026-05-24 · unverdicted · novelty 8.0

Introduces BonaFide benchmark of 3,066 ground-truth labeled CoTs showing most faithfulness metrics perform near chance with biases and poor scaling to longer chains.

BoLT: A Benchmark to Democratize Black-box Optimization Research for Expensive LLM Tasks

cs.LG · 2026-05-16 · conditional · novelty 8.0

BoLT is a benchmark of surrogate models fitted to real LLM experiment data that enables evaluation of Bayesian and black-box optimization methods on multi-fidelity, multi-objective, high-dimensional LLM tasks.

Tracing Persona Vectors Through LLM Pretraining

cs.CL · 2026-05-13 · unverdicted · novelty 8.0

Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.

Large Language Models Lack Temporal Awareness of Medical Knowledge

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.

Pretraining Exposure Explains Popularity Judgments in Large Language Models

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.

STEB: Style Text Embedding Benchmark

cs.CL · 2026-06-30 · unverdicted · novelty 7.0

STEB is a new benchmark of 96 datasets in 7 languages for evaluating style text embeddings on authorship, detection, and linguistic probing tasks.

MultiHashFormer: Hash-based Generative Language Models

cs.CL · 2026-06-26 · unverdicted · novelty 7.0

MultiHashFormer enables hash-based autoregression in LMs by encoding tokens as multi-hash signatures, outperforming standard Transformers at 100M-3B scales while keeping parameter count constant for multilingual expansion.

Detecting Functional Memorization in Code Language Models

cs.LG · 2026-06-11 · unverdicted · novelty 7.0

Authors demonstrate functional memorization in code LLMs via counterfactual midtraining comparison on functional equivalence metrics beyond textual overlap.

Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs

cs.CL · 2026-06-10 · unverdicted · novelty 7.0

ModSleuth reconstructs dependency graphs from public artifacts for four LLM releases, recovering 1,060 source-verified dependencies and exposing license issues, train-evaluation coupling, and documentation gaps.

Generalization Hacking: Models Can Game Reinforcement Learning by Preventing Behavioral Generalization

cs.LG · 2026-06-10 · unverdicted · novelty 7.0

A finetuned Qwen3-235B model organism achieves comparable train-time harmfulness to controls while sustaining a ~15 percentage point compliance gap across 700 RL steps by framing compliance as context-specific.

WhiFlash: Accelerating Speculative Decoding with Token-Level Cross-Paradigm Routing

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

WhiFlash introduces token-level cross-paradigm routing between autoregressive and diffusion drafting models, with cache optimizations, to raise acceptance lengths and deliver up to 69.6% throughput gains over EAGLE-3.

On the Geometry of On-Policy Distillation

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

OPD updates occupy a relaxed off-principal regime and rapidly lock into a low-dimensional subspace that is functionally sufficient for its performance, distinct from SFT and RLVR trajectories.

UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

cs.CL · 2026-06-04 · unverdicted · novelty 7.0

UnpredictaBench creates 448 distributional sampling tasks and the KS@N metric to measure LLM approximation of target distributions, finding no model exceeds 40% success at N=100.

Reinforcement Learning from Rich Feedback with Distributional DAgger

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

DistIL applies distributional DAgger with forward cross-entropy to achieve monotonic policy improvement and better Pass@N from rich feedback in RL for reasoning tasks.

LoopMoE: Unifying Iterative Computation with Mixture-of-Experts for Language Modeling

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

LoopMoE is a looped MoE language model that outperforms matched vanilla MoE on 8 of 9 downstream benchmarks at 3B scale and continues to outperform at 9B scale under strictly controlled budgets.

IndoBias: A Dual Track Culturally Grounded Benchmark for LLMs Bias Evaluation in Indonesian Languages

cs.CL · 2026-05-31 · unverdicted · novelty 7.0

IndoBias is a dual-track culturally grounded benchmark revealing strong LLM bias in Indonesian prototypical sentences and higher ideology/religion bias in local languages, with Common Crawl pretraining adding more bias than curated sources.

Subliminal Learning Is Steering Vector Distillation

cs.AI · 2026-05-31 · unverdicted · novelty 7.0

Subliminal learning is steering vector distillation: a student fine-tuned on a steered teacher's outputs learns to imitate the steering vector.

Recovering Diversity Without Losing Alignment: A DPO Recipe for Post-Trained LLMs

cs.CL · 2026-05-28 · conditional · novelty 7.0

REDIPO constructs DPO preference data from base-model generations rewritten by the instruct model to increase output diversity on NoveltyBench while preserving alignment metrics across three LLMs.

The Harder Text Embedding Benchmark (HTEB): Beyond One-dimensional Static Robustness

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

HTEB introduces dynamic, multi-axis evaluation of text embedding robustness using LLM transformations, finding decoupled profiles across models and that scaling does not close all robustness gaps.

CurveRL: Principled Distribution-Aware Context Reweighting for LLM Reasoning

cs.LG · 2026-05-23 · unverdicted · novelty 7.0

CurveRL derives a quantile-coordinate reweighting rule from a utility functional on pass rates and shows it outperforms GRPO on reasoning benchmarks.

Learnability-Informed Fine-Tuning of Diffusion Language Models

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

LIFT is a learnability-informed SFT algorithm for diffusion LMs that aligns token difficulty with diffusion time steps, yielding up to 3x gains on AIME'24 and AIME'25 over standard SFT baselines.

MetaBackdoor: Exploiting Positional Encoding as a Backdoor Attack Surface in LLMs

cs.CR · 2026-05-14 · unverdicted · novelty 7.0

MetaBackdoor shows that LLMs can be backdoored using positional triggers like sequence length, enabling stealthy activation on clean inputs to leak system prompts or trigger malicious behavior.

citing papers explorer

Showing 50 of 54 citing papers after filters.

Large Language Models Lack Temporal Awareness of Medical Knowledge cs.LG · 2026-05-13 · unverdicted · none · ref 25 · internal anchor
LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.
Detecting Functional Memorization in Code Language Models cs.LG · 2026-06-11 · unverdicted · none · ref 26 · internal anchor
Authors demonstrate functional memorization in code LLMs via counterfactual midtraining comparison on functional equivalence metrics beyond textual overlap.
Generalization Hacking: Models Can Game Reinforcement Learning by Preventing Behavioral Generalization cs.LG · 2026-06-10 · unverdicted · none · ref 18 · internal anchor
A finetuned Qwen3-235B model organism achieves comparable train-time harmfulness to controls while sustaining a ~15 percentage point compliance gap across 700 RL steps by framing compliance as context-specific.
WhiFlash: Accelerating Speculative Decoding with Token-Level Cross-Paradigm Routing cs.LG · 2026-06-05 · unverdicted · none · ref 12 · internal anchor
WhiFlash introduces token-level cross-paradigm routing between autoregressive and diffusion drafting models, with cache optimizations, to raise acceptance lengths and deliver up to 69.6% throughput gains over EAGLE-3.
On the Geometry of On-Policy Distillation cs.LG · 2026-06-05 · unverdicted · none · ref 11 · internal anchor
OPD updates occupy a relaxed off-principal regime and rapidly lock into a low-dimensional subspace that is functionally sufficient for its performance, distinct from SFT and RLVR trajectories.
Reinforcement Learning from Rich Feedback with Distributional DAgger cs.LG · 2026-06-03 · unverdicted · none · ref 19 · internal anchor
DistIL applies distributional DAgger with forward cross-entropy to achieve monotonic policy improvement and better Pass@N from rich feedback in RL for reasoning tasks.
LoopMoE: Unifying Iterative Computation with Mixture-of-Experts for Language Modeling cs.LG · 2026-06-03 · unverdicted · none · ref 25 · internal anchor
LoopMoE is a looped MoE language model that outperforms matched vanilla MoE on 8 of 9 downstream benchmarks at 3B scale and continues to outperform at 9B scale under strictly controlled budgets.
CurveRL: Principled Distribution-Aware Context Reweighting for LLM Reasoning cs.LG · 2026-05-23 · unverdicted · none · ref 22 · internal anchor
CurveRL derives a quantile-coordinate reweighting rule from a utility functional on pass rates and shows it outperforms GRPO on reasoning benchmarks.
Learning from Language Feedback via Variational Policy Distillation cs.LG · 2026-05-14 · unverdicted · none · ref 21 · internal anchor
VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming fixed-teacher baselines on reasoning and code tasks.
AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents cs.LG · 2026-05-11 · unverdicted · none · ref 88 · internal anchor
AssayBench is a new gene-ranking benchmark for phenotypic CRISPR screens that shows zero-shot generalist LLMs outperform both biology-specific LLMs and trainable baselines on adjusted nDCG.
KL for a KL: On-Policy Distillation with Control Variate Baseline cs.LG · 2026-05-08 · unverdicted · none · ref 29 · internal anchor
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.
LLM-ODE: Data-driven Discovery of Dynamical Systems with Large Language Models cs.LG · 2026-03-21 · unverdicted · none · ref 28 · internal anchor
LLM-ODE integrates large language models into genetic programming to guide symbolic search for governing equations of dynamical systems, outperforming classical GP on 91 test cases in efficiency and solution quality.
Addressing Over-Refusal in LLMs with Competing Rewards cs.LG · 2026-06-30 · unverdicted · none · ref 129 · internal anchor
SEAR trains one LLM via adversarial process rewards to explore harmful reasoning paths but flip to safe outputs, reducing over-refusal while preserving safety.
DRIFT: Refining Instruction Data via On-Policy Data Attribution cs.LG · 2026-06-16 · unverdicted · none · ref 14 · internal anchor
DRIFT applies on-policy influence functions with signed weighting and debiasing to attribute and refine SFT data, raising performance on 7B instruction and reasoning models over prior curation methods.
RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation cs.LG · 2026-06-10 · unverdicted · none · ref 19 · internal anchor
RLCSD contrasts teacher-student distributional gaps under correct versus wrong hints to suppress privilege-induced style drift and concentrate supervision on task tokens, outperforming GRPO and prior OPSD on Qwen3 and Olmo models.
Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier cs.LG · 2026-06-10 · unverdicted · none · ref 132 · internal anchor
PROPEL amortizes solver evaluation with a trained activation probe to optimize task generators toward a target solve rate, raising the share of learnable tasks from ~10% to ~20% in coding and SWE experiments.
Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests cs.LG · 2026-06-05 · unverdicted · none · ref 4 · internal anchor
CapCode constructs coding datasets with randomized tests that deliberately cap non-cheating performance below one, enabling detection of cheating via scores exceeding the cap, while CapReward reduces cheating in training.
Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws cs.LG · 2026-06-05 · unverdicted · none · ref 26 · internal anchor
MIR improves validation loss in repeated-data pretraining and SoftQ fits data-constrained scaling experiments better than additive laws, equating MIR gains to roughly 1.3 times more unique data.
Trust, but Don't Verify: Epistemic Blind Spots in LLM Source Evaluation cs.LG · 2026-06-03 · unverdicted · none · ref 24 · internal anchor
LLMs identify fabricated statistics in isolation (rates 0.76-1.00) but ignore numeric validity during synthesis, relying on a methodology-register representation that transfers across domains.
Sequential Data Poisoning in LLM Post-Training cs.LG · 2026-06-03 · unverdicted · none · ref 8 · internal anchor
Multiple adversaries poisoning different stages of LLM post-training produce additive or complementary effects that single-stage analyses underestimate.
RL Excursions during Pre-Training: Re-examining Policy Optimization for LLM training cs.LG · 2026-06-02 · unverdicted · none · ref 33 · internal anchor
Experiments indicate RL applied early in pre-training often matches full SFT-then-RL performance, targeted data composition outweighs scale for RL success, and averaging RL and SFT objectives outperforms sequential or single methods.
Enhancing LLM Metacognition via Cognitive Pairwise Training cs.LG · 2026-05-30 · unverdicted · none · ref 34 · internal anchor
CPT is introduced as a pairwise reasoning-trace comparison stage that improves the reasoning-metacognition trade-off over standard SFT+RL pipelines across model scales.
Single-Rollout Hidden-State Dynamics for Training-Free RLVR Data Selection cs.LG · 2026-05-27 · unverdicted · none · ref 16 · internal anchor
SHIFT selects compact RLVR training subsets using the magnitude of hidden-state change from a single inference rollout plus quality-weighted farthest-first coverage, outperforming training-free baselines on math reasoning and medical QA under low budgets.
MobileMoE: Scaling On-Device Mixture of Experts cs.LG · 2026-05-26 · unverdicted · none · ref 44 · internal anchor
MobileMoE introduces on-device MoE LLMs that match dense models with 2-4x fewer FLOPs and provide efficient smartphone inference.
Boundary-targeted Membership Inference Attacks on Safety Classifiers cs.LG · 2026-05-21 · unverdicted · none · ref 28 · 2 links · internal anchor
A boundary-targeted MIA strategy recovers 19% of distress-flagged conversations from a safety classifier at 5% false-positive rate, 3.5 times better than prior methods.
Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning cs.LG · 2026-05-21 · unverdicted · none · ref 44 · internal anchor
DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.
You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories cs.LG · 2026-05-20 · unverdicted · none · ref 6 · internal anchor
RELEX extrapolates LLM checkpoints from short RLVR prefixes by projecting deltas onto a rank-1 subspace and fitting a linear trend, matching full training performance at 15% of the steps.
DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards cs.LG · 2026-05-20 · unverdicted · none · ref 65 · internal anchor
DelTA estimates token coefficients to amplify discriminative directions in token-gradient vectors, reweighting the RLVR surrogate to produce more contrastive side-wise centroids and yielding 3.26 and 2.62 point gains on math benchmarks for 8B and 14B Qwen3 models.
The Evaluation Game: Beyond Static LLM Benchmarking cs.LG · 2026-05-19 · unverdicted · none · ref 56 · internal anchor
Presents a game-theoretic model with group actions for data augmentation in LLM adversarial evaluation, demonstrating local generalization from fine-tuning on three model families and redefining benchmarks as orbits under group actions.
Dynamics of the Transformer Residual Stream: Coupling Spectral Geometry to Network Topology cs.LG · 2026-05-14 · unverdicted · none · ref 2 · internal anchor
Training installs a depth-dependent spectral gradient and low-rank bottleneck in LLM residual streams whose amplification or suppression of graph communities is predicted by local operator type.
Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer cs.LG · 2026-05-12 · unverdicted · none · ref 14 · internal anchor
Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.
Before the Last Token: Diagnosing Final-Token Safety Probe Failures cs.LG · 2026-05-12 · unverdicted · none · ref 8 · internal anchor
Final-token probes miss distributed unsafe evidence in jailbreaks, but a PCA-HMM model on prefill trajectories recovers many misses without naive pooling's false positives.
Scaling Laws for Mixture Pretraining Under Data Constraints cs.LG · 2026-05-12 · unverdicted · none · ref 25 · 2 links · internal anchor
Empirical study shows mixture pretraining tolerates higher target data repetition than single-source training, with a new repetition-aware scaling law enabling principled mixture selection based on data size, compute, and model scale.
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information cs.LG · 2026-05-12 · unverdicted · none · ref 20 · internal anchor
Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
Remember to Forget: Gated Adaptive Positional Encoding cs.LG · 2026-05-11 · unverdicted · none · ref 19 · internal anchor
GAPE augments RoPE with query- and key-dependent gates to stabilize attention and improve long-context performance in language models.
Prescriptive Scaling Laws for Data Constrained Training cs.LG · 2026-05-02 · unverdicted · none · ref 9 · internal anchor
A one-parameter scaling law models excess loss from data repetition as an additive overfitting penalty, recommending model capacity increases over excessive repetition and showing that strong weight decay reduces the penalty coefficient by ~70%.
When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient cs.LG · 2026-04-28 · unverdicted · none · ref 57 · internal anchor
Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward design insights.
TEMPO: Scaling Test-time Training for Large Reasoning Models cs.LG · 2026-04-21 · unverdicted · none · ref 21 · internal anchor
TEMPO scales test-time training for large reasoning models by interleaving policy refinement on unlabeled data with critic recalibration on labeled data via an EM formulation, yielding large gains on AIME tasks.
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts cs.LG · 2026-04-20 · unverdicted · none · ref 30 · internal anchor
BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation cs.LG · 2026-04-17 · unverdicted · none · ref 36 · internal anchor
RISE applies CountSketch to dual lexical and semantic channels derived from output-layer gradient outer products, cutting data attribution storage by up to 112x and enabling retrospective and prospective influence analysis on LLMs up to 32B parameters.
LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking cs.LG · 2026-04-16 · unverdicted · none · ref 7 · internal anchor
RLVR-trained LLMs exploit verifier weaknesses by producing non-generalizable outputs on rule-induction tasks, detectable via Isomorphic Perturbation Testing.
Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR cs.LG · 2026-04-06 · unverdicted · none · ref 33 · internal anchor
Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.
Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL cs.LG · 2026-02-03 · unverdicted · none · ref 10 · internal anchor
PULSE exploits BF16-invisible sparsity in weight updates to enable over 100x lower communication in distributed RL post-training via compute-visible sparsification.
Watermarking for Proprietary Dataset Protection cs.LG · 2026-07-01 · unverdicted · none · ref 9 · internal anchor
Watermark-based dataset inference achieves membership detection performance comparable to loss-based methods when subset exposure is high, under alternate assumptions.
DRIFT: Difficulty Routing Self-DIstillation with Rhythm-Gated Exploration and Success BuFfer Training cs.LG · 2026-06-29 · unverdicted · none · ref 15 · internal anchor
DRIFT is an online self-evolution policy optimization framework using Difficulty Routing, Rhythm Gating, success buffers, and two-stage curriculum learning that reports new SOTA results on five reasoning benchmarks.
From Drift to Coherence: Stabilizing Beliefs in LLMs cs.LG · 2026-06-16 · unverdicted · none · ref 14 · internal anchor
In multiple-choice QA, LLM beliefs drift early under repeated sampling but self-stabilize; seed-answer prompting and a self-consistency loss reduce drift while preserving accuracy.
Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal cs.LG · 2026-06-10 · unverdicted · none · ref 31 · internal anchor
A new pipeline uses interpretability to characterize concepts in preference data and shape rewards via feature or data interventions during LM post-training.
A Predictive Law for On-Policy Self-Distillation From World Feedback cs.LG · 2026-05-28 · unverdicted · none · ref 16 · internal anchor
A linear relationship between initial student-self-teacher performance gap and OPSD improvement provides a predictive law across contexts and model families.
Efficient Pre-Training of LLMs through Truncated SVD Layers cs.LG · 2026-05-27 · unverdicted · none · ref 29 · internal anchor
TSVD framework maintains low-rank orthonormal weights during LLM pretraining via truncated SVD, adaptive spectral rank selection, and caching to reduce compute while matching baseline performance.
Forgetting in Language Models: Capacity, Optimization, and Self-Generated Replay cs.LG · 2026-05-25 · unverdicted · none · ref 33 · internal anchor
Self-generated replay from language models nearly eliminates catastrophic forgetting during finetuning except when models are pretrained close to saturation.

Title resolution pending

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer