hub Mixed citations

An empirical investigation of catastrophic forgetting in gradient-based neural networks.arXiv preprint arXiv:1312.6211

22 I · 2013 · stat.ML · arXiv 1312.6211

Mixed citation behavior. Most common role is background (60%).

28 Pith papers citing it

Background 60% of classified citations

open full Pith review browse 28 citing papers arXiv PDF

abstract

Catastrophic forgetting is a problem faced by many machine learning models and algorithms. When trained on one task, then trained on a second task, many machine learning models "forget" how to perform the first task. This is widely believed to be a serious problem for neural networks. Here, we investigate the extent to which the catastrophic forgetting problem occurs for modern neural networks, comparing both established and recent gradient-based training algorithms and activation functions. We also examine the effect of the relationship between the first task and the second task on catastrophic forgetting. We find that it is always best to train using the dropout algorithm--the dropout algorithm is consistently best at adapting to the new task, remembering the old task, and has the best tradeoff curve between these two extremes. We find that different tasks and relationships between tasks result in very different rankings of activation function performance. This suggests the choice of activation function should always be cross-validated.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 other 1

citation-polarity summary

background 3 support 1 unclear 1

representative citing papers

MIST: Reliable Streaming Decision Trees for Online Class-Incremental Learning via McDiarmid Bound

cs.LG · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

MIST fixes unreliable splits in streaming decision trees for class-incremental learning by replacing Hoeffding-style bounds with a K-independent McDiarmid radius on Gini, plus Bayesian parent-to-child inheritance and per-leaf quantile sketches.

SLE-FNO: Single-Layer Extensions for Task-Agnostic Continual Learning in Fourier Neural Operators

cs.LG · 2026-03-20 · unverdicted · novelty 7.0

SLE-FNO achieves zero forgetting and strong plasticity-stability balance in continual learning for FNO surrogate models of pulsatile blood flow by adding minimal single-layer extensions across four out-of-distribution tasks.

NetTailor: Tuning the Architecture, Not Just the Weights

cs.CV · 2019-06-29 · unverdicted · novelty 7.0

NetTailor adapts CNN architecture for new tasks by assembling pre-trained universal blocks with task-specific layers, trained via activation mimicry and complexity penalties to match accuracy while reducing size for simpler tasks.

HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.

Temporal Task Diversity: Inductive Biases Under Non-Stationarity in Synthetic Sequence Modelling

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

Temporal diversity in task distribution during training increases generalization bias over memorization in transformers for in-context linear regression.

Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning

cs.LG · 2026-05-09 · unverdicted · novelty 6.0 · 2 refs

Muon-OGD introduces a spectral-norm constrained orthogonal projection method solved via dual iterations and Newton-Schulz approximations to improve stability-plasticity trade-off in sequential LLM adaptation.

Robust Policy Optimization to Prevent Catastrophic Forgetting

cs.LG · 2026-02-09 · unverdicted · novelty 6.0

FRPO applies a max-min robust optimization over KL-bounded policy neighborhoods during RLHF to reduce catastrophic forgetting of safety and accuracy under subsequent SFT or RL fine-tuning.

Mitigating Catastrophic Forgetting in Target Language Adaptation of LLMs via Source-Shielded Updates

cs.CL · 2025-12-04 · conditional · novelty 6.0

SSU mitigates catastrophic forgetting in low-resource LLM target-language adaptation by scoring and column-wise freezing source-critical parameters, reducing source degradation to ~3% versus ~20% for full fine-tuning while matching target performance.

Routing-Based Continual Learning for Multimodal Large Language Models

cs.LG · 2025-11-03 · unverdicted · novelty 6.0

Routing architecture for MLLMs enables continual learning with constant compute, matching multi-task learning performance and supporting cross-modal transfer.

Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.

Diversity in Large Language Models under Supervised Fine-Tuning

cs.LG · 2026-04-30 · unverdicted · novelty 6.0 · 2 refs

TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.

NORACL: Neurogenesis for Oracle-free Resource-Adaptive Continual Learning

cs.LG · 2026-04-29 · unverdicted · novelty 6.0

NORACL dynamically grows network capacity via neurogenesis-inspired signals to achieve oracle-level continual learning performance without pre-specifying architecture size.

Cortex-Inspired Continual Learning: Unsupervised Instantiation and Recovery of Functional Task Networks

cs.LG · 2026-04-27 · unverdicted · novelty 6.0

FTN achieves near-zero forgetting on continual learning benchmarks by isolating task subnetworks via self-organizing binary masks generated through gradient descent, smoothing, and k-winner-take-all.

Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability

cs.LG · 2026-04-23 · conditional · novelty 6.0

Different valid temporal partitions of the same streaming dataset can produce materially different rankings and performance numbers for continual learning methods.

Continuous Limits of Coupled Flows in Representation Learning

cs.LG · 2026-04-18 · unverdicted · novelty 6.0

Discrete decentralized learning dynamics on manifolds converge uniformly to an overdamped Langevin SDE whose stationary states produce orthogonally disentangled, linearly separable features.

Label Leakage Attacks in Machine Unlearning: A Parameter and Inversion-Based Approach

cs.CR · 2026-04-08 · unverdicted · novelty 6.0

Parameter-difference and model-inversion attacks can identify forgotten classes after machine unlearning on standard image datasets.

Debiasing LLMs by Fine-tuning

q-fin.GN · 2026-04-03 · unverdicted · novelty 6.0

Supervised fine-tuning with LoRA on rational benchmark forecasts corrects extrapolation bias out-of-sample in LLM predictions for controlled experiments and cross-sectional stock returns.

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

cs.AI · 2023-08-01 · unverdicted · novelty 6.0

MetaGPT embeds human SOPs into LLM prompts to create role-specialized agent teams that produce more coherent solutions on collaborative software engineering tasks than prior chat-based multi-agent systems.

On the Stability of Growth in Structural Plasticity

cs.LG · 2026-05-14 · unverdicted · novelty 5.0

Newborn units in growing neural networks are forward-active but backward-starved, receiving weaker gradients than existing units and creating integration challenges that make growth less reliable than pruning in complex tasks.

Activation Function Design Sustains Plasticity in Continual Learning

cs.LG · 2025-09-26 · unverdicted · novelty 5.0

Smooth-Leaky and Randomized Smooth-Leaky activations mitigate loss of plasticity in continual learning by targeting negative-branch shape and saturation behavior.

FedRef: Bayesian Fine-Tuning using a Reference Model to Mitigate Catastrophic Forgetting for Heterogeneous Federated Learning

cs.LG · 2025-06-29 · unverdicted · novelty 5.0

FedRef uses a temporally aggregated reference model and MAP regularization for server-side fine-tuning to reduce forgetting and drift in non-IID federated learning, showing better accuracy and lower client compute on image tasks.

Adaptive Compression-based Lifelong Learning

cs.CV · 2019-07-23 · unverdicted · novelty 5.0

Bayesian optimization enables adaptive network pruning rates in lifelong learning, performing heavier pruning on small/simple tasks and milder on large/complex ones.

Online Generalised Predictive Coding

stat.ML · 2026-05-04 · unverdicted · novelty 5.0

Online generalised predictive coding (ODEM) tracks latent states in nonlinear and chaotic generative models by separating temporal scales for fast Bayesian belief updating and slow parameter learning.

(How) Learning Rates Regulate Catastrophic Overtraining

cs.LG · 2026-04-15 · unverdicted · novelty 5.0

Learning rate decay during SFT increases pretrained model sharpness, which exacerbates catastrophic forgetting and causes overtraining in LLMs.

citing papers explorer

Showing 28 of 28 citing papers.

MIST: Reliable Streaming Decision Trees for Online Class-Incremental Learning via McDiarmid Bound cs.LG · 2026-05-12 · unverdicted · none · ref 25 · 2 links · internal anchor
MIST fixes unreliable splits in streaming decision trees for class-incremental learning by replacing Hoeffding-style bounds with a K-independent McDiarmid radius on Gini, plus Bayesian parent-to-child inheritance and per-leaf quantile sketches.
SLE-FNO: Single-Layer Extensions for Task-Agnostic Continual Learning in Fourier Neural Operators cs.LG · 2026-03-20 · unverdicted · none · ref 58 · internal anchor
SLE-FNO achieves zero forgetting and strong plasticity-stability balance in continual learning for FNO surrogate models of pulsatile blood flow by adding minimal single-layer extensions across four out-of-distribution tasks.
NetTailor: Tuning the Architecture, Not Just the Weights cs.CV · 2019-06-29 · unverdicted · none · ref 16 · internal anchor
NetTailor adapts CNN architecture for new tasks by assembling pre-trained universal blocks with task-specific layers, trained via activation mimicry and complexity penalties to match accuracy while reducing size for simpler tasks.
HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model cs.CL · 2026-05-11 · unverdicted · none · ref 12
Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.
Temporal Task Diversity: Inductive Biases Under Non-Stationarity in Synthetic Sequence Modelling cs.LG · 2026-05-18 · unverdicted · none · ref 18 · internal anchor
Temporal diversity in task distribution during training increases generalization bias over memorization in transformers for in-context linear regression.
Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning cs.LG · 2026-05-09 · unverdicted · none · ref 2 · 2 links · internal anchor
Muon-OGD introduces a spectral-norm constrained orthogonal projection method solved via dual iterations and Newton-Schulz approximations to improve stability-plasticity trade-off in sequential LLM adaptation.
Robust Policy Optimization to Prevent Catastrophic Forgetting cs.LG · 2026-02-09 · unverdicted · none · ref 21 · internal anchor
FRPO applies a max-min robust optimization over KL-bounded policy neighborhoods during RLHF to reduce catastrophic forgetting of safety and accuracy under subsequent SFT or RL fine-tuning.
Mitigating Catastrophic Forgetting in Target Language Adaptation of LLMs via Source-Shielded Updates cs.CL · 2025-12-04 · conditional · none · ref 26 · internal anchor
SSU mitigates catastrophic forgetting in low-resource LLM target-language adaptation by scoring and column-wise freezing source-critical parameters, reducing source degradation to ~3% versus ~20% for full fine-tuning while matching target performance.
Routing-Based Continual Learning for Multimodal Large Language Models cs.LG · 2025-11-03 · unverdicted · none · ref 16 · internal anchor
Routing architecture for MLLMs enables continual learning with constant compute, matching multi-task learning performance and supporting cross-modal transfer.
Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting cs.LG · 2026-05-04 · unverdicted · none · ref 69
Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.
Diversity in Large Language Models under Supervised Fine-Tuning cs.LG · 2026-04-30 · unverdicted · none · ref 26 · 2 links
TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
NORACL: Neurogenesis for Oracle-free Resource-Adaptive Continual Learning cs.LG · 2026-04-29 · unverdicted · none · ref 7
NORACL dynamically grows network capacity via neurogenesis-inspired signals to achieve oracle-level continual learning performance without pre-specifying architecture size.
Cortex-Inspired Continual Learning: Unsupervised Instantiation and Recovery of Functional Task Networks cs.LG · 2026-04-27 · unverdicted · none · ref 12
FTN achieves near-zero forgetting on continual learning benchmarks by isolating task subnetworks via self-organizing binary masks generated through gradient descent, smoothing, and k-winner-take-all.
Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability cs.LG · 2026-04-23 · conditional · none · ref 17
Different valid temporal partitions of the same streaming dataset can produce materially different rankings and performance numbers for continual learning methods.
Continuous Limits of Coupled Flows in Representation Learning cs.LG · 2026-04-18 · unverdicted · none · ref 42
Discrete decentralized learning dynamics on manifolds converge uniformly to an overdamped Langevin SDE whose stationary states produce orthogonally disentangled, linearly separable features.
Label Leakage Attacks in Machine Unlearning: A Parameter and Inversion-Based Approach cs.CR · 2026-04-08 · unverdicted · none · ref 37
Parameter-difference and model-inversion attacks can identify forgotten classes after machine unlearning on standard image datasets.
Debiasing LLMs by Fine-tuning q-fin.GN · 2026-04-03 · unverdicted · none · ref 13
Supervised fine-tuning with LoRA on rational benchmark forecasts corrects extrapolation bias out-of-sample in LLM predictions for controlled experiments and cross-sectional stock returns.
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework cs.AI · 2023-08-01 · unverdicted · none · ref 255
MetaGPT embeds human SOPs into LLM prompts to create role-specialized agent teams that produce more coherent solutions on collaborative software engineering tasks than prior chat-based multi-agent systems.
On the Stability of Growth in Structural Plasticity cs.LG · 2026-05-14 · unverdicted · none · ref 16 · internal anchor
Newborn units in growing neural networks are forward-active but backward-starved, receiving weaker gradients than existing units and creating integration challenges that make growth less reliable than pruning in complex tasks.
Activation Function Design Sustains Plasticity in Continual Learning cs.LG · 2025-09-26 · unverdicted · none · ref 8 · internal anchor
Smooth-Leaky and Randomized Smooth-Leaky activations mitigate loss of plasticity in continual learning by targeting negative-branch shape and saturation behavior.
FedRef: Bayesian Fine-Tuning using a Reference Model to Mitigate Catastrophic Forgetting for Heterogeneous Federated Learning cs.LG · 2025-06-29 · unverdicted · none · ref 4 · internal anchor
FedRef uses a temporally aggregated reference model and MAP regularization for server-side fine-tuning to reduce forgetting and drift in non-IID federated learning, showing better accuracy and lower client compute on image tasks.
Adaptive Compression-based Lifelong Learning cs.CV · 2019-07-23 · unverdicted · none · ref 10 · internal anchor
Bayesian optimization enables adaptive network pruning rates in lifelong learning, performing heavier pruning on small/simple tasks and milder on large/complex ones.
Online Generalised Predictive Coding stat.ML · 2026-05-04 · unverdicted · none · ref 35
Online generalised predictive coding (ODEM) tracks latent states in nonlinear and chaotic generative models by separating temporal scales for fast Bayesian belief updating and slow parameter learning.
(How) Learning Rates Regulate Catastrophic Overtraining cs.LG · 2026-04-15 · unverdicted · none · ref 10
Learning rate decay during SFT increases pretrained model sharpness, which exacerbates catastrophic forgetting and causes overtraining in LLMs.
Plasticity Loss in Deep Reinforcement Learning: A Survey cs.AI · 2024-11-07 · unverdicted · none · ref 37 · internal anchor
Survey unifies the definition of plasticity loss in DRL, taxonomizes over 50 mitigations, identifies evaluation gaps, and finds general regularization often outperforms domain-specific methods.
End-to-End 3D-PointCloud Semantic Segmentation for Autonomous Driving cs.CV · 2019-06-26 · unverdicted · none · ref 20 · internal anchor
Proposes weighted self-incremental transfer learning to address class imbalance in 3D point cloud semantic segmentation and reports a new benchmark on the KITTI dataset.
Dynamic Distillation and Gradient Consistency for Robust Long-Tailed Incremental Learning cs.CV · 2026-05-05 · unverdicted · none · ref 9
Gradient consistency regularization and entropy-driven dynamic distillation improve accuracy by up to 5% in long-tailed incremental learning, with strong gains in majority-to-minority task ordering.
MPCS: Neuroplastic Continual Learning via Multi-Component Plasticity and Topology-Aware EWC cs.LG · 2026-05-04 · unverdicted · none · ref 12
MPCS integrates eleven plasticity mechanisms and reaches a Normalized Efficiency Score of 94.2 on a 31-task benchmark, with ablations showing that removing EWC and Hebbian updates yields higher performance at lower cost.

An empirical investigation of catastrophic forgetting in gradient-based neural networks.arXiv preprint arXiv:1312.6211

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer