Provides the first theoretical formulation and bounds for certified unlearning in continual learning, decomposing excess risk and adapting gradient and Hessian methods with a low-storage hybrid.
hub Mixed citations
An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks
Mixed citation behavior. Most common role is background (60%).
abstract
Catastrophic forgetting is a problem faced by many machine learning models and algorithms. When trained on one task, then trained on a second task, many machine learning models "forget" how to perform the first task. This is widely believed to be a serious problem for neural networks. Here, we investigate the extent to which the catastrophic forgetting problem occurs for modern neural networks, comparing both established and recent gradient-based training algorithms and activation functions. We also examine the effect of the relationship between the first task and the second task on catastrophic forgetting. We find that it is always best to train using the dropout algorithm--the dropout algorithm is consistently best at adapting to the new task, remembering the old task, and has the best tradeoff curve between these two extremes. We find that different tasks and relationships between tasks result in very different rankings of activation function performance. This suggests the choice of activation function should always be cross-validated.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Three last-layer coefficients derived from gradient interference sources predict the forgetting rank order of past classes in rehearsal-based CIL, with self-induced interference as the strongest predictor.
MIST fixes unreliable splits in streaming decision trees for class-incremental learning by replacing Hoeffding-style bounds with a K-independent McDiarmid radius on Gini, plus Bayesian parent-to-child inheritance and per-leaf quantile sketches.
SLE-FNO achieves zero forgetting and strong plasticity-stability balance in continual learning for FNO surrogate models of pulsatile blood flow by adding minimal single-layer extensions across four out-of-distribution tasks.
NetTailor adapts CNN architecture for new tasks by assembling pre-trained universal blocks with task-specific layers, trained via activation mimicry and complexity penalties to match accuracy while reducing size for simpler tasks.
Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.
Framework learns latent scene embeddings from 24 trajectory datasets to produce transferability scores that correlate with cross-dataset model performance.
Dynamical isometry (Jacobian singular values near 1) preserves plasticity in continual learning; an isometry-promoting regularizer and decoupled AdamO optimizer match or beat prior methods on supervised and RL benchmarks.
Sequential training of multiple tasks followed by unsupervised sleep-like replay partially restores performance across all previously learned tasks in neural networks.
Polaris retrieves and integrates relevant models from a large library of checkpoints and adapters to enable scalable instruction-guided image generation and editing without additional training.
Temporal diversity in task distribution during training increases generalization bias over memorization in transformers for in-context linear regression.
Muon-OGD introduces a spectral-norm constrained orthogonal projection method solved via dual iterations and Newton-Schulz approximations to improve stability-plasticity trade-off in sequential LLM adaptation.
FRPO applies a max-min robust optimization over KL-bounded policy neighborhoods during RLHF to reduce catastrophic forgetting of safety and accuracy under subsequent SFT or RL fine-tuning.
SSU mitigates catastrophic forgetting in low-resource LLM target-language adaptation by scoring and column-wise freezing source-critical parameters, reducing source degradation to ~3% versus ~20% for full fine-tuning while matching target performance.
Routing architecture for MLLMs enables continual learning with constant compute, matching multi-task learning performance and supporting cross-modal transfer.
Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.
TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
NORACL dynamically grows network capacity via neurogenesis-inspired signals to achieve oracle-level continual learning performance without pre-specifying architecture size.
FTN achieves near-zero forgetting on continual learning benchmarks by isolating task subnetworks via self-organizing binary masks generated through gradient descent, smoothing, and k-winner-take-all.
Different valid temporal partitions of the same streaming dataset can produce materially different rankings and performance numbers for continual learning methods.
Discrete decentralized learning dynamics on manifolds converge uniformly to an overdamped Langevin SDE whose stationary states produce orthogonally disentangled, linearly separable features.
Parameter-difference and model-inversion attacks can identify forgotten classes after machine unlearning on standard image datasets.
Supervised fine-tuning with LoRA on rational benchmark forecasts corrects extrapolation bias out-of-sample in LLM predictions for controlled experiments and cross-sectional stock returns.
MetaGPT embeds human SOPs into LLM prompts to create role-specialized agent teams that produce more coherent solutions on collaborative software engineering tasks than prior chat-based multi-agent systems.
citing papers explorer
No citing papers match the current filters.