Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.
Deep ignorance: Filtering pretraining data builds tamper-resistant safeguards into open-weight LLMs
4 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Downgrading optimizers to lower-information variants during LLM unlearning yields more robust forgetting on MUSE and WMDP benchmarks by converging to harder-to-perturb loss basins.
TPCs allow term-by-term progressive polynomial evaluation on LLM activations for flexible safety monitoring that supports both stronger guardrails and low-cost adaptive cascades.
Training data for open LLMs is systematically left-leaning, with pre-training corpora containing more political material than post-training data and model stances aligning with data distributions.
citing papers explorer
-
Early Data Exposure Improves Robustness to Subsequent Fine-Tuning
Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.
-
Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning
Downgrading optimizers to lower-information variants during LLM unlearning yields more robust forgetting on MUSE and WMDP benchmarks by converging to harder-to-perturb loss basins.
-
Beyond Linear Probes: Dynamic Safety Monitoring for Language Models
TPCs allow term-by-term progressive polynomial evaluation on LLM activations for flexible safety monitoring that supports both stronger guardrails and low-cost adaptive cascades.
-
What Is The Political Content in LLMs' Pre- and Post-Training Data?
Training data for open LLMs is systematically left-leaning, with pre-training corpora containing more political material than post-training data and model stances aligning with data distributions.