Position: Don't Just "Fix it in Post": A Science of AI Must Study Training Dynamics
Pith reviewed 2026-06-28 06:23 UTC · model grok-4.3
The pith
A science of AI must study training dynamics to understand how model behaviors emerge rather than analyzing them after training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Models are not static objects but snapshots of time-evolving processes shaped by data, objectives, architectures, and optimization dynamics. A science of AI must move beyond post-hoc fixes and study the training dynamics that produce model behavior, supporting prediction from early signals, intervention on trajectories, and design of training procedures for desired properties. Scaling laws demonstrate prediction for loss, and the challenge is to extend this to other behaviors while meeting standards from the history and philosophy of science.
What carries the argument
Training dynamics, the time-evolving processes during optimization that determine model behavior.
If this is right
- Predicting model outcomes becomes possible using signals available early in training.
- Intervening to correct undesired behaviors during the training process rather than after.
- Designing training procedures that reliably produce models with specific capabilities, reduced biases, and improved safety.
- Extending the predictive power of scaling laws from loss to capabilities, robustness, and safety-relevant properties.
Where Pith is reading between the lines
- Work on mechanistic interpretability could be reframed to track how circuits and features develop over the course of training.
- Problems in fairness and memorization might be better understood by observing their emergence rather than their final state.
- New benchmarks and experiments that evaluate models at multiple points during training would be needed to test these ideas.
Load-bearing premise
Theories of training dynamics can be developed that enable prediction, intervention, and design of model properties beyond what is currently possible with scaling laws for loss.
What would settle it
Finding that early training signals do not predict final model capabilities or safety properties in large models any better than post-training analysis would falsify the position.
read the original abstract
What would it mean to have a scientific understanding of AI? Models are not static objects: they are snapshots of time-evolving processes shaped by data, objectives, architectures, and optimization dynamics. Yet much of AI research treats models as fixed artifacts, analyzing behaviors after training rather than asking why they emerge. This position paper argues that a science of AI must move beyond post-hoc fixes and study the training dynamics that produce model behavior. Such a science should support progressively stronger forms of understanding: predicting outcomes from early training signals, intervening when trajectories go wrong, and ultimately designing training procedures that more reliably produce desired properties. Scaling laws have made prediction routine for loss; the challenge is extending this success to capabilities, biases, robustness, and safety-relevant behaviors. We articulate requirements for such theories grounded in the history and philosophy of science, examine progress in mechanistic interpretability, fairness, memorization, and simplicity bias, and identify concrete open problems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This position paper claims that a science of AI must study training dynamics rather than relying on post-hoc fixes after training, to enable progressively stronger forms of understanding: predicting outcomes from early training signals, intervening on trajectories, and designing training procedures for desired properties including capabilities, biases, robustness, and safety. It articulates requirements drawn from the history and philosophy of science, surveys progress in subfields such as mechanistic interpretability, fairness, memorization, and simplicity bias, and identifies concrete open problems, while noting that scaling laws have succeeded for loss but the challenge is extending this to other behaviors.
Significance. If the proposed research program can be realized, it would shift AI research from treating models as fixed artifacts to understanding them as outcomes of time-evolving processes, potentially enabling more reliable control over model properties at scale and extending the predictive success of scaling laws beyond loss.
major comments (1)
- Abstract: the claim that theories grounded in training dynamics will support prediction, intervention, and design for capabilities, biases, robustness, and safety (beyond loss) is asserted without any derivation, data, formal argument, or concrete example of how this extension would work; this feasibility assumption is load-bearing for the normative claim that such a science 'must' be pursued to achieve the described stronger forms of understanding.
Simulated Author's Rebuttal
We thank the referee for their detailed review and constructive feedback on our position paper. The primary concern is addressed point-by-point below. We agree that the abstract could better signal the supporting material in the body and will make a targeted revision.
read point-by-point responses
-
Referee: [—] Abstract: the claim that theories grounded in training dynamics will support prediction, intervention, and design for capabilities, biases, robustness, and safety (beyond loss) is asserted without any derivation, data, formal argument, or concrete example of how this extension would work; this feasibility assumption is load-bearing for the normative claim that such a science 'must' be pursued to achieve the described stronger forms of understanding.
Authors: As a position paper, the manuscript advocates for a research program rather than deriving or empirically validating its feasibility. The abstract is intentionally concise and summarizes the core thesis; the body articulates requirements from the history and philosophy of science, surveys existing progress (e.g., circuit-level analysis during training in mechanistic interpretability, emergence of biases over the course of training in fairness research, dynamics of memorization, and simplicity bias), and frames extension beyond loss as an open challenge with scaling laws as a partial precedent. These sections supply the concrete examples and partial evidence the referee correctly notes are absent from the abstract alone. We will revise the abstract to briefly reference the surveyed subfields and to distinguish the established success on loss from the open extension to other behaviors, thereby making the load-bearing assumption more transparent without altering the position itself. revision: partial
Circularity Check
No significant circularity
full rationale
This position paper advances a normative argument that AI research should prioritize theories of training dynamics to enable prediction, intervention, and design of model properties. It contains no equations, derivations, fitted parameters, or technical steps whose validity reduces to self-referential inputs. The abstract and structure rely on philosophy of science requirements and surveys of subfields without any self-citation load-bearing chains, ansatzes, or renamings that would make the central claim equivalent to its own premises by construction. The argument is self-contained as an aspirational research program.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
URL https://escholarship.org/uc/ item/2h78n1hj. Ahdritz, G., Bouatta, N., Floristean, C., Kadyan, S., Xia, 9 Position: Building a Science of AI Q., Gerecke, W., O’Donnell, T. J., Berenberg, D., Fisk, I., Zanichelli, N., Zhang, B., Nowaczynski, A., Wang, B., Stepniewska-Dziubinska, M. M., Zhang, S., Ojew- ole, A., Guney, M. E., Biderman, S., Watkins, A. M....
-
[2]
a person holding a sign that says
URL https://openreview.net/forum? id=oDbiL9CLoS. Anderson, J.Fundamentals of Aerodynamics (SI units). McGraw Hill, 2011. Andriushchenko, M. and Flammarion, N. Does refusal train- ing in LLMs generalize to the past tense? InThe Thir- teenth International Conference on Learning Represen- tations, 2025. URL https://openreview.net/ forum?id=aJUuere4fM. Angwin...
-
[3]
URL https://aclanthology.org/2020. acl-main.485/. Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V ., and Kalai, A. T. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R. (eds.),Advances in Neural Information Processing Systems, volume 29. Curran Associates,...
-
[4]
URL https://aclanthology.org/2022. acl-long.477/. Cloud, A., Goldman-Wetzler, J., Wybitul, E., Miller, J., and Turner, A. M. Gradient routing: Masking gradients to localize computation in neural networks.arXiv preprint arXiv:2410.04332, 2024. Cotterell, R., Mielke, S. J., Eisner, J., and Roark, B. Are all languages equally hard to language-model? In Walke...
-
[5]
Nature645(8081), 633–638 (Sep 2025)
doi: 10.1038/s41586-025-09422-z. Deng, J., Li, T.-W., Zhang, S., Liu, S., Pan, Y ., Huang, H., Wang, X., Hu, P., Zhang, X., et al. dattri: A library for efficient data attribution.Advances in Neural Information Processing Systems, 37:136763–136781, 2024. Dentan, J., Buscaldi, D., Shabou, A., and Vanier, S. Predict- ing and analyzing memorization within fi...
-
[6]
URL https://aclanthology.org/2025. naacl-long.148/. Grant, N. and Hill, K. Google’s photo app still can’t find gorillas. and neither can ap- ple’s., May 2023. URL https://www. nytimes.com/2023/05/22/technology/ ai-photo-labels-google-apple.html . Accessed: 2026-01-28. Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A....
Pith/arXiv arXiv 2025
-
[7]
Real-Time Action Recogni- tion with Enhanced Motion Vector CNNs
IEEE Computer Society. doi: 10.1109/CVPR.2016
-
[8]
Deep Residual Learning for Image Recognition
URL https://doi.ieeecomputersociety. org/10.1109/CVPR.2016.90. Heap, T., Lawson, T., Farnik, L., and Aitchison, L. Sparse autoencoders can interpret randomly initialized trans- formers.arXiv e-prints, pp. arXiv–2501, 2025. URL https://arxiv.org/abs/2501.17727. Held, W., Hall, D., Liang, P., and Yang, D. Relative scaling laws for llms, 2026. URL https://ar...
-
[9]
Deep Learning Scaling is Predictable, Empirically
URL https://www.theguardian. com/technology/2018/jan/12/ google-racism-ban-gorilla-black-people . Accessed: 2026-01-28. Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y ., and Zhou, Y . Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409, 2017. Hoffmann, J., Borgeaud, S.,...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.emnlp-main 2018
-
[10]
URL https://aclanthology.org/2024. emnlp-main.598/. Ilyas, A. and Engstrom, L. Magic: Near-optimal data attribu- tion for deep learning.arXiv preprint arXiv:2504.16430, 2025. Ilyas, A., Park, S. M., Engstrom, L., Leclerc, G., and Madry, A. Datamodels: Predicting predictions from training data. arXiv preprint arXiv:2202.00622, 2022. Juneja, J., Bansal, R.,...
arXiv 2024
-
[11]
Kaplan, J., McCandlish, S., Henighan, T., Brown, T
URL https://openreview.net/forum? id=eiEIKGuqaf. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. URL https: //arxiv.org/abs/2001.08361. Kirk, R., Mediratta, I., Nalmpantis, C., Luketina, J., Ham- bro, ...
Pith/arXiv arXiv 2001
-
[12]
URL https://openreview.net/forum? id=VRhVS59yhP. Kuhn, T. S.The Essential Tension: Selected Studies in Scientific Tradition and Change. University of Chicago Press, Chicago, 1977. Leavitt, M. L. and Morcos, A. S. Selectivity considered harmful: evaluating the causal impact of class selectiv- ity in {dnn}s. InInternational Conference on Learning Representa...
-
[13]
URL https://aclanthology.org/2025. findings-emnlp.877/. Lu, C., Lu, C., Lange, R. T., Yamada, Y ., Hu, S., Foerster, J., Ha, D., and Clune, J. Towards end-to-end automation of AI research.Nature, 651:914–919, 2026. doi: 10.1038/ s41586-026-10265-5. Masud, S., Khan, M. A., Goyal, V ., Akhtar, M. S., and Chakraborty, T. Probing critical learning dynamics of...
-
[15]
URL https://aclanthology.org/2024. emnlp-main.41/. M´eloux, M., Dirupo, G., Portet, F., and Peyrard, M. The dead salmons of ai interpretability.arXiv preprint arXiv:2512.18792, 2025. Michaelov, J. A., Levy, R. P., and Bergen, B. K. Language model behavioral phases are consistent across architec- ture, training data, and scale. InThe Thirty-ninth Annual Co...
arXiv 2024
-
[16]
Mickel, J., De-Arteaga, M., Leqi, L., and Tian, K
URL https://openreview.net/forum? id=HenpVfO3Wp. Mickel, J., De-Arteaga, M., Leqi, L., and Tian, K. More of the same: Persistent representational harms under in- creased representation.Advances in Neural Information Processing Systems, 38:61277–61315, 2025. Mlodozeniec, B., Reid, I., Power, S., Krueger, D., Erdogdu, M., Turner, R. E., and Grosse, R. Distr...
arXiv 2025
-
[17]
URL https://openreview.net/forum? id=frVo9MzRuU. Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y ., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish...
Pith/arXiv arXiv 2022
-
[18]
URL https://openreview.net/forum? id=3E8YNv1HjU. Qi, X., Zeng, Y ., Xie, T., Chen, P.-Y ., Jia, R., Mittal, P., and Henderson, P. Fine-tuning aligned language models compromises safety, even when users do not intend to! InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=hTEGyKf0dZ. 15 Position: B...
-
[19]
Wang, Y ., Qu, W., Zhai, S., Jiang, Y ., Zichen, L., Liu, Y ., Dong, Y ., and Zhang, J
URL https://openreview.net/forum? id=K0E_F0gFDgA. Seshadri, P., Singh, S., and Elazar, Y . The bias amplification paradox in text-to-image generation. In Duh, K., Gomez, H., and Bethard, S. (eds.),Proceedings of the 2024 Con- ference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies (Volume 1: Lo...
-
[20]
URL https://aclanthology.org/2024. naacl-long.353/. Shani, C., Reif, Y ., Roll, N., Jurafsky, D., and Shutova, E. The roots of performance disparity in multilingual language models: Intrinsic modeling difficulty or design choices?arXiv preprint arXiv:2601.07220, 2026. URL https://arxiv.org/abs/2601.07220. Singh, A. K., Chan, S. C., Moskovitz, T., Grant, E...
Pith/arXiv arXiv 2024
-
[21]
Singh, S., Romanou, A., Fourrier, C., Adelani, D
URL https://openreview.net/forum? id=Of0GBzow8P. Singh, S., Romanou, A., Fourrier, C., Adelani, D. I., Ngui, J. G., Vila-Suero, D., Limkonchotiwat, P., Marchisio, K., Leong, W. Q., Susanto, Y ., Ng, R., Longpre, S., Ruder, S., Ko, W.-Y ., Bosselut, A., Oh, A., Martins, A., Choshen, L., Ippolito, D., Ferrante, E., Fadaee, M., Ermis, B., and Hooker, S. Glob...
2025
-
[22]
Global MMLU : Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
doi: 10.18653/v1/2025.acl-long.919. URL https: //aclanthology.org/2025.acl-long.919/. 16 Position: Building a Science of AI Tao, C., Liu, Q., Dou, L., Muennighoff, N., Wan, Z., Luo, P., Lin, M., and Wong, N. Scaling laws with vocab- ulary: Larger models deserve larger vocabularies. In Globerson, A., Mackey, L., Belgrave, D., Fan, A., Pa- quet, U., Tomczak...
-
[23]
Wei, J., Godbole, A., Khan, M
URL https://openreview.net/forum? id=jA235JGM09. Wei, J., Godbole, A., Khan, M. A., Wang, R. Y ., Zhu, X., Flemings, J., Kashyap, N., Gummadi, K. P., Neiswanger, W., and Jia, R. Hubble: a model suite to advance the study of LLM memorization. InThe Fourteenth International Conference on Learning Representations,
-
[24]
The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search
URL https://openreview.net/forum? id=ZfdnZhOP0k. Wright, W.Some aeronautical experiments. US Govern- ment Printing Office, 1901. Xu, Z.-Q. J., Zhang, Y ., Luo, T., Xiao, Y ., and Ma, Z. Fre- quency principle: Fourier analysis sheds light on deep neural networks.Communications in Computational Physics, 28(5):1746–1767, 2020. doi: 10.4208/cicp. OA-2020-0085...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.4208/cicp 1901
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.