pith. sign in

arxiv: 2603.06610 · v2 · pith:KNBHL44Vnew · submitted 2026-02-19 · 💻 cs.LG

CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training

Pith reviewed 2026-05-25 06:39 UTC · model grok-4.3

classification 💻 cs.LG
keywords LLM forgettingpost-traininginstruction fine-tuningpreference optimizationmodel driftcapability evaluationrobustnessbehavioral taxonomy
0
0 comments X

The pith

Post-training of LLMs induces forgetting as drift in robustness and default behaviors beyond lost parametric knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that the common accuracy-centric view of forgetting in LLM post-training misses broader effects on model behavior. Forgetting is defined instead as systematic drift that degrades how models respond in robustness, default behaviors, and other capabilities that shape user experience. To measure this, the authors introduce CapTrack, a framework built around a behavioral taxonomy and capability-specific metrics. Large-scale tests across post-training methods, domains, and models up to 80B parameters show that instruction fine-tuning drives the largest shifts while preference optimization tends to be more stable and can reverse some earlier losses. Drift patterns differ by model family with no single mitigation that works universally.

Core claim

Using CapTrack, the study finds that forgetting extends beyond parametric knowledge, with pronounced drift in robustness and default behaviors. Instruction fine-tuning induces the strongest relative drift, while preference optimization is more conservative and can partially recover lost capabilities. Differences across model families persist, and no universal mitigation emerges.

What carries the argument

CapTrack, a capability-centric framework that combines a behavioral taxonomy with an evaluation suite of capability-specific metrics to detect systematic model drift after post-training.

If this is right

  • Instruction fine-tuning produces stronger relative drift than preference optimization across the tested capabilities.
  • Preference optimization steps can partially restore capabilities that were lost during earlier instruction fine-tuning.
  • Drift patterns remain consistent within each model family even when training data and algorithms change.
  • No single post-training algorithm or mitigation strategy eliminates drift across all model families and domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • CapTrack could be extended to monitor drift during continual learning or domain adaptation without full retraining.
  • If default-behavior drift accumulates, it may explain gradual erosion of safety alignments in deployed models.
  • The taxonomy might be tested on new capabilities such as tool use or long-context reasoning to check coverage.

Load-bearing premise

The behavioral taxonomy and capability-specific metrics in CapTrack accurately capture the kinds of model drift that actually degrade real user experience.

What would settle it

A controlled experiment that applies post-training, runs the full CapTrack suite showing no measurable drift on any tracked capability, yet finds clear drops in independent user preference ratings or task performance outside the taxonomy.

read the original abstract

Large language model (LLM) post-training enhances latent skills, unlocks value alignment, improves performance, and enables domain adaptation. Unfortunately, post-training is known to induce forgetting, especially in the ubiquitous use-case of leveraging third-party pre-trained models, which is typically understood as a loss of parametric or factual knowledge. We argue that this accuracy-centric view is insufficient for modern foundation models and instead define forgetting as systematic model drift that degrades behavior and user experience. In this context, we introduce CapTrack, a capability-centric framework for analyzing forgetting in LLMs that combines a behavioral taxonomy with an evaluation suite centered on capability-specific metrics. Using CapTrack, we conduct a large-scale empirical study across post-training algorithms, domains, and model families, including models up to 80B parameters. We find that forgetting extends beyond parametric knowledge, with pronounced drift in robustness and default behaviors. Instruction fine-tuning induces the strongest relative drift, while preference optimization is more conservative and can partially recover lost capabilities. Differences across model families persist, and no universal mitigation emerges.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces CapTrack, a capability-centric framework combining a behavioral taxonomy with capability-specific metrics to evaluate forgetting in LLM post-training. It redefines forgetting as systematic model drift degrading behavior and user experience (beyond parametric knowledge loss) and reports results from a large-scale empirical study across post-training algorithms, domains, and model families up to 80B parameters. Key findings include pronounced drift in robustness and default behaviors, with instruction fine-tuning inducing the strongest relative drift while preference optimization is more conservative and can partially recover capabilities; model-family differences persist with no universal mitigation identified.

Significance. If the taxonomy and metrics hold, the work is significant for shifting the field from an accuracy-centric view of forgetting to one that accounts for robustness and behavioral consistency, which directly impact user experience in deployed foundation models. The scale of the study (multiple algorithms, domains, and large models) and the provision of a reusable evaluation suite are strengths that could inform more careful post-training design. Explicit credit is due for the empirical breadth and the attempt to make the evaluation multifaceted rather than single-metric.

minor comments (3)
  1. [Abstract, §1] Abstract and §1: The description of the behavioral taxonomy and the specific capability metrics used in CapTrack would benefit from one additional sentence each on construction/validation criteria and on how drift is quantified (e.g., reference to a table or figure showing the metric definitions), as this directly affects immediate assessability of the central empirical claims.
  2. [§6] The manuscript should include a short limitations subsection (or paragraph in §6) explicitly discussing potential confounds in the chosen domains and model families, as well as the degree to which the observed drifts generalize beyond the tested post-training setups.
  3. [Figures/Tables] Figure and table captions throughout would be improved by stating the exact number of runs or seeds used for each reported statistic, to allow readers to gauge the stability of the cross-algorithm and cross-family comparisons.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of CapTrack, the recognition of its significance in shifting from accuracy-centric to capability-centric evaluation of forgetting, and the recommendation for minor revision. The empirical breadth across algorithms, domains, and model scales up to 80B parameters is a core strength we aimed to highlight.

Circularity Check

0 steps flagged

No significant circularity

full rationale

This paper is a purely empirical evaluation study that introduces the CapTrack framework and reports results from large-scale experiments on LLM post-training. No derivation chain, equations, first-principles predictions, or fitted parameters exist that could reduce to inputs by construction. Claims rest on experimental measurements across algorithms, domains, and models rather than self-referential definitions or self-citation load-bearing steps. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no information on free parameters, axioms, or invented entities can be extracted from the provided text.

pith-pipeline@v0.9.0 · 5718 in / 999 out tokens · 27437 ms · 2026-05-25T06:39:17.976578+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 27 internal anchors

  1. [1]

    Qwen3 Technical Report

    URLhttps://arxiv.org/abs/2505.09388. Austin, J., Odena, A., Nye, M. I., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C. J., Terry, M., Le, Q. V., and Sutton, C. Program synthesis with large language models.CoRR, abs/2108.07732,

  2. [2]

    Program Synthesis with Large Language Models

    URLhttps://arxiv.org/abs/2108.07732. Bai, Y., Tu, S., Zhang, J., 0015, H. P., Wang, X., Lv, X., Cao, S., Xu, J., 0001, L. H., Dong, Y., 0001, J. T., and Li, J. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. ACL, pp. 3639–3664,

  3. [3]

    URLhttps://aclanthology.org/2025.acl-long.183/. Bean, A. M., Seedat, N., Chen, S., and Schwarz, J. R. Scales++: Compute efficient evaluation subset selection with cognitive scales embeddings.CoRR, abs/2510.26384,

  4. [4]

    Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings

    doi: 10.48550/arxiv.2510.26384. URL https://doi.org/10.48550/arxiv.2510.26384. Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Ba...

  5. [6]

    URLhttps://arxiv.org/abs/2110.14168. Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning.CoRR, abs/2307.08691,

  6. [7]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    doi: 10.48550/arxiv.2307.08691. URLhttps://doi.org/10.48550/arxiv.2307.08691. Dasigi, P., Lo, K., Beltagy, I., Cohan, A., Smith, N. A., and 0001, M. G. A dataset of information- seeking questions and answers anchored in research papers.NAACL-HLT, pp. 4599–4610,

  7. [8]

    URLhttps://doi.org/10.18653/v1/2021.naacl-main.365

    doi: 10.18653/v1/2021.naacl-main.365. URLhttps://doi.org/10.18653/v1/2021.naacl-main.365. Delange, M., Aljundi, R., Masana, M., Parisot, S., Jia, X., Leonardis, A., Slabaugh, G., and Tuytelaars, T. A continual learning survey: Defying forgetting in classification tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence,

  8. [9]

    doi: 10.1109/tpami.2021.3057446

    ISSN 1939-3539. doi: 10.1109/tpami.2021.3057446. URL http://dx.doi.org/10.1109/TPAMI.2021.3057446. Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., and Kiela, D. Kto: Model alignment as prospect theoretic optimization.CoRR, abs/2402.01306,

  9. [10]

    KTO: Model Alignment as Prospect Theoretic Optimization

    doi: 10.48550/arxiv.2402.01306. URL https: //doi.org/10.48550/arxiv.2402.01306. Fan, A., Jernite, Y., Perez, E., Grangier, D., Weston, J., and Auli, M. Eli5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,

  10. [11]

    D., Shen, H., Ram, P., 0015, Y

    Fernando, H. D., Shen, H., Ram, P., 0015, Y. Z., Samulowitz, H., Baracaldo, N., and Chen, T. Mitigating forgetting in llm supervised fine-tuning and preference learning.CoRR, abs/2410.15483, October

  11. [12]

    D., Shen, H., Ram, P., 0015, Y

    doi: 10.48550/arxiv.2410.15483. URLhttps://doi.org/10.48550/arxiv.2410.15483. Garg, S., Singh, A., Singh, S., and Chopra, P. Ipo: Your language model is secretly a preference classifier.CoRR, abs/2502.16182,

  12. [13]

    URLhttps://doi.org/10.48550/arxiv.2502.16182

    doi: 10.48550/arxiv.2502.16182. URLhttps://doi.org/10.48550/arxiv.2502.16182. Google, G. T. Gemma 3 technical report,

  13. [14]

    Gemma 3 Technical Report

    URLhttps://arxiv.org/abs/2503.19786. Guha, N., Nyarko, J., Ho, D. E., Ré, C., Chilton, A., Narayana, A., Chohlas-Wood, A., Peters, A., Waldon, B., Rockmore, D. N., Zambrano, D., Talisman, D., Hoque, E., Surani, F., Fagan, F., Sarfaty, G., Dickinson, G. M., Porat, H., Hegland, J., Wu, J., Nudell, J., Niklaus, J., Nay, J. J., Choi, J. H., Tobia, K., Hagan, ...

  14. [15]

    URLhttps://doi.org/10.48550/arxiv.2308.11462

    doi: 10.48550/arxiv.2308.11462. URLhttps://doi.org/10.48550/arxiv.2308.11462. Haque, N. Catastrophic forgetting in llms: A comparative analysis across language tasks.CoRR, abs/2504.01241,

  15. [16]

    URLhttps://doi.org/10.48550/arxiv.2504.01241

    doi: 10.48550/arxiv.2504.01241. URLhttps://doi.org/10.48550/arxiv.2504.01241. Harmon, J., Hochlehnert, A., Bethge, M., and Prabhu, A. Mapping post-training forgetting in language models at scale.CoRR, abs/2510.17776,

  16. [17]

    URLhttps://doi.org/10

    doi: 10.48550/arxiv.2510.17776. URLhttps://doi.org/10. 48550/arxiv.2510.17776. Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the math dataset.CoRR, abs/2103.03874,

  17. [18]

    Measuring Mathematical Problem Solving With the MATH Dataset

    URL https://arxiv.org/abs/2103.03874. Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., 0089, Y. Z., and Ginsburg, B. Ruler: Whats the real context size of your long-context language models?CoRR, abs/2404.06654,

  18. [19]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    doi: 10.48550/arxiv.2404.06654. URLhttps://doi.org/10.48550/arxiv.2404.06654. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., and Chen, W. Lora: Low-rank adaptation of large language models.CoRR, abs/2106.09685,

  19. [20]

    LoRA: Low-Rank Adaptation of Large Language Models

    URLhttps://arxiv.org/abs/2106.09685. Hu, J., Ruder, S., Siddhant, A., Neubig, G., Firat, O., and Johnson, M. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization.CoRR, abs/2003.11080,

  20. [21]

    doi: 10.1073/pnas.1611835114

    ISSN 1091-6490. doi: 10.1073/pnas.1611835114. URLhttp://dx.doi.org/10.1073/ pnas.1611835114. Kotha, S., Springer, J. M., and Raghunathan, A. Understanding catastrophic forgetting in language models via implicit inference.CoRR, abs/2309.10105, September

  21. [22]

    URL https://doi.org/10.48550/arxiv.2309.10105

    doi: 10.48550/arxiv.2309.10105. URL https://doi.org/10.48550/arxiv.2309.10105. Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,

  22. [23]

    R., Stevens, K., Barhoum, A., Duc, N

    Köpf, A., Kilcher, Y., von Rütte, D., Anagnostidis, S., Tam, Z. R., Stevens, K., Barhoum, A., Duc, N. M., Stanley, O., Nagyfi, R., ES, S., Suri, S., Glushkov, D., Dantuluri, A., Maguire, A., Schuhmann, C., Nguyen, H., and Mattick, A. Openassistant conversations - democratizing large language model alignment.CoRR, abs/2304.07327,

  23. [24]

    R., Stevens, K., Barhoum, A., Duc, N

    doi: 10.48550/arxiv.2304.07327. URLhttps://doi.org/10.48550/arxiv.2304.07327. Lambert, N., Morrison, J., Pyatkin, V., Huang, S., Ivison, H., Brahman, F., Miranda, L. J. V., Liu, A., Dziri, N., Lyu, S., Gu, Y., Malik, S., Graf, V., Hwang, J. D., Yang, J., Bras, R. L., Tafjord, O., Wilhelm, C., Soldaini, L., Smith, N. A., Wang, Y., Dasigi, P., and Hajishirz...

  24. [25]

    Revisiting catastrophic forgetting in large language model tuning

    Li, H., Ding, L., Fang, M., and Tao, D. Revisiting catastrophic forgetting in large language model tuning. In Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 4297–4308. Association for Computational Linguistics,

  25. [26]

    URLhttps://aclanthology

    doi: 10.18653/v1/2024.findings-emnlp.249. URLhttps://aclanthology. org/2024.findings-emnlp.249/. Li, J., Li, J., Wang, Y., Chang, Y., and Wu, Y. Structflowbench: A structured flow benchmark for multi-turn instruction following.ACL, pp. 9322–9341, 2025a. URLhttps://aclanthology.org/2025.findings-acl.486/. Li, S. S., Mun, J., Brahman, F., Ilgen, J., Tsvetko...

  26. [27]

    URLhttps://doi.org/10.18653/v1/2022.acl-long.229

    doi: 10.18653/v1/2022.acl-long.229. URLhttps://doi.org/10.18653/v1/2022.acl-long.229. Lin, Y., Lin, H., Xiong, W., Diao, S., Liu, J., Zhang, J., Pan, R., Wang, H., Hu, W., Zhang, H., Dong, H., Pi, R., Zhao, H., Jiang, N., Ji, H., Yao, Y., and Zhang, T. Mitigating the alignment tax of rlhf,

  27. [28]

    13 CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training Liu, D

    URL https://arxiv.org/abs/2309.06256. 13 CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training Liu, D. and Niehues, J. Conditions for catastrophic forgetting in multilingual translation.CoRR, abs/2510.19546,

  28. [29]

    URLhttps://doi.org/10.48550/arxiv.2510.19546

    doi: 10.48550/arxiv.2510.19546. URLhttps://doi.org/10.48550/arxiv.2510.19546. Liu, J., Liu, H., Xiao, L., Wang, Z., Liu, K., Gao, S., Zhang, W., Zhang, S., and Chen, K. Are your llms capable of stable reasoning?ACL, pp. 17594–17632,

  29. [30]

    An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

    URLhttps://aclanthology.org/2025.findings-acl.905/. Luo, Y., Yang, Z., Meng, F., Li, Y., 0016, J. Z., and 0004, Y. Z. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.CoRR, abs/2308.08747, August

  30. [31]

    An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

    doi: 10.48550/arxiv.2308.08747. URLhttps://doi.org/10.48550/arxiv.2308.08747. Ma, Z., Huang, W., Zhang, J., Gupta, T., and Krishna, R. m&m’s: A benchmark to evaluate tool-use for multi-step multi-modal tasks,

  31. [32]

    When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

    Mallen, A., Asai, A., Zhong, V., Das, R., Hajishirzi, H., and Khashabi, D. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories.CoRR, abs/2212.10511,

  32. [33]

    When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

    doi: 10.48550/arxiv.2212.10511. URLhttps://doi.org/10.48550/arxiv.2212.10511. Mazeika, M., Phan, L., Yin, X., Zou, A., 0001, Z. W., Mu, N., Sakhaee, E., Li, N., Basart, S., 0026, B. L., Forsyth, D. A., and Hendrycks, D. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.CoRR, abs/2402.04249,

  33. [34]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    doi: 10.48550/arxiv.2402.04249. URL https://doi.org/10.48550/arxiv.2402.04249. McCloskey, M. and Cohen, N. J. Catastrophic interference in connectionist networks: The sequential learning problem. volume 24 ofPsychology of Learning and Motivation, pp. 109–165. Academic Press,

  34. [35]

    URL https://www.sciencedirect.com/science/article/ pii/S0079742108605368

    doi: https://doi.org/10.1016/S0079-7421(08)60536-8. URL https://www.sciencedirect.com/science/article/ pii/S0079742108605368. Meta, L. T. The llama 3 herd of models,

  35. [36]

    The Llama 3 Herd of Models

    URLhttps://arxiv.org/abs/2407.21783. Niu, C., Wu, Y., Zhu, J., Xu, S., Shum, K., Zhong, R., Song, J., and Zhang, T. RAGTruth: A hallucination corpus for developing trustworthy retrieval-augmented language models. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.),Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volu...

  36. [37]

    doi: 10.18653/v1/2024.acl-long.585

    Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.585. URLhttps://aclanthology.org/2024.acl-long.585/. OpenAI. Openai o1 system card,

  37. [38]

    OpenAI o1 System Card

    URLhttps://arxiv.org/abs/2412.16720. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., and Lowe, R. Training language models to follow instructions with human feedback,

  38. [39]

    Training language models to follow instructions with human feedback

    URLhttps://arxiv.org/abs/2203.02155. Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., and Wermter, S. Continual lifelong learning with neural networks: A review.Neural Networks, 113:54–71,

  39. [40]

    Zheng, Y

    ISSN 0893-6080. doi: https://doi.org/10.1016/j. neunet.2019.01.012. URLhttps://www.sciencedirect.com/science/article/pii/S0893608019300231. Pryzant, R., Iter, D., Li, J., Lee, Y. T., Zhu, C., and Zeng, M. Automatic prompt optimization with "gradient descent" and beam search,

  40. [41]

    Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C

    URLhttps://arxiv.org/abs/2305.03495. Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., and Finn, C. Direct preference optimization: Your language model is secretly a reward model.CoRR, abs/2305.18290,

  41. [42]

    doi: 10.48550/arxiv.2305. 18290. URLhttps://doi.org/10.48550/arxiv.2305.18290. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y. K., Wu, Y., and Guo, D. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300,

  42. [43]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    doi: 10.48550/arxiv.2402.03300. URLhttps://doi.org/10.48550/arxiv.2402.03300. Shi, F., Suzgun, M., Freitag, M., 0002, X. W., Srivats, S., Vosoughi, S., Chung, H. W., Tay, Y., Ruder, S., Zhou, D., 0001, D. D., and Wei, J. Language models are multilingual chain-of-thought reasoners.CoRR, abs/2210.03057,

  43. [44]

    Language Models are Multilingual Chain-of-Thought Reasoners

    doi: 10.48550/arxiv.2210.03057. URLhttps://doi.org/10.48550/arxiv.2210.03057. Team, M.-A.-P., Du, X., Yao, Y., Ma, K., Wang, B., Zheng, T., Zhu, K., Liu, M., Liang, Y., Jin, X., Wei, Z., Zheng, C., Deng, K., Guo, S., Jia, S., Jiang, S., Liao, Y., Li, R., Li, Q., Li, S., Li, Y., Li, Y., Ma, D., Ni, Y., 14 CapTrack: Multifaceted Evaluation of Forgetting in ...

  44. [45]

    SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

    URLhttps://arxiv.org/abs/2502.14739. Thede, L., Roth, K., Hénaff, O. J., Bethge, M., and Akata, Z. Reflecting on the state of rehearsal-free continual learning with pretrained models.CoLLAs, pp. 1076–1093,

  45. [46]

    press/v274/thede25a.html

    URLhttps://proceedings.mlr. press/v274/thede25a.html. Tie, G., Zhao, Z., Song, D., Wei, F., Zhou, R., Dai, Y., Yin, W., Yang, Z., Yan, J., 0003, Y. S., Dai, Z., Xie, Y., Cao, Y., 0001, L. S., 0001, P. Z., 0001, L. H., Chen, H., 0006, Y. Z., Wen, Q., 0001, T. L., Gong, N. Z., Tang, J., Xiong, C., 0001, H.J., Yu, P.S., and0001, J.G. Asurveyonpost-trainingof...

  46. [47]

    press/v274/thede25a.html

    doi: 10.48550/arxiv.2503.06072. URLhttps://doi.org/10.48550/arxiv.2503.06072. Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., Li, T., Ku, M., Wang, K., Zhuang, A., Fan, R., Yue, X., and Chen, W. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark,

  47. [48]

    MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    URLhttps://arxiv.org/abs/2406.01574. Wortsman, M., Ilharco, G., Gadre, S. Y., Roelofs, R., Lopes, R. G., Morcos, A. S., Namkoong, H., Farhadi, A., Carmon, Y., Kornblith, S., and Schmidt, L. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time.ICML, pp. 23965–23998,

  48. [49]

    Yan, F., Mao, H., Ji, C

    URL http://papers.nips.cc/paper_files/paper/2023/hash/ 1644c9af28ab7916874f6fd6228a9bcf-Abstract-Conference.html. Yan, F., Mao, H., Ji, C. C.-J., Zhang, T., Patil, S. G., Stoica, I., and Gonzalez, J. E. Berkeley function calling leaderboard. https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard. html,

  49. [50]

    Detecting causal language use in science findings

    Yu, B., Li, Y., and Wang, J. Detecting causal language use in science findings. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4664–4674, Hong Kong, China, November

  50. [51]

    doi: 10.18653/v1/D19-1473

    Association for Computational Linguistics. doi: 10.18653/v1/D19-1473. URLhttps://aclanthology.org/ D19-1473. Yu, L., Yu, B., Yu, H., Huang, F., and Li, Y. Language models are super mario: Absorbing abilities from homologous models as a free lunch,

  51. [52]

    Yıldız, C., Ravichandran, N

    URLhttps://arxiv.org/abs/2311.03099. Yıldız, C., Ravichandran, N. K., Punia, P., Bethge, M., and Ermis, B. Investigating continual pretraining in large language models: Insights and implications.arXiv [cs.CL], February

  52. [53]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    doi: 10.1145/3777411. URLhttps://doi.org/10.1145/3777411. Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, C.-Y., Zhuang, Y., Krishnamurthy, N., Chen, Z., 15 CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training Koyejo, S., Arik, S. O., Li, D. S., and Stoica, I. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXi...

  53. [54]

    Instruction-Following Evaluation for Large Language Models

    Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y., Zhou, D., and Hou, L. Instruction-following evaluation for large language models.CoRR, abs/2311.07911,

  54. [55]

    Instruction-Following Evaluation for Large Language Models

    doi: 10.48550/arxiv.2311.07911. URL https://doi.org/10.48550/arxiv.2311.07911. 16 CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training Appendix This appendix provides additional methodological details, analyses, and results that support the findings presented in the main paper. Its primary purpose is to improve transparency and reproducibi...

  55. [56]

    All runs use a similar effective batch size, determined by the number of nodes, the per-device batch size, and the gradient accumulation rate

    is used where supported to improve memory efficiency. All runs use a similar effective batch size, determined by the number of nodes, the per-device batch size, and the gradient accumulation rate. Random seeds are fixed across runs to ensure reproducibility. Instruction Fine-Tuning.IFT is performed using standard next-token prediction with an NLL loss com...

  56. [57]

    As in the main paper, each spider plot summarizes capability-level changes across the CAN, WILL, and HOW categories, aggregated by model family

    D.1 Extended Spider Plot Results Figure 5 extends the spider plot analysis from the main paper by including the missing IFT+DPO configuration as well as the corresponding results for the medical domain. As in the main paper, each spider plot summarizes capability-level changes across the CAN, WILL, and HOW categories, aggregated by model family. These add...

  57. [58]

    Across all settings, the post-training algorithm and overall training budget are held fixed

    with models trained on a domain-specific legal mixture. Across all settings, the post-training algorithm and overall training budget are held fixed. This experiment is designed to isolate the contribution of the data source to forgetting behavior, complementing the aggregated analysis presented in the main paper. We report the full, non-aggregated results...

  58. [59]

    free lunch

    with density 0.1. For each method, we interpolate between the OOB model and the instruction fine-tuned (IFT) model using different merge weights, and compute stability and plasticity for each merged checkpoint. Figure13reportsstability-plasticitycurvesacrossmergingmethods. Acrossallmethods, weobserveaconsistent stability-plasticity trade-off: increasing r...