CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training

Jonathan Richard Schwarz; Lukas Thede; Stefan Winzeck; Zeynep Akata

arxiv: 2603.06610 · v2 · pith:KNBHL44Vnew · submitted 2026-02-19 · 💻 cs.LG

CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training

Lukas Thede , Stefan Winzeck , Zeynep Akata , Jonathan Richard Schwarz This is my paper

Pith reviewed 2026-05-25 06:39 UTC · model grok-4.3

classification 💻 cs.LG

keywords LLM forgettingpost-traininginstruction fine-tuningpreference optimizationmodel driftcapability evaluationrobustnessbehavioral taxonomy

0 comments

The pith

Post-training of LLMs induces forgetting as drift in robustness and default behaviors beyond lost parametric knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that the common accuracy-centric view of forgetting in LLM post-training misses broader effects on model behavior. Forgetting is defined instead as systematic drift that degrades how models respond in robustness, default behaviors, and other capabilities that shape user experience. To measure this, the authors introduce CapTrack, a framework built around a behavioral taxonomy and capability-specific metrics. Large-scale tests across post-training methods, domains, and models up to 80B parameters show that instruction fine-tuning drives the largest shifts while preference optimization tends to be more stable and can reverse some earlier losses. Drift patterns differ by model family with no single mitigation that works universally.

Core claim

Using CapTrack, the study finds that forgetting extends beyond parametric knowledge, with pronounced drift in robustness and default behaviors. Instruction fine-tuning induces the strongest relative drift, while preference optimization is more conservative and can partially recover lost capabilities. Differences across model families persist, and no universal mitigation emerges.

What carries the argument

CapTrack, a capability-centric framework that combines a behavioral taxonomy with an evaluation suite of capability-specific metrics to detect systematic model drift after post-training.

If this is right

Instruction fine-tuning produces stronger relative drift than preference optimization across the tested capabilities.
Preference optimization steps can partially restore capabilities that were lost during earlier instruction fine-tuning.
Drift patterns remain consistent within each model family even when training data and algorithms change.
No single post-training algorithm or mitigation strategy eliminates drift across all model families and domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

CapTrack could be extended to monitor drift during continual learning or domain adaptation without full retraining.
If default-behavior drift accumulates, it may explain gradual erosion of safety alignments in deployed models.
The taxonomy might be tested on new capabilities such as tool use or long-context reasoning to check coverage.

Load-bearing premise

The behavioral taxonomy and capability-specific metrics in CapTrack accurately capture the kinds of model drift that actually degrade real user experience.

What would settle it

A controlled experiment that applies post-training, runs the full CapTrack suite showing no measurable drift on any tracked capability, yet finds clear drops in independent user preference ratings or task performance outside the taxonomy.

read the original abstract

Large language model (LLM) post-training enhances latent skills, unlocks value alignment, improves performance, and enables domain adaptation. Unfortunately, post-training is known to induce forgetting, especially in the ubiquitous use-case of leveraging third-party pre-trained models, which is typically understood as a loss of parametric or factual knowledge. We argue that this accuracy-centric view is insufficient for modern foundation models and instead define forgetting as systematic model drift that degrades behavior and user experience. In this context, we introduce CapTrack, a capability-centric framework for analyzing forgetting in LLMs that combines a behavioral taxonomy with an evaluation suite centered on capability-specific metrics. Using CapTrack, we conduct a large-scale empirical study across post-training algorithms, domains, and model families, including models up to 80B parameters. We find that forgetting extends beyond parametric knowledge, with pronounced drift in robustness and default behaviors. Instruction fine-tuning induces the strongest relative drift, while preference optimization is more conservative and can partially recover lost capabilities. Differences across model families persist, and no universal mitigation emerges.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CapTrack reframes forgetting as behavioral drift beyond facts and shows instruction tuning causes more of it than preference optimization, but the taxonomy's validity rests on unshown details.

read the letter

The core move here is treating forgetting as systematic drift in robustness and default behaviors rather than just lost parametric knowledge. They build CapTrack around a behavioral taxonomy and capability-specific metrics, then run it across post-training methods, domains, and models up to 80B. That scale and the direct comparison between instruction fine-tuning and preference optimization are the concrete pieces that stand out. The finding that instruction tuning drives stronger relative drift while preference optimization stays more conservative and can recover some ground is the kind of comparative result that could shift how people choose alignment steps in practice. Model family differences persisting and no single mitigation working across the board also feel like useful negative results if the experiments hold up. The empirical breadth is the main strength; covering multiple families and sizes gives the claims more weight than smaller studies usually manage. The soft spot is that the abstract gives no concrete metrics, controls, or statistical checks, so it is impossible to judge whether the taxonomy actually tracks user-visible degradation or just measures correlated but non-causal shifts. If the full paper does not include ablation on the taxonomy or external validation of the capability metrics, the central claims stay harder to trust. This is aimed at people doing post-training and alignment work who already care about forgetting but want to move past accuracy-only checks. A reader already running similar evaluations could pull the framework and the comparative patterns without much trouble. It deserves a serious referee because the scope is large enough and the reframing is practical, even if the metric justification will need close attention in review.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces CapTrack, a capability-centric framework combining a behavioral taxonomy with capability-specific metrics to evaluate forgetting in LLM post-training. It redefines forgetting as systematic model drift degrading behavior and user experience (beyond parametric knowledge loss) and reports results from a large-scale empirical study across post-training algorithms, domains, and model families up to 80B parameters. Key findings include pronounced drift in robustness and default behaviors, with instruction fine-tuning inducing the strongest relative drift while preference optimization is more conservative and can partially recover capabilities; model-family differences persist with no universal mitigation identified.

Significance. If the taxonomy and metrics hold, the work is significant for shifting the field from an accuracy-centric view of forgetting to one that accounts for robustness and behavioral consistency, which directly impact user experience in deployed foundation models. The scale of the study (multiple algorithms, domains, and large models) and the provision of a reusable evaluation suite are strengths that could inform more careful post-training design. Explicit credit is due for the empirical breadth and the attempt to make the evaluation multifaceted rather than single-metric.

minor comments (3)

[Abstract, §1] Abstract and §1: The description of the behavioral taxonomy and the specific capability metrics used in CapTrack would benefit from one additional sentence each on construction/validation criteria and on how drift is quantified (e.g., reference to a table or figure showing the metric definitions), as this directly affects immediate assessability of the central empirical claims.
[§6] The manuscript should include a short limitations subsection (or paragraph in §6) explicitly discussing potential confounds in the chosen domains and model families, as well as the degree to which the observed drifts generalize beyond the tested post-training setups.
[Figures/Tables] Figure and table captions throughout would be improved by stating the exact number of runs or seeds used for each reported statistic, to allow readers to gauge the stability of the cross-algorithm and cross-family comparisons.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of CapTrack, the recognition of its significance in shifting from accuracy-centric to capability-centric evaluation of forgetting, and the recommendation for minor revision. The empirical breadth across algorithms, domains, and model scales up to 80B parameters is a core strength we aimed to highlight.

Circularity Check

0 steps flagged

No significant circularity

full rationale

This paper is a purely empirical evaluation study that introduces the CapTrack framework and reports results from large-scale experiments on LLM post-training. No derivation chain, equations, first-principles predictions, or fitted parameters exist that could reduce to inputs by construction. Claims rest on experimental measurements across algorithms, domains, and models rather than self-referential definitions or self-citation load-bearing steps. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no information on free parameters, axioms, or invented entities can be extracted from the provided text.

pith-pipeline@v0.9.0 · 5718 in / 999 out tokens · 27437 ms · 2026-05-25T06:39:17.976578+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We argue that this accuracy-centric view is insufficient... define forgetting as systematic model drift... CapTrack... behavioral taxonomy with an evaluation suite
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

forgetting extends beyond parametric knowledge, with pronounced drift in robustness and default behaviors

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 27 internal anchors

[1]

Qwen3 Technical Report

URLhttps://arxiv.org/abs/2505.09388. Austin, J., Odena, A., Nye, M. I., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C. J., Terry, M., Le, Q. V., and Sutton, C. Program synthesis with large language models.CoRR, abs/2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Program Synthesis with Large Language Models

URLhttps://arxiv.org/abs/2108.07732. Bai, Y., Tu, S., Zhang, J., 0015, H. P., Wang, X., Lv, X., Cao, S., Xu, J., 0001, L. H., Dong, Y., 0001, J. T., and Li, J. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. ACL, pp. 3639–3664,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

URLhttps://aclanthology.org/2025.acl-long.183/. Bean, A. M., Seedat, N., Chen, S., and Schwarz, J. R. Scales++: Compute efficient evaluation subset selection with cognitive scales embeddings.CoRR, abs/2510.26384,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings

doi: 10.48550/arxiv.2510.26384. URL https://doi.org/10.48550/arxiv.2510.26384. Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Ba...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.26384
[6]

URLhttps://arxiv.org/abs/2110.14168. Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning.CoRR, abs/2307.08691,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

doi: 10.48550/arxiv.2307.08691. URLhttps://doi.org/10.48550/arxiv.2307.08691. Dasigi, P., Lo, K., Beltagy, I., Cohan, A., Smith, N. A., and 0001, M. G. A dataset of information- seeking questions and answers anchored in research papers.NAACL-HLT, pp. 4599–4610,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.08691
[8]

URLhttps://doi.org/10.18653/v1/2021.naacl-main.365

doi: 10.18653/v1/2021.naacl-main.365. URLhttps://doi.org/10.18653/v1/2021.naacl-main.365. Delange, M., Aljundi, R., Masana, M., Parisot, S., Jia, X., Leonardis, A., Slabaugh, G., and Tuytelaars, T. A continual learning survey: Defying forgetting in classification tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence,

work page doi:10.18653/v1/2021.naacl-main.365 2021
[9]

doi: 10.1109/tpami.2021.3057446

ISSN 1939-3539. doi: 10.1109/tpami.2021.3057446. URL http://dx.doi.org/10.1109/TPAMI.2021.3057446. Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., and Kiela, D. Kto: Model alignment as prospect theoretic optimization.CoRR, abs/2402.01306,

work page doi:10.1109/tpami.2021.3057446 1939
[10]

KTO: Model Alignment as Prospect Theoretic Optimization

doi: 10.48550/arxiv.2402.01306. URL https: //doi.org/10.48550/arxiv.2402.01306. Fan, A., Jernite, Y., Perez, E., Grangier, D., Weston, J., and Auli, M. Eli5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.01306
[11]

D., Shen, H., Ram, P., 0015, Y

Fernando, H. D., Shen, H., Ram, P., 0015, Y. Z., Samulowitz, H., Baracaldo, N., and Chen, T. Mitigating forgetting in llm supervised fine-tuning and preference learning.CoRR, abs/2410.15483, October

work page arXiv
[12]

D., Shen, H., Ram, P., 0015, Y

doi: 10.48550/arxiv.2410.15483. URLhttps://doi.org/10.48550/arxiv.2410.15483. Garg, S., Singh, A., Singh, S., and Chopra, P. Ipo: Your language model is secretly a preference classifier.CoRR, abs/2502.16182,

work page doi:10.48550/arxiv.2410.15483
[13]

URLhttps://doi.org/10.48550/arxiv.2502.16182

doi: 10.48550/arxiv.2502.16182. URLhttps://doi.org/10.48550/arxiv.2502.16182. Google, G. T. Gemma 3 technical report,

work page doi:10.48550/arxiv.2502.16182
[14]

Gemma 3 Technical Report

URLhttps://arxiv.org/abs/2503.19786. Guha, N., Nyarko, J., Ho, D. E., Ré, C., Chilton, A., Narayana, A., Chohlas-Wood, A., Peters, A., Waldon, B., Rockmore, D. N., Zambrano, D., Talisman, D., Hoque, E., Surani, F., Fagan, F., Sarfaty, G., Dickinson, G. M., Porat, H., Hegland, J., Wu, J., Nudell, J., Niklaus, J., Nay, J. J., Choi, J. H., Tobia, K., Hagan, ...

work page internal anchor Pith review Pith/arXiv arXiv
[15]

URLhttps://doi.org/10.48550/arxiv.2308.11462

doi: 10.48550/arxiv.2308.11462. URLhttps://doi.org/10.48550/arxiv.2308.11462. Haque, N. Catastrophic forgetting in llms: A comparative analysis across language tasks.CoRR, abs/2504.01241,

work page doi:10.48550/arxiv.2308.11462
[16]

URLhttps://doi.org/10.48550/arxiv.2504.01241

doi: 10.48550/arxiv.2504.01241. URLhttps://doi.org/10.48550/arxiv.2504.01241. Harmon, J., Hochlehnert, A., Bethge, M., and Prabhu, A. Mapping post-training forgetting in language models at scale.CoRR, abs/2510.17776,

work page doi:10.48550/arxiv.2504.01241
[17]

URLhttps://doi.org/10

doi: 10.48550/arxiv.2510.17776. URLhttps://doi.org/10. 48550/arxiv.2510.17776. Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the math dataset.CoRR, abs/2103.03874,

work page doi:10.48550/arxiv.2510.17776
[18]

Measuring Mathematical Problem Solving With the MATH Dataset

URL https://arxiv.org/abs/2103.03874. Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., 0089, Y. Z., and Ginsburg, B. Ruler: Whats the real context size of your long-context language models?CoRR, abs/2404.06654,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

RULER: What's the Real Context Size of Your Long-Context Language Models?

doi: 10.48550/arxiv.2404.06654. URLhttps://doi.org/10.48550/arxiv.2404.06654. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., and Chen, W. Lora: Low-rank adaptation of large language models.CoRR, abs/2106.09685,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.06654
[20]

LoRA: Low-Rank Adaptation of Large Language Models

URLhttps://arxiv.org/abs/2106.09685. Hu, J., Ruder, S., Siddhant, A., Neubig, G., Firat, O., and Johnson, M. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization.CoRR, abs/2003.11080,

work page internal anchor Pith review Pith/arXiv arXiv 2003
[21]

doi: 10.1073/pnas.1611835114

ISSN 1091-6490. doi: 10.1073/pnas.1611835114. URLhttp://dx.doi.org/10.1073/ pnas.1611835114. Kotha, S., Springer, J. M., and Raghunathan, A. Understanding catastrophic forgetting in language models via implicit inference.CoRR, abs/2309.10105, September

work page doi:10.1073/pnas.1611835114
[22]

URL https://doi.org/10.48550/arxiv.2309.10105

doi: 10.48550/arxiv.2309.10105. URL https://doi.org/10.48550/arxiv.2309.10105. Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,

work page doi:10.48550/arxiv.2309.10105
[23]

R., Stevens, K., Barhoum, A., Duc, N

Köpf, A., Kilcher, Y., von Rütte, D., Anagnostidis, S., Tam, Z. R., Stevens, K., Barhoum, A., Duc, N. M., Stanley, O., Nagyfi, R., ES, S., Suri, S., Glushkov, D., Dantuluri, A., Maguire, A., Schuhmann, C., Nguyen, H., and Mattick, A. Openassistant conversations - democratizing large language model alignment.CoRR, abs/2304.07327,

work page arXiv
[24]

R., Stevens, K., Barhoum, A., Duc, N

doi: 10.48550/arxiv.2304.07327. URLhttps://doi.org/10.48550/arxiv.2304.07327. Lambert, N., Morrison, J., Pyatkin, V., Huang, S., Ivison, H., Brahman, F., Miranda, L. J. V., Liu, A., Dziri, N., Lyu, S., Gu, Y., Malik, S., Graf, V., Hwang, J. D., Yang, J., Bras, R. L., Tafjord, O., Wilhelm, C., Soldaini, L., Smith, N. A., Wang, Y., Dasigi, P., and Hajishirz...

work page doi:10.48550/arxiv.2304.07327
[25]

Revisiting catastrophic forgetting in large language model tuning

Li, H., Ding, L., Fang, M., and Tao, D. Revisiting catastrophic forgetting in large language model tuning. In Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 4297–4308. Association for Computational Linguistics,

work page 2024
[26]

URLhttps://aclanthology

doi: 10.18653/v1/2024.findings-emnlp.249. URLhttps://aclanthology. org/2024.findings-emnlp.249/. Li, J., Li, J., Wang, Y., Chang, Y., and Wu, Y. Structflowbench: A structured flow benchmark for multi-turn instruction following.ACL, pp. 9322–9341, 2025a. URLhttps://aclanthology.org/2025.findings-acl.486/. Li, S. S., Mun, J., Brahman, F., Ilgen, J., Tsvetko...

work page doi:10.18653/v1/2024.findings-emnlp.249 2024
[27]

URLhttps://doi.org/10.18653/v1/2022.acl-long.229

doi: 10.18653/v1/2022.acl-long.229. URLhttps://doi.org/10.18653/v1/2022.acl-long.229. Lin, Y., Lin, H., Xiong, W., Diao, S., Liu, J., Zhang, J., Pan, R., Wang, H., Hu, W., Zhang, H., Dong, H., Pi, R., Zhao, H., Jiang, N., Ji, H., Yao, Y., and Zhang, T. Mitigating the alignment tax of rlhf,

work page doi:10.18653/v1/2022.acl-long.229 2022
[28]

13 CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training Liu, D

URL https://arxiv.org/abs/2309.06256. 13 CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training Liu, D. and Niehues, J. Conditions for catastrophic forgetting in multilingual translation.CoRR, abs/2510.19546,

work page arXiv
[29]

URLhttps://doi.org/10.48550/arxiv.2510.19546

doi: 10.48550/arxiv.2510.19546. URLhttps://doi.org/10.48550/arxiv.2510.19546. Liu, J., Liu, H., Xiao, L., Wang, Z., Liu, K., Gao, S., Zhang, W., Zhang, S., and Chen, K. Are your llms capable of stable reasoning?ACL, pp. 17594–17632,

work page doi:10.48550/arxiv.2510.19546
[30]

An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

URLhttps://aclanthology.org/2025.findings-acl.905/. Luo, Y., Yang, Z., Meng, F., Li, Y., 0016, J. Z., and 0004, Y. Z. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.CoRR, abs/2308.08747, August

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

doi: 10.48550/arxiv.2308.08747. URLhttps://doi.org/10.48550/arxiv.2308.08747. Ma, Z., Huang, W., Zhang, J., Gupta, T., and Krishna, R. m&m’s: A benchmark to evaluate tool-use for multi-step multi-modal tasks,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.08747
[32]

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Mallen, A., Asai, A., Zhong, V., Das, R., Hajishirzi, H., and Khashabi, D. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories.CoRR, abs/2212.10511,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

doi: 10.48550/arxiv.2212.10511. URLhttps://doi.org/10.48550/arxiv.2212.10511. Mazeika, M., Phan, L., Yin, X., Zou, A., 0001, Z. W., Mu, N., Sakhaee, E., Li, N., Basart, S., 0026, B. L., Forsyth, D. A., and Hendrycks, D. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.CoRR, abs/2402.04249,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.10511
[34]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

doi: 10.48550/arxiv.2402.04249. URL https://doi.org/10.48550/arxiv.2402.04249. McCloskey, M. and Cohen, N. J. Catastrophic interference in connectionist networks: The sequential learning problem. volume 24 ofPsychology of Learning and Motivation, pp. 109–165. Academic Press,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.04249
[35]

URL https://www.sciencedirect.com/science/article/ pii/S0079742108605368

doi: https://doi.org/10.1016/S0079-7421(08)60536-8. URL https://www.sciencedirect.com/science/article/ pii/S0079742108605368. Meta, L. T. The llama 3 herd of models,

work page doi:10.1016/s0079-7421(08)60536-8
[36]

The Llama 3 Herd of Models

URLhttps://arxiv.org/abs/2407.21783. Niu, C., Wu, Y., Zhu, J., Xu, S., Shum, K., Zhong, R., Song, J., and Zhang, T. RAGTruth: A hallucination corpus for developing trustworthy retrieval-augmented language models. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.),Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volu...

work page internal anchor Pith review Pith/arXiv arXiv
[37]

doi: 10.18653/v1/2024.acl-long.585

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.585. URLhttps://aclanthology.org/2024.acl-long.585/. OpenAI. Openai o1 system card,

work page doi:10.18653/v1/2024.acl-long.585 2024
[38]

OpenAI o1 System Card

URLhttps://arxiv.org/abs/2412.16720. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., and Lowe, R. Training language models to follow instructions with human feedback,

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Training language models to follow instructions with human feedback

URLhttps://arxiv.org/abs/2203.02155. Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., and Wermter, S. Continual lifelong learning with neural networks: A review.Neural Networks, 113:54–71,

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Zheng, Y

ISSN 0893-6080. doi: https://doi.org/10.1016/j. neunet.2019.01.012. URLhttps://www.sciencedirect.com/science/article/pii/S0893608019300231. Pryzant, R., Iter, D., Li, J., Lee, Y. T., Zhu, C., and Zeng, M. Automatic prompt optimization with "gradient descent" and beam search,

work page doi:10.1016/j 2019
[41]

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C

URLhttps://arxiv.org/abs/2305.03495. Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., and Finn, C. Direct preference optimization: Your language model is secretly a reward model.CoRR, abs/2305.18290,

work page arXiv
[42]

doi: 10.48550/arxiv.2305. 18290. URLhttps://doi.org/10.48550/arxiv.2305.18290. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y. K., Wu, Y., and Guo, D. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305
[43]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

doi: 10.48550/arxiv.2402.03300. URLhttps://doi.org/10.48550/arxiv.2402.03300. Shi, F., Suzgun, M., Freitag, M., 0002, X. W., Srivats, S., Vosoughi, S., Chung, H. W., Tay, Y., Ruder, S., Zhou, D., 0001, D. D., and Wei, J. Language models are multilingual chain-of-thought reasoners.CoRR, abs/2210.03057,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300
[44]

Language Models are Multilingual Chain-of-Thought Reasoners

doi: 10.48550/arxiv.2210.03057. URLhttps://doi.org/10.48550/arxiv.2210.03057. Team, M.-A.-P., Du, X., Yao, Y., Ma, K., Wang, B., Zheng, T., Zhu, K., Liu, M., Liang, Y., Jin, X., Wei, Z., Zheng, C., Deng, K., Guo, S., Jia, S., Jiang, S., Liao, Y., Li, R., Li, Q., Li, S., Li, Y., Li, Y., Ma, D., Ni, Y., 14 CapTrack: Multifaceted Evaluation of Forgetting in ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.03057
[45]

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

URLhttps://arxiv.org/abs/2502.14739. Thede, L., Roth, K., Hénaff, O. J., Bethge, M., and Akata, Z. Reflecting on the state of rehearsal-free continual learning with pretrained models.CoLLAs, pp. 1076–1093,

work page internal anchor Pith review Pith/arXiv arXiv
[46]

press/v274/thede25a.html

URLhttps://proceedings.mlr. press/v274/thede25a.html. Tie, G., Zhao, Z., Song, D., Wei, F., Zhou, R., Dai, Y., Yin, W., Yang, Z., Yan, J., 0003, Y. S., Dai, Z., Xie, Y., Cao, Y., 0001, L. S., 0001, P. Z., 0001, L. H., Chen, H., 0006, Y. Z., Wen, Q., 0001, T. L., Gong, N. Z., Tang, J., Xiong, C., 0001, H.J., Yu, P.S., and0001, J.G. Asurveyonpost-trainingof...

work page arXiv
[47]

press/v274/thede25a.html

doi: 10.48550/arxiv.2503.06072. URLhttps://doi.org/10.48550/arxiv.2503.06072. Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., Li, T., Ku, M., Wang, K., Zhuang, A., Fan, R., Yue, X., and Chen, W. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark,

work page doi:10.48550/arxiv.2503.06072
[48]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

URLhttps://arxiv.org/abs/2406.01574. Wortsman, M., Ilharco, G., Gadre, S. Y., Roelofs, R., Lopes, R. G., Morcos, A. S., Namkoong, H., Farhadi, A., Carmon, Y., Kornblith, S., and Schmidt, L. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time.ICML, pp. 23965–23998,

work page internal anchor Pith review Pith/arXiv arXiv
[49]

Yan, F., Mao, H., Ji, C

URL http://papers.nips.cc/paper_files/paper/2023/hash/ 1644c9af28ab7916874f6fd6228a9bcf-Abstract-Conference.html. Yan, F., Mao, H., Ji, C. C.-J., Zhang, T., Patil, S. G., Stoica, I., and Gonzalez, J. E. Berkeley function calling leaderboard. https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard. html,

work page 2023
[50]

Detecting causal language use in science findings

Yu, B., Li, Y., and Wang, J. Detecting causal language use in science findings. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4664–4674, Hong Kong, China, November

work page 2019
[51]

doi: 10.18653/v1/D19-1473

Association for Computational Linguistics. doi: 10.18653/v1/D19-1473. URLhttps://aclanthology.org/ D19-1473. Yu, L., Yu, B., Yu, H., Huang, F., and Li, Y. Language models are super mario: Absorbing abilities from homologous models as a free lunch,

work page doi:10.18653/v1/d19-1473
[52]

Yıldız, C., Ravichandran, N

URLhttps://arxiv.org/abs/2311.03099. Yıldız, C., Ravichandran, N. K., Punia, P., Bethge, M., and Ermis, B. Investigating continual pretraining in large language models: Insights and implications.arXiv [cs.CL], February

work page arXiv
[53]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

doi: 10.1145/3777411. URLhttps://doi.org/10.1145/3777411. Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, C.-Y., Zhuang, Y., Krishnamurthy, N., Chen, Z., 15 CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training Koyejo, S., Arik, S. O., Li, D. S., and Stoica, I. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3777411
[54]

Instruction-Following Evaluation for Large Language Models

Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y., Zhou, D., and Hou, L. Instruction-following evaluation for large language models.CoRR, abs/2311.07911,

work page internal anchor Pith review Pith/arXiv arXiv
[55]

Instruction-Following Evaluation for Large Language Models

doi: 10.48550/arxiv.2311.07911. URL https://doi.org/10.48550/arxiv.2311.07911. 16 CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training Appendix This appendix provides additional methodological details, analyses, and results that support the findings presented in the main paper. Its primary purpose is to improve transparency and reproducibi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2311.07911 2025
[56]

All runs use a similar effective batch size, determined by the number of nodes, the per-device batch size, and the gradient accumulation rate

is used where supported to improve memory efficiency. All runs use a similar effective batch size, determined by the number of nodes, the per-device batch size, and the gradient accumulation rate. Random seeds are fixed across runs to ensure reproducibility. Instruction Fine-Tuning.IFT is performed using standard next-token prediction with an NLL loss com...

work page 2019
[57]

As in the main paper, each spider plot summarizes capability-level changes across the CAN, WILL, and HOW categories, aggregated by model family

D.1 Extended Spider Plot Results Figure 5 extends the spider plot analysis from the main paper by including the missing IFT+DPO configuration as well as the corresponding results for the medical domain. As in the main paper, each spider plot summarizes capability-level changes across the CAN, WILL, and HOW categories, aggregated by model family. These add...

work page 2023
[58]

Across all settings, the post-training algorithm and overall training budget are held fixed

with models trained on a domain-specific legal mixture. Across all settings, the post-training algorithm and overall training budget are held fixed. This experiment is designed to isolate the contribution of the data source to forgetting behavior, complementing the aggregated analysis presented in the main paper. We report the full, non-aggregated results...

work page 2022
[59]

free lunch

with density 0.1. For each method, we interpolate between the OOB model and the instruction fine-tuned (IFT) model using different merge weights, and compute stability and plasticity for each merged checkpoint. Figure13reportsstability-plasticitycurvesacrossmergingmethods. Acrossallmethods, weobserveaconsistent stability-plasticity trade-off: increasing r...

work page 2021

[1] [1]

Qwen3 Technical Report

URLhttps://arxiv.org/abs/2505.09388. Austin, J., Odena, A., Nye, M. I., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C. J., Terry, M., Le, Q. V., and Sutton, C. Program synthesis with large language models.CoRR, abs/2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Program Synthesis with Large Language Models

URLhttps://arxiv.org/abs/2108.07732. Bai, Y., Tu, S., Zhang, J., 0015, H. P., Wang, X., Lv, X., Cao, S., Xu, J., 0001, L. H., Dong, Y., 0001, J. T., and Li, J. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. ACL, pp. 3639–3664,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

URLhttps://aclanthology.org/2025.acl-long.183/. Bean, A. M., Seedat, N., Chen, S., and Schwarz, J. R. Scales++: Compute efficient evaluation subset selection with cognitive scales embeddings.CoRR, abs/2510.26384,

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings

doi: 10.48550/arxiv.2510.26384. URL https://doi.org/10.48550/arxiv.2510.26384. Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Ba...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.26384

[5] [6]

URLhttps://arxiv.org/abs/2110.14168. Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning.CoRR, abs/2307.08691,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [7]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

doi: 10.48550/arxiv.2307.08691. URLhttps://doi.org/10.48550/arxiv.2307.08691. Dasigi, P., Lo, K., Beltagy, I., Cohan, A., Smith, N. A., and 0001, M. G. A dataset of information- seeking questions and answers anchored in research papers.NAACL-HLT, pp. 4599–4610,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.08691

[7] [8]

URLhttps://doi.org/10.18653/v1/2021.naacl-main.365

doi: 10.18653/v1/2021.naacl-main.365. URLhttps://doi.org/10.18653/v1/2021.naacl-main.365. Delange, M., Aljundi, R., Masana, M., Parisot, S., Jia, X., Leonardis, A., Slabaugh, G., and Tuytelaars, T. A continual learning survey: Defying forgetting in classification tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence,

work page doi:10.18653/v1/2021.naacl-main.365 2021

[8] [9]

doi: 10.1109/tpami.2021.3057446

ISSN 1939-3539. doi: 10.1109/tpami.2021.3057446. URL http://dx.doi.org/10.1109/TPAMI.2021.3057446. Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., and Kiela, D. Kto: Model alignment as prospect theoretic optimization.CoRR, abs/2402.01306,

work page doi:10.1109/tpami.2021.3057446 1939

[9] [10]

KTO: Model Alignment as Prospect Theoretic Optimization

doi: 10.48550/arxiv.2402.01306. URL https: //doi.org/10.48550/arxiv.2402.01306. Fan, A., Jernite, Y., Perez, E., Grangier, D., Weston, J., and Auli, M. Eli5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.01306

[10] [11]

D., Shen, H., Ram, P., 0015, Y

Fernando, H. D., Shen, H., Ram, P., 0015, Y. Z., Samulowitz, H., Baracaldo, N., and Chen, T. Mitigating forgetting in llm supervised fine-tuning and preference learning.CoRR, abs/2410.15483, October

work page arXiv

[11] [12]

D., Shen, H., Ram, P., 0015, Y

doi: 10.48550/arxiv.2410.15483. URLhttps://doi.org/10.48550/arxiv.2410.15483. Garg, S., Singh, A., Singh, S., and Chopra, P. Ipo: Your language model is secretly a preference classifier.CoRR, abs/2502.16182,

work page doi:10.48550/arxiv.2410.15483

[12] [13]

URLhttps://doi.org/10.48550/arxiv.2502.16182

doi: 10.48550/arxiv.2502.16182. URLhttps://doi.org/10.48550/arxiv.2502.16182. Google, G. T. Gemma 3 technical report,

work page doi:10.48550/arxiv.2502.16182

[13] [14]

Gemma 3 Technical Report

URLhttps://arxiv.org/abs/2503.19786. Guha, N., Nyarko, J., Ho, D. E., Ré, C., Chilton, A., Narayana, A., Chohlas-Wood, A., Peters, A., Waldon, B., Rockmore, D. N., Zambrano, D., Talisman, D., Hoque, E., Surani, F., Fagan, F., Sarfaty, G., Dickinson, G. M., Porat, H., Hegland, J., Wu, J., Nudell, J., Niklaus, J., Nay, J. J., Choi, J. H., Tobia, K., Hagan, ...

work page internal anchor Pith review Pith/arXiv arXiv

[14] [15]

URLhttps://doi.org/10.48550/arxiv.2308.11462

doi: 10.48550/arxiv.2308.11462. URLhttps://doi.org/10.48550/arxiv.2308.11462. Haque, N. Catastrophic forgetting in llms: A comparative analysis across language tasks.CoRR, abs/2504.01241,

work page doi:10.48550/arxiv.2308.11462

[15] [16]

URLhttps://doi.org/10.48550/arxiv.2504.01241

doi: 10.48550/arxiv.2504.01241. URLhttps://doi.org/10.48550/arxiv.2504.01241. Harmon, J., Hochlehnert, A., Bethge, M., and Prabhu, A. Mapping post-training forgetting in language models at scale.CoRR, abs/2510.17776,

work page doi:10.48550/arxiv.2504.01241

[16] [17]

URLhttps://doi.org/10

doi: 10.48550/arxiv.2510.17776. URLhttps://doi.org/10. 48550/arxiv.2510.17776. Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the math dataset.CoRR, abs/2103.03874,

work page doi:10.48550/arxiv.2510.17776

[17] [18]

Measuring Mathematical Problem Solving With the MATH Dataset

URL https://arxiv.org/abs/2103.03874. Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., 0089, Y. Z., and Ginsburg, B. Ruler: Whats the real context size of your long-context language models?CoRR, abs/2404.06654,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [19]

RULER: What's the Real Context Size of Your Long-Context Language Models?

doi: 10.48550/arxiv.2404.06654. URLhttps://doi.org/10.48550/arxiv.2404.06654. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., and Chen, W. Lora: Low-rank adaptation of large language models.CoRR, abs/2106.09685,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.06654

[19] [20]

LoRA: Low-Rank Adaptation of Large Language Models

URLhttps://arxiv.org/abs/2106.09685. Hu, J., Ruder, S., Siddhant, A., Neubig, G., Firat, O., and Johnson, M. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization.CoRR, abs/2003.11080,

work page internal anchor Pith review Pith/arXiv arXiv 2003

[20] [21]

doi: 10.1073/pnas.1611835114

ISSN 1091-6490. doi: 10.1073/pnas.1611835114. URLhttp://dx.doi.org/10.1073/ pnas.1611835114. Kotha, S., Springer, J. M., and Raghunathan, A. Understanding catastrophic forgetting in language models via implicit inference.CoRR, abs/2309.10105, September

work page doi:10.1073/pnas.1611835114

[21] [22]

URL https://doi.org/10.48550/arxiv.2309.10105

doi: 10.48550/arxiv.2309.10105. URL https://doi.org/10.48550/arxiv.2309.10105. Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,

work page doi:10.48550/arxiv.2309.10105

[22] [23]

R., Stevens, K., Barhoum, A., Duc, N

Köpf, A., Kilcher, Y., von Rütte, D., Anagnostidis, S., Tam, Z. R., Stevens, K., Barhoum, A., Duc, N. M., Stanley, O., Nagyfi, R., ES, S., Suri, S., Glushkov, D., Dantuluri, A., Maguire, A., Schuhmann, C., Nguyen, H., and Mattick, A. Openassistant conversations - democratizing large language model alignment.CoRR, abs/2304.07327,

work page arXiv

[23] [24]

R., Stevens, K., Barhoum, A., Duc, N

doi: 10.48550/arxiv.2304.07327. URLhttps://doi.org/10.48550/arxiv.2304.07327. Lambert, N., Morrison, J., Pyatkin, V., Huang, S., Ivison, H., Brahman, F., Miranda, L. J. V., Liu, A., Dziri, N., Lyu, S., Gu, Y., Malik, S., Graf, V., Hwang, J. D., Yang, J., Bras, R. L., Tafjord, O., Wilhelm, C., Soldaini, L., Smith, N. A., Wang, Y., Dasigi, P., and Hajishirz...

work page doi:10.48550/arxiv.2304.07327

[24] [25]

Revisiting catastrophic forgetting in large language model tuning

Li, H., Ding, L., Fang, M., and Tao, D. Revisiting catastrophic forgetting in large language model tuning. In Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 4297–4308. Association for Computational Linguistics,

work page 2024

[25] [26]

URLhttps://aclanthology

doi: 10.18653/v1/2024.findings-emnlp.249. URLhttps://aclanthology. org/2024.findings-emnlp.249/. Li, J., Li, J., Wang, Y., Chang, Y., and Wu, Y. Structflowbench: A structured flow benchmark for multi-turn instruction following.ACL, pp. 9322–9341, 2025a. URLhttps://aclanthology.org/2025.findings-acl.486/. Li, S. S., Mun, J., Brahman, F., Ilgen, J., Tsvetko...

work page doi:10.18653/v1/2024.findings-emnlp.249 2024

[26] [27]

URLhttps://doi.org/10.18653/v1/2022.acl-long.229

doi: 10.18653/v1/2022.acl-long.229. URLhttps://doi.org/10.18653/v1/2022.acl-long.229. Lin, Y., Lin, H., Xiong, W., Diao, S., Liu, J., Zhang, J., Pan, R., Wang, H., Hu, W., Zhang, H., Dong, H., Pi, R., Zhao, H., Jiang, N., Ji, H., Yao, Y., and Zhang, T. Mitigating the alignment tax of rlhf,

work page doi:10.18653/v1/2022.acl-long.229 2022

[27] [28]

13 CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training Liu, D

URL https://arxiv.org/abs/2309.06256. 13 CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training Liu, D. and Niehues, J. Conditions for catastrophic forgetting in multilingual translation.CoRR, abs/2510.19546,

work page arXiv

[28] [29]

URLhttps://doi.org/10.48550/arxiv.2510.19546

doi: 10.48550/arxiv.2510.19546. URLhttps://doi.org/10.48550/arxiv.2510.19546. Liu, J., Liu, H., Xiao, L., Wang, Z., Liu, K., Gao, S., Zhang, W., Zhang, S., and Chen, K. Are your llms capable of stable reasoning?ACL, pp. 17594–17632,

work page doi:10.48550/arxiv.2510.19546

[29] [30]

An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

URLhttps://aclanthology.org/2025.findings-acl.905/. Luo, Y., Yang, Z., Meng, F., Li, Y., 0016, J. Z., and 0004, Y. Z. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.CoRR, abs/2308.08747, August

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [31]

An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

doi: 10.48550/arxiv.2308.08747. URLhttps://doi.org/10.48550/arxiv.2308.08747. Ma, Z., Huang, W., Zhang, J., Gupta, T., and Krishna, R. m&m’s: A benchmark to evaluate tool-use for multi-step multi-modal tasks,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.08747

[31] [32]

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Mallen, A., Asai, A., Zhong, V., Das, R., Hajishirzi, H., and Khashabi, D. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories.CoRR, abs/2212.10511,

work page internal anchor Pith review Pith/arXiv arXiv

[32] [33]

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

doi: 10.48550/arxiv.2212.10511. URLhttps://doi.org/10.48550/arxiv.2212.10511. Mazeika, M., Phan, L., Yin, X., Zou, A., 0001, Z. W., Mu, N., Sakhaee, E., Li, N., Basart, S., 0026, B. L., Forsyth, D. A., and Hendrycks, D. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.CoRR, abs/2402.04249,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.10511

[33] [34]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

doi: 10.48550/arxiv.2402.04249. URL https://doi.org/10.48550/arxiv.2402.04249. McCloskey, M. and Cohen, N. J. Catastrophic interference in connectionist networks: The sequential learning problem. volume 24 ofPsychology of Learning and Motivation, pp. 109–165. Academic Press,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.04249

[34] [35]

URL https://www.sciencedirect.com/science/article/ pii/S0079742108605368

doi: https://doi.org/10.1016/S0079-7421(08)60536-8. URL https://www.sciencedirect.com/science/article/ pii/S0079742108605368. Meta, L. T. The llama 3 herd of models,

work page doi:10.1016/s0079-7421(08)60536-8

[35] [36]

The Llama 3 Herd of Models

URLhttps://arxiv.org/abs/2407.21783. Niu, C., Wu, Y., Zhu, J., Xu, S., Shum, K., Zhong, R., Song, J., and Zhang, T. RAGTruth: A hallucination corpus for developing trustworthy retrieval-augmented language models. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.),Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volu...

work page internal anchor Pith review Pith/arXiv arXiv

[36] [37]

doi: 10.18653/v1/2024.acl-long.585

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.585. URLhttps://aclanthology.org/2024.acl-long.585/. OpenAI. Openai o1 system card,

work page doi:10.18653/v1/2024.acl-long.585 2024

[37] [38]

OpenAI o1 System Card

URLhttps://arxiv.org/abs/2412.16720. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., and Lowe, R. Training language models to follow instructions with human feedback,

work page internal anchor Pith review Pith/arXiv arXiv

[38] [39]

Training language models to follow instructions with human feedback

URLhttps://arxiv.org/abs/2203.02155. Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., and Wermter, S. Continual lifelong learning with neural networks: A review.Neural Networks, 113:54–71,

work page internal anchor Pith review Pith/arXiv arXiv

[39] [40]

Zheng, Y

ISSN 0893-6080. doi: https://doi.org/10.1016/j. neunet.2019.01.012. URLhttps://www.sciencedirect.com/science/article/pii/S0893608019300231. Pryzant, R., Iter, D., Li, J., Lee, Y. T., Zhu, C., and Zeng, M. Automatic prompt optimization with "gradient descent" and beam search,

work page doi:10.1016/j 2019

[40] [41]

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C

URLhttps://arxiv.org/abs/2305.03495. Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., and Finn, C. Direct preference optimization: Your language model is secretly a reward model.CoRR, abs/2305.18290,

work page arXiv

[41] [42]

doi: 10.48550/arxiv.2305. 18290. URLhttps://doi.org/10.48550/arxiv.2305.18290. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y. K., Wu, Y., and Guo, D. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305

[42] [43]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

doi: 10.48550/arxiv.2402.03300. URLhttps://doi.org/10.48550/arxiv.2402.03300. Shi, F., Suzgun, M., Freitag, M., 0002, X. W., Srivats, S., Vosoughi, S., Chung, H. W., Tay, Y., Ruder, S., Zhou, D., 0001, D. D., and Wei, J. Language models are multilingual chain-of-thought reasoners.CoRR, abs/2210.03057,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300

[43] [44]

Language Models are Multilingual Chain-of-Thought Reasoners

doi: 10.48550/arxiv.2210.03057. URLhttps://doi.org/10.48550/arxiv.2210.03057. Team, M.-A.-P., Du, X., Yao, Y., Ma, K., Wang, B., Zheng, T., Zhu, K., Liu, M., Liang, Y., Jin, X., Wei, Z., Zheng, C., Deng, K., Guo, S., Jia, S., Jiang, S., Liao, Y., Li, R., Li, Q., Li, S., Li, Y., Li, Y., Ma, D., Ni, Y., 14 CapTrack: Multifaceted Evaluation of Forgetting in ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.03057

[44] [45]

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

URLhttps://arxiv.org/abs/2502.14739. Thede, L., Roth, K., Hénaff, O. J., Bethge, M., and Akata, Z. Reflecting on the state of rehearsal-free continual learning with pretrained models.CoLLAs, pp. 1076–1093,

work page internal anchor Pith review Pith/arXiv arXiv

[45] [46]

press/v274/thede25a.html

URLhttps://proceedings.mlr. press/v274/thede25a.html. Tie, G., Zhao, Z., Song, D., Wei, F., Zhou, R., Dai, Y., Yin, W., Yang, Z., Yan, J., 0003, Y. S., Dai, Z., Xie, Y., Cao, Y., 0001, L. S., 0001, P. Z., 0001, L. H., Chen, H., 0006, Y. Z., Wen, Q., 0001, T. L., Gong, N. Z., Tang, J., Xiong, C., 0001, H.J., Yu, P.S., and0001, J.G. Asurveyonpost-trainingof...

work page arXiv

[46] [47]

press/v274/thede25a.html

doi: 10.48550/arxiv.2503.06072. URLhttps://doi.org/10.48550/arxiv.2503.06072. Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., Li, T., Ku, M., Wang, K., Zhuang, A., Fan, R., Yue, X., and Chen, W. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark,

work page doi:10.48550/arxiv.2503.06072

[47] [48]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

URLhttps://arxiv.org/abs/2406.01574. Wortsman, M., Ilharco, G., Gadre, S. Y., Roelofs, R., Lopes, R. G., Morcos, A. S., Namkoong, H., Farhadi, A., Carmon, Y., Kornblith, S., and Schmidt, L. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time.ICML, pp. 23965–23998,

work page internal anchor Pith review Pith/arXiv arXiv

[48] [49]

Yan, F., Mao, H., Ji, C

URL http://papers.nips.cc/paper_files/paper/2023/hash/ 1644c9af28ab7916874f6fd6228a9bcf-Abstract-Conference.html. Yan, F., Mao, H., Ji, C. C.-J., Zhang, T., Patil, S. G., Stoica, I., and Gonzalez, J. E. Berkeley function calling leaderboard. https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard. html,

work page 2023

[49] [50]

Detecting causal language use in science findings

Yu, B., Li, Y., and Wang, J. Detecting causal language use in science findings. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4664–4674, Hong Kong, China, November

work page 2019

[50] [51]

doi: 10.18653/v1/D19-1473

Association for Computational Linguistics. doi: 10.18653/v1/D19-1473. URLhttps://aclanthology.org/ D19-1473. Yu, L., Yu, B., Yu, H., Huang, F., and Li, Y. Language models are super mario: Absorbing abilities from homologous models as a free lunch,

work page doi:10.18653/v1/d19-1473

[51] [52]

Yıldız, C., Ravichandran, N

URLhttps://arxiv.org/abs/2311.03099. Yıldız, C., Ravichandran, N. K., Punia, P., Bethge, M., and Ermis, B. Investigating continual pretraining in large language models: Insights and implications.arXiv [cs.CL], February

work page arXiv

[52] [53]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

doi: 10.1145/3777411. URLhttps://doi.org/10.1145/3777411. Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, C.-Y., Zhuang, Y., Krishnamurthy, N., Chen, Z., 15 CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training Koyejo, S., Arik, S. O., Li, D. S., and Stoica, I. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3777411

[53] [54]

Instruction-Following Evaluation for Large Language Models

Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y., Zhou, D., and Hou, L. Instruction-following evaluation for large language models.CoRR, abs/2311.07911,

work page internal anchor Pith review Pith/arXiv arXiv

[54] [55]

Instruction-Following Evaluation for Large Language Models

doi: 10.48550/arxiv.2311.07911. URL https://doi.org/10.48550/arxiv.2311.07911. 16 CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training Appendix This appendix provides additional methodological details, analyses, and results that support the findings presented in the main paper. Its primary purpose is to improve transparency and reproducibi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2311.07911 2025

[55] [56]

All runs use a similar effective batch size, determined by the number of nodes, the per-device batch size, and the gradient accumulation rate

is used where supported to improve memory efficiency. All runs use a similar effective batch size, determined by the number of nodes, the per-device batch size, and the gradient accumulation rate. Random seeds are fixed across runs to ensure reproducibility. Instruction Fine-Tuning.IFT is performed using standard next-token prediction with an NLL loss com...

work page 2019

[56] [57]

As in the main paper, each spider plot summarizes capability-level changes across the CAN, WILL, and HOW categories, aggregated by model family

D.1 Extended Spider Plot Results Figure 5 extends the spider plot analysis from the main paper by including the missing IFT+DPO configuration as well as the corresponding results for the medical domain. As in the main paper, each spider plot summarizes capability-level changes across the CAN, WILL, and HOW categories, aggregated by model family. These add...

work page 2023

[57] [58]

Across all settings, the post-training algorithm and overall training budget are held fixed

with models trained on a domain-specific legal mixture. Across all settings, the post-training algorithm and overall training budget are held fixed. This experiment is designed to isolate the contribution of the data source to forgetting behavior, complementing the aggregated analysis presented in the main paper. We report the full, non-aggregated results...

work page 2022

[58] [59]

free lunch

with density 0.1. For each method, we interpolate between the OOB model and the instruction fine-tuned (IFT) model using different merge weights, and compute stability and plasticity for each merged checkpoint. Figure13reportsstability-plasticitycurvesacrossmergingmethods. Acrossallmethods, weobserveaconsistent stability-plasticity trade-off: increasing r...

work page 2021