CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training
Pith reviewed 2026-05-25 06:39 UTC · model grok-4.3
The pith
Post-training of LLMs induces forgetting as drift in robustness and default behaviors beyond lost parametric knowledge.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using CapTrack, the study finds that forgetting extends beyond parametric knowledge, with pronounced drift in robustness and default behaviors. Instruction fine-tuning induces the strongest relative drift, while preference optimization is more conservative and can partially recover lost capabilities. Differences across model families persist, and no universal mitigation emerges.
What carries the argument
CapTrack, a capability-centric framework that combines a behavioral taxonomy with an evaluation suite of capability-specific metrics to detect systematic model drift after post-training.
If this is right
- Instruction fine-tuning produces stronger relative drift than preference optimization across the tested capabilities.
- Preference optimization steps can partially restore capabilities that were lost during earlier instruction fine-tuning.
- Drift patterns remain consistent within each model family even when training data and algorithms change.
- No single post-training algorithm or mitigation strategy eliminates drift across all model families and domains.
Where Pith is reading between the lines
- CapTrack could be extended to monitor drift during continual learning or domain adaptation without full retraining.
- If default-behavior drift accumulates, it may explain gradual erosion of safety alignments in deployed models.
- The taxonomy might be tested on new capabilities such as tool use or long-context reasoning to check coverage.
Load-bearing premise
The behavioral taxonomy and capability-specific metrics in CapTrack accurately capture the kinds of model drift that actually degrade real user experience.
What would settle it
A controlled experiment that applies post-training, runs the full CapTrack suite showing no measurable drift on any tracked capability, yet finds clear drops in independent user preference ratings or task performance outside the taxonomy.
read the original abstract
Large language model (LLM) post-training enhances latent skills, unlocks value alignment, improves performance, and enables domain adaptation. Unfortunately, post-training is known to induce forgetting, especially in the ubiquitous use-case of leveraging third-party pre-trained models, which is typically understood as a loss of parametric or factual knowledge. We argue that this accuracy-centric view is insufficient for modern foundation models and instead define forgetting as systematic model drift that degrades behavior and user experience. In this context, we introduce CapTrack, a capability-centric framework for analyzing forgetting in LLMs that combines a behavioral taxonomy with an evaluation suite centered on capability-specific metrics. Using CapTrack, we conduct a large-scale empirical study across post-training algorithms, domains, and model families, including models up to 80B parameters. We find that forgetting extends beyond parametric knowledge, with pronounced drift in robustness and default behaviors. Instruction fine-tuning induces the strongest relative drift, while preference optimization is more conservative and can partially recover lost capabilities. Differences across model families persist, and no universal mitigation emerges.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CapTrack, a capability-centric framework combining a behavioral taxonomy with capability-specific metrics to evaluate forgetting in LLM post-training. It redefines forgetting as systematic model drift degrading behavior and user experience (beyond parametric knowledge loss) and reports results from a large-scale empirical study across post-training algorithms, domains, and model families up to 80B parameters. Key findings include pronounced drift in robustness and default behaviors, with instruction fine-tuning inducing the strongest relative drift while preference optimization is more conservative and can partially recover capabilities; model-family differences persist with no universal mitigation identified.
Significance. If the taxonomy and metrics hold, the work is significant for shifting the field from an accuracy-centric view of forgetting to one that accounts for robustness and behavioral consistency, which directly impact user experience in deployed foundation models. The scale of the study (multiple algorithms, domains, and large models) and the provision of a reusable evaluation suite are strengths that could inform more careful post-training design. Explicit credit is due for the empirical breadth and the attempt to make the evaluation multifaceted rather than single-metric.
minor comments (3)
- [Abstract, §1] Abstract and §1: The description of the behavioral taxonomy and the specific capability metrics used in CapTrack would benefit from one additional sentence each on construction/validation criteria and on how drift is quantified (e.g., reference to a table or figure showing the metric definitions), as this directly affects immediate assessability of the central empirical claims.
- [§6] The manuscript should include a short limitations subsection (or paragraph in §6) explicitly discussing potential confounds in the chosen domains and model families, as well as the degree to which the observed drifts generalize beyond the tested post-training setups.
- [Figures/Tables] Figure and table captions throughout would be improved by stating the exact number of runs or seeds used for each reported statistic, to allow readers to gauge the stability of the cross-algorithm and cross-family comparisons.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of CapTrack, the recognition of its significance in shifting from accuracy-centric to capability-centric evaluation of forgetting, and the recommendation for minor revision. The empirical breadth across algorithms, domains, and model scales up to 80B parameters is a core strength we aimed to highlight.
Circularity Check
No significant circularity
full rationale
This paper is a purely empirical evaluation study that introduces the CapTrack framework and reports results from large-scale experiments on LLM post-training. No derivation chain, equations, first-principles predictions, or fitted parameters exist that could reduce to inputs by construction. Claims rest on experimental measurements across algorithms, domains, and models rather than self-referential definitions or self-citation load-bearing steps. The work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We argue that this accuracy-centric view is insufficient... define forgetting as systematic model drift... CapTrack... behavioral taxonomy with an evaluation suite
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
forgetting extends beyond parametric knowledge, with pronounced drift in robustness and default behaviors
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/2505.09388. Austin, J., Odena, A., Nye, M. I., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C. J., Terry, M., Le, Q. V., and Sutton, C. Program synthesis with large language models.CoRR, abs/2108.07732,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Program Synthesis with Large Language Models
URLhttps://arxiv.org/abs/2108.07732. Bai, Y., Tu, S., Zhang, J., 0015, H. P., Wang, X., Lv, X., Cao, S., Xu, J., 0001, L. H., Dong, Y., 0001, J. T., and Li, J. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. ACL, pp. 3639–3664,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
URLhttps://aclanthology.org/2025.acl-long.183/. Bean, A. M., Seedat, N., Chen, S., and Schwarz, J. R. Scales++: Compute efficient evaluation subset selection with cognitive scales embeddings.CoRR, abs/2510.26384,
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings
doi: 10.48550/arxiv.2510.26384. URL https://doi.org/10.48550/arxiv.2510.26384. Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Ba...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.26384
-
[6]
URLhttps://arxiv.org/abs/2110.14168. Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning.CoRR, abs/2307.08691,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
doi: 10.48550/arxiv.2307.08691. URLhttps://doi.org/10.48550/arxiv.2307.08691. Dasigi, P., Lo, K., Beltagy, I., Cohan, A., Smith, N. A., and 0001, M. G. A dataset of information- seeking questions and answers anchored in research papers.NAACL-HLT, pp. 4599–4610,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.08691
-
[8]
URLhttps://doi.org/10.18653/v1/2021.naacl-main.365
doi: 10.18653/v1/2021.naacl-main.365. URLhttps://doi.org/10.18653/v1/2021.naacl-main.365. Delange, M., Aljundi, R., Masana, M., Parisot, S., Jia, X., Leonardis, A., Slabaugh, G., and Tuytelaars, T. A continual learning survey: Defying forgetting in classification tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence,
-
[9]
doi: 10.1109/tpami.2021.3057446
ISSN 1939-3539. doi: 10.1109/tpami.2021.3057446. URL http://dx.doi.org/10.1109/TPAMI.2021.3057446. Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., and Kiela, D. Kto: Model alignment as prospect theoretic optimization.CoRR, abs/2402.01306,
-
[10]
KTO: Model Alignment as Prospect Theoretic Optimization
doi: 10.48550/arxiv.2402.01306. URL https: //doi.org/10.48550/arxiv.2402.01306. Fan, A., Jernite, Y., Perez, E., Grangier, D., Weston, J., and Auli, M. Eli5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.01306
-
[11]
D., Shen, H., Ram, P., 0015, Y
Fernando, H. D., Shen, H., Ram, P., 0015, Y. Z., Samulowitz, H., Baracaldo, N., and Chen, T. Mitigating forgetting in llm supervised fine-tuning and preference learning.CoRR, abs/2410.15483, October
-
[12]
D., Shen, H., Ram, P., 0015, Y
doi: 10.48550/arxiv.2410.15483. URLhttps://doi.org/10.48550/arxiv.2410.15483. Garg, S., Singh, A., Singh, S., and Chopra, P. Ipo: Your language model is secretly a preference classifier.CoRR, abs/2502.16182,
-
[13]
URLhttps://doi.org/10.48550/arxiv.2502.16182
doi: 10.48550/arxiv.2502.16182. URLhttps://doi.org/10.48550/arxiv.2502.16182. Google, G. T. Gemma 3 technical report,
-
[14]
URLhttps://arxiv.org/abs/2503.19786. Guha, N., Nyarko, J., Ho, D. E., Ré, C., Chilton, A., Narayana, A., Chohlas-Wood, A., Peters, A., Waldon, B., Rockmore, D. N., Zambrano, D., Talisman, D., Hoque, E., Surani, F., Fagan, F., Sarfaty, G., Dickinson, G. M., Porat, H., Hegland, J., Wu, J., Nudell, J., Niklaus, J., Nay, J. J., Choi, J. H., Tobia, K., Hagan, ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
URLhttps://doi.org/10.48550/arxiv.2308.11462
doi: 10.48550/arxiv.2308.11462. URLhttps://doi.org/10.48550/arxiv.2308.11462. Haque, N. Catastrophic forgetting in llms: A comparative analysis across language tasks.CoRR, abs/2504.01241,
-
[16]
URLhttps://doi.org/10.48550/arxiv.2504.01241
doi: 10.48550/arxiv.2504.01241. URLhttps://doi.org/10.48550/arxiv.2504.01241. Harmon, J., Hochlehnert, A., Bethge, M., and Prabhu, A. Mapping post-training forgetting in language models at scale.CoRR, abs/2510.17776,
-
[17]
doi: 10.48550/arxiv.2510.17776. URLhttps://doi.org/10. 48550/arxiv.2510.17776. Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the math dataset.CoRR, abs/2103.03874,
-
[18]
Measuring Mathematical Problem Solving With the MATH Dataset
URL https://arxiv.org/abs/2103.03874. Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., 0089, Y. Z., and Ginsburg, B. Ruler: Whats the real context size of your long-context language models?CoRR, abs/2404.06654,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
RULER: What's the Real Context Size of Your Long-Context Language Models?
doi: 10.48550/arxiv.2404.06654. URLhttps://doi.org/10.48550/arxiv.2404.06654. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., and Chen, W. Lora: Low-rank adaptation of large language models.CoRR, abs/2106.09685,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.06654
-
[20]
LoRA: Low-Rank Adaptation of Large Language Models
URLhttps://arxiv.org/abs/2106.09685. Hu, J., Ruder, S., Siddhant, A., Neubig, G., Firat, O., and Johnson, M. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization.CoRR, abs/2003.11080,
work page internal anchor Pith review Pith/arXiv arXiv 2003
-
[21]
ISSN 1091-6490. doi: 10.1073/pnas.1611835114. URLhttp://dx.doi.org/10.1073/ pnas.1611835114. Kotha, S., Springer, J. M., and Raghunathan, A. Understanding catastrophic forgetting in language models via implicit inference.CoRR, abs/2309.10105, September
-
[22]
URL https://doi.org/10.48550/arxiv.2309.10105
doi: 10.48550/arxiv.2309.10105. URL https://doi.org/10.48550/arxiv.2309.10105. Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,
-
[23]
R., Stevens, K., Barhoum, A., Duc, N
Köpf, A., Kilcher, Y., von Rütte, D., Anagnostidis, S., Tam, Z. R., Stevens, K., Barhoum, A., Duc, N. M., Stanley, O., Nagyfi, R., ES, S., Suri, S., Glushkov, D., Dantuluri, A., Maguire, A., Schuhmann, C., Nguyen, H., and Mattick, A. Openassistant conversations - democratizing large language model alignment.CoRR, abs/2304.07327,
-
[24]
R., Stevens, K., Barhoum, A., Duc, N
doi: 10.48550/arxiv.2304.07327. URLhttps://doi.org/10.48550/arxiv.2304.07327. Lambert, N., Morrison, J., Pyatkin, V., Huang, S., Ivison, H., Brahman, F., Miranda, L. J. V., Liu, A., Dziri, N., Lyu, S., Gu, Y., Malik, S., Graf, V., Hwang, J. D., Yang, J., Bras, R. L., Tafjord, O., Wilhelm, C., Soldaini, L., Smith, N. A., Wang, Y., Dasigi, P., and Hajishirz...
-
[25]
Revisiting catastrophic forgetting in large language model tuning
Li, H., Ding, L., Fang, M., and Tao, D. Revisiting catastrophic forgetting in large language model tuning. In Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 4297–4308. Association for Computational Linguistics,
work page 2024
-
[26]
doi: 10.18653/v1/2024.findings-emnlp.249. URLhttps://aclanthology. org/2024.findings-emnlp.249/. Li, J., Li, J., Wang, Y., Chang, Y., and Wu, Y. Structflowbench: A structured flow benchmark for multi-turn instruction following.ACL, pp. 9322–9341, 2025a. URLhttps://aclanthology.org/2025.findings-acl.486/. Li, S. S., Mun, J., Brahman, F., Ilgen, J., Tsvetko...
-
[27]
URLhttps://doi.org/10.18653/v1/2022.acl-long.229
doi: 10.18653/v1/2022.acl-long.229. URLhttps://doi.org/10.18653/v1/2022.acl-long.229. Lin, Y., Lin, H., Xiong, W., Diao, S., Liu, J., Zhang, J., Pan, R., Wang, H., Hu, W., Zhang, H., Dong, H., Pi, R., Zhao, H., Jiang, N., Ji, H., Yao, Y., and Zhang, T. Mitigating the alignment tax of rlhf,
-
[28]
13 CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training Liu, D
URL https://arxiv.org/abs/2309.06256. 13 CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training Liu, D. and Niehues, J. Conditions for catastrophic forgetting in multilingual translation.CoRR, abs/2510.19546,
-
[29]
URLhttps://doi.org/10.48550/arxiv.2510.19546
doi: 10.48550/arxiv.2510.19546. URLhttps://doi.org/10.48550/arxiv.2510.19546. Liu, J., Liu, H., Xiao, L., Wang, Z., Liu, K., Gao, S., Zhang, W., Zhang, S., and Chen, K. Are your llms capable of stable reasoning?ACL, pp. 17594–17632,
-
[30]
An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning
URLhttps://aclanthology.org/2025.findings-acl.905/. Luo, Y., Yang, Z., Meng, F., Li, Y., 0016, J. Z., and 0004, Y. Z. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.CoRR, abs/2308.08747, August
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning
doi: 10.48550/arxiv.2308.08747. URLhttps://doi.org/10.48550/arxiv.2308.08747. Ma, Z., Huang, W., Zhang, J., Gupta, T., and Krishna, R. m&m’s: A benchmark to evaluate tool-use for multi-step multi-modal tasks,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.08747
-
[32]
Mallen, A., Asai, A., Zhong, V., Das, R., Hajishirzi, H., and Khashabi, D. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories.CoRR, abs/2212.10511,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
doi: 10.48550/arxiv.2212.10511. URLhttps://doi.org/10.48550/arxiv.2212.10511. Mazeika, M., Phan, L., Yin, X., Zou, A., 0001, Z. W., Mu, N., Sakhaee, E., Li, N., Basart, S., 0026, B. L., Forsyth, D. A., and Hendrycks, D. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.CoRR, abs/2402.04249,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.10511
-
[34]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
doi: 10.48550/arxiv.2402.04249. URL https://doi.org/10.48550/arxiv.2402.04249. McCloskey, M. and Cohen, N. J. Catastrophic interference in connectionist networks: The sequential learning problem. volume 24 ofPsychology of Learning and Motivation, pp. 109–165. Academic Press,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.04249
-
[35]
URL https://www.sciencedirect.com/science/article/ pii/S0079742108605368
doi: https://doi.org/10.1016/S0079-7421(08)60536-8. URL https://www.sciencedirect.com/science/article/ pii/S0079742108605368. Meta, L. T. The llama 3 herd of models,
-
[36]
URLhttps://arxiv.org/abs/2407.21783. Niu, C., Wu, Y., Zhu, J., Xu, S., Shum, K., Zhong, R., Song, J., and Zhang, T. RAGTruth: A hallucination corpus for developing trustworthy retrieval-augmented language models. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.),Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volu...
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
doi: 10.18653/v1/2024.acl-long.585
Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.585. URLhttps://aclanthology.org/2024.acl-long.585/. OpenAI. Openai o1 system card,
-
[38]
URLhttps://arxiv.org/abs/2412.16720. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., and Lowe, R. Training language models to follow instructions with human feedback,
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
Training language models to follow instructions with human feedback
URLhttps://arxiv.org/abs/2203.02155. Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., and Wermter, S. Continual lifelong learning with neural networks: A review.Neural Networks, 113:54–71,
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
ISSN 0893-6080. doi: https://doi.org/10.1016/j. neunet.2019.01.012. URLhttps://www.sciencedirect.com/science/article/pii/S0893608019300231. Pryzant, R., Iter, D., Li, J., Lee, Y. T., Zhu, C., and Zeng, M. Automatic prompt optimization with "gradient descent" and beam search,
work page doi:10.1016/j 2019
-
[41]
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C
URLhttps://arxiv.org/abs/2305.03495. Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., and Finn, C. Direct preference optimization: Your language model is secretly a reward model.CoRR, abs/2305.18290,
-
[42]
doi: 10.48550/arxiv.2305. 18290. URLhttps://doi.org/10.48550/arxiv.2305.18290. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y. K., Wu, Y., and Guo, D. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305
-
[43]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
doi: 10.48550/arxiv.2402.03300. URLhttps://doi.org/10.48550/arxiv.2402.03300. Shi, F., Suzgun, M., Freitag, M., 0002, X. W., Srivats, S., Vosoughi, S., Chung, H. W., Tay, Y., Ruder, S., Zhou, D., 0001, D. D., and Wei, J. Language models are multilingual chain-of-thought reasoners.CoRR, abs/2210.03057,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300
-
[44]
Language Models are Multilingual Chain-of-Thought Reasoners
doi: 10.48550/arxiv.2210.03057. URLhttps://doi.org/10.48550/arxiv.2210.03057. Team, M.-A.-P., Du, X., Yao, Y., Ma, K., Wang, B., Zheng, T., Zhu, K., Liu, M., Liang, Y., Jin, X., Wei, Z., Zheng, C., Deng, K., Guo, S., Jia, S., Jiang, S., Liao, Y., Li, R., Li, Q., Li, S., Li, Y., Li, Y., Ma, D., Ni, Y., 14 CapTrack: Multifaceted Evaluation of Forgetting in ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.03057
-
[45]
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
URLhttps://arxiv.org/abs/2502.14739. Thede, L., Roth, K., Hénaff, O. J., Bethge, M., and Akata, Z. Reflecting on the state of rehearsal-free continual learning with pretrained models.CoLLAs, pp. 1076–1093,
work page internal anchor Pith review Pith/arXiv arXiv
-
[46]
URLhttps://proceedings.mlr. press/v274/thede25a.html. Tie, G., Zhao, Z., Song, D., Wei, F., Zhou, R., Dai, Y., Yin, W., Yang, Z., Yan, J., 0003, Y. S., Dai, Z., Xie, Y., Cao, Y., 0001, L. S., 0001, P. Z., 0001, L. H., Chen, H., 0006, Y. Z., Wen, Q., 0001, T. L., Gong, N. Z., Tang, J., Xiong, C., 0001, H.J., Yu, P.S., and0001, J.G. Asurveyonpost-trainingof...
-
[47]
doi: 10.48550/arxiv.2503.06072. URLhttps://doi.org/10.48550/arxiv.2503.06072. Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., Li, T., Ku, M., Wang, K., Zhuang, A., Fan, R., Yue, X., and Chen, W. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark,
-
[48]
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
URLhttps://arxiv.org/abs/2406.01574. Wortsman, M., Ilharco, G., Gadre, S. Y., Roelofs, R., Lopes, R. G., Morcos, A. S., Namkoong, H., Farhadi, A., Carmon, Y., Kornblith, S., and Schmidt, L. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time.ICML, pp. 23965–23998,
work page internal anchor Pith review Pith/arXiv arXiv
-
[49]
URL http://papers.nips.cc/paper_files/paper/2023/hash/ 1644c9af28ab7916874f6fd6228a9bcf-Abstract-Conference.html. Yan, F., Mao, H., Ji, C. C.-J., Zhang, T., Patil, S. G., Stoica, I., and Gonzalez, J. E. Berkeley function calling leaderboard. https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard. html,
work page 2023
-
[50]
Detecting causal language use in science findings
Yu, B., Li, Y., and Wang, J. Detecting causal language use in science findings. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4664–4674, Hong Kong, China, November
work page 2019
-
[51]
Association for Computational Linguistics. doi: 10.18653/v1/D19-1473. URLhttps://aclanthology.org/ D19-1473. Yu, L., Yu, B., Yu, H., Huang, F., and Li, Y. Language models are super mario: Absorbing abilities from homologous models as a free lunch,
-
[52]
URLhttps://arxiv.org/abs/2311.03099. Yıldız, C., Ravichandran, N. K., Punia, P., Bethge, M., and Ermis, B. Investigating continual pretraining in large language models: Insights and implications.arXiv [cs.CL], February
-
[53]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
doi: 10.1145/3777411. URLhttps://doi.org/10.1145/3777411. Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, C.-Y., Zhuang, Y., Krishnamurthy, N., Chen, Z., 15 CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training Koyejo, S., Arik, S. O., Li, D. S., and Stoica, I. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXi...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3777411
-
[54]
Instruction-Following Evaluation for Large Language Models
Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y., Zhou, D., and Hou, L. Instruction-following evaluation for large language models.CoRR, abs/2311.07911,
work page internal anchor Pith review Pith/arXiv arXiv
-
[55]
Instruction-Following Evaluation for Large Language Models
doi: 10.48550/arxiv.2311.07911. URL https://doi.org/10.48550/arxiv.2311.07911. 16 CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training Appendix This appendix provides additional methodological details, analyses, and results that support the findings presented in the main paper. Its primary purpose is to improve transparency and reproducibi...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2311.07911 2025
-
[56]
is used where supported to improve memory efficiency. All runs use a similar effective batch size, determined by the number of nodes, the per-device batch size, and the gradient accumulation rate. Random seeds are fixed across runs to ensure reproducibility. Instruction Fine-Tuning.IFT is performed using standard next-token prediction with an NLL loss com...
work page 2019
-
[57]
D.1 Extended Spider Plot Results Figure 5 extends the spider plot analysis from the main paper by including the missing IFT+DPO configuration as well as the corresponding results for the medical domain. As in the main paper, each spider plot summarizes capability-level changes across the CAN, WILL, and HOW categories, aggregated by model family. These add...
work page 2023
-
[58]
Across all settings, the post-training algorithm and overall training budget are held fixed
with models trained on a domain-specific legal mixture. Across all settings, the post-training algorithm and overall training budget are held fixed. This experiment is designed to isolate the contribution of the data source to forgetting behavior, complementing the aggregated analysis presented in the main paper. We report the full, non-aggregated results...
work page 2022
-
[59]
with density 0.1. For each method, we interpolate between the OOB model and the instruction fine-tuned (IFT) model using different merge weights, and compute stability and plasticity for each merged checkpoint. Figure13reportsstability-plasticitycurvesacrossmergingmethods. Acrossallmethods, weobserveaconsistent stability-plasticity trade-off: increasing r...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.