pith. sign in

arxiv: 2406.11354 · v3 · submitted 2024-06-17 · 💻 cs.CL · cs.AI· cs.CV

Preserving Knowledge in Large Language Model with Model-Agnostic Self-Decompression

Pith reviewed 2026-05-24 00:08 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CV
keywords catastrophic forgettinglarge language modelsmultimodal LLMsself-decompressionTree Generationsupervised fine-tuningknowledge preservationinstruction tuning
0
0 comments X

The pith

Tree Generation creates synthetic instruction data from an LLM that, when mixed into SFT, reduces language forgetting in multimodal models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models lose prior knowledge when fine-tuned on new tasks, and multimodal versions suffer extra decline on pure language benchmarks. The paper presents Tree Generation as a way to unpack an LLM's existing knowledge into a reusable corpus of synthetic supervised fine-tuning examples. Adding this corpus during later instruction tuning measurably limits the drop in language performance. A sympathetic reader would see this as a practical route to update models on domain data without having to store or replay the entire original pretraining set.

Core claim

Tree Generation (TG) is a model-agnostic self-decompression procedure that converts the parametric knowledge inside an LLM into an explicit training corpus by producing synthetic instruction-response pairs. TG-SFT applies this corpus during supervised fine-tuning of multimodal LLMs; the resulting models exhibit substantially less degradation on language-only benchmarks than models trained on the same target data without the added corpus.

What carries the argument

Tree Generation (TG), a procedure that synthetically expands an LLM's internal knowledge into instruction-tuning examples for later reuse.

If this is right

  • MLLMs can acquire visual capabilities while retaining more of their original language competence.
  • The same decompression step can be applied to plain LLMs before any domain-specific fine-tuning.
  • Once generated, the synthetic corpus can be stored and reused across multiple downstream fine-tuning runs without additional model queries.
  • The approach requires no changes to model architecture or training objective beyond data mixture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Continual-learning pipelines could replace replay buffers with periodically regenerated TG corpora.
  • The method might extend to other modalities if the base model can be prompted to produce cross-modal instruction pairs.
  • If the synthetic data proves high-fidelity, pretraining checkpoints could be discarded after a single TG pass, lowering storage costs.

Load-bearing premise

The synthetic examples generated by Tree Generation accurately capture the original LLM's knowledge without distortion or loss of fidelity.

What would settle it

Training an MLLM on target data plus the TG corpus and observing no improvement, or a larger drop, on language benchmarks relative to target data alone would falsify the central claim.

Figures

Figures reproduced from arXiv: 2406.11354 by Jianwei Yin, Kyusong Lee, Leigang Sha, Ruochen Xu, Tiancheng Zhao, Yutao Sun, Zilun Zhang.

Figure 1
Figure 1. Figure 1: The motivation of Our Work. Shadow rep￾resents the error bar. The SFT of MLLM harms the language ability of its LLM backbone (MLLM has be￾gun to forget its general language ability while training is processed). We choose the LLaMA2-7B-chat model as the LLM backbone for the experiments. Details of this experiment can be found in Appendix A.1. The first data point is evaluated from the checkpoint of 3000 ste… view at source ↗
Figure 2
Figure 2. Figure 2: TG-SFT structure overview, illustrates a three-layer complete tree structure. In practice, the depth of [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: 𝑆𝐷 [INST] <<SYS>> A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user’s questions. The topic of this conversation will be focused on all kinds of world knowledge. <</SYS>> What is type 2 diabetes? What are its causes? [/INST] Type 2 diabetes is a prevalent and complex metabolic disorder that af￾fects how our body regu… view at source ↗
Figure 5
Figure 5. Figure 5: Number of turns in TG-SFT decompressed Data the 2-turn corpus achieves the best performance in LLM benchmarks compared to the other two configurations. This could be attributed to the G￾turn corpus being too diverse in context length and the 1-turn corpus being too short, which harms the the model during SFT. 5 Conclusion & Future Work To address the problem of catastrophic forget￾ting in LLMs and MLLMs, w… view at source ↗
read the original abstract

Humans can retain old knowledge while learning new information, but Large Language Models (LLMs) often suffer from catastrophic forgetting when post-pretrained or supervised fine-tuned (SFT) on domain-specific data. Moreover, for Multimodal Large Language Models (MLLMs) which are composed of the LLM base and visual projector (e.g. LLaVA), a significant decline in performance on language benchmarks was observed compared to their single-modality counterparts. To address these challenges, we introduce a novel model-agnostic self-decompression method, Tree Generation (TG), that decompresses knowledge within LLMs into the training corpus. This paper focuses on TG-SFT, which can synthetically generate SFT data for the instruction tuning steps. By incorporating the dumped corpus during SFT for MLLMs, we significantly reduce the forgetting problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces a model-agnostic self-decompression method called Tree Generation (TG) that extracts knowledge from an LLM into a synthetic training corpus. It focuses on the TG-SFT variant, which generates instruction-tuning data; the central claim is that incorporating this corpus during supervised fine-tuning of MLLMs (e.g., LLaVA) significantly reduces catastrophic forgetting on language benchmarks relative to standard SFT.

Significance. If the empirical results hold, the approach would supply a parameter-free, model-internal mechanism for preserving base-LLM capabilities when extending to multimodal settings. This addresses a documented practical limitation of current MLLMs without requiring external data or architectural changes.

major comments (2)
  1. [Abstract] Abstract: the claim that TG-SFT 'significantly reduce[s] the forgetting problem' is stated without any quantitative results, baselines, metrics, or experimental protocol. No numbers, tables, or figures are referenced to support the reduction.
  2. The method's validity rests on the unverified assumption that Tree Generation produces synthetic SFT examples whose factual and reasoning content matches the original LLM without systematic omission or hallucination. No fidelity checks (perplexity on held-out pre-training data, knowledge-probe accuracy, or generated-vs-human pair comparison) are described.
minor comments (1)
  1. [Abstract] The phrase 'dumped corpus' is used without a concise definition or high-level description of the decompression procedure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and the need to substantiate the core assumptions of Tree Generation. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that TG-SFT 'significantly reduce[s] the forgetting problem' is stated without any quantitative results, baselines, metrics, or experimental protocol. No numbers, tables, or figures are referenced to support the reduction.

    Authors: We agree the abstract would be strengthened by referencing the supporting evidence. In the revised manuscript we will update the abstract to cite the key quantitative findings (e.g., the measured reduction in language-benchmark degradation relative to standard SFT), the evaluation metrics, and the relevant tables/figures from the experimental section. revision: yes

  2. Referee: [—] The method's validity rests on the unverified assumption that Tree Generation produces synthetic SFT examples whose factual and reasoning content matches the original LLM without systematic omission or hallucination. No fidelity checks (perplexity on held-out pre-training data, knowledge-probe accuracy, or generated-vs-human pair comparison) are described.

    Authors: The recursive tree-expansion procedure is intended to elicit comprehensive knowledge from the source LLM, and the observed reduction in catastrophic forgetting supplies indirect support for the quality of the generated data. We acknowledge that explicit fidelity verification was omitted from the initial submission. In revision we will add a dedicated limitations paragraph discussing this assumption and will report basic fidelity metrics (e.g., knowledge-probe accuracy on held-out queries) using the generated corpus. revision: partial

Circularity Check

0 steps flagged

No circularity: method is a proposed generation procedure whose efficacy is claimed to be shown empirically

full rationale

The provided abstract and description outline a new Tree Generation procedure that produces synthetic SFT data from an LLM, followed by an empirical claim that including this data during MLLM fine-tuning reduces forgetting. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the given text. The derivation chain does not reduce any result to its own inputs by construction; the central claim remains an external empirical assertion about the generated corpus's effect.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5695 in / 855 out tokens · 28084 ms · 2026-05-24T00:08:03.992502+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 29 internal anchors

  1. [1]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Aleixo, Juan G

    Everton L. Aleixo, Juan G. Colonna, Marco Cristo, and Everlandio Fernandes. 2023. https://arxiv.org/abs/2312.10549 Catastrophic forgetting in deep learning: A comprehensive taxonomy . Preprint, arXiv:2312.10549

  4. [4]

    AI Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card

  5. [5]

    Llemma: An Open Language Model For Mathematics

    Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, and Sean Welleck. 2024. https://arxiv.org/abs/2310.10631 Llemma: An open language model for mathematics . Preprint, arXiv:2310.10631

  6. [6]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  7. [7]

    Carbonell and Jade Goldstein

    Jaime G. Carbonell and Jade Goldstein. 2017. The use of mmr, diversity-based reranking for reordering documents and producing summaries. SIGIR Forum , 51(2):209--210

  8. [8]

    Ted Chiang. 2023. https://www.newyorker.com/tech/annals-of-technology/chatgpt-is-a-blurry-jpeg-of-the-web Chatgpt is a blurry jpeg of the web

  9. [9]

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. https://arxiv.org/abs/1803.05457 Think you have solved question answering? try arc, the ai2 reasoning challenge . Preprint, arXiv:1803.05457

  10. [10]

    Grégoire Delétang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, Marcus Hutter, and Joel Veness. 2024. https://arxiv.org/abs/2309.10668 Language modeling is compression . Preprint, arXiv:2309.10668

  11. [11]

    Shihan Dou, Enyu Zhou, Yan Liu, Songyang Gao, Jun Zhao, Wei Shen, Yuhao Zhou, Zhiheng Xi, Xiao Wang, Xiaoran Fan, Shiliang Pu, Jiang Zhu, Rui Zheng, Tao Gui, Qi Zhang, and Xuanjing Huang. 2024. https://arxiv.org/abs/2312.09979 Loramoe: Alleviate world knowledge forgetting in large language models via moe-style plugin . Preprint, arXiv:2312.09979

  12. [12]

    Matthew Finlayson, Xiang Ren, and Swabha Swayamdipta. 2024. https://arxiv.org/abs/2403.09539 Logits of api-protected llms leak proprietary information . Preprint, arXiv:2403.09539

  13. [13]

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2023. https...

  14. [14]

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. https://arxiv.org/abs/1612.00837 Making the v in vqa matter: Elevating the role of image understanding in visual question answering . Preprint, arXiv:1612.00837

  15. [15]

    Yuxian Gu, Li Dong, Yaru Hao, Qingxiu Dong, Minlie Huang, and Furu Wei. 2024. https://arxiv.org/abs/2402.17759 Towards optimal learning of language models . Preprint, arXiv:2402.17759

  16. [16]

    Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. 2023. https://arxiv.org/abs/2306.11644 Textbooks are...

  17. [17]

    VizWiz Grand Challenge: Answering Visual Questions from Blind People

    Danna Gurari, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. 2018. https://arxiv.org/abs/1802.08218 Vizwiz grand challenge: Answering visual questions from blind people . Preprint, arXiv:1802.08218

  18. [18]

    Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. https://arxiv.org/abs/2004.10964 Don't stop pretraining: Adapt language models to domains and tasks . Preprint, arXiv:2004.10964

  19. [19]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. https://arxiv.org/abs/2009.03300 Measuring massive multitask language understanding . Preprint, arXiv:2009.03300

  20. [20]

    Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. 2023. https://arxiv.org/abs/2305.02301 Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes . Preprint, arXiv:2305.02301

  21. [21]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. https://arxiv.org/abs/2106.09685 Lora: Low-rank adaptation of large language models . Preprint, arXiv:2106.09685

  22. [22]

    Quzhe Huang, Mingxu Tao, Chen Zhang, Zhenwei An, Cong Jiang, Zhibin Chen, Zirui Wu, and Yansong Feng. 2023. https://arxiv.org/abs/2305.15062 Lawyer llama technical report . Preprint, arXiv:2305.15062

  23. [23]

    GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

    Drew A. Hudson and Christopher D. Manning. 2019. https://arxiv.org/abs/1902.09506 Gqa: A new dataset for real-world visual reasoning and compositional question answering . Preprint, arXiv:1902.09506

  24. [24]

    Eric Jang. 2023. https://evjang.com/2023/03/26/self-reflection.html Can llms critique and iterate on their own outputs? evjang.com

  25. [25]

    Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, and Robert McHardy. 2023. https://arxiv.org/abs/2307.10169 Challenges and applications of large language models . Preprint, arXiv:2307.10169

  26. [26]

    Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. 2023. https://arxiv.org/abs/2311.15826 Geochat: Grounded large vision-language model for remote sensing . Preprint, arXiv:2311.15826

  27. [27]

    Mahoney, Kurt Keutzer, and Amir Gholami

    Nicholas Lee, Thanakul Wattanawong, Sehoon Kim, Karttikeya Mangalam, Sheng Shen, Gopala Anumanchipali, Michael W. Mahoney, Kurt Keutzer, and Amir Gholami. 2024. https://arxiv.org/abs/2403.15042 Llm2llm: Boosting llms with novel iterative data enhancement . Preprint, arXiv:2403.15042

  28. [28]

    Bo Li*, Kaichen Zhang* Peiyuan Zhang*, Fanyi Pu*, Xinrun Du, Yuhao Dong, Haotian Liu, Yuanhan Zhang, Ge Zhang, Chunyuan Li, and Ziwei Liu. 2024. https://github.com/EvolvingLMMs-Lab/lmms-eval Lmms-eval: Accelerating the development of large multimoal models

  29. [29]

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. 2023 a . https://arxiv.org/abs/2307.16125 Seed-bench: Benchmarking multimodal llms with generative comprehension . Preprint, arXiv:2307.16125

  30. [30]

    Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. 2023 b . Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890

  31. [31]

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. 2023 c . https://arxiv.org/abs/2305.10355 Evaluating object hallucination in large vision-language models . Preprint, arXiv:2305.10355

  32. [32]

    Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. 2023 d . https://arxiv.org/abs/2309.05463 Textbooks are all you need ii: phi-1.5 technical report . Preprint, arXiv:2309.05463

  33. [33]

    Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, and Ming Yin. 2023 e . https://arxiv.org/abs/2310.07849 Synthetic data generation with large language models for text classification: Potential and limitations . Preprint, arXiv:2310.07849

  34. [34]

    Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. https://arxiv.org/abs/2109.07958 Truthfulqa: Measuring how models mimic human falsehoods . Preprint, arXiv:2109.07958

  35. [35]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning

  36. [36]

    Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, and Andrew M. Dai. 2024 a . https://arxiv.org/abs/2404.07503 Best practices and lessons learned on synthetic data for language models . Preprint, arXiv:2404.07503

  37. [37]

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. 2024 b . https://arxiv.org/abs/2307.06281 Mmbench: Is your multi-modal model an all-around player? Preprint, arXiv:2307.06281

  38. [38]

    Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. https://arxiv.org/abs/2209.09513 Learn to explain: Multimodal reasoning via thought chains for science question answering . Preprint, arXiv:2209.09513

  39. [39]

    Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. 2024. https://arxiv.org/abs/2308.08747 An empirical study of catastrophic forgetting in large language models during continual fine-tuning . Preprint, arXiv:2308.08747

  40. [40]

    Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. 2022. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft

  41. [41]

    Arindam Mitra, Hamed Khanpour, Corby Rosset, and Ahmed Awadallah. 2024. https://arxiv.org/abs/2402.14830 Orca-math: Unlocking the potential of slms in grade school math . Preprint, arXiv:2402.14830

  42. [42]

    Scalable Extraction of Training Data from (Production) Language Models

    Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, and Katherine Lee. 2023. https://arxiv.org/abs/2311.17035 Scalable extraction of training data from (production) language models . Preprint, arXiv:2311.17035

  43. [43]

    OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner...

  44. [44]

    Ankit Patel. 2024. N V I D I A R eleases O pen S ynthetic D ata G eneration P ipeline for T raining L arge L anguage M odels --- blogs.nvidia.com. https://blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm-training/. [Accessed 15-06-2024]

  45. [45]

    Jack Rae. 2023. chttps://www.youtube.com/watch?v=dO4TPJkeaaU Compression for agi

  46. [46]

    Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In EMNLP/IJCNLP (1) , pages 3980--3990. Association for Computational Linguistics

  47. [47]

    Nils Reimers and Iryna Gurevych. 2020. https://arxiv.org/abs/2004.09813 Making monolingual sentence embeddings multilingual using knowledge distillation . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics

  48. [48]

    Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, J \'e r \'e my Rapin, et al. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950

  49. [49]

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2019. https://arxiv.org/abs/1907.10641 Winogrande: An adversarial winograd schema challenge at scale . Preprint, arXiv:1907.10641

  50. [50]

    Siu, Byron C

    Chantal Shaib, Joe Barrow, Jiuding Sun, Alexa F. Siu, Byron C. Wallace, and Ani Nenkova. 2024. https://arxiv.org/abs/2403.00553 Standardizing the measurement of text diversity: A tool and a comparative analysis of scores . Preprint, arXiv:2403.00553

  51. [51]

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. https://arxiv.org/abs/1904.08920 Towards vqa models that can read . Preprint, arXiv:1904.08920

  52. [52]

    Ilya Sutskever. 2023. https://www.youtube.com/watch?v=Yf1o0TQzry8 Ilya sutskever (openai chief scientist) - building agi, alignment, spies, microsoft, & enlightenment

  53. [53]

    Gemini Team, Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry, Lepikhin, Timothy Lillicrap, Jean baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, Ioannis Antonoglou, Rohan Anil, Sebastian Borgeaud, Andrew Dai, Katie Millican, Ethan Dyer, Mia Glaese, Thibault Sottiaux, Benjamin Lee, Fabio Viola, Malcolm Reynolds,...

  54. [54]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

  55. [55]

    Laurens van der Maaten and Geoffrey Hinton. 2008. http://jmlr.org/papers/v9/vandermaaten08a.html Visualizing data using t-sne . Journal of Machine Learning Research, 9(86):2579--2605

  56. [56]

    Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuanjing Huang. 2023. https://arxiv.org/abs/2310.14152 Orthogonal subspace learning for language model continual learning . Preprint, arXiv:2310.14152

  57. [57]

    Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, and Quoc V. Le. 2024. https://arxiv.org/abs/2403.18802 Long-form factuality in large language models . Preprint, arXiv:2403.18802

  58. [58]

    Chengyue Wu, Yukang Gan, Yixiao Ge, Zeyu Lu, Jiahao Wang, Ye Feng, Ying Shan, and Ping Luo. 2024. https://arxiv.org/abs/2401.02415 Llama pro: Progressive llama with block expansion . Preprint, arXiv:2401.02415

  59. [59]

    Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. 2023. https://arxiv.org/abs/2304.12244 Wizardlm: Empowering large language models to follow complex instructions . Preprint, arXiv:2304.12244

  60. [60]

    Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, and Bill Yuchen Lin. 2024. https://arxiv.org/abs/2406.08464 Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing . Preprint, arXiv:2406.08464

  61. [61]

    Yibo Yang, Stephan Mandt, and Lucas Theis. 2023. https://arxiv.org/abs/2202.06533 An introduction to neural data compression . Preprint, arXiv:2202.06533

  62. [62]

    Zhaorui Yang, Tianyu Pang, Haozhe Feng, Han Wang, Wei Chen, Minfeng Zhu, and Qian Liu. 2024. https://arxiv.org/abs/2402.13669 Self-distillation bridges distribution gap in language model fine-tuning . Preprint, arXiv:2402.13669

  63. [63]

    Asaf Yehudai, Boaz Carmeli, Yosi Mass, Ofir Arviv, Nathaniel Mills, Assaf Toledo, Eyal Shnarch, and Leshem Choshen. 2024. https://arxiv.org/abs/2401.14367 Genie: Achieving human parity in content-grounded datasets generation . Preprint, arXiv:2401.14367

  64. [64]

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. 2024. https://arxiv.org/abs/2306.13549 A survey on multimodal large language models . Preprint, arXiv:2306.13549

  65. [65]

    Li Yunxiang, Li Zihan, Zhang Kai, Dan Ruilong, and Zhang You. 2023. Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge. arXiv preprint arXiv:2303.14070

  66. [66]

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. https://arxiv.org/abs/1905.07830 Hellaswag: Can a machine really finish your sentence? Preprint, arXiv:1905.07830