pith. sign in

arxiv: 2511.08877 · v2 · submitted 2025-11-12 · 💻 cs.CL · cs.AI

Hierarchical Memorization in Large Language Models: Evidence from Citation Generation

Pith reviewed 2026-05-17 23:08 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords large language modelsmemorizationcitation generationhallucinationtraining data redundancyhierarchical learning
0
0 comments X

The pith

Large language models memorize citation details in layers, recalling titles and authors before venues or years as training repetitions increase.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how LLMs generate citations and treats the number of times a paper has been cited as a stand-in for how often its details appeared in the model's training data. It shows that factual accuracy rises steadily with this count but different pieces of information are acquired at different rates: titles and first authors emerge early, venues and numeric details come later, and years stay mostly unlearned. A reader would care because this layered pattern explains why LLMs keep producing plausible but invented citations and points to the uneven way training data gets stored inside the model. The work tests this by prompting GPT-4.1 to produce citations across computer-science fields and checking them against real metadata.

Core claim

Memorization of bibliographic records inside LLMs is not an all-or-nothing event but a graduated, hierarchically layered process. Factual accuracy scales log-linearly with citation count, with an inflection near 90 citations and near-verbatim reproduction after roughly 1,200 citations. Within any single record, titles and first authors are recovered first, venues and numeric fields demand substantially more redundancy, and publication years remain essentially unlearned. Records that share similar titles or authors can still interfere with one another even when each is individually well represented.

What carries the argument

Citation count used as a proxy for training-data redundancy, which orders the acquisition thresholds for different metadata fields inside one bibliographic entry.

If this is right

  • Generated citations become measurably more accurate as the underlying paper's citation count rises in a log-linear pattern.
  • Near-verbatim reproduction of a full citation record occurs only after the record exceeds roughly 1,200 citations.
  • Overlapping titles or authors can still produce conflations even for records that individually exceed the saturation threshold.
  • Publication years stay poorly reproduced regardless of how high the citation count becomes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same layered pattern may govern factual recall in domains other than citations whenever training data contains repeated but unevenly detailed statements.
  • Targeted increases in redundancy for numeric fields or dates during continued pretraining could reduce specific classes of hallucination without raising overall data volume.
  • Corpus curators could prioritize balanced coverage of weaker fields such as years rather than simply adding more high-count examples.

Load-bearing premise

Citation count serves as a reliable proxy for the redundancy of a bibliographic record in the pretraining corpus.

What would settle it

Direct measurement of actual training-data frequency for specific papers showing that accuracy no longer tracks citation count once true redundancy is controlled.

Figures

Figures reproduced from arXiv: 2511.08877 by Junichiro Niimi.

Figure 1
Figure 1. Figure 1: Prompt to generate bibliographic information [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Relationship between citation frequency and generation fidelity. Each dot represents a factual record [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

Large language models (LLMs) generate fluent text across a wide range of tasks, but the fabrication of non-existent academic citations remains a critical and well-documented failure mode. Building on prior work that frames hallucination and verbatim memorization as outcomes of the same probabilistic process, this study uses citation count as a proxy for training data redundancy and asks how this redundancy is internally structured within a single bibliographic record. Using GPT-4.1, we generated and manually verified 100 citations across twenty computer-science domains, measuring factual fidelity via cosine similarity against authentic metadata. We find that (i) factual accuracy varies substantially across domains and scales log-linearly with citation count, (ii) the model crosses two empirically identifiable thresholds; an inflection around 90 citations and a saturation point near 1,200 citations beyond which records are reproduced nearly verbatim, (iii) memorization is hierarchical, with titles and first authors recalled earliest while venues and numeric fields require far greater redundancy and publication years remain essentially unlearned, and (iv) even highly cited records can be conflated when their titles and authors overlap, an effect interpretable as spurious-attractor interference. Memorization in LLMs is therefore not a binary on/off state but a graduated, hierarchically layered phenomenon shaped by the uneven distribution of knowledge in the pretraining corpus.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that memorization in LLMs is not binary but a graduated, hierarchically layered process. Using GPT-4.1 to generate 100 citations across 20 computer-science domains, manually verifying them, and measuring factual fidelity via cosine similarity to authentic metadata, the authors treat citation count as a proxy for pretraining-data redundancy. They report log-linear scaling of accuracy with citation count, two thresholds (inflection near 90 citations, saturation near 1,200), a clear ordering in which titles and first authors are recalled earliest while venues, numeric fields, and especially publication years require greater redundancy, and spurious-attractor interference when titles and authors overlap.

Significance. If the central empirical patterns hold, the work supplies concrete evidence that LLM knowledge acquisition tracks uneven redundancy gradients inside the training corpus, moving the field beyond binary hallucination/memorization dichotomies. The manual verification of 100 citations and the quantitative cosine-similarity metric against ground-truth metadata constitute a clear methodological strength and support falsifiable claims about field-specific thresholds.

major comments (3)
  1. [Methods] Methods section: the justification for treating raw citation count as a reliable proxy for token-level redundancy of specific bibliographic fields (exact title strings versus year tokens, for example) is not provided; paywalled sources, citation contexts that omit full metadata, and domain-specific scraping rates could decouple citation count from the relevant duplication statistics, directly undermining the interpretation of the observed hierarchy and the two thresholds.
  2. [Results] Results section on scaling and thresholds: the identification of the inflection at ~90 citations and saturation at ~1,200 citations lacks reported statistical procedure (change-point analysis, piecewise regression, or bootstrap confidence intervals), so it is unclear whether these are robustly estimated or post-hoc visual choices; this choice is load-bearing for the log-linear scaling claim and the hierarchical interpretation.
  3. [Results] Results on field ordering: the claim that titles and first authors are recalled at lower redundancy while venues and numeric fields require higher counts rests on the untested assumption that each field appears with equal frequency in the contexts that actually reach the model; without auxiliary measurements of field-specific occurrence rates in the training distribution, the hierarchy could arise from prompt formatting or generation heuristics instead.
minor comments (2)
  1. [Abstract] Abstract and Results: clarify whether the reported log-linear relationship is obtained by regressing accuracy on log(citation count) or by fitting a linear model to log-transformed variables; the distinction affects how the thresholds are interpreted.
  2. [Figures] Figure captions and legends: ensure that any plots of accuracy versus citation count explicitly mark the 90- and 1,200-citation thresholds and report the number of points in each bin.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment point by point below, indicating revisions where the manuscript will be updated to improve rigor and transparency.

read point-by-point responses
  1. Referee: [Methods] Methods section: the justification for treating raw citation count as a reliable proxy for token-level redundancy of specific bibliographic fields (exact title strings versus year tokens, for example) is not provided; paywalled sources, citation contexts that omit full metadata, and domain-specific scraping rates could decouple citation count from the relevant duplication statistics, directly undermining the interpretation of the observed hierarchy and the two thresholds.

    Authors: We agree that citation count is an indirect proxy and does not directly capture token-level redundancy for individual fields, with potential decoupling from paywalled sources or incomplete contexts. In the revised manuscript we will add an explicit limitations paragraph in the Methods section discussing these factors and their implications for the hierarchy and thresholds, while retaining citation count as a practical proxy given the inaccessibility of full pretraining corpora. revision: yes

  2. Referee: [Results] Results section on scaling and thresholds: the identification of the inflection at ~90 citations and saturation at ~1,200 citations lacks reported statistical procedure (change-point analysis, piecewise regression, or bootstrap confidence intervals), so it is unclear whether these are robustly estimated or post-hoc visual choices; this choice is load-bearing for the log-linear scaling claim and the hierarchical interpretation.

    Authors: The thresholds were identified via visual inspection of accuracy curves for slope changes and plateauing. We acknowledge the absence of formal statistical validation. The revised Results section will incorporate change-point detection analysis along with bootstrap confidence intervals to substantiate the reported inflection and saturation points. revision: yes

  3. Referee: [Results] Results on field ordering: the claim that titles and first authors are recalled at lower redundancy while venues and numeric fields require higher counts rests on the untested assumption that each field appears with equal frequency in the contexts that actually reach the model; without auxiliary measurements of field-specific occurrence rates in the training distribution, the hierarchy could arise from prompt formatting or generation heuristics instead.

    Authors: Direct auxiliary measurements of field-specific occurrence rates are not possible without access to the closed training corpus. Standardized prompts were used across all generations to reduce formatting confounds, and the hierarchy is consistent across twenty domains. We will add discussion of alternative explanations including generation heuristics in the revision while maintaining that the redundancy interpretation best accounts for the observed patterns. revision: partial

Circularity Check

0 steps flagged

No significant circularity; purely observational empirical study

full rationale

The paper conducts an observational study by prompting GPT-4.1 to generate citations, manually verifying them, and measuring factual fidelity via cosine similarity to authentic metadata while correlating results against external citation counts. No equations, derivations, or fitted parameters are present that reduce any claimed result to a quantity defined by the inputs themselves. The hierarchical memorization pattern is reported as an empirical observation across fields at different citation-count thresholds rather than a self-referential construction. The proxy assumption linking citation count to training-data redundancy is an interpretive framing, not a definitional or self-citation loop that forces the central claims.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on treating real-world citation counts as a proxy for pretraining exposure and on cosine similarity plus manual review as faithful measures of factual accuracy; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Citation count is a valid proxy for training-data redundancy of a bibliographic record
    Invoked to link observed accuracy gradients to internal memorization structure.
  • domain assumption Cosine similarity on metadata plus manual verification accurately captures factual fidelity
    Used to quantify how closely generated citations match authentic records.

pith-pipeline@v0.9.0 · 5527 in / 1315 out tokens · 37489 ms · 2026-05-17T23:08:48.370198+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LLM-Metrics: Measuring Research Impact Through Large Language Model Memory

    cs.AI 2026-05 unverdicted novelty 5.0

    LLM-Metrics probes memory in 17 LLMs across 549 2023-2024 CS papers and finds a modest Spearman correlation (rho=0.1495) with citation counts, stronger for 2024 papers.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    Sentiment analysis in the age of generative ai.Customer Needs and Solutions, 11(1):3, 2024

    Jan Ole Krugmann and Jochen Hartmann. Sentiment analysis in the age of generative ai.Customer Needs and Solutions, 11(1):3, 2024

  2. [2]

    Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

    Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

  3. [3]

    Hallucination is Inevitable: An Innate Limitation of Large Language Models

    Ziwei Xu, Sanjay Jain, and Mohan Kankanhalli. Hallucination is inevitable: An innate limita- tion of large language models.arXiv preprint arXiv:2401.11817, 2024

  4. [4]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Sys- tems, 43(2):1–55, 2025

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Sys- tems, 43(2):1–55, 2025

  5. [5]

    A direct mail customer purchase model.Journal of Direct Marketing, 2(3):16–24, 1988

    Connie L Bauer. A direct mail customer purchase model.Journal of Direct Marketing, 2(3):16–24, 1988

  6. [6]

    Optimal selec- tion for direct mail.Marketing Science, 14(4):378– 394, 1995

    Jan Roelf Bult and Tom Wansbeek. Optimal selec- tion for direct mail.Marketing Science, 14(4):378– 394, 1995

  7. [7]

    Customer life- time value and firm valuation.Journal of Relation- ship Marketing, 5(2-3):87–110, 2006

    Sunil Gupta and Donald R Lehmann. Customer life- time value and firm valuation.Journal of Relation- ship Marketing, 5(2-3):87–110, 2006

  8. [8]

    John Wiley & Sons Incorporated, 1978

    Jacob Jacoby and Robert W Chestnut.Brand loy- alty: Measurement and management. John Wiley & Sons Incorporated, 1978

  9. [9]

    Ravindra Chitturi, Rajagopal Raghunathan, and Vi- jay Mahajan. Form versus function: How the in- tensities of specific emotions evoked in functional versus hedonic trade-offs mediate product prefer- ences.Journal of marketing research, 44(4):702– 714, 2007

  10. [10]

    Web 2.0: Conceptual foundations and marketing is- sues.Journal of direct, data and digital marketing practice, 9(3):231–244, 2008

    Efthymios Constantinides and Stefan J Fountain. Web 2.0: Conceptual foundations and marketing is- sues.Journal of direct, data and digital marketing practice, 9(3):231–244, 2008

  11. [11]

    Rfm and clv: Using iso-value curves for cus- tomer base analysis.Journal of marketing research, 42(4):415–430, 2005

    Peter S Fader, Bruce GS Hardie, and Ka Lok Lee. Rfm and clv: Using iso-value curves for cus- tomer base analysis.Journal of marketing research, 42(4):415–430, 2005

  12. [12]

    Quantifying memorization across neural lan- guage models

    Nicholas Carlini, Daphne Ippolito, Matthew Jagiel- ski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural lan- guage models. InThe Eleventh International Con- ference on Learning Representations, 2022

  13. [13]

    Research paper recommender system evaluation: a quantitative literature survey

    Joeran Beel, Stefan Langer, Marcel Genzmehr, Bela Gipp, Corinna Breitinger, and Andreas N ¨urnberger. Research paper recommender system evaluation: a quantitative literature survey. InProceedings of the international workshop on reproducibility and repli- cation in recommender systems evaluation, pages 15–22, 2013

  14. [14]

    Scientific paper recommendation: A survey.Ieee Access, 7:9324– 9339, 2019

    Xiaomei Bai, Mengyang Wang, Ivan Lee, Zhuo Yang, Xiangjie Kong, and Feng Xia. Scientific paper recommendation: A survey.Ieee Access, 7:9324– 9339, 2019

  15. [15]

    Deep learning 7 Hallucinate or Memorize?A PREPRINT in citation recommendation models survey.Expert Systems with Applications, 162:113790, 2020

    Zafar Ali, Pavlos Kefalas, Khan Muhammad, Ba- hadar Ali, and Muhammad Imran. Deep learning 7 Hallucinate or Memorize?A PREPRINT in citation recommendation models survey.Expert Systems with Applications, 162:113790, 2020

  16. [16]

    Citation recom- mendation: approaches and datasets.International Journal on Digital Libraries, 21(4):375–405, 2020

    Michael F ¨arber and Adam Jatowt. Citation recom- mendation: approaches and datasets.International Journal on Digital Libraries, 21(4):375–405, 2020

  17. [17]

    Scholarly recommendation systems: a literature survey.Knowledge and Information Sys- tems, 65(11):4433–4478, 2023

    Zitong Zhang, Braja Gopal Patra, Ashraf Yaseen, Jie Zhu, Rachit Sabharwal, Kirk Roberts, Tru Cao, and Hulin Wu. Scholarly recommendation systems: a literature survey.Knowledge and Information Sys- tems, 65(11):4433–4478, 2023

  18. [18]

    Citeseer: An autonomous web agent for automatic retrieval and identification of interesting publica- tions

    Kurt D Bollacker, Steve Lawrence, and C Lee Giles. Citeseer: An autonomous web agent for automatic retrieval and identification of interesting publica- tions. InProceedings of the second international conference on Autonomous agents, pages 116–123, 1998

  19. [19]

    A statistical interpretation of term specificity and its application in retrieval.Jour- nal of documentation, 28(1):11–21, 1972

    Karen Sparck Jones. A statistical interpretation of term specificity and its application in retrieval.Jour- nal of documentation, 28(1):11–21, 1972

  20. [20]

    Research paper rec- ommendation with topic analysis

    Chenguang Pan and Wenxin Li. Research paper rec- ommendation with topic analysis. In2010 Interna- tional conference on computer design and applica- tions, volume 4, pages V4–264. IEEE, 2010

  21. [21]

    An analysis on information diffu- sion through blogcast in a blogosphere.Information sciences, 290:45–62, 2015

    Jiwoon Ha, Sang-Wook Kim, Christos Faloutsos, and Sunju Park. An analysis on information diffu- sion through blogcast in a blogosphere.Information sciences, 290:45–62, 2015

  22. [22]

    Latent dirichlet allocation.Journal of machine Learning research, 3(Jan):993–1022, 2003

    David M Blei, Andrew Y Ng, and Michael I Jor- dan. Latent dirichlet allocation.Journal of machine Learning research, 3(Jan):993–1022, 2003

  23. [23]

    Indexing by latent semantic analysis.Jour- nal of the American society for information science, 41(6):391–407, 1990

    Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harsh- man. Indexing by latent semantic analysis.Jour- nal of the American society for information science, 41(6):391–407, 1990

  24. [24]

    A context-aware citation recom- mendation model with bert and graph convolutional networks.Scientometrics, 124(3):1907–1922, 2020

    Chanwoo Jeong, Sion Jang, Eunjeong Park, and Sungchul Choi. A context-aware citation recom- mendation model with bert and graph convolutional networks.Scientometrics, 124(3):1907–1922, 2020

  25. [25]

    Specter: Document- level representation learning using citation-informed transformers

    Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S Weld. Specter: Document- level representation learning using citation-informed transformers. InProceedings of the 58th Annual Meeting of the Association for Computational Lin- guistics, pages 2270–2282, 2020

  26. [26]

    Academic literature recommen- dation in large-scale citation networks enhanced by large language models.Scientometrics, 130:5143– 5169, 2025

    Kun Liu, Yan Zhang, Rui Pan, Tianchen Gao, and Hansheng Wang. Academic literature recommen- dation in large-scale citation networks enhanced by large language models.Scientometrics, 130:5143– 5169, 2025

  27. [27]

    Survey and analysis of hallucinations in large lan- guage models: attribution to prompting strategies or model behavior.Frontiers in Artificial Intelligence, 8:1622292, 2025

    Hoang Anh Dang, Vu Tran, and Le-Minh Nguyen. Survey and analysis of hallucinations in large lan- guage models: attribution to prompting strategies or model behavior.Frontiers in Artificial Intelligence, 8:1622292, 2025

  28. [28]

    We have a package for you! a comprehensive analysis of package hallucinations by code generating{LLMs}

    Joseph Spracklen, Raveen Wijewickrama, AHM Nazmus Sakib, Anindya Maiti, and Bi- mal Viswanath. We have a package for you! a comprehensive analysis of package hallucinations by code generating{LLMs}. In34th USENIX Security Symposium (USENIX Security 25), pages 3687–3706, 2025

  29. [29]

    Why Language Models Hallucinate

    Adam Tauman Kalai, Ofir Nachum, Santosh S Vem- pala, and Edwin Zhang. Why language models hal- lucinate.arXiv preprint arXiv:2509.04664, 2025

  30. [30]

    Deep re- inforcement learning from human preferences.Ad- vances in neural information processing systems, 30, 2017

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep re- inforcement learning from human preferences.Ad- vances in neural information processing systems, 30, 2017

  31. [31]

    Fine-Tuning Language Models from Human Preferences

    Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning lan- guage models from human preferences.arXiv preprint arXiv:1909.08593, 2019

  32. [32]

    Extracting training data from large language models

    Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In30th USENIX security sym- posium (USENIX Security 21), pages 2633–2650, 2021

  33. [33]

    Deduplicating training data mitigates privacy risks in language models

    Nikhil Kandpal, Eric Wallace, and Colin Raffel. Deduplicating training data mitigates privacy risks in language models. InInternational Conference on Machine Learning, pages 10697–10707. PMLR, 2022

  34. [34]

    Deduplicating train- ing data makes language models better

    Katherine Lee, Daphne Ippolito, Andrew Nys- trom, Chiyuan Zhang, Douglas Eck, Chris Callison- Burch, and Nicholas Carlini. Deduplicating train- ing data makes language models better. InProceed- ings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 8424–8445, 2022

  35. [35]

    Large language models as general pattern machines

    Suvir Mirchandani, Fei Xia, Pete Florence, Brian Ichter, Danny Driess, Montserrat Gonzalez Arenas, Kanishka Rao, Dorsa Sadigh, and Andy Zeng. Large language models as general pattern machines. In Conference on Robot Learning, pages 2498–2518. PMLR, 2023

  36. [36]

    Distributional semantics: Meaning through culture and interaction.Topics in cognitive science, 2024

    Pablo Contreras Kallens and Morten H Christiansen. Distributional semantics: Meaning through culture and interaction.Topics in cognitive science, 2024

  37. [37]

    The secret sharer: Eval- uating and testing unintended memorization in neu- ral networks

    Nicholas Carlini, Chang Liu, ´Ulfar Erlingsson, Jernej Kos, and Dawn Song. The secret sharer: Eval- uating and testing unintended memorization in neu- ral networks. In28th USENIX security symposium (USENIX security 19), pages 267–284, 2019

  38. [38]

    Attention is all you 8 Hallucinate or Memorize?A PREPRINT need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you 8 Hallucinate or Memorize?A PREPRINT need.Advances in neural information processing systems, 30, 2017

  39. [39]

    Denois- ing diffusion probabilistic models.Advances in neu- ral information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denois- ing diffusion probabilistic models.Advances in neu- ral information processing systems, 33:6840–6851, 2020

  40. [40]

    Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neu- ral information processing systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K ¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨aschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neu- ral information processing systems, 33:9459–9474, 2020

  41. [41]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  42. [42]

    Sentence-bert: Sentence embeddings using siamese bert-networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNLP), page 3982. Association for Computational Linguistics, 2019

  43. [43]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021

  44. [44]

    Mega: Moving average equipped gated attention

    Xuezhe Ma, Chunting Zhou, Xiang Kong, Junx- ian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer. Mega: Moving av- erage equipped gated attention.arXiv preprint arXiv:2209.10655, 2022

  45. [45]

    Revisiting deep learning mod- els for tabular data.Advances in neural information processing systems, 34:18932–18943, 2021

    Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. Revisiting deep learning mod- els for tabular data.Advances in neural information processing systems, 34:18932–18943, 2021

  46. [46]

    An image is worth 16x16 words.International Con- ference on Learning Representations (ICLR 2021), 2021

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words.International Con- ference on Learning Representations (ICLR 2021), 2021

  47. [47]

    Language models are few-shot learn- ers.Advances in neural information processing sys- tems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learn- ers.Advances in neural information processing sys- tems, 33:1877–1901, 2020

  48. [48]

    Generative adversarial nets.Advances in neural information processing systems, 27, 2014

    Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.Advances in neural information processing systems, 27, 2014

  49. [49]

    Generative ad- versarial networks.Communications of the ACM, 63(11):139–144, 2020

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative ad- versarial networks.Communications of the ACM, 63(11):139–144, 2020

  50. [50]

    Generative adver- sarial networks: introduction and outlook.IEEE/- CAA Journal of Automatica Sinica, 4(4):588–598, 2017

    Kunfeng Wang, Chao Gou, Yanjie Duan, Yilun Lin, Xinhu Zheng, and Fei-Yue Wang. Generative adver- sarial networks: introduction and outlook.IEEE/- CAA Journal of Automatica Sinica, 4(4):588–598, 2017

  51. [51]

    Generative adversarial networks: An overview.IEEE signal processing magazine, 35(1):53–65, 2018

    Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil A Bharath. Generative adversarial networks: An overview.IEEE signal processing magazine, 35(1):53–65, 2018

  52. [52]

    Self- supervised learning: Generative or contrastive

    Xiao Liu, Fanjin Zhang, Zhenyu Hou, Li Mian, Zhaoyu Wang, Jing Zhang, and Jie Tang. Self- supervised learning: Generative or contrastive. IEEE transactions on knowledge and data engineer- ing, 35(1):857–876, 2021. 9