pith. sign in

arxiv: 2307.06435 · v10 · pith:JVF5XXJWnew · submitted 2023-07-12 · 💻 cs.CL

A Comprehensive Overview of Large Language Models

Pith reviewed 2026-05-19 20:23 UTC · model grok-4.3

classification 💻 cs.CL
keywords large language modelssurveynatural language processingtransformersfine-tuningmulti-modalbenchmarkingefficiency
0
0 comments X p. Extension
pith:JVF5XXJW Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{JVF5XXJW}

Prints a linked pith:JVF5XXJW badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

This review compiles background concepts and frontier advances in large language models into one accessible guide.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to help the research community keep up with the fast pace of LLM developments by offering a self-contained overview. It covers everything from basic ideas to cutting-edge topics such as better training strategies, longer context lengths, multi-modal models, and efficiency improvements. A sympathetic reader would care because the sheer volume of new papers makes it hard to see how different advances fit together without a guide like this. If successful, the overview lets practitioners and researchers draw quick insights to push the field forward.

Core claim

The authors claim that by systematically surveying literature on architectural innovations, training strategies, context improvements, fine-tuning, multi-modal LLMs, robotics applications, datasets, benchmarking, and efficiency, they can provide a concise yet comprehensive reference that benefits researchers and practitioners alike.

What carries the argument

The survey structure itself, which groups diverse LLM-related concepts and provides informative summaries of existing works.

If this is right

  • Researchers gain a quick reference to draw insights from summaries of existing works.
  • Practitioners can better understand advanced topics to apply LLMs effectively.
  • The overview highlights connections across topics like efficiency and multi-modal capabilities.
  • Future research can build on the identified frontier areas more systematically.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such overviews may become essential tools as the field grows, potentially standardizing how new contributions are contextualized.
  • Connecting LLM advances to robotics could lead to more integrated systems where language models control physical actions.
  • Tracking efficiency improvements might reveal patterns in how model scale interacts with performance gains.

Load-bearing premise

The literature reviewed is representative of the field and that the summaries provided are accurate and unbiased representations of the original contributions.

What would settle it

Finding a significant recent LLM paper or technique omitted from the overview, or identifying a summary that misrepresents the findings of a cited work.

read the original abstract

Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and beyond. This success of LLMs has led to a large influx of research contributions in this direction. These works encompass diverse topics such as architectural innovations, better training strategies, context length improvements, fine-tuning, multi-modal LLMs, robotics, datasets, benchmarking, efficiency, and more. With the rapid development of techniques and regular breakthroughs in LLM research, it has become considerably challenging to perceive the bigger picture of the advances in this direction. Considering the rapidly emerging plethora of literature on LLMs, it is imperative that the research community is able to benefit from a concise yet comprehensive overview of the recent developments in this field. This article provides an overview of the existing literature on a broad range of LLM-related concepts. Our self-contained comprehensive overview of LLMs discusses relevant background concepts along with covering the advanced topics at the frontier of research in LLMs. This review article is intended to not only provide a systematic survey but also a quick comprehensive reference for the researchers and practitioners to draw insights from extensive informative summaries of the existing works to advance the LLM research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a self-contained comprehensive overview of Large Language Models (LLMs), covering background concepts along with advanced topics including architectural innovations, training strategies, context length improvements, fine-tuning, multi-modal LLMs, robotics applications, datasets, benchmarking, and efficiency techniques. It positions the work as both a systematic survey and a quick reference for researchers and practitioners to synthesize insights from existing literature.

Significance. If the reviewed literature is representative and the summaries are faithful, the paper would provide a useful consolidation of the rapidly expanding LLM literature, helping the community navigate diverse contributions and draw cross-cutting insights.

major comments (2)
  1. [Introduction / Abstract] The central claim of a 'comprehensive overview' and 'systematic survey' lacks any description of the literature selection process (search protocol, databases, keywords, inclusion/exclusion criteria, or time window). This directly affects the representativeness of the covered topics and is load-bearing for the abstract's assertions.
  2. [Main survey sections (e.g., those detailing fine-tuning and multi-modal LLMs)] Fidelity of the condensed summaries to the original contributions is not verifiable from the provided structure; without explicit cross-referencing or error-checking mechanisms, interpretive drift in key areas (e.g., training strategies or multi-modal extensions) could undermine the reference value.
minor comments (2)
  1. [Abstract] The abstract could usefully state the approximate number of works reviewed and the literature cutoff date to clarify scope.
  2. [Throughout] Notation and terminology for model components (e.g., parameter counts, context lengths) should be standardized across sections for reader clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and describe the changes we will make to strengthen the manuscript as a survey.

read point-by-point responses
  1. Referee: [Introduction / Abstract] The central claim of a 'comprehensive overview' and 'systematic survey' lacks any description of the literature selection process (search protocol, databases, keywords, inclusion/exclusion criteria, or time window). This directly affects the representativeness of the covered topics and is load-bearing for the abstract's assertions.

    Authors: We agree that explicitly describing the literature selection process would improve transparency and support the claims of a systematic survey. In the revised manuscript we will add a dedicated subsection in the Introduction outlining the methodology. This will specify the databases consulted (arXiv, Google Scholar, ACL Anthology), search keywords (e.g., 'large language models', 'LLM', 'transformer', 'foundation model'), the time window (primarily 2018–2023 with selective earlier foundational works), and inclusion criteria focused on influential, highly cited contributions that address the topics enumerated in the abstract. Exclusion criteria will note the omission of non-English works and very recent preprints not yet widely cited. We believe this addition directly addresses the concern about representativeness. revision: yes

  2. Referee: [Main survey sections (e.g., those detailing fine-tuning and multi-modal LLMs)] Fidelity of the condensed summaries to the original contributions is not verifiable from the provided structure; without explicit cross-referencing or error-checking mechanisms, interpretive drift in key areas (e.g., training strategies or multi-modal extensions) could undermine the reference value.

    Authors: We acknowledge the value of stronger traceability for the condensed summaries. In the revision we will add explicit cross-references and, where space permits, short direct quotations or key phrases from the cited papers in the fine-tuning, training strategies, and multi-modal sections. We will also insert a brief statement in the Introduction describing our summarization process: each summary was derived from the primary source and cross-checked against the original abstract and conclusions. While readers will still benefit most by consulting the cited works, these enhancements should reduce the risk of interpretive drift and improve the paper's utility as a reference. revision: yes

Circularity Check

0 steps flagged

No circularity: literature survey without derivations or predictions

full rationale

This paper is a review article that surveys existing LLM literature, providing background concepts and summaries of advanced topics drawn from external references. It contains no original derivations, equations, predictions, or first-principles results that could reduce to inputs by construction. The central claim of offering a 'self-contained comprehensive overview' rests on the selection and accuracy of cited works rather than any internal logical chain that loops back to its own fitted parameters or self-citations. No steps match the enumerated circularity patterns such as self-definitional claims, fitted inputs renamed as predictions, or ansatz smuggled via self-citation. The structure is self-contained against external benchmarks precisely because it defers all substantive content to the referenced primary sources.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This being a survey paper, it introduces no new free parameters, mathematical axioms, or invented entities; its content is drawn from cited prior works.

pith-pipeline@v0.9.0 · 5750 in / 1098 out tokens · 50957 ms · 2026-05-19T20:23:29.036336+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    This article provides an overview of the literature on a broad range of LLM-related concepts. Our self-contained comprehensive overview of LLMs discusses relevant background concepts along with covering the advanced topics at the frontier of research in LLMs.

  • IndisputableMonolith.Foundation.PhiForcing phi_equation unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and beyond.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    SemGrad is a gradient-based uncertainty quantification technique for free-form LLM generation that operates in semantic space using a Semantic Preservation Score to select stable embeddings.

  2. EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks

    cs.CV 2026-04 unverdicted novelty 7.0

    EgoTL provides a new egocentric dataset with think-aloud chains and metric labels that benchmarks VLMs on long-horizon tasks and improves their planning, reasoning, and spatial grounding after finetuning.

  3. Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination

    cs.MM 2026-05 unverdicted novelty 6.0

    LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.

  4. DSIPA: Detecting LLM-Generated Texts via Sentiment-Invariant Patterns Divergence Analysis

    cs.CL 2026-04 unverdicted novelty 6.0

    DSIPA is a zero-shot black-box detector that uses sentiment distribution consistency and preservation metrics to identify LLM text, reporting up to 49.89% F1 gains over baselines across domains and models.

  5. How Large Language Models Balance Internal Knowledge with User and Document Assertions

    cs.CL 2026-04 unverdicted novelty 6.0

    LLMs prefer document assertions over user assertions, are impressionable to external information, and gain better discrimination after fine-tuning on diverse source-interaction data.

  6. Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers

    cs.CR 2026-04 unverdicted novelty 6.0

    BadStyle creates stealthy backdoors in LLMs by poisoning samples with imperceptible style triggers and using an auxiliary loss to stabilize payload injection, achieving high attack success rates across multiple models...

  7. ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks

    cs.AI 2026-04 unverdicted novelty 6.0

    ActuBench is a multi-agent LLM pipeline for generating and evaluating actuarial reasoning tasks, with evaluations of 50 models showing effective verification, competitive local open-weights models, and differing ranki...

  8. Bangla Key2Text: Text Generation from Keywords for a Low Resource Language

    cs.CL 2026-04 conditional novelty 6.0

    Bangla Key2Text releases 2.6M keyword-text pairs and demonstrates that fine-tuned mT5 and BanglaT5 outperform zero-shot LLMs on keyword-conditioned Bangla text generation.

  9. Multi-LLM Token Filtering and Routing for Sequential Recommendation

    cs.IR 2026-04 unverdicted novelty 6.0

    MLTFR combines user-guided token filtering with a multi-LLM mixture-of-experts and Fisher-weighted consensus expert to deliver stable gains in corpus-free sequential recommendation.

  10. Agent-GWO: Collaborative Agents for Dynamic Prompt Optimization in Large Language Models

    cs.NE 2026-04 unverdicted novelty 6.0

    Agent-GWO uses collaborative grey-wolf-inspired agents to jointly optimize LLM prompts and decoding settings, yielding higher accuracy and stability than prior single-agent prompt optimization methods on math and hybr...

  11. Semantic Communication with an LLM-enabled Knowledge Base

    eess.SP 2026-04 unverdicted novelty 6.0

    SC-LMKB uses LLM-generated data with cross-domain fusion to cut hallucinations and delivers up to 72.6% gains on cross-modality retrieval tasks over standard semantic communication.

  12. Improved Evidence Extraction and Metrics for Document Inconsistency Detection with LLMs

    cs.CL 2026-01 unverdicted novelty 6.0

    New evidence-extraction metrics and a redact-and-retry framework with constrained filtering substantially improve LLM performance on document inconsistency detection, supported by experiments on a released semi-synthe...

  13. VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

    cs.CV 2025-05 unverdicted novelty 6.0

    VLM-3R augments VLMs with implicit 3D tokens from monocular video via geometry encoding and 200K+ 3D reconstructive QA pairs, plus a new 138K-pair temporal benchmark, to support spatial and embodied reasoning.

  14. WhiteTesseract: Reframing the Interpretation of Cultural Heritage through XR and Conversational AI

    cs.HC 2026-05 unverdicted novelty 5.0

    WhiteTesseract deploys XR-based diminished reality and LLM dialogue in a Monet exhibition, raising average viewing time from 35.3 to 98.3 seconds and shifting 60% of 529 interactions toward analytical and emotional queries.

  15. DrugPlayGround: Benchmarking Large Language Models and Embeddings for Drug Discovery

    cs.LG 2026-02 unverdicted novelty 5.0

    DrugPlayGround is a new benchmark framework for evaluating LLMs on text-based descriptions of physiochemical drug characteristics, synergism, drug-protein interactions, and physiological responses.

  16. Small Language Models are the Future of Agentic AI

    cs.AI 2025-06 unverdicted novelty 5.0

    Small language models are sufficiently capable, more suitable, and far more economical than large models for the repetitive tasks that dominate agentic AI systems.

  17. Evaluating the Reliability of Multiple Large Language Models in Risk Assessment: A CIS Controls Based Approach

    cs.CR 2026-05 unverdicted novelty 4.0

    Large language models consistently underestimate cybersecurity risks compared to human experts in CIS Controls-based assessments, indicating they should serve as complementary rather than standalone tools.

  18. Exploiting Web Search Tools of AI Agents for Data Exfiltration

    cs.CR 2025-10 unverdicted novelty 4.0

    Indirect prompt injection attacks remain effective on LLMs using web search tools, allowing data exfiltration and exposing ongoing weaknesses in current model defenses.

  19. Large Language Model-Brained GUI Agents: A Survey

    cs.AI 2024-11 unverdicted novelty 4.0

    A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · cited by 19 Pith papers · 105 internal anchors

  1. [1]

    the end of his- tory

    A. Chernyavskiy, D. Ilvovsky, P. Nakov, Transformers:“the end of his- tory” for natural language processing?, in: Machine Learning and Knowledge Discovery in Databases. Research Track: European Con- ference, ECML PKDD 2021, Bilbao, Spain, September 13–17, 2021, Proceedings, Part III 21, Springer, 2021, pp. 677–693. 1

  2. [2]

    A. Wang, Y . Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, S. Bowman, Superglue: A stickier benchmark for general- purpose language understanding systems, Advances in neural informa- tion processing systems 32 (2019). 1, 26, 29

  3. [3]

    Adiwardana, M.-T

    D. Adiwardana, M.-T. Luong, D. R. So, J. Hall, N. Fiedel, R. Thoppilan, Z. Yang, A. Kulshreshtha, G. Nemade, Y . Lu, et al., Towards a human- like open-domain chatbot, arXiv preprint arXiv:2001.09977 (2020). 1

  4. [4]

    B. A. y Arcas, Do large language models understand us?, Daedalus 151 (2) (2022) 183–197. 2

  5. [5]

    Radford, J

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are unsupervised multitask learners, OpenAI blog 1 (8) (2019) 9. 2, 7

  6. [6]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing sys- tems 33 (2020) 1877–1901. 2, 6, 7, 8, 9, 16, 18, 23, 24, 25, 34

  7. [7]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). 2, 18, 24 35

  8. [8]

    M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep contextualized word representations, in: NAACL- HLT, Association for Computational Linguistics, 2018, pp. 2227–2237. 2

  9. [9]

    BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

    M. Lewis, Y . Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V . Stoyanov, L. Zettlemoyer, Bart: Denoising sequence-to-sequence pre- training for natural language generation, translation, and comprehen- sion, arXiv preprint arXiv:1910.13461 (2019). 2

  10. [10]

    Ra ffel, N

    C. Ra ffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, The Journal of Machine Learning Re- search 21 (1) (2020) 5485–5551. 2, 7, 8, 18, 19, 24, 25, 28, 30, 31

  11. [11]

    L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, C. Ra ffel, mt5: A massively multilingual pre-trained text-to- text transformer, arXiv preprint arXiv:2010.11934 (2020). 2, 7, 8, 24, 25, 28, 30

  12. [12]

    Zhang, Y

    Z. Zhang, Y . Gu, X. Han, S. Chen, C. Xiao, Z. Sun, Y . Yao, F. Qi, J. Guan, P. Ke, et al., Cpm-2: Large-scale cost-effective pre-trained lan- guage models, AI Open 2 (2021) 216–224. 2, 8, 25

  13. [13]

    T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ili ´c, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé, et al., Bloom: A 176b- parameter open-access multilingual language model, arXiv preprint arXiv:2211.05100 (2022). 2, 4, 9, 11, 23, 24, 25, 30

  14. [14]

    OPT: Open Pre-trained Transformer Language Models

    S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Lin, et al., Opt: Open pre-trained transformer language models, arXiv preprint arXiv:2205.01068 (2022). 2, 9, 11, 24, 25

  15. [15]

    PaLM: Scaling Language Modeling with Pathways

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al., Palm: Scal- ing language modeling with pathways, arXiv preprint arXiv:2204.02311 (2022). 2, 6, 9, 11, 23, 24, 25

  16. [16]

    H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma, et al., Scaling instruction-finetuned language models, arXiv preprint arXiv:2210.11416 (2022). 2, 7, 11, 16, 17, 22, 24, 25, 28, 31

  17. [17]

    V . Sanh, A. Webson, C. Ra ffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Cha ffin, A. Stiegler, T. L. Scao, A. Raja, et al., Multitask prompted training enables zero-shot task generalization, arXiv preprint arXiv:2110.08207 (2021). 2, 11, 16, 25, 28, 31

  18. [18]

    Y . Wang, S. Mishra, P. Alipoormolabashi, Y . Kordi, A. Mirzaei, A. Naik, A. Ashok, A. S. Dhanasekaran, A. Arunkumar, D. Stap, et al., Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks, in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 5085–5109. 2, 7, 11, 16, 17, ...

  19. [19]

    Y . Wang, Y . Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, H. Ha- jishirzi, Self-instruct: Aligning language model with self generated in- structions, arXiv preprint arXiv:2212.10560 (2022). 2, 16, 19, 22, 28

  20. [20]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al., Training language mod- els to follow instructions with human feedback, Advances in Neural In- formation Processing Systems 35 (2022) 27730–27744. 2, 7, 11, 16, 22

  21. [21]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., Llama 2: Open foundation and fine-tuned chat models, arXiv preprint arXiv:2307.09288 (2023). 2, 7, 10, 16, 25, 34

  22. [22]

    J. Wei, Y . Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yo- gatama, M. Bosma, D. Zhou, D. Metzler, et al., Emergent abilities of large language models, arXiv preprint arXiv:2206.07682 (2022). 2

  23. [23]

    T. Webb, K. J. Holyoak, H. Lu, Emergent analogical reasoning in large language models, Nature Human Behaviour 7 (9) (2023) 1526–1541. 2

  24. [24]

    D. A. Boiko, R. MacKnight, G. Gomes, Emergent autonomous sci- entific research capabilities of large language models, arXiv preprint arXiv:2304.05332 (2023). 2

  25. [25]

    Atlas: Few-shot Learning with Retrieval Augmented Language Models

    G. Izacard, P. Lewis, M. Lomeli, L. Hosseini, F. Petroni, T. Schick, J. Dwivedi-Yu, A. Joulin, S. Riedel, E. Grave, Few-shot learning with retrieval augmented language models, arXiv preprint arXiv:2208.03299 (2022). 2, 18, 19, 34

  26. [26]

    PaLM-E: An Embodied Multimodal Language Model

    D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al., Palm-e: An embodied multimodal language model, arXiv preprint arXiv:2303.03378 (2023). 2, 20, 22, 33

  27. [27]

    Parisi, Y

    A. Parisi, Y . Zhao, N. Fiedel, Talm: Tool augmented language models, arXiv preprint arXiv:2205.12255 (2022). 2, 19, 20

  28. [28]

    Zhang, H

    B. Zhang, H. Soh, Large language models as zero-shot human models for human-robot interaction, arXiv preprint arXiv:2303.03548 (2023). 2, 33

  29. [29]

    Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y . Zhou, J. Wang, A. Hu, P. Shi, Y . Shi, et al., mplug-owl: Modularization empowers large language models with multimodality, arXiv preprint arXiv:2304.14178 (2023). 2, 22

  30. [30]

    W. Wang, Z. Chen, X. Chen, J. Wu, X. Zhu, G. Zeng, P. Luo, T. Lu, J. Zhou, Y . Qiao, et al., Visionllm: Large language model is also an open-ended decoder for vision-centric tasks, arXiv preprint arXiv:2305.11175 (2023). 2, 22

  31. [31]

    R. Yang, L. Song, Y . Li, S. Zhao, Y . Ge, X. Li, Y . Shan, Gpt4tools: Teaching large language model to use tools via self-instruction, arXiv preprint arXiv:2305.18752 (2023). 2, 19, 22, 23

  32. [32]

    Saravia, Prompt Engineering Guide, https: //github.com/dair- ai/Prompt-Engineering-Guide (12 2022)

    E. Saravia, Prompt Engineering Guide, https: //github.com/dair- ai/Prompt-Engineering-Guide (12 2022). 2, 7, 18, 34

  33. [33]

    A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y . Xu, W. Zheng, X. Xia, et al., Glm-130b: An open bilingual pre-trained model, arXiv preprint arXiv:2210.02414 (2022). 2, 10, 23, 24, 25

  34. [34]

    Y . Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, S. C. Hoi, Codet5 +: Open code large language models for code understanding and genera- tion, arXiv preprint arXiv:2305.07922 (2023). 2, 11, 24, 25

  35. [35]

    S. Wang, Y . Sun, Y . Xiang, Z. Wu, S. Ding, W. Gong, S. Feng, J. Shang, Y . Zhao, C. Pang, et al., Ernie 3.0 titan: Exploring larger-scale knowl- edge enhanced pre-training for language understanding and generation, arXiv preprint arXiv:2112.12731 (2021). 2, 8, 24, 25

  36. [36]

    Rasley, S

    J. Rasley, S. Rajbhandari, O. Ruwase, Y . He, Deepspeed: System op- timizations enable training deep learning models with over 100 billion parameters, in: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 3505–

  37. [37]

    Rajbhandari, J

    S. Rajbhandari, J. Rasley, O. Ruwase, Y . He, Zero: Memory optimiza- tions toward training trillion parameter models, in: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE, 2020, pp. 1–16. 2, 4, 24

  38. [38]

    J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, G. Neubig, Towards a unified view of parameter-e fficient transfer learning, arXiv preprint arXiv:2110.04366 (2021). 2, 20, 21

  39. [39]

    Z. Hu, Y . Lan, L. Wang, W. Xu, E.-P. Lim, R. K.-W. Lee, L. Bing, S. Po- ria, Llm-adapters: An adapter family for parameter-e fficient fine-tuning of large language models, arXiv preprint arXiv:2304.01933 (2023). 2, 20

  40. [40]

    The Power of Scale for Parameter-Efficient Prompt Tuning

    B. Lester, R. Al-Rfou, N. Constant, The power of scale for parameter- efficient prompt tuning, arXiv preprint arXiv:2104.08691 (2021). 2, 8, 20, 21

  41. [41]

    X. L. Li, P. Liang, Prefix-tuning: Optimizing continuous prompts for generation, arXiv preprint arXiv:2101.00190 (2021). 2, 20, 21

  42. [42]

    X. Ma, G. Fang, X. Wang, Llm-pruner: On the structural pruning of large language models, arXiv preprint arXiv:2305.11627 (2023). 2, 22

  43. [43]

    R. Xu, F. Luo, C. Wang, B. Chang, J. Huang, S. Huang, F. Huang, From dense to sparse: Contrastive pruning for better pre-trained lan- guage model compression, in: Proceedings of the AAAI Conference on Artificial Intelligence, V ol. 36, 2022, pp. 11547–11555. 2, 22

  44. [44]

    G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, S. Han, Smoothquant: Accurate and e fficient post-training quantization for large language models, in: ICML, V ol. 202 of Proceedings of Machine Learning Re- search, PMLR, 2023, pp. 38087–38099. 2, 21

  45. [45]

    C. Tao, L. Hou, W. Zhang, L. Shang, X. Jiang, Q. Liu, P. Luo, N. Wong, Compression of generative pre-trained language models via quantiza- tion, arXiv preprint arXiv:2203.10705 (2022). 2, 21

  46. [46]

    A. Pal, D. Karkhanis, M. Roberts, S. Dooley, A. Sundararajan, S. Naidu, Giraffe: Adventures in expanding context lengths in llms, arXiv preprint arXiv:2308.10882 (2023). 2, 17

  47. [47]

    B. Peng, J. Quesnelle, H. Fan, E. Shippole, Yarn: E fficient con- text window extension of large language models, arXiv preprint arXiv:2309.00071 (2023). 2, 17

  48. [48]

    M. Guo, J. Ainslie, D. Uthus, S. Ontanon, J. Ni, Y .-H. Sung, Y . Yang, 36 Longt5: E fficient text-to-text transformer for long sequences, arXiv preprint arXiv:2112.07916 (2021). 2, 18

  49. [49]

    S. Chen, S. Wong, L. Chen, Y . Tian, Extending context window of large language models via positional interpolation, arXiv preprint arXiv:2306.15595 (2023). 2, 17

  50. [50]

    W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Dong, et al., A survey of large language models, arXiv preprint arXiv:2303.18223 (2023). 2, 3, 7

  51. [51]

    Naseem, I

    U. Naseem, I. Razzak, S. K. Khan, M. Prasad, A comprehensive sur- vey on word representation models: From classical to state-of-the-art word representation language models, Transactions on Asian and Low- Resource Language Information Processing 20 (5) (2021) 1–35. 2, 3

  52. [52]

    B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz, E. Agirre, I. Heinz, D. Roth, Recent advances in natural language pro- cessing via large pre-trained language models: A survey, arXiv preprint arXiv:2111.01243 (2021). 2, 3

  53. [53]

    C. Zhou, Q. Li, C. Li, J. Yu, Y . Liu, G. Wang, K. Zhang, C. Ji, Q. Yan, L. He, et al., A comprehensive survey on pretrained foundation models: A history from bert to chatgpt, arXiv preprint arXiv:2302.09419 (2023). 2, 3

  54. [54]

    Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, Z. Sui, A survey for in-context learning, arXiv preprint arXiv:2301.00234 (2022). 2, 7, 18

  55. [55]

    Towards Reasoning in Large Language Models: A Survey

    J. Huang, K. C.-C. Chang, Towards reasoning in large language models: A survey, arXiv preprint arXiv:2212.10403 (2022). 2, 7, 18

  56. [56]

    Y . Wang, W. Zhong, L. Li, F. Mi, X. Zeng, W. Huang, L. Shang, X. Jiang, Q. Liu, Aligning large language models with human: A survey, arXiv preprint arXiv:2307.12966 (2023). 2

  57. [57]

    X. Zhu, J. Li, Y . Liu, C. Ma, W. Wang, A survey on model compression for large language models, arXiv preprint arXiv:2308.07633 (2023). 2

  58. [58]

    S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, E. Chen, A survey on multi- modal large language models, arXiv preprint arXiv:2306.13549 (2023). 2, 22, 23

  59. [59]

    J. J. Webster, C. Kit, Tokenization as the initial phase in nlp, in: COL- ING 1992 volume 4: The 14th international conference on computa- tional linguistics, 1992. 4

  60. [60]

    T. Kudo, Subword regularization: Improving neural network translation models with multiple subword candidates, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (V ol- ume 1: Long Papers), 2018, pp. 66–75. 4

  61. [61]

    Sennrich, B

    R. Sennrich, B. Haddow, A. Birch, Neural machine translation of rare words with subword units, in: Proceedings of the 54th Annual Meet- ing of the Association for Computational Linguistics (V olume 1: Long Papers), 2016, pp. 1715–1725. 4

  62. [62]

    Schuster, K

    M. Schuster, K. Nakajima, Japanese and korean voice search, in: 2012 IEEE international conference on acoustics, speech and signal process- ing (ICASSP), IEEE, 2012, pp. 5149–5152. 4

  63. [63]

    S. J. Mielke, Z. Alyafeai, E. Salesky, C. Ra ffel, M. Dey, M. Gallé, A. Raja, C. Si, W. Y . Lee, B. Sagot, et al., Between words and char- acters: A brief history of open-vocabulary modeling and tokenization in nlp, arXiv preprint arXiv:2112.10508 (2021). 4

  64. [64]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017). 4, 7

  65. [65]

    Press, N

    O. Press, N. Smith, M. Lewis, Train short, test long: Attention with linear biases enables input length extrapolation, in: International Con- ference on Learning Representations, 2022. URL https://openreview.net/forum?id=R8sQPpGCv0 4, 17

  66. [66]

    J. Su, Y . Lu, S. Pan, A. Murtadha, B. Wen, Y . Liu, Roformer: En- hanced transformer with rotary position embedding, arXiv preprint arXiv:2104.09864 (2021). 4, 9, 17

  67. [67]

    Generating Long Sequences with Sparse Transformers

    R. Child, S. Gray, A. Radford, I. Sutskever, Generating long sequences with sparse transformers, arXiv preprint arXiv:1904.10509 (2019). 4, 7, 23

  68. [68]

    T. Dao, D. Fu, S. Ermon, A. Rudra, C. Ré, Flashattention: Fast and memory-efficient exact attention with io-awareness, Advances in Neural Information Processing Systems 35 (2022) 16344–16359. 4

  69. [69]

    Hornik, M

    K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators, Neural networks 2 (5) (1989) 359–366. 4

  70. [70]

    V . Nair, G. E. Hinton, Rectified linear units improve restricted boltz- mann machines, in: Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814. 4

  71. [71]

    Gaussian Error Linear Units (GELUs)

    D. Hendrycks, K. Gimpel, Gaussian error linear units (gelus), arXiv preprint arXiv:1606.08415 (2016). 4

  72. [72]

    Srivastava, G

    N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, The journal of machine learning research 15 (1) (2014) 1929–1958. 4

  73. [73]

    Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations

    D. Krueger, T. Maharaj, J. Kramár, M. Pezeshki, N. Ballas, N. R. Ke, A. Goyal, Y . Bengio, A. Courville, C. Pal, Zoneout: Regular- izing rnns by randomly preserving hidden activations, arXiv preprint arXiv:1606.01305 (2016). 4

  74. [74]

    GLU Variants Improve Transformer

    N. Shazeer, Glu variants improve transformer, arXiv preprint arXiv:2002.05202 (2020). 4

  75. [75]

    Y . N. Dauphin, A. Fan, M. Auli, D. Grangier, Language modeling with gated convolutional networks, in: International conference on machine learning, PMLR, 2017, pp. 933–941. 4

  76. [76]

    J. L. Ba, J. R. Kiros, G. E. Hinton, Layer normalization, arXiv preprint arXiv:1607.06450 (2016). 4

  77. [77]

    Zhang, R

    B. Zhang, R. Sennrich, Root mean square layer normalization, Advances in Neural Information Processing Systems 32 (2019). 4

  78. [78]

    Adaptive Input Representations for Neural Language Modeling

    A. Baevski, M. Auli, Adaptive input representations for neural language modeling, arXiv preprint arXiv:1809.10853 (2018). 4

  79. [79]

    H. Wang, S. Ma, L. Dong, S. Huang, D. Zhang, F. Wei, Deepnet: Scaling transformers to 1,000 layers, arXiv preprint arXiv:2203.00555 (2022). 4

  80. [80]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, B. Catanzaro, Megatron-lm: Training multi-billion parameter language models using model parallelism, arXiv preprint arXiv:1909.08053 (2019). 4, 5

Showing first 80 references.