A Comprehensive Overview of Large Language Models

arxiv: 2307.06435 · v10 · pith:JVF5XXJWnew · submitted 2023-07-12 · 💻 cs.CL

A Comprehensive Overview of Large Language Models

Humza Naveed , Asad Ullah Khan , Shi Qiu , Muhammad Saqib , Saeed Anwar , Muhammad Usman , Naveed Akhtar , Nick Barnes

show 1 more author

Ajmal Mian

This is my paper

Pith reviewed 2026-05-19 20:23 UTC · model grok-4.3

classification 💻 cs.CL

keywords large language modelssurveynatural language processingtransformersfine-tuningmulti-modalbenchmarkingefficiency

0 comments p. Extension

pith:JVF5XXJW Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{JVF5XXJW}

Prints a linked pith:JVF5XXJW badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

This review compiles background concepts and frontier advances in large language models into one accessible guide.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to help the research community keep up with the fast pace of LLM developments by offering a self-contained overview. It covers everything from basic ideas to cutting-edge topics such as better training strategies, longer context lengths, multi-modal models, and efficiency improvements. A sympathetic reader would care because the sheer volume of new papers makes it hard to see how different advances fit together without a guide like this. If successful, the overview lets practitioners and researchers draw quick insights to push the field forward.

Core claim

The authors claim that by systematically surveying literature on architectural innovations, training strategies, context improvements, fine-tuning, multi-modal LLMs, robotics applications, datasets, benchmarking, and efficiency, they can provide a concise yet comprehensive reference that benefits researchers and practitioners alike.

What carries the argument

The survey structure itself, which groups diverse LLM-related concepts and provides informative summaries of existing works.

If this is right

Researchers gain a quick reference to draw insights from summaries of existing works.
Practitioners can better understand advanced topics to apply LLMs effectively.
The overview highlights connections across topics like efficiency and multi-modal capabilities.
Future research can build on the identified frontier areas more systematically.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such overviews may become essential tools as the field grows, potentially standardizing how new contributions are contextualized.
Connecting LLM advances to robotics could lead to more integrated systems where language models control physical actions.
Tracking efficiency improvements might reveal patterns in how model scale interacts with performance gains.

Load-bearing premise

The literature reviewed is representative of the field and that the summaries provided are accurate and unbiased representations of the original contributions.

What would settle it

Finding a significant recent LLM paper or technique omitted from the overview, or identifying a summary that misrepresents the findings of a cited work.

read the original abstract

Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and beyond. This success of LLMs has led to a large influx of research contributions in this direction. These works encompass diverse topics such as architectural innovations, better training strategies, context length improvements, fine-tuning, multi-modal LLMs, robotics, datasets, benchmarking, efficiency, and more. With the rapid development of techniques and regular breakthroughs in LLM research, it has become considerably challenging to perceive the bigger picture of the advances in this direction. Considering the rapidly emerging plethora of literature on LLMs, it is imperative that the research community is able to benefit from a concise yet comprehensive overview of the recent developments in this field. This article provides an overview of the existing literature on a broad range of LLM-related concepts. Our self-contained comprehensive overview of LLMs discusses relevant background concepts along with covering the advanced topics at the frontier of research in LLMs. This review article is intended to not only provide a systematic survey but also a quick comprehensive reference for the researchers and practitioners to draw insights from extensive informative summaries of the existing works to advance the LLM research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a standard literature survey on LLMs that organizes existing work into sections but adds no new results and leaves its coverage claims hard to verify.

read the letter

This paper rounds up a lot of LLM papers from mid-2023 and groups them under headings like architectures, training tricks, context length, fine-tuning, multi-modal extensions, robotics uses, datasets, benchmarks, and efficiency. It walks through background ideas first then hits frontier topics, which is the main service it offers. The summaries are written to be readable and self-contained, so a reader can get the gist of many contributions without chasing every original source right away. That structure is useful for someone trying to map the field quickly. The authors do cite a wide range of work and try to hit both foundational and recent pieces. What is missing is any description of how they decided what to include or exclude. No search protocol, no explicit inclusion rules, and no discussion of balance across sub-areas appear in the text. That makes the “comprehensive” claim rest on the authors’ judgment alone, which is common in surveys but still leaves room for important lines of work to be under-represented or for summaries to drift from the original papers. The citations themselves look reasonable on a quick scan, but without independent checking it is impossible to know how faithfully each one is captured. For a new researcher or practitioner who wants one document that touches most of the main threads, this could save time. Someone already deep in the literature will probably just use it as a reminder list rather than a primary source. It is the sort of survey that belongs in peer review so referees can flag gaps and suggest additions; the core idea of a consolidated reference is worth the effort even if the current version needs tightening on coverage.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a self-contained comprehensive overview of Large Language Models (LLMs), covering background concepts along with advanced topics including architectural innovations, training strategies, context length improvements, fine-tuning, multi-modal LLMs, robotics applications, datasets, benchmarking, and efficiency techniques. It positions the work as both a systematic survey and a quick reference for researchers and practitioners to synthesize insights from existing literature.

Significance. If the reviewed literature is representative and the summaries are faithful, the paper would provide a useful consolidation of the rapidly expanding LLM literature, helping the community navigate diverse contributions and draw cross-cutting insights.

major comments (2)

[Introduction / Abstract] The central claim of a 'comprehensive overview' and 'systematic survey' lacks any description of the literature selection process (search protocol, databases, keywords, inclusion/exclusion criteria, or time window). This directly affects the representativeness of the covered topics and is load-bearing for the abstract's assertions.
[Main survey sections (e.g., those detailing fine-tuning and multi-modal LLMs)] Fidelity of the condensed summaries to the original contributions is not verifiable from the provided structure; without explicit cross-referencing or error-checking mechanisms, interpretive drift in key areas (e.g., training strategies or multi-modal extensions) could undermine the reference value.

minor comments (2)

[Abstract] The abstract could usefully state the approximate number of works reviewed and the literature cutoff date to clarify scope.
[Throughout] Notation and terminology for model components (e.g., parameter counts, context lengths) should be standardized across sections for reader clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and describe the changes we will make to strengthen the manuscript as a survey.

read point-by-point responses

Referee: [Introduction / Abstract] The central claim of a 'comprehensive overview' and 'systematic survey' lacks any description of the literature selection process (search protocol, databases, keywords, inclusion/exclusion criteria, or time window). This directly affects the representativeness of the covered topics and is load-bearing for the abstract's assertions.

Authors: We agree that explicitly describing the literature selection process would improve transparency and support the claims of a systematic survey. In the revised manuscript we will add a dedicated subsection in the Introduction outlining the methodology. This will specify the databases consulted (arXiv, Google Scholar, ACL Anthology), search keywords (e.g., 'large language models', 'LLM', 'transformer', 'foundation model'), the time window (primarily 2018–2023 with selective earlier foundational works), and inclusion criteria focused on influential, highly cited contributions that address the topics enumerated in the abstract. Exclusion criteria will note the omission of non-English works and very recent preprints not yet widely cited. We believe this addition directly addresses the concern about representativeness. revision: yes
Referee: [Main survey sections (e.g., those detailing fine-tuning and multi-modal LLMs)] Fidelity of the condensed summaries to the original contributions is not verifiable from the provided structure; without explicit cross-referencing or error-checking mechanisms, interpretive drift in key areas (e.g., training strategies or multi-modal extensions) could undermine the reference value.

Authors: We acknowledge the value of stronger traceability for the condensed summaries. In the revision we will add explicit cross-references and, where space permits, short direct quotations or key phrases from the cited papers in the fine-tuning, training strategies, and multi-modal sections. We will also insert a brief statement in the Introduction describing our summarization process: each summary was derived from the primary source and cross-checked against the original abstract and conclusions. While readers will still benefit most by consulting the cited works, these enhancements should reduce the risk of interpretive drift and improve the paper's utility as a reference. revision: yes

Circularity Check

0 steps flagged

No circularity: literature survey without derivations or predictions

full rationale

This paper is a review article that surveys existing LLM literature, providing background concepts and summaries of advanced topics drawn from external references. It contains no original derivations, equations, predictions, or first-principles results that could reduce to inputs by construction. The central claim of offering a 'self-contained comprehensive overview' rests on the selection and accuracy of cited works rather than any internal logical chain that loops back to its own fitted parameters or self-citations. No steps match the enumerated circularity patterns such as self-definitional claims, fitted inputs renamed as predictions, or ansatz smuggled via self-citation. The structure is self-contained against external benchmarks precisely because it defers all substantive content to the referenced primary sources.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This being a survey paper, it introduces no new free parameters, mathematical axioms, or invented entities; its content is drawn from cited prior works.

pith-pipeline@v0.9.0 · 5750 in / 1098 out tokens · 50957 ms · 2026-05-19T20:23:29.036336+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This article provides an overview of the literature on a broad range of LLM-related concepts. Our self-contained comprehensive overview of LLMs discusses relevant background concepts along with covering the advanced topics at the frontier of research in LLMs.
IndisputableMonolith.Foundation.PhiForcing phi_equation unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and beyond.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

SemGrad is a gradient-based uncertainty quantification technique for free-form LLM generation that operates in semantic space using a Semantic Preservation Score to select stable embeddings.
EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks
cs.CV 2026-04 unverdicted novelty 7.0

EgoTL provides a new egocentric dataset with think-aloud chains and metric labels that benchmarks VLMs on long-horizon tasks and improves their planning, reasoning, and spatial grounding after finetuning.
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination
cs.MM 2026-05 unverdicted novelty 6.0

LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.
DSIPA: Detecting LLM-Generated Texts via Sentiment-Invariant Patterns Divergence Analysis
cs.CL 2026-04 unverdicted novelty 6.0

DSIPA is a zero-shot black-box detector that uses sentiment distribution consistency and preservation metrics to identify LLM text, reporting up to 49.89% F1 gains over baselines across domains and models.
How Large Language Models Balance Internal Knowledge with User and Document Assertions
cs.CL 2026-04 unverdicted novelty 6.0

LLMs prefer document assertions over user assertions, are impressionable to external information, and gain better discrimination after fine-tuning on diverse source-interaction data.
Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers
cs.CR 2026-04 unverdicted novelty 6.0

BadStyle creates stealthy backdoors in LLMs by poisoning samples with imperceptible style triggers and using an auxiliary loss to stabilize payload injection, achieving high attack success rates across multiple models...
ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks
cs.AI 2026-04 unverdicted novelty 6.0

ActuBench is a multi-agent LLM pipeline for generating and evaluating actuarial reasoning tasks, with evaluations of 50 models showing effective verification, competitive local open-weights models, and differing ranki...
Bangla Key2Text: Text Generation from Keywords for a Low Resource Language
cs.CL 2026-04 conditional novelty 6.0

Bangla Key2Text releases 2.6M keyword-text pairs and demonstrates that fine-tuned mT5 and BanglaT5 outperform zero-shot LLMs on keyword-conditioned Bangla text generation.
Multi-LLM Token Filtering and Routing for Sequential Recommendation
cs.IR 2026-04 unverdicted novelty 6.0

MLTFR combines user-guided token filtering with a multi-LLM mixture-of-experts and Fisher-weighted consensus expert to deliver stable gains in corpus-free sequential recommendation.
Agent-GWO: Collaborative Agents for Dynamic Prompt Optimization in Large Language Models
cs.NE 2026-04 unverdicted novelty 6.0

Agent-GWO uses collaborative grey-wolf-inspired agents to jointly optimize LLM prompts and decoding settings, yielding higher accuracy and stability than prior single-agent prompt optimization methods on math and hybr...
Semantic Communication with an LLM-enabled Knowledge Base
eess.SP 2026-04 unverdicted novelty 6.0

SC-LMKB uses LLM-generated data with cross-domain fusion to cut hallucinations and delivers up to 72.6% gains on cross-modality retrieval tasks over standard semantic communication.
Improved Evidence Extraction and Metrics for Document Inconsistency Detection with LLMs
cs.CL 2026-01 unverdicted novelty 6.0

New evidence-extraction metrics and a redact-and-retry framework with constrained filtering substantially improve LLM performance on document inconsistency detection, supported by experiments on a released semi-synthe...
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
cs.CV 2025-05 unverdicted novelty 6.0

VLM-3R augments VLMs with implicit 3D tokens from monocular video via geometry encoding and 200K+ 3D reconstructive QA pairs, plus a new 138K-pair temporal benchmark, to support spatial and embodied reasoning.
WhiteTesseract: Reframing the Interpretation of Cultural Heritage through XR and Conversational AI
cs.HC 2026-05 unverdicted novelty 5.0

WhiteTesseract deploys XR-based diminished reality and LLM dialogue in a Monet exhibition, raising average viewing time from 35.3 to 98.3 seconds and shifting 60% of 529 interactions toward analytical and emotional queries.
DrugPlayGround: Benchmarking Large Language Models and Embeddings for Drug Discovery
cs.LG 2026-02 unverdicted novelty 5.0

DrugPlayGround is a new benchmark framework for evaluating LLMs on text-based descriptions of physiochemical drug characteristics, synergism, drug-protein interactions, and physiological responses.
Small Language Models are the Future of Agentic AI
cs.AI 2025-06 unverdicted novelty 5.0

Small language models are sufficiently capable, more suitable, and far more economical than large models for the repetitive tasks that dominate agentic AI systems.
Evaluating the Reliability of Multiple Large Language Models in Risk Assessment: A CIS Controls Based Approach
cs.CR 2026-05 unverdicted novelty 4.0

Large language models consistently underestimate cybersecurity risks compared to human experts in CIS Controls-based assessments, indicating they should serve as complementary rather than standalone tools.
Exploiting Web Search Tools of AI Agents for Data Exfiltration
cs.CR 2025-10 unverdicted novelty 4.0

Indirect prompt injection attacks remain effective on LLMs using web search tools, allowing data exfiltration and exposing ongoing weaknesses in current model defenses.
Large Language Model-Brained GUI Agents: A Survey
cs.AI 2024-11 unverdicted novelty 4.0

A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · cited by 19 Pith papers · 105 internal anchors

[1]

the end of his- tory

A. Chernyavskiy, D. Ilvovsky, P. Nakov, Transformers:“the end of his- tory” for natural language processing?, in: Machine Learning and Knowledge Discovery in Databases. Research Track: European Con- ference, ECML PKDD 2021, Bilbao, Spain, September 13–17, 2021, Proceedings, Part III 21, Springer, 2021, pp. 677–693. 1

work page 2021
[2]

A. Wang, Y . Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, S. Bowman, Superglue: A stickier benchmark for general- purpose language understanding systems, Advances in neural informa- tion processing systems 32 (2019). 1, 26, 29

work page 2019
[3]

Adiwardana, M.-T

D. Adiwardana, M.-T. Luong, D. R. So, J. Hall, N. Fiedel, R. Thoppilan, Z. Yang, A. Kulshreshtha, G. Nemade, Y . Lu, et al., Towards a human- like open-domain chatbot, arXiv preprint arXiv:2001.09977 (2020). 1

work page arXiv 2001
[4]

B. A. y Arcas, Do large language models understand us?, Daedalus 151 (2) (2022) 183–197. 2

work page 2022
[5]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are unsupervised multitask learners, OpenAI blog 1 (8) (2019) 9. 2, 7

work page 2019
[6]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing sys- tems 33 (2020) 1877–1901. 2, 6, 7, 8, 9, 16, 18, 23, 24, 25, 34

work page 2020
[7]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). 2, 18, 24 35

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep contextualized word representations, in: NAACL- HLT, Association for Computational Linguistics, 2018, pp. 2227–2237. 2

work page 2018
[9]

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

M. Lewis, Y . Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V . Stoyanov, L. Zettlemoyer, Bart: Denoising sequence-to-sequence pre- training for natural language generation, translation, and comprehen- sion, arXiv preprint arXiv:1910.13461 (2019). 2

work page internal anchor Pith review Pith/arXiv arXiv 1910
[10]

Ra ffel, N

C. Ra ffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, The Journal of Machine Learning Re- search 21 (1) (2020) 5485–5551. 2, 7, 8, 18, 19, 24, 25, 28, 30, 31

work page 2020
[11]

L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, C. Ra ffel, mt5: A massively multilingual pre-trained text-to- text transformer, arXiv preprint arXiv:2010.11934 (2020). 2, 7, 8, 24, 25, 28, 30

work page arXiv 2010
[12]

Zhang, Y

Z. Zhang, Y . Gu, X. Han, S. Chen, C. Xiao, Z. Sun, Y . Yao, F. Qi, J. Guan, P. Ke, et al., Cpm-2: Large-scale cost-effective pre-trained lan- guage models, AI Open 2 (2021) 216–224. 2, 8, 25

work page 2021
[13]

T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ili ´c, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé, et al., Bloom: A 176b- parameter open-access multilingual language model, arXiv preprint arXiv:2211.05100 (2022). 2, 4, 9, 11, 23, 24, 25, 30

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

OPT: Open Pre-trained Transformer Language Models

S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Lin, et al., Opt: Open pre-trained transformer language models, arXiv preprint arXiv:2205.01068 (2022). 2, 9, 11, 24, 25

work page internal anchor Pith review Pith/arXiv arXiv 2022
[15]

PaLM: Scaling Language Modeling with Pathways

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al., Palm: Scal- ing language modeling with pathways, arXiv preprint arXiv:2204.02311 (2022). 2, 6, 9, 11, 23, 24, 25

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma, et al., Scaling instruction-finetuned language models, arXiv preprint arXiv:2210.11416 (2022). 2, 7, 11, 16, 17, 22, 24, 25, 28, 31

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

V . Sanh, A. Webson, C. Ra ffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Cha ffin, A. Stiegler, T. L. Scao, A. Raja, et al., Multitask prompted training enables zero-shot task generalization, arXiv preprint arXiv:2110.08207 (2021). 2, 11, 16, 25, 28, 31

work page internal anchor Pith review Pith/arXiv arXiv 2021
[18]

Y . Wang, S. Mishra, P. Alipoormolabashi, Y . Kordi, A. Mirzaei, A. Naik, A. Ashok, A. S. Dhanasekaran, A. Arunkumar, D. Stap, et al., Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks, in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 5085–5109. 2, 7, 11, 16, 17, ...

work page 2022
[19]

Y . Wang, Y . Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, H. Ha- jishirzi, Self-instruct: Aligning language model with self generated in- structions, arXiv preprint arXiv:2212.10560 (2022). 2, 16, 19, 22, 28

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al., Training language mod- els to follow instructions with human feedback, Advances in Neural In- formation Processing Systems 35 (2022) 27730–27744. 2, 7, 11, 16, 22

work page 2022
[21]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., Llama 2: Open foundation and fine-tuned chat models, arXiv preprint arXiv:2307.09288 (2023). 2, 7, 10, 16, 25, 34

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

J. Wei, Y . Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yo- gatama, M. Bosma, D. Zhou, D. Metzler, et al., Emergent abilities of large language models, arXiv preprint arXiv:2206.07682 (2022). 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

T. Webb, K. J. Holyoak, H. Lu, Emergent analogical reasoning in large language models, Nature Human Behaviour 7 (9) (2023) 1526–1541. 2

work page 2023
[24]

D. A. Boiko, R. MacKnight, G. Gomes, Emergent autonomous sci- entific research capabilities of large language models, arXiv preprint arXiv:2304.05332 (2023). 2

work page internal anchor Pith review arXiv 2023
[25]

Atlas: Few-shot Learning with Retrieval Augmented Language Models

G. Izacard, P. Lewis, M. Lomeli, L. Hosseini, F. Petroni, T. Schick, J. Dwivedi-Yu, A. Joulin, S. Riedel, E. Grave, Few-shot learning with retrieval augmented language models, arXiv preprint arXiv:2208.03299 (2022). 2, 18, 19, 34

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

PaLM-E: An Embodied Multimodal Language Model

D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al., Palm-e: An embodied multimodal language model, arXiv preprint arXiv:2303.03378 (2023). 2, 20, 22, 33

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Parisi, Y

A. Parisi, Y . Zhao, N. Fiedel, Talm: Tool augmented language models, arXiv preprint arXiv:2205.12255 (2022). 2, 19, 20

work page arXiv 2022
[28]

Zhang, H

B. Zhang, H. Soh, Large language models as zero-shot human models for human-robot interaction, arXiv preprint arXiv:2303.03548 (2023). 2, 33

work page arXiv 2023
[29]

Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y . Zhou, J. Wang, A. Hu, P. Shi, Y . Shi, et al., mplug-owl: Modularization empowers large language models with multimodality, arXiv preprint arXiv:2304.14178 (2023). 2, 22

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

W. Wang, Z. Chen, X. Chen, J. Wu, X. Zhu, G. Zeng, P. Luo, T. Lu, J. Zhou, Y . Qiao, et al., Visionllm: Large language model is also an open-ended decoder for vision-centric tasks, arXiv preprint arXiv:2305.11175 (2023). 2, 22

work page arXiv 2023
[31]

R. Yang, L. Song, Y . Li, S. Zhao, Y . Ge, X. Li, Y . Shan, Gpt4tools: Teaching large language model to use tools via self-instruction, arXiv preprint arXiv:2305.18752 (2023). 2, 19, 22, 23

work page arXiv 2023
[32]

Saravia, Prompt Engineering Guide, https: //github.com/dair- ai/Prompt-Engineering-Guide (12 2022)

E. Saravia, Prompt Engineering Guide, https: //github.com/dair- ai/Prompt-Engineering-Guide (12 2022). 2, 7, 18, 34

work page 2022
[33]

A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y . Xu, W. Zheng, X. Xia, et al., Glm-130b: An open bilingual pre-trained model, arXiv preprint arXiv:2210.02414 (2022). 2, 10, 23, 24, 25

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

Y . Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, S. C. Hoi, Codet5 +: Open code large language models for code understanding and genera- tion, arXiv preprint arXiv:2305.07922 (2023). 2, 11, 24, 25

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

S. Wang, Y . Sun, Y . Xiang, Z. Wu, S. Ding, W. Gong, S. Feng, J. Shang, Y . Zhao, C. Pang, et al., Ernie 3.0 titan: Exploring larger-scale knowl- edge enhanced pre-training for language understanding and generation, arXiv preprint arXiv:2112.12731 (2021). 2, 8, 24, 25

work page arXiv 2021
[36]

Rasley, S

J. Rasley, S. Rajbhandari, O. Ruwase, Y . He, Deepspeed: System op- timizations enable training deep learning models with over 100 billion parameters, in: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 3505–

work page 2020
[37]

Rajbhandari, J

S. Rajbhandari, J. Rasley, O. Ruwase, Y . He, Zero: Memory optimiza- tions toward training trillion parameter models, in: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE, 2020, pp. 1–16. 2, 4, 24

work page 2020
[38]

J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, G. Neubig, Towards a unified view of parameter-e fficient transfer learning, arXiv preprint arXiv:2110.04366 (2021). 2, 20, 21

work page arXiv 2021
[39]

Z. Hu, Y . Lan, L. Wang, W. Xu, E.-P. Lim, R. K.-W. Lee, L. Bing, S. Po- ria, Llm-adapters: An adapter family for parameter-e fficient fine-tuning of large language models, arXiv preprint arXiv:2304.01933 (2023). 2, 20

work page arXiv 2023
[40]

The Power of Scale for Parameter-Efficient Prompt Tuning

B. Lester, R. Al-Rfou, N. Constant, The power of scale for parameter- efficient prompt tuning, arXiv preprint arXiv:2104.08691 (2021). 2, 8, 20, 21

work page internal anchor Pith review Pith/arXiv arXiv 2021
[41]

X. L. Li, P. Liang, Prefix-tuning: Optimizing continuous prompts for generation, arXiv preprint arXiv:2101.00190 (2021). 2, 20, 21

work page internal anchor Pith review Pith/arXiv arXiv 2021
[42]

X. Ma, G. Fang, X. Wang, Llm-pruner: On the structural pruning of large language models, arXiv preprint arXiv:2305.11627 (2023). 2, 22

work page arXiv 2023
[43]

R. Xu, F. Luo, C. Wang, B. Chang, J. Huang, S. Huang, F. Huang, From dense to sparse: Contrastive pruning for better pre-trained lan- guage model compression, in: Proceedings of the AAAI Conference on Artificial Intelligence, V ol. 36, 2022, pp. 11547–11555. 2, 22

work page 2022
[44]

G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, S. Han, Smoothquant: Accurate and e fficient post-training quantization for large language models, in: ICML, V ol. 202 of Proceedings of Machine Learning Re- search, PMLR, 2023, pp. 38087–38099. 2, 21

work page 2023
[45]

C. Tao, L. Hou, W. Zhang, L. Shang, X. Jiang, Q. Liu, P. Luo, N. Wong, Compression of generative pre-trained language models via quantiza- tion, arXiv preprint arXiv:2203.10705 (2022). 2, 21

work page arXiv 2022
[46]

A. Pal, D. Karkhanis, M. Roberts, S. Dooley, A. Sundararajan, S. Naidu, Giraffe: Adventures in expanding context lengths in llms, arXiv preprint arXiv:2308.10882 (2023). 2, 17

work page arXiv 2023
[47]

B. Peng, J. Quesnelle, H. Fan, E. Shippole, Yarn: E fficient con- text window extension of large language models, arXiv preprint arXiv:2309.00071 (2023). 2, 17

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

M. Guo, J. Ainslie, D. Uthus, S. Ontanon, J. Ni, Y .-H. Sung, Y . Yang, 36 Longt5: E fficient text-to-text transformer for long sequences, arXiv preprint arXiv:2112.07916 (2021). 2, 18

work page arXiv 2021
[49]

S. Chen, S. Wong, L. Chen, Y . Tian, Extending context window of large language models via positional interpolation, arXiv preprint arXiv:2306.15595 (2023). 2, 17

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Dong, et al., A survey of large language models, arXiv preprint arXiv:2303.18223 (2023). 2, 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Naseem, I

U. Naseem, I. Razzak, S. K. Khan, M. Prasad, A comprehensive sur- vey on word representation models: From classical to state-of-the-art word representation language models, Transactions on Asian and Low- Resource Language Information Processing 20 (5) (2021) 1–35. 2, 3

work page 2021
[52]

B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz, E. Agirre, I. Heinz, D. Roth, Recent advances in natural language pro- cessing via large pre-trained language models: A survey, arXiv preprint arXiv:2111.01243 (2021). 2, 3

work page arXiv 2021
[53]

C. Zhou, Q. Li, C. Li, J. Yu, Y . Liu, G. Wang, K. Zhang, C. Ji, Q. Yan, L. He, et al., A comprehensive survey on pretrained foundation models: A history from bert to chatgpt, arXiv preprint arXiv:2302.09419 (2023). 2, 3

work page arXiv 2023
[54]

Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, Z. Sui, A survey for in-context learning, arXiv preprint arXiv:2301.00234 (2022). 2, 7, 18

work page internal anchor Pith review Pith/arXiv arXiv 2022
[55]

Towards Reasoning in Large Language Models: A Survey

J. Huang, K. C.-C. Chang, Towards reasoning in large language models: A survey, arXiv preprint arXiv:2212.10403 (2022). 2, 7, 18

work page internal anchor Pith review Pith/arXiv arXiv 2022
[56]

Y . Wang, W. Zhong, L. Li, F. Mi, X. Zeng, W. Huang, L. Shang, X. Jiang, Q. Liu, Aligning large language models with human: A survey, arXiv preprint arXiv:2307.12966 (2023). 2

work page arXiv 2023
[57]

X. Zhu, J. Li, Y . Liu, C. Ma, W. Wang, A survey on model compression for large language models, arXiv preprint arXiv:2308.07633 (2023). 2

work page arXiv 2023
[58]

S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, E. Chen, A survey on multi- modal large language models, arXiv preprint arXiv:2306.13549 (2023). 2, 22, 23

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

J. J. Webster, C. Kit, Tokenization as the initial phase in nlp, in: COL- ING 1992 volume 4: The 14th international conference on computa- tional linguistics, 1992. 4

work page 1992
[60]

T. Kudo, Subword regularization: Improving neural network translation models with multiple subword candidates, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (V ol- ume 1: Long Papers), 2018, pp. 66–75. 4

work page 2018
[61]

Sennrich, B

R. Sennrich, B. Haddow, A. Birch, Neural machine translation of rare words with subword units, in: Proceedings of the 54th Annual Meet- ing of the Association for Computational Linguistics (V olume 1: Long Papers), 2016, pp. 1715–1725. 4

work page 2016
[62]

Schuster, K

M. Schuster, K. Nakajima, Japanese and korean voice search, in: 2012 IEEE international conference on acoustics, speech and signal process- ing (ICASSP), IEEE, 2012, pp. 5149–5152. 4

work page 2012
[63]

S. J. Mielke, Z. Alyafeai, E. Salesky, C. Ra ffel, M. Dey, M. Gallé, A. Raja, C. Si, W. Y . Lee, B. Sagot, et al., Between words and char- acters: A brief history of open-vocabulary modeling and tokenization in nlp, arXiv preprint arXiv:2112.10508 (2021). 4

work page arXiv 2021
[64]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017). 4, 7

work page 2017
[65]

Press, N

O. Press, N. Smith, M. Lewis, Train short, test long: Attention with linear biases enables input length extrapolation, in: International Con- ference on Learning Representations, 2022. URL https://openreview.net/forum?id=R8sQPpGCv0 4, 17

work page 2022
[66]

J. Su, Y . Lu, S. Pan, A. Murtadha, B. Wen, Y . Liu, Roformer: En- hanced transformer with rotary position embedding, arXiv preprint arXiv:2104.09864 (2021). 4, 9, 17

work page internal anchor Pith review Pith/arXiv arXiv 2021
[67]

Generating Long Sequences with Sparse Transformers

R. Child, S. Gray, A. Radford, I. Sutskever, Generating long sequences with sparse transformers, arXiv preprint arXiv:1904.10509 (2019). 4, 7, 23

work page internal anchor Pith review Pith/arXiv arXiv 1904
[68]

T. Dao, D. Fu, S. Ermon, A. Rudra, C. Ré, Flashattention: Fast and memory-efficient exact attention with io-awareness, Advances in Neural Information Processing Systems 35 (2022) 16344–16359. 4

work page 2022
[69]

Hornik, M

K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators, Neural networks 2 (5) (1989) 359–366. 4

work page 1989
[70]

V . Nair, G. E. Hinton, Rectified linear units improve restricted boltz- mann machines, in: Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814. 4

work page 2010
[71]

Gaussian Error Linear Units (GELUs)

D. Hendrycks, K. Gimpel, Gaussian error linear units (gelus), arXiv preprint arXiv:1606.08415 (2016). 4

work page internal anchor Pith review Pith/arXiv arXiv 2016
[72]

Srivastava, G

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, The journal of machine learning research 15 (1) (2014) 1929–1958. 4

work page 2014
[73]

Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations

D. Krueger, T. Maharaj, J. Kramár, M. Pezeshki, N. Ballas, N. R. Ke, A. Goyal, Y . Bengio, A. Courville, C. Pal, Zoneout: Regular- izing rnns by randomly preserving hidden activations, arXiv preprint arXiv:1606.01305 (2016). 4

work page internal anchor Pith review Pith/arXiv arXiv 2016
[74]

GLU Variants Improve Transformer

N. Shazeer, Glu variants improve transformer, arXiv preprint arXiv:2002.05202 (2020). 4

work page internal anchor Pith review Pith/arXiv arXiv 2002
[75]

Y . N. Dauphin, A. Fan, M. Auli, D. Grangier, Language modeling with gated convolutional networks, in: International conference on machine learning, PMLR, 2017, pp. 933–941. 4

work page 2017
[76]

J. L. Ba, J. R. Kiros, G. E. Hinton, Layer normalization, arXiv preprint arXiv:1607.06450 (2016). 4

work page internal anchor Pith review Pith/arXiv arXiv 2016
[77]

Zhang, R

B. Zhang, R. Sennrich, Root mean square layer normalization, Advances in Neural Information Processing Systems 32 (2019). 4

work page 2019
[78]

Adaptive Input Representations for Neural Language Modeling

A. Baevski, M. Auli, Adaptive input representations for neural language modeling, arXiv preprint arXiv:1809.10853 (2018). 4

work page internal anchor Pith review Pith/arXiv arXiv 2018
[79]

H. Wang, S. Ma, L. Dong, S. Huang, D. Zhang, F. Wei, Deepnet: Scaling transformers to 1,000 layers, arXiv preprint arXiv:2203.00555 (2022). 4

work page arXiv 2022
[80]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, B. Catanzaro, Megatron-lm: Training multi-billion parameter language models using model parallelism, arXiv preprint arXiv:1909.08053 (2019). 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 1909

Showing first 80 references.

[1] [1]

the end of his- tory

A. Chernyavskiy, D. Ilvovsky, P. Nakov, Transformers:“the end of his- tory” for natural language processing?, in: Machine Learning and Knowledge Discovery in Databases. Research Track: European Con- ference, ECML PKDD 2021, Bilbao, Spain, September 13–17, 2021, Proceedings, Part III 21, Springer, 2021, pp. 677–693. 1

work page 2021

[2] [2]

A. Wang, Y . Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, S. Bowman, Superglue: A stickier benchmark for general- purpose language understanding systems, Advances in neural informa- tion processing systems 32 (2019). 1, 26, 29

work page 2019

[3] [3]

Adiwardana, M.-T

D. Adiwardana, M.-T. Luong, D. R. So, J. Hall, N. Fiedel, R. Thoppilan, Z. Yang, A. Kulshreshtha, G. Nemade, Y . Lu, et al., Towards a human- like open-domain chatbot, arXiv preprint arXiv:2001.09977 (2020). 1

work page arXiv 2001

[4] [4]

B. A. y Arcas, Do large language models understand us?, Daedalus 151 (2) (2022) 183–197. 2

work page 2022

[5] [5]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are unsupervised multitask learners, OpenAI blog 1 (8) (2019) 9. 2, 7

work page 2019

[6] [6]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing sys- tems 33 (2020) 1877–1901. 2, 6, 7, 8, 9, 16, 18, 23, 24, 25, 34

work page 2020

[7] [7]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). 2, 18, 24 35

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep contextualized word representations, in: NAACL- HLT, Association for Computational Linguistics, 2018, pp. 2227–2237. 2

work page 2018

[9] [9]

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

M. Lewis, Y . Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V . Stoyanov, L. Zettlemoyer, Bart: Denoising sequence-to-sequence pre- training for natural language generation, translation, and comprehen- sion, arXiv preprint arXiv:1910.13461 (2019). 2

work page internal anchor Pith review Pith/arXiv arXiv 1910

[10] [10]

Ra ffel, N

C. Ra ffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, The Journal of Machine Learning Re- search 21 (1) (2020) 5485–5551. 2, 7, 8, 18, 19, 24, 25, 28, 30, 31

work page 2020

[11] [11]

L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, C. Ra ffel, mt5: A massively multilingual pre-trained text-to- text transformer, arXiv preprint arXiv:2010.11934 (2020). 2, 7, 8, 24, 25, 28, 30

work page arXiv 2010

[12] [12]

Zhang, Y

Z. Zhang, Y . Gu, X. Han, S. Chen, C. Xiao, Z. Sun, Y . Yao, F. Qi, J. Guan, P. Ke, et al., Cpm-2: Large-scale cost-effective pre-trained lan- guage models, AI Open 2 (2021) 216–224. 2, 8, 25

work page 2021

[13] [13]

T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ili ´c, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé, et al., Bloom: A 176b- parameter open-access multilingual language model, arXiv preprint arXiv:2211.05100 (2022). 2, 4, 9, 11, 23, 24, 25, 30

work page internal anchor Pith review Pith/arXiv arXiv 2022

[14] [14]

OPT: Open Pre-trained Transformer Language Models

S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Lin, et al., Opt: Open pre-trained transformer language models, arXiv preprint arXiv:2205.01068 (2022). 2, 9, 11, 24, 25

work page internal anchor Pith review Pith/arXiv arXiv 2022

[15] [15]

PaLM: Scaling Language Modeling with Pathways

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al., Palm: Scal- ing language modeling with pathways, arXiv preprint arXiv:2204.02311 (2022). 2, 6, 9, 11, 23, 24, 25

work page internal anchor Pith review Pith/arXiv arXiv 2022

[16] [16]

H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma, et al., Scaling instruction-finetuned language models, arXiv preprint arXiv:2210.11416 (2022). 2, 7, 11, 16, 17, 22, 24, 25, 28, 31

work page internal anchor Pith review Pith/arXiv arXiv 2022

[17] [17]

V . Sanh, A. Webson, C. Ra ffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Cha ffin, A. Stiegler, T. L. Scao, A. Raja, et al., Multitask prompted training enables zero-shot task generalization, arXiv preprint arXiv:2110.08207 (2021). 2, 11, 16, 25, 28, 31

work page internal anchor Pith review Pith/arXiv arXiv 2021

[18] [18]

Y . Wang, S. Mishra, P. Alipoormolabashi, Y . Kordi, A. Mirzaei, A. Naik, A. Ashok, A. S. Dhanasekaran, A. Arunkumar, D. Stap, et al., Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks, in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 5085–5109. 2, 7, 11, 16, 17, ...

work page 2022

[19] [19]

Y . Wang, Y . Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, H. Ha- jishirzi, Self-instruct: Aligning language model with self generated in- structions, arXiv preprint arXiv:2212.10560 (2022). 2, 16, 19, 22, 28

work page internal anchor Pith review Pith/arXiv arXiv 2022

[20] [20]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al., Training language mod- els to follow instructions with human feedback, Advances in Neural In- formation Processing Systems 35 (2022) 27730–27744. 2, 7, 11, 16, 22

work page 2022

[21] [21]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., Llama 2: Open foundation and fine-tuned chat models, arXiv preprint arXiv:2307.09288 (2023). 2, 7, 10, 16, 25, 34

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

J. Wei, Y . Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yo- gatama, M. Bosma, D. Zhou, D. Metzler, et al., Emergent abilities of large language models, arXiv preprint arXiv:2206.07682 (2022). 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[23] [23]

T. Webb, K. J. Holyoak, H. Lu, Emergent analogical reasoning in large language models, Nature Human Behaviour 7 (9) (2023) 1526–1541. 2

work page 2023

[24] [24]

D. A. Boiko, R. MacKnight, G. Gomes, Emergent autonomous sci- entific research capabilities of large language models, arXiv preprint arXiv:2304.05332 (2023). 2

work page internal anchor Pith review arXiv 2023

[25] [25]

Atlas: Few-shot Learning with Retrieval Augmented Language Models

G. Izacard, P. Lewis, M. Lomeli, L. Hosseini, F. Petroni, T. Schick, J. Dwivedi-Yu, A. Joulin, S. Riedel, E. Grave, Few-shot learning with retrieval augmented language models, arXiv preprint arXiv:2208.03299 (2022). 2, 18, 19, 34

work page internal anchor Pith review Pith/arXiv arXiv 2022

[26] [26]

PaLM-E: An Embodied Multimodal Language Model

D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al., Palm-e: An embodied multimodal language model, arXiv preprint arXiv:2303.03378 (2023). 2, 20, 22, 33

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

Parisi, Y

A. Parisi, Y . Zhao, N. Fiedel, Talm: Tool augmented language models, arXiv preprint arXiv:2205.12255 (2022). 2, 19, 20

work page arXiv 2022

[28] [28]

Zhang, H

B. Zhang, H. Soh, Large language models as zero-shot human models for human-robot interaction, arXiv preprint arXiv:2303.03548 (2023). 2, 33

work page arXiv 2023

[29] [29]

Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y . Zhou, J. Wang, A. Hu, P. Shi, Y . Shi, et al., mplug-owl: Modularization empowers large language models with multimodality, arXiv preprint arXiv:2304.14178 (2023). 2, 22

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

W. Wang, Z. Chen, X. Chen, J. Wu, X. Zhu, G. Zeng, P. Luo, T. Lu, J. Zhou, Y . Qiao, et al., Visionllm: Large language model is also an open-ended decoder for vision-centric tasks, arXiv preprint arXiv:2305.11175 (2023). 2, 22

work page arXiv 2023

[31] [31]

R. Yang, L. Song, Y . Li, S. Zhao, Y . Ge, X. Li, Y . Shan, Gpt4tools: Teaching large language model to use tools via self-instruction, arXiv preprint arXiv:2305.18752 (2023). 2, 19, 22, 23

work page arXiv 2023

[32] [32]

Saravia, Prompt Engineering Guide, https: //github.com/dair- ai/Prompt-Engineering-Guide (12 2022)

E. Saravia, Prompt Engineering Guide, https: //github.com/dair- ai/Prompt-Engineering-Guide (12 2022). 2, 7, 18, 34

work page 2022

[33] [33]

A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y . Xu, W. Zheng, X. Xia, et al., Glm-130b: An open bilingual pre-trained model, arXiv preprint arXiv:2210.02414 (2022). 2, 10, 23, 24, 25

work page internal anchor Pith review Pith/arXiv arXiv 2022

[34] [34]

Y . Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, S. C. Hoi, Codet5 +: Open code large language models for code understanding and genera- tion, arXiv preprint arXiv:2305.07922 (2023). 2, 11, 24, 25

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [35]

S. Wang, Y . Sun, Y . Xiang, Z. Wu, S. Ding, W. Gong, S. Feng, J. Shang, Y . Zhao, C. Pang, et al., Ernie 3.0 titan: Exploring larger-scale knowl- edge enhanced pre-training for language understanding and generation, arXiv preprint arXiv:2112.12731 (2021). 2, 8, 24, 25

work page arXiv 2021

[36] [36]

Rasley, S

J. Rasley, S. Rajbhandari, O. Ruwase, Y . He, Deepspeed: System op- timizations enable training deep learning models with over 100 billion parameters, in: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 3505–

work page 2020

[37] [37]

Rajbhandari, J

S. Rajbhandari, J. Rasley, O. Ruwase, Y . He, Zero: Memory optimiza- tions toward training trillion parameter models, in: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE, 2020, pp. 1–16. 2, 4, 24

work page 2020

[38] [38]

J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, G. Neubig, Towards a unified view of parameter-e fficient transfer learning, arXiv preprint arXiv:2110.04366 (2021). 2, 20, 21

work page arXiv 2021

[39] [39]

Z. Hu, Y . Lan, L. Wang, W. Xu, E.-P. Lim, R. K.-W. Lee, L. Bing, S. Po- ria, Llm-adapters: An adapter family for parameter-e fficient fine-tuning of large language models, arXiv preprint arXiv:2304.01933 (2023). 2, 20

work page arXiv 2023

[40] [40]

The Power of Scale for Parameter-Efficient Prompt Tuning

B. Lester, R. Al-Rfou, N. Constant, The power of scale for parameter- efficient prompt tuning, arXiv preprint arXiv:2104.08691 (2021). 2, 8, 20, 21

work page internal anchor Pith review Pith/arXiv arXiv 2021

[41] [41]

X. L. Li, P. Liang, Prefix-tuning: Optimizing continuous prompts for generation, arXiv preprint arXiv:2101.00190 (2021). 2, 20, 21

work page internal anchor Pith review Pith/arXiv arXiv 2021

[42] [42]

X. Ma, G. Fang, X. Wang, Llm-pruner: On the structural pruning of large language models, arXiv preprint arXiv:2305.11627 (2023). 2, 22

work page arXiv 2023

[43] [43]

R. Xu, F. Luo, C. Wang, B. Chang, J. Huang, S. Huang, F. Huang, From dense to sparse: Contrastive pruning for better pre-trained lan- guage model compression, in: Proceedings of the AAAI Conference on Artificial Intelligence, V ol. 36, 2022, pp. 11547–11555. 2, 22

work page 2022

[44] [44]

G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, S. Han, Smoothquant: Accurate and e fficient post-training quantization for large language models, in: ICML, V ol. 202 of Proceedings of Machine Learning Re- search, PMLR, 2023, pp. 38087–38099. 2, 21

work page 2023

[45] [45]

C. Tao, L. Hou, W. Zhang, L. Shang, X. Jiang, Q. Liu, P. Luo, N. Wong, Compression of generative pre-trained language models via quantiza- tion, arXiv preprint arXiv:2203.10705 (2022). 2, 21

work page arXiv 2022

[46] [46]

A. Pal, D. Karkhanis, M. Roberts, S. Dooley, A. Sundararajan, S. Naidu, Giraffe: Adventures in expanding context lengths in llms, arXiv preprint arXiv:2308.10882 (2023). 2, 17

work page arXiv 2023

[47] [47]

B. Peng, J. Quesnelle, H. Fan, E. Shippole, Yarn: E fficient con- text window extension of large language models, arXiv preprint arXiv:2309.00071 (2023). 2, 17

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [48]

M. Guo, J. Ainslie, D. Uthus, S. Ontanon, J. Ni, Y .-H. Sung, Y . Yang, 36 Longt5: E fficient text-to-text transformer for long sequences, arXiv preprint arXiv:2112.07916 (2021). 2, 18

work page arXiv 2021

[49] [49]

S. Chen, S. Wong, L. Chen, Y . Tian, Extending context window of large language models via positional interpolation, arXiv preprint arXiv:2306.15595 (2023). 2, 17

work page internal anchor Pith review Pith/arXiv arXiv 2023

[50] [50]

W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Dong, et al., A survey of large language models, arXiv preprint arXiv:2303.18223 (2023). 2, 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [51]

Naseem, I

U. Naseem, I. Razzak, S. K. Khan, M. Prasad, A comprehensive sur- vey on word representation models: From classical to state-of-the-art word representation language models, Transactions on Asian and Low- Resource Language Information Processing 20 (5) (2021) 1–35. 2, 3

work page 2021

[52] [52]

B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz, E. Agirre, I. Heinz, D. Roth, Recent advances in natural language pro- cessing via large pre-trained language models: A survey, arXiv preprint arXiv:2111.01243 (2021). 2, 3

work page arXiv 2021

[53] [53]

C. Zhou, Q. Li, C. Li, J. Yu, Y . Liu, G. Wang, K. Zhang, C. Ji, Q. Yan, L. He, et al., A comprehensive survey on pretrained foundation models: A history from bert to chatgpt, arXiv preprint arXiv:2302.09419 (2023). 2, 3

work page arXiv 2023

[54] [54]

Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, Z. Sui, A survey for in-context learning, arXiv preprint arXiv:2301.00234 (2022). 2, 7, 18

work page internal anchor Pith review Pith/arXiv arXiv 2022

[55] [55]

Towards Reasoning in Large Language Models: A Survey

J. Huang, K. C.-C. Chang, Towards reasoning in large language models: A survey, arXiv preprint arXiv:2212.10403 (2022). 2, 7, 18

work page internal anchor Pith review Pith/arXiv arXiv 2022

[56] [56]

Y . Wang, W. Zhong, L. Li, F. Mi, X. Zeng, W. Huang, L. Shang, X. Jiang, Q. Liu, Aligning large language models with human: A survey, arXiv preprint arXiv:2307.12966 (2023). 2

work page arXiv 2023

[57] [57]

X. Zhu, J. Li, Y . Liu, C. Ma, W. Wang, A survey on model compression for large language models, arXiv preprint arXiv:2308.07633 (2023). 2

work page arXiv 2023

[58] [58]

S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, E. Chen, A survey on multi- modal large language models, arXiv preprint arXiv:2306.13549 (2023). 2, 22, 23

work page internal anchor Pith review Pith/arXiv arXiv 2023

[59] [59]

J. J. Webster, C. Kit, Tokenization as the initial phase in nlp, in: COL- ING 1992 volume 4: The 14th international conference on computa- tional linguistics, 1992. 4

work page 1992

[60] [60]

T. Kudo, Subword regularization: Improving neural network translation models with multiple subword candidates, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (V ol- ume 1: Long Papers), 2018, pp. 66–75. 4

work page 2018

[61] [61]

Sennrich, B

R. Sennrich, B. Haddow, A. Birch, Neural machine translation of rare words with subword units, in: Proceedings of the 54th Annual Meet- ing of the Association for Computational Linguistics (V olume 1: Long Papers), 2016, pp. 1715–1725. 4

work page 2016

[62] [62]

Schuster, K

M. Schuster, K. Nakajima, Japanese and korean voice search, in: 2012 IEEE international conference on acoustics, speech and signal process- ing (ICASSP), IEEE, 2012, pp. 5149–5152. 4

work page 2012

[63] [63]

S. J. Mielke, Z. Alyafeai, E. Salesky, C. Ra ffel, M. Dey, M. Gallé, A. Raja, C. Si, W. Y . Lee, B. Sagot, et al., Between words and char- acters: A brief history of open-vocabulary modeling and tokenization in nlp, arXiv preprint arXiv:2112.10508 (2021). 4

work page arXiv 2021

[64] [64]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017). 4, 7

work page 2017

[65] [65]

Press, N

O. Press, N. Smith, M. Lewis, Train short, test long: Attention with linear biases enables input length extrapolation, in: International Con- ference on Learning Representations, 2022. URL https://openreview.net/forum?id=R8sQPpGCv0 4, 17

work page 2022

[66] [66]

J. Su, Y . Lu, S. Pan, A. Murtadha, B. Wen, Y . Liu, Roformer: En- hanced transformer with rotary position embedding, arXiv preprint arXiv:2104.09864 (2021). 4, 9, 17

work page internal anchor Pith review Pith/arXiv arXiv 2021

[67] [67]

Generating Long Sequences with Sparse Transformers

R. Child, S. Gray, A. Radford, I. Sutskever, Generating long sequences with sparse transformers, arXiv preprint arXiv:1904.10509 (2019). 4, 7, 23

work page internal anchor Pith review Pith/arXiv arXiv 1904

[68] [68]

T. Dao, D. Fu, S. Ermon, A. Rudra, C. Ré, Flashattention: Fast and memory-efficient exact attention with io-awareness, Advances in Neural Information Processing Systems 35 (2022) 16344–16359. 4

work page 2022

[69] [69]

Hornik, M

K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators, Neural networks 2 (5) (1989) 359–366. 4

work page 1989

[70] [70]

V . Nair, G. E. Hinton, Rectified linear units improve restricted boltz- mann machines, in: Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814. 4

work page 2010

[71] [71]

Gaussian Error Linear Units (GELUs)

D. Hendrycks, K. Gimpel, Gaussian error linear units (gelus), arXiv preprint arXiv:1606.08415 (2016). 4

work page internal anchor Pith review Pith/arXiv arXiv 2016

[72] [72]

Srivastava, G

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, The journal of machine learning research 15 (1) (2014) 1929–1958. 4

work page 2014

[73] [73]

Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations

D. Krueger, T. Maharaj, J. Kramár, M. Pezeshki, N. Ballas, N. R. Ke, A. Goyal, Y . Bengio, A. Courville, C. Pal, Zoneout: Regular- izing rnns by randomly preserving hidden activations, arXiv preprint arXiv:1606.01305 (2016). 4

work page internal anchor Pith review Pith/arXiv arXiv 2016

[74] [74]

GLU Variants Improve Transformer

N. Shazeer, Glu variants improve transformer, arXiv preprint arXiv:2002.05202 (2020). 4

work page internal anchor Pith review Pith/arXiv arXiv 2002

[75] [75]

Y . N. Dauphin, A. Fan, M. Auli, D. Grangier, Language modeling with gated convolutional networks, in: International conference on machine learning, PMLR, 2017, pp. 933–941. 4

work page 2017

[76] [76]

J. L. Ba, J. R. Kiros, G. E. Hinton, Layer normalization, arXiv preprint arXiv:1607.06450 (2016). 4

work page internal anchor Pith review Pith/arXiv arXiv 2016

[77] [77]

Zhang, R

B. Zhang, R. Sennrich, Root mean square layer normalization, Advances in Neural Information Processing Systems 32 (2019). 4

work page 2019

[78] [78]

Adaptive Input Representations for Neural Language Modeling

A. Baevski, M. Auli, Adaptive input representations for neural language modeling, arXiv preprint arXiv:1809.10853 (2018). 4

work page internal anchor Pith review Pith/arXiv arXiv 2018

[79] [79]

H. Wang, S. Ma, L. Dong, S. Huang, D. Zhang, F. Wei, Deepnet: Scaling transformers to 1,000 layers, arXiv preprint arXiv:2203.00555 (2022). 4

work page arXiv 2022

[80] [80]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, B. Catanzaro, Megatron-lm: Training multi-billion parameter language models using model parallelism, arXiv preprint arXiv:1909.08053 (2019). 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 1909