A Comprehensive Overview of Large Language Models
Pith reviewed 2026-05-19 20:23 UTC · model grok-4.3
pith:JVF5XXJW Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{JVF5XXJW}
Prints a linked pith:JVF5XXJW badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
This review compiles background concepts and frontier advances in large language models into one accessible guide.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that by systematically surveying literature on architectural innovations, training strategies, context improvements, fine-tuning, multi-modal LLMs, robotics applications, datasets, benchmarking, and efficiency, they can provide a concise yet comprehensive reference that benefits researchers and practitioners alike.
What carries the argument
The survey structure itself, which groups diverse LLM-related concepts and provides informative summaries of existing works.
If this is right
- Researchers gain a quick reference to draw insights from summaries of existing works.
- Practitioners can better understand advanced topics to apply LLMs effectively.
- The overview highlights connections across topics like efficiency and multi-modal capabilities.
- Future research can build on the identified frontier areas more systematically.
Where Pith is reading between the lines
- Such overviews may become essential tools as the field grows, potentially standardizing how new contributions are contextualized.
- Connecting LLM advances to robotics could lead to more integrated systems where language models control physical actions.
- Tracking efficiency improvements might reveal patterns in how model scale interacts with performance gains.
Load-bearing premise
The literature reviewed is representative of the field and that the summaries provided are accurate and unbiased representations of the original contributions.
What would settle it
Finding a significant recent LLM paper or technique omitted from the overview, or identifying a summary that misrepresents the findings of a cited work.
read the original abstract
Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and beyond. This success of LLMs has led to a large influx of research contributions in this direction. These works encompass diverse topics such as architectural innovations, better training strategies, context length improvements, fine-tuning, multi-modal LLMs, robotics, datasets, benchmarking, efficiency, and more. With the rapid development of techniques and regular breakthroughs in LLM research, it has become considerably challenging to perceive the bigger picture of the advances in this direction. Considering the rapidly emerging plethora of literature on LLMs, it is imperative that the research community is able to benefit from a concise yet comprehensive overview of the recent developments in this field. This article provides an overview of the existing literature on a broad range of LLM-related concepts. Our self-contained comprehensive overview of LLMs discusses relevant background concepts along with covering the advanced topics at the frontier of research in LLMs. This review article is intended to not only provide a systematic survey but also a quick comprehensive reference for the researchers and practitioners to draw insights from extensive informative summaries of the existing works to advance the LLM research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a self-contained comprehensive overview of Large Language Models (LLMs), covering background concepts along with advanced topics including architectural innovations, training strategies, context length improvements, fine-tuning, multi-modal LLMs, robotics applications, datasets, benchmarking, and efficiency techniques. It positions the work as both a systematic survey and a quick reference for researchers and practitioners to synthesize insights from existing literature.
Significance. If the reviewed literature is representative and the summaries are faithful, the paper would provide a useful consolidation of the rapidly expanding LLM literature, helping the community navigate diverse contributions and draw cross-cutting insights.
major comments (2)
- [Introduction / Abstract] The central claim of a 'comprehensive overview' and 'systematic survey' lacks any description of the literature selection process (search protocol, databases, keywords, inclusion/exclusion criteria, or time window). This directly affects the representativeness of the covered topics and is load-bearing for the abstract's assertions.
- [Main survey sections (e.g., those detailing fine-tuning and multi-modal LLMs)] Fidelity of the condensed summaries to the original contributions is not verifiable from the provided structure; without explicit cross-referencing or error-checking mechanisms, interpretive drift in key areas (e.g., training strategies or multi-modal extensions) could undermine the reference value.
minor comments (2)
- [Abstract] The abstract could usefully state the approximate number of works reviewed and the literature cutoff date to clarify scope.
- [Throughout] Notation and terminology for model components (e.g., parameter counts, context lengths) should be standardized across sections for reader clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment below and describe the changes we will make to strengthen the manuscript as a survey.
read point-by-point responses
-
Referee: [Introduction / Abstract] The central claim of a 'comprehensive overview' and 'systematic survey' lacks any description of the literature selection process (search protocol, databases, keywords, inclusion/exclusion criteria, or time window). This directly affects the representativeness of the covered topics and is load-bearing for the abstract's assertions.
Authors: We agree that explicitly describing the literature selection process would improve transparency and support the claims of a systematic survey. In the revised manuscript we will add a dedicated subsection in the Introduction outlining the methodology. This will specify the databases consulted (arXiv, Google Scholar, ACL Anthology), search keywords (e.g., 'large language models', 'LLM', 'transformer', 'foundation model'), the time window (primarily 2018–2023 with selective earlier foundational works), and inclusion criteria focused on influential, highly cited contributions that address the topics enumerated in the abstract. Exclusion criteria will note the omission of non-English works and very recent preprints not yet widely cited. We believe this addition directly addresses the concern about representativeness. revision: yes
-
Referee: [Main survey sections (e.g., those detailing fine-tuning and multi-modal LLMs)] Fidelity of the condensed summaries to the original contributions is not verifiable from the provided structure; without explicit cross-referencing or error-checking mechanisms, interpretive drift in key areas (e.g., training strategies or multi-modal extensions) could undermine the reference value.
Authors: We acknowledge the value of stronger traceability for the condensed summaries. In the revision we will add explicit cross-references and, where space permits, short direct quotations or key phrases from the cited papers in the fine-tuning, training strategies, and multi-modal sections. We will also insert a brief statement in the Introduction describing our summarization process: each summary was derived from the primary source and cross-checked against the original abstract and conclusions. While readers will still benefit most by consulting the cited works, these enhancements should reduce the risk of interpretive drift and improve the paper's utility as a reference. revision: yes
Circularity Check
No circularity: literature survey without derivations or predictions
full rationale
This paper is a review article that surveys existing LLM literature, providing background concepts and summaries of advanced topics drawn from external references. It contains no original derivations, equations, predictions, or first-principles results that could reduce to inputs by construction. The central claim of offering a 'self-contained comprehensive overview' rests on the selection and accuracy of cited works rather than any internal logical chain that loops back to its own fitted parameters or self-citations. No steps match the enumerated circularity patterns such as self-definitional claims, fitted inputs renamed as predictions, or ansatz smuggled via self-citation. The structure is self-contained against external benchmarks precisely because it defers all substantive content to the referenced primary sources.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
This article provides an overview of the literature on a broad range of LLM-related concepts. Our self-contained comprehensive overview of LLMs discusses relevant background concepts along with covering the advanced topics at the frontier of research in LLMs.
-
IndisputableMonolith.Foundation.PhiForcingphi_equation unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and beyond.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models
SemGrad is a gradient-based uncertainty quantification technique for free-form LLM generation that operates in semantic space using a Semantic Preservation Score to select stable embeddings.
-
EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks
EgoTL provides a new egocentric dataset with think-aloud chains and metric labels that benchmarks VLMs on long-horizon tasks and improves their planning, reasoning, and spatial grounding after finetuning.
-
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination
LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.
-
DSIPA: Detecting LLM-Generated Texts via Sentiment-Invariant Patterns Divergence Analysis
DSIPA is a zero-shot black-box detector that uses sentiment distribution consistency and preservation metrics to identify LLM text, reporting up to 49.89% F1 gains over baselines across domains and models.
-
How Large Language Models Balance Internal Knowledge with User and Document Assertions
LLMs prefer document assertions over user assertions, are impressionable to external information, and gain better discrimination after fine-tuning on diverse source-interaction data.
-
Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers
BadStyle creates stealthy backdoors in LLMs by poisoning samples with imperceptible style triggers and using an auxiliary loss to stabilize payload injection, achieving high attack success rates across multiple models...
-
ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks
ActuBench is a multi-agent LLM pipeline for generating and evaluating actuarial reasoning tasks, with evaluations of 50 models showing effective verification, competitive local open-weights models, and differing ranki...
-
Bangla Key2Text: Text Generation from Keywords for a Low Resource Language
Bangla Key2Text releases 2.6M keyword-text pairs and demonstrates that fine-tuned mT5 and BanglaT5 outperform zero-shot LLMs on keyword-conditioned Bangla text generation.
-
Multi-LLM Token Filtering and Routing for Sequential Recommendation
MLTFR combines user-guided token filtering with a multi-LLM mixture-of-experts and Fisher-weighted consensus expert to deliver stable gains in corpus-free sequential recommendation.
-
Agent-GWO: Collaborative Agents for Dynamic Prompt Optimization in Large Language Models
Agent-GWO uses collaborative grey-wolf-inspired agents to jointly optimize LLM prompts and decoding settings, yielding higher accuracy and stability than prior single-agent prompt optimization methods on math and hybr...
-
Semantic Communication with an LLM-enabled Knowledge Base
SC-LMKB uses LLM-generated data with cross-domain fusion to cut hallucinations and delivers up to 72.6% gains on cross-modality retrieval tasks over standard semantic communication.
-
Improved Evidence Extraction and Metrics for Document Inconsistency Detection with LLMs
New evidence-extraction metrics and a redact-and-retry framework with constrained filtering substantially improve LLM performance on document inconsistency detection, supported by experiments on a released semi-synthe...
-
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
VLM-3R augments VLMs with implicit 3D tokens from monocular video via geometry encoding and 200K+ 3D reconstructive QA pairs, plus a new 138K-pair temporal benchmark, to support spatial and embodied reasoning.
-
WhiteTesseract: Reframing the Interpretation of Cultural Heritage through XR and Conversational AI
WhiteTesseract deploys XR-based diminished reality and LLM dialogue in a Monet exhibition, raising average viewing time from 35.3 to 98.3 seconds and shifting 60% of 529 interactions toward analytical and emotional queries.
-
DrugPlayGround: Benchmarking Large Language Models and Embeddings for Drug Discovery
DrugPlayGround is a new benchmark framework for evaluating LLMs on text-based descriptions of physiochemical drug characteristics, synergism, drug-protein interactions, and physiological responses.
-
Small Language Models are the Future of Agentic AI
Small language models are sufficiently capable, more suitable, and far more economical than large models for the repetitive tasks that dominate agentic AI systems.
-
Evaluating the Reliability of Multiple Large Language Models in Risk Assessment: A CIS Controls Based Approach
Large language models consistently underestimate cybersecurity risks compared to human experts in CIS Controls-based assessments, indicating they should serve as complementary rather than standalone tools.
-
Exploiting Web Search Tools of AI Agents for Data Exfiltration
Indirect prompt injection attacks remain effective on LLMs using web search tools, allowing data exfiltration and exposing ongoing weaknesses in current model defenses.
-
Large Language Model-Brained GUI Agents: A Survey
A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.
Reference graph
Works this paper leans on
-
[1]
A. Chernyavskiy, D. Ilvovsky, P. Nakov, Transformers:“the end of his- tory” for natural language processing?, in: Machine Learning and Knowledge Discovery in Databases. Research Track: European Con- ference, ECML PKDD 2021, Bilbao, Spain, September 13–17, 2021, Proceedings, Part III 21, Springer, 2021, pp. 677–693. 1
work page 2021
-
[2]
A. Wang, Y . Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, S. Bowman, Superglue: A stickier benchmark for general- purpose language understanding systems, Advances in neural informa- tion processing systems 32 (2019). 1, 26, 29
work page 2019
-
[3]
D. Adiwardana, M.-T. Luong, D. R. So, J. Hall, N. Fiedel, R. Thoppilan, Z. Yang, A. Kulshreshtha, G. Nemade, Y . Lu, et al., Towards a human- like open-domain chatbot, arXiv preprint arXiv:2001.09977 (2020). 1
-
[4]
B. A. y Arcas, Do large language models understand us?, Daedalus 151 (2) (2022) 183–197. 2
work page 2022
-
[5]
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are unsupervised multitask learners, OpenAI blog 1 (8) (2019) 9. 2, 7
work page 2019
- [6]
-
[7]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). 2, 18, 24 35
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[8]
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep contextualized word representations, in: NAACL- HLT, Association for Computational Linguistics, 2018, pp. 2227–2237. 2
work page 2018
-
[9]
M. Lewis, Y . Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V . Stoyanov, L. Zettlemoyer, Bart: Denoising sequence-to-sequence pre- training for natural language generation, translation, and comprehen- sion, arXiv preprint arXiv:1910.13461 (2019). 2
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[10]
C. Ra ffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, The Journal of Machine Learning Re- search 21 (1) (2020) 5485–5551. 2, 7, 8, 18, 19, 24, 25, 28, 30, 31
work page 2020
- [11]
- [12]
-
[13]
T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ili ´c, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé, et al., Bloom: A 176b- parameter open-access multilingual language model, arXiv preprint arXiv:2211.05100 (2022). 2, 4, 9, 11, 23, 24, 25, 30
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[14]
OPT: Open Pre-trained Transformer Language Models
S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Lin, et al., Opt: Open pre-trained transformer language models, arXiv preprint arXiv:2205.01068 (2022). 2, 9, 11, 24, 25
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[15]
PaLM: Scaling Language Modeling with Pathways
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al., Palm: Scal- ing language modeling with pathways, arXiv preprint arXiv:2204.02311 (2022). 2, 6, 9, 11, 23, 24, 25
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma, et al., Scaling instruction-finetuned language models, arXiv preprint arXiv:2210.11416 (2022). 2, 7, 11, 16, 17, 22, 24, 25, 28, 31
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
V . Sanh, A. Webson, C. Ra ffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Cha ffin, A. Stiegler, T. L. Scao, A. Raja, et al., Multitask prompted training enables zero-shot task generalization, arXiv preprint arXiv:2110.08207 (2021). 2, 11, 16, 25, 28, 31
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[18]
Y . Wang, S. Mishra, P. Alipoormolabashi, Y . Kordi, A. Mirzaei, A. Naik, A. Ashok, A. S. Dhanasekaran, A. Arunkumar, D. Stap, et al., Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks, in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 5085–5109. 2, 7, 11, 16, 17, ...
work page 2022
-
[19]
Y . Wang, Y . Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, H. Ha- jishirzi, Self-instruct: Aligning language model with self generated in- structions, arXiv preprint arXiv:2212.10560 (2022). 2, 16, 19, 22, 28
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[20]
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al., Training language mod- els to follow instructions with human feedback, Advances in Neural In- formation Processing Systems 35 (2022) 27730–27744. 2, 7, 11, 16, 22
work page 2022
-
[21]
Llama 2: Open Foundation and Fine-Tuned Chat Models
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., Llama 2: Open foundation and fine-tuned chat models, arXiv preprint arXiv:2307.09288 (2023). 2, 7, 10, 16, 25, 34
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
J. Wei, Y . Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yo- gatama, M. Bosma, D. Zhou, D. Metzler, et al., Emergent abilities of large language models, arXiv preprint arXiv:2206.07682 (2022). 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[23]
T. Webb, K. J. Holyoak, H. Lu, Emergent analogical reasoning in large language models, Nature Human Behaviour 7 (9) (2023) 1526–1541. 2
work page 2023
-
[24]
D. A. Boiko, R. MacKnight, G. Gomes, Emergent autonomous sci- entific research capabilities of large language models, arXiv preprint arXiv:2304.05332 (2023). 2
work page internal anchor Pith review arXiv 2023
-
[25]
Atlas: Few-shot Learning with Retrieval Augmented Language Models
G. Izacard, P. Lewis, M. Lomeli, L. Hosseini, F. Petroni, T. Schick, J. Dwivedi-Yu, A. Joulin, S. Riedel, E. Grave, Few-shot learning with retrieval augmented language models, arXiv preprint arXiv:2208.03299 (2022). 2, 18, 19, 34
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[26]
PaLM-E: An Embodied Multimodal Language Model
D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al., Palm-e: An embodied multimodal language model, arXiv preprint arXiv:2303.03378 (2023). 2, 20, 22, 33
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [27]
- [28]
-
[29]
Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y . Zhou, J. Wang, A. Hu, P. Shi, Y . Shi, et al., mplug-owl: Modularization empowers large language models with multimodality, arXiv preprint arXiv:2304.14178 (2023). 2, 22
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [30]
- [31]
-
[32]
Saravia, Prompt Engineering Guide, https: //github.com/dair- ai/Prompt-Engineering-Guide (12 2022)
E. Saravia, Prompt Engineering Guide, https: //github.com/dair- ai/Prompt-Engineering-Guide (12 2022). 2, 7, 18, 34
work page 2022
-
[33]
A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y . Xu, W. Zheng, X. Xia, et al., Glm-130b: An open bilingual pre-trained model, arXiv preprint arXiv:2210.02414 (2022). 2, 10, 23, 24, 25
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[34]
Y . Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, S. C. Hoi, Codet5 +: Open code large language models for code understanding and genera- tion, arXiv preprint arXiv:2305.07922 (2023). 2, 11, 24, 25
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [35]
- [36]
-
[37]
S. Rajbhandari, J. Rasley, O. Ruwase, Y . He, Zero: Memory optimiza- tions toward training trillion parameter models, in: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE, 2020, pp. 1–16. 2, 4, 24
work page 2020
- [38]
- [39]
-
[40]
The Power of Scale for Parameter-Efficient Prompt Tuning
B. Lester, R. Al-Rfou, N. Constant, The power of scale for parameter- efficient prompt tuning, arXiv preprint arXiv:2104.08691 (2021). 2, 8, 20, 21
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[41]
X. L. Li, P. Liang, Prefix-tuning: Optimizing continuous prompts for generation, arXiv preprint arXiv:2101.00190 (2021). 2, 20, 21
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [42]
-
[43]
R. Xu, F. Luo, C. Wang, B. Chang, J. Huang, S. Huang, F. Huang, From dense to sparse: Contrastive pruning for better pre-trained lan- guage model compression, in: Proceedings of the AAAI Conference on Artificial Intelligence, V ol. 36, 2022, pp. 11547–11555. 2, 22
work page 2022
-
[44]
G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, S. Han, Smoothquant: Accurate and e fficient post-training quantization for large language models, in: ICML, V ol. 202 of Proceedings of Machine Learning Re- search, PMLR, 2023, pp. 38087–38099. 2, 21
work page 2023
- [45]
- [46]
-
[47]
B. Peng, J. Quesnelle, H. Fan, E. Shippole, Yarn: E fficient con- text window extension of large language models, arXiv preprint arXiv:2309.00071 (2023). 2, 17
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [48]
-
[49]
S. Chen, S. Wong, L. Chen, Y . Tian, Extending context window of large language models via positional interpolation, arXiv preprint arXiv:2306.15595 (2023). 2, 17
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Dong, et al., A survey of large language models, arXiv preprint arXiv:2303.18223 (2023). 2, 3, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [51]
- [52]
- [53]
-
[54]
Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, Z. Sui, A survey for in-context learning, arXiv preprint arXiv:2301.00234 (2022). 2, 7, 18
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[55]
Towards Reasoning in Large Language Models: A Survey
J. Huang, K. C.-C. Chang, Towards reasoning in large language models: A survey, arXiv preprint arXiv:2212.10403 (2022). 2, 7, 18
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [56]
- [57]
-
[58]
S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, E. Chen, A survey on multi- modal large language models, arXiv preprint arXiv:2306.13549 (2023). 2, 22, 23
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
J. J. Webster, C. Kit, Tokenization as the initial phase in nlp, in: COL- ING 1992 volume 4: The 14th international conference on computa- tional linguistics, 1992. 4
work page 1992
-
[60]
T. Kudo, Subword regularization: Improving neural network translation models with multiple subword candidates, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (V ol- ume 1: Long Papers), 2018, pp. 66–75. 4
work page 2018
-
[61]
R. Sennrich, B. Haddow, A. Birch, Neural machine translation of rare words with subword units, in: Proceedings of the 54th Annual Meet- ing of the Association for Computational Linguistics (V olume 1: Long Papers), 2016, pp. 1715–1725. 4
work page 2016
-
[62]
M. Schuster, K. Nakajima, Japanese and korean voice search, in: 2012 IEEE international conference on acoustics, speech and signal process- ing (ICASSP), IEEE, 2012, pp. 5149–5152. 4
work page 2012
- [63]
-
[64]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017). 4, 7
work page 2017
- [65]
-
[66]
J. Su, Y . Lu, S. Pan, A. Murtadha, B. Wen, Y . Liu, Roformer: En- hanced transformer with rotary position embedding, arXiv preprint arXiv:2104.09864 (2021). 4, 9, 17
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[67]
Generating Long Sequences with Sparse Transformers
R. Child, S. Gray, A. Radford, I. Sutskever, Generating long sequences with sparse transformers, arXiv preprint arXiv:1904.10509 (2019). 4, 7, 23
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[68]
T. Dao, D. Fu, S. Ermon, A. Rudra, C. Ré, Flashattention: Fast and memory-efficient exact attention with io-awareness, Advances in Neural Information Processing Systems 35 (2022) 16344–16359. 4
work page 2022
- [69]
-
[70]
V . Nair, G. E. Hinton, Rectified linear units improve restricted boltz- mann machines, in: Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814. 4
work page 2010
-
[71]
Gaussian Error Linear Units (GELUs)
D. Hendrycks, K. Gimpel, Gaussian error linear units (gelus), arXiv preprint arXiv:1606.08415 (2016). 4
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[72]
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, The journal of machine learning research 15 (1) (2014) 1929–1958. 4
work page 2014
-
[73]
Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations
D. Krueger, T. Maharaj, J. Kramár, M. Pezeshki, N. Ballas, N. R. Ke, A. Goyal, Y . Bengio, A. Courville, C. Pal, Zoneout: Regular- izing rnns by randomly preserving hidden activations, arXiv preprint arXiv:1606.01305 (2016). 4
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[74]
GLU Variants Improve Transformer
N. Shazeer, Glu variants improve transformer, arXiv preprint arXiv:2002.05202 (2020). 4
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[75]
Y . N. Dauphin, A. Fan, M. Auli, D. Grangier, Language modeling with gated convolutional networks, in: International conference on machine learning, PMLR, 2017, pp. 933–941. 4
work page 2017
-
[76]
J. L. Ba, J. R. Kiros, G. E. Hinton, Layer normalization, arXiv preprint arXiv:1607.06450 (2016). 4
work page internal anchor Pith review Pith/arXiv arXiv 2016
- [77]
-
[78]
Adaptive Input Representations for Neural Language Modeling
A. Baevski, M. Auli, Adaptive input representations for neural language modeling, arXiv preprint arXiv:1809.10853 (2018). 4
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [79]
-
[80]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, B. Catanzaro, Megatron-lm: Training multi-billion parameter language models using model parallelism, arXiv preprint arXiv:1909.08053 (2019). 4, 5
work page internal anchor Pith review Pith/arXiv arXiv 1909
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.