Recognition: 2 theorem links
· Lean TheoremA Survey of Large Language Models
Pith reviewed 2026-05-10 22:42 UTC · model grok-4.3
The pith
Large language models develop special abilities once their size exceeds a certain scale.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The survey claims that large language models achieve significant performance improvements and exhibit special abilities not present in small-scale models when their parameter scale exceeds a certain level. It reviews the background, key findings, and mainstream techniques, with a focus on four major aspects: pre-training, adaptation tuning, utilization, and capacity evaluation. The paper also summarizes available resources for developing LLMs and discusses remaining issues for future directions.
What carries the argument
The four-aspect framework consisting of pre-training, adaptation tuning, utilization, and capacity evaluation that organizes the analysis of how scaling produces emergent abilities in LLMs.
If this is right
- The technical evolution of LLMs impacts the entire AI community.
- This evolution would revolutionize the development and use of AI algorithms.
- The launch of ChatGPT has drawn widespread societal attention.
- Summarized resources aid in further LLM development.
- Identified remaining issues guide future research directions.
Where Pith is reading between the lines
- The emergence of special abilities through scaling may apply to non-language AI systems, prompting tests in other modalities.
- The four-aspect structure provides a model for surveying progress in adjacent fields like multimodal learning.
- Developers could experiment with scale thresholds to predict when new capabilities appear.
- Incorporating more industry reports might refine the review's coverage of practical advances.
Load-bearing premise
The reviewed works and four-aspect framework together capture the essential advances in large language models without significant omissions or bias.
What would settle it
Discovery of a major new technique in large language models that falls outside the categories of pre-training, adaptation tuning, utilization, or capacity evaluation would falsify the survey's claim to comprehensive coverage.
read the original abstract
Language is essentially a complex, intricate system of human expressions governed by grammatical rules. It poses a significant challenge to develop capable AI algorithms for comprehending and grasping a language. As a major approach, language modeling has been widely studied for language understanding and generation in the past two decades, evolving from statistical language models to neural language models. Recently, pre-trained language models (PLMs) have been proposed by pre-training Transformer models over large-scale corpora, showing strong capabilities in solving various NLP tasks. Since researchers have found that model scaling can lead to performance improvement, they further study the scaling effect by increasing the model size to an even larger size. Interestingly, when the parameter scale exceeds a certain level, these enlarged language models not only achieve a significant performance improvement but also show some special abilities that are not present in small-scale language models. To discriminate the difference in parameter scale, the research community has coined the term large language models (LLM) for the PLMs of significant size. Recently, the research on LLMs has been largely advanced by both academia and industry, and a remarkable progress is the launch of ChatGPT, which has attracted widespread attention from society. The technical evolution of LLMs has been making an important impact on the entire AI community, which would revolutionize the way how we develop and use AI algorithms. In this survey, we review the recent advances of LLMs by introducing the background, key findings, and mainstream techniques. In particular, we focus on four major aspects of LLMs, namely pre-training, adaptation tuning, utilization, and capacity evaluation. Besides, we also summarize the available resources for developing LLMs and discuss the remaining issues for future directions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is a survey reviewing recent advances in large language models (LLMs). It traces the evolution of language modeling from statistical and neural approaches to pre-trained Transformer-based PLMs, notes that scaling parameters beyond a threshold yields both performance gains and emergent special abilities absent in smaller models, and organizes the review around four aspects: pre-training, adaptation tuning, utilization, and capacity evaluation. The paper also summarizes resources for LLM development and outlines remaining issues and future directions, highlighting ChatGPT as a notable industry milestone with broad AI impact.
Significance. If the coverage is balanced and citations accurate, the survey would serve as a useful organizing reference for a fast-moving field. It synthesizes reported observations on scaling effects and emergent abilities, maps mainstream techniques across the four focal areas, and points to resources and open problems, thereby helping researchers navigate the literature on LLMs and their influence on AI algorithm development and usage.
minor comments (2)
- The abstract introduces 'adaptation tuning' as one of the four core aspects; this phrasing is less common than 'fine-tuning' or 'instruction tuning' in the broader literature, so a brief definition or mapping to standard terminology in the corresponding section would aid clarity.
- The abstract states that the survey covers 'recent advances' and 'mainstream techniques' but does not indicate the temporal scope or approximate number of works reviewed; adding such a sentence would help readers gauge potential selection bias or completeness.
Simulated Author's Rebuttal
We thank the referee for the positive review and the recommendation to accept the manuscript. The referee's summary accurately reflects the scope, structure, and contributions of our survey on large language models.
Circularity Check
No significant circularity
full rationale
This is a literature survey paper with no original derivations, equations, fitted parameters, predictions, or self-referential claims. The central statements about scaling effects and emergent abilities are explicitly framed as summaries of prior work in the reviewed literature (pre-training, adaptation, utilization, evaluation). No load-bearing step reduces to a self-citation chain, ansatz, or input-by-construction; the text functions as an organizing map of external findings.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.PhiForcinghierarchy_emergence_forces_phi unclearwhen the parameter scale exceeds a certain level, these enlarged language models not only achieve a significant performance improvement but also show some special abilities that are not present in small-scale language models
Forward citations
Cited by 60 Pith papers
-
Diffusion-CAM: Faithful Visual Explanations for dMLLMs
Diffusion-CAM is the first method for visual explanations in dMLLMs, using differentiable probing of intermediates plus four refinement modules to produce activation maps that outperform prior CAM approaches in locali...
-
TRUSTDESC: Preventing Tool Poisoning in LLM Applications via Trusted Description Generation
TRUSTDESC prevents tool poisoning in LLM applications by automatically generating accurate tool descriptions from code via a three-stage pipeline of reachability analysis, description synthesis, and dynamic verification.
-
Large Language Diffusion Models
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
-
A Hybrid Framework for Natural Language Querying of IFC Models with Relational and Graph Representations
IfcLLM combines relational and graph representations of IFC models with iterative LLM reasoning to deliver 93.3-100% first-attempt accuracy on natural language queries across three test models.
-
MLPs are Efficient Distilled Generative Recommenders
SID-MLP distills autoregressive generative recommenders into efficient position-specific MLP heads for Semantic ID tasks, achieving 8.74x faster inference with matching accuracy.
-
Variance-aware Reward Modeling with Anchor Guidance
Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, ...
-
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.
-
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
StereoTales shows that LLMs produce harmful, culturally adapted stereotypes in open-ended multilingual stories, with patterns consistent across providers and aligned human-LLM harm judgments.
-
NaiAD: Initiate Data-Driven Research for LLM Advertising
NaiAD is a new dataset and framework for LLM-native advertising that uses decoupled generation and calibrated scoring to identify four semantic strategies for balancing user and commercial utilities.
-
Fin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain
LLMs copy biased analyst ratings in investment decisions but a new detection method encourages independent reasoning and can improve stock return predictions beyond human levels.
-
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conform...
-
CrossCult-KIBench: A Benchmark for Cross-Cultural Knowledge Insertion in MLLMs
CrossCult-KIBench provides 9,800 test cases for cross-cultural knowledge insertion in MLLMs and shows that existing methods cannot reliably adapt to one culture while preserving behavior in others.
-
CrossCult-KIBench: A Benchmark for Cross-Cultural Knowledge Insertion in MLLMs
CrossCult-KIBench is a new benchmark for evaluating cross-cultural knowledge insertion in MLLMs, paired with the MCKI baseline method, showing current approaches fail to balance adaptation and preservation.
-
LLMorphism: When humans come to see themselves as language models
LLMorphism is a proposed bias where exposure to human-like AI language leads people to view their own thinking as similar to statistical next-token prediction, risking under-attribution of mind to humans.
-
Anny-Fit: All-Age Human Mesh Recovery
Anny-Fit jointly optimizes all-age multi-person 3D human meshes in camera coordinates using complementary signals from off-the-shelf depth, segmentation, keypoint, and VLM networks, yielding better reprojection, depth...
-
Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models
SemGrad is a gradient-based uncertainty quantification technique for free-form LLM generation that operates in semantic space using a Semantic Preservation Score to select stable embeddings.
-
Revisiting the Travel Planning Capabilities of Large Language Models
LLMs extract explicit constraints effectively but struggle with implicit open-world requirements, structural biases in plans, and ineffective self-correction during travel planning.
-
Taming Request Imbalance: SLO-Aware Scheduling for Disaggregated LLM Inference
Kairos improves SLO attainment and throughput in LLM serving by adapting to request length imbalance with priority scheduling and adaptive batching.
-
Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory
Item response theory applied to 17 LLMs on SciEntsBank and Beetle reveals that grading accuracy declines at different rates with response difficulty, with errors clustering on the partially correct label and difficult...
-
Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory
Item response theory applied to 17 LLMs on SciEntsBank and Beetle reveals that models with similar overall scores differ sharply in robustness to difficult responses, with errors clustering on partial-credit labels.
-
Tracking Conversations: Measuring Content and Identity Exposure on AI Chatbots
17 of 20 AI chatbots share conversation content or identifiers with third parties, including plaintext text sent to Microsoft Clarity via session replay in three cases.
-
Tracking Conversations: Measuring Content and Identity Exposure on AI Chatbots
17 of 20 AI chatbots share conversation content or identifiers with third parties, including plaintext prompt and response text with Microsoft Clarity in three cases.
-
ProMax: Exploring the Potential of LLM-derived Profiles with Distribution Shaping for Recommender Systems
ProMax uses dense retrieval and dual distribution reshaping on LLM-derived profiles to guide recommender models toward preferences for unseen items, substantially boosting base model performance on public datasets.
-
RAG-Reflect: Agentic Retrieval-Augmented Generation with Reflections for Comment-Driven Code Maintenance on Stack Overflow
RAG-Reflect achieves F1=0.78 on valid comment-edit prediction using retrieval-augmented reasoning and self-reflection, outperforming baselines and approaching fine-tuned models without retraining.
-
Participatory provenance as representational auditing for AI-mediated public consultation
Participatory provenance auditing of Canada's AI strategy consultation shows official AI summaries exclude 15-17% of participants more than random baselines, with 33-88% exclusion for dissent clusters.
-
A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding
A-MAR decomposes art queries into reasoning plans to condition retrieval, leading to improved explanation quality and multi-step reasoning on art benchmarks compared to baselines.
-
Self-Improving Tabular Language Models via Iterative Group Alignment
TabGRAA enables self-improving tabular language models through iterative group-relative advantage alignment using modular automated quality signals like distinguishability classifiers.
-
STRIDE: Strategic Iterative Decision-Making for Retrieval-Augmented Multi-Hop Question Answering
STRIDE uses a meta-planner for entity-agnostic reasoning skeletons and a supervisor for dependency-aware execution to improve retrieval-augmented multi-hop QA.
-
NL2SQLBench: A Modular Benchmarking Framework for LLM-Enabled NL2SQL Solutions
NL2SQLBench is a new modular benchmarking framework that evaluates LLM NL2SQL methods across three core modules on existing datasets, exposing large accuracy gaps and computational inefficiency.
-
Exploring Knowledge Conflicts for Faithful LLM Reasoning: Benchmark and Method
ConflictQA benchmark shows LLMs fail to resolve conflicts between text and KG evidence and often default to one source, motivating the XoT explanation-based reasoning method.
-
Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs
A multi-agent framework reconstructs the evolutionary graph of post-training LLM datasets, revealing domain patterns like vertical refinement in math data and systemic issues like redundancy and benchmark contaminatio...
-
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
-
Fusion and Alignment Enhancement with Large Language Models for Tail-item Sequential Recommendation
FAERec fuses collaborative ID embeddings with LLM semantic embeddings using adaptive gating and dual-level alignment to enhance tail-item sequential recommendations.
-
Large Language Models Align with the Human Brain during Creative Thinking
LLMs show scaling and training-dependent alignment with human brain responses in creativity-related networks during divergent thinking tasks, measured via RSA on fMRI data.
-
InfoSeeker: A Scalable Hierarchical Parallel Agent Framework for Web Information Seeking
InfoSeeker is a new hierarchical parallel agent framework that delivers 3-5x speedups and benchmark gains on web search tasks by using context isolation and layered aggregation.
-
Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods
Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.
-
Chronos: Learning the Language of Time Series
Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.
-
Evaluating Object Hallucination in Large Vision-Language Models
Large vision-language models exhibit severe object hallucination that varies with training instructions, and the proposed POPE polling method evaluates it more stably and flexibly than prior approaches.
-
WizardLM: Empowering large pre-trained language models to follow complex instructions
WizardLM uses LLM-driven iterative rewriting to generate complex instruction data and fine-tunes LLaMA to reach over 90% of ChatGPT capacity on 17 of 29 evaluated skills.
-
CHAL: Council of Hierarchical Agentic Language
CHAL is a multi-agent dialectic system that performs structured belief optimization over defeasible domains using Bayesian-inspired graph representations and configurable meta-cognitive value system hyperparameters.
-
SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory
SAGE is a self-evolving agentic graph-memory engine that dynamically constructs and refines structured memory graphs via writer-reader feedback, yielding performance gains on multi-hop QA, open-domain retrieval, and l...
-
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall i...
-
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
MORA breaks the safety-helpfulness trade-off in LLM alignment by pre-sampling single-reward prompts and rewriting them to expand multi-dimensional reward diversity, yielding 5-12.4% single-preference gains in sequenti...
-
Conditional Memory Enhanced Item Representation for Generative Recommendation
ComeIR introduces dual-level Engram memory and memory-restoring prediction to reconstruct SID-token embeddings and restore token granularity in generative recommendation.
-
Freeze Deep, Train Shallow: Interpretable Layer Allocation for Continued Pre-Training
Freezing deep layers and training shallow layers during continued pre-training of LLMs outperforms full fine-tuning and the opposite allocation on C-Eval and CMMLU, guided by a new layer-sensitivity diagnostic.
-
PG-3DGS: Optimizing 3D Gaussian Splatting to Satisfy Physics Objectives
PG-3DGS couples 3D Gaussian Splatting with differentiable physics so that optimized shapes satisfy both visual fidelity and physical objectives such as pouring and aerodynamic lift, with real-world 3D-printed validation.
-
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.
-
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
DECO sparse MoE matches dense Transformer performance at 20% expert activation with a 3x hardware inference speedup.
-
Evaluating the False Trust engendered by LLM Explanations
A user study finds that LLM reasoning traces and post-hoc explanations create false trust by increasing acceptance of incorrect answers, whereas contrastive dual explanations improve users' ability to detect errors.
-
Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm
Theoretical analysis of continual factual knowledge acquisition shows data replay stabilizes pretrained knowledge by shifting convergence dynamics while regularization only slows forgetting, leading to the STOC method...
-
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination
LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.
-
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
-
Event Fields: Learning Latent Event Structure for Waveform Foundation Models
Event-centric waveform foundation models are learned via self-supervised consistency on latent event structures and interactions, yielding improved performance and label efficiency over sequence-based baselines on phy...
-
Mechanism Design for Quality-Preserving LLM Advertising
A quality-preserving auction framework for LLM advertising uses RAG-based endogenous reserves and KL-regularized or screened VCG mechanisms to achieve DSIC, IR, higher revenue, and better semantic fidelity than baselines.
-
Continuous Latent Diffusion Language Model
Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing l...
-
The Geopolitics of AI Safety: A Causal Analysis of Regional LLM Bias
Causal analysis of LLMs finds standard bias metrics overestimate demographic effects due to context toxicity, with Western models showing higher refusal rates for certain groups and Eastern models showing targeted reg...
-
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.
-
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
OSAQ uses the low-rank structure of the Hessian to construct a closed-form additive weight transformation that suppresses outliers without changing task loss, enabling better low-bit LLM quantization.
-
Tailored Prompts, Targeted Protection: Vulnerability-Specific LLM Analysis for Smart Contracts
An LLM framework with tailored prompts and a new dataset of 31,165 annotated instances achieves 0.92 positive recall and 0.85 negative recall for detecting 13 smart contract vulnerability categories.
-
Bridging Behavior and Semantics for Time-aware Cross-Domain Sequential Recommendation
BST-CDSR combines neural ODEs for continuous behavioral preference modeling with LLM-based temporal semantic generation and adaptive domain transfer to improve cross-domain sequential recommendations.
Reference graph
Works this paper leans on
-
[1]
A neural probabilistic language model,
Y. Bengio, R. Ducharme, P . Vincent, and C. Janvin, “A neural probabilistic language model,”J. Mach. Learn. Res., vol. 3, pp. 1137–1155, 2003. 100
2003
-
[2]
Natural language processing (almost) from scratch,
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P . P . Kuksa, “Natural language processing (almost) from scratch,”J. Mach. Learn. Res., vol. 12, pp. 2493–2537, 2011
work page 2011
-
[3]
Pinker,The Language Instinct: How the Mind Creates Language
S. Pinker,The Language Instinct: How the Mind Creates Language. Brilliance Audio; Unabridged edition, 2014
work page 2014
-
[4]
The faculty of language: what is it, who has it, and how did it evolve?
M. D. Hauser, N. Chomsky, and W. T. Fitch, “The faculty of language: what is it, who has it, and how did it evolve?”science, vol. 298, no. 5598, pp. 1569– 1579, 2002
work page 2002
-
[5]
Computing machinery and intelli- gence,
A. M. Turing, “Computing machinery and intelli- gence,”Mind, vol. LIX, no. 236, pp. 433–460, 1950
work page 1950
-
[6]
Jelinek,Statistical Methods for Speech Recognition
F. Jelinek,Statistical Methods for Speech Recognition. MIT Press, 1998
work page 1998
-
[7]
Introduction to the special issue on statistical language modeling,
J. Gao and C. Lin, “Introduction to the special issue on statistical language modeling,”ACM Trans. Asian Lang. Inf. Process., vol. 3, no. 2, pp. 87–93, 2004
work page 2004
-
[8]
Two decades of statistical language modeling: Where do we go from here?
R. Rosenfeld, “Two decades of statistical language modeling: Where do we go from here?”Proceedings of the IEEE, vol. 88, no. 8, pp. 1270–1278, 2000
work page 2000
-
[9]
Srilm-an extensible language modeling toolkit,
A. Stolcke, “Srilm-an extensible language modeling toolkit,” inSeventh international conference on spoken language processing, 2002
work page 2002
-
[10]
Statistical language mod- eling for information retrieval,
X. Liu and W. B. Croft, “Statistical language mod- eling for information retrieval,”Annu. Rev. Inf. Sci. Technol., vol. 39, no. 1, pp. 1–31, 2005
work page 2005
-
[11]
Zhai,Statistical Language Models for Information Re- trieval, ser
C. Zhai,Statistical Language Models for Information Re- trieval, ser. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers, 2008
work page 2008
-
[12]
A second-order hidden markov model for part-of-speech tagging,
S. M. Thede and M. P . Harper, “A second-order hidden markov model for part-of-speech tagging,” in27th Annual Meeting of the Association for Computa- tional Linguistics, University of Maryland, College Park, Maryland, USA, 20-26 June 1999, R. Dale and K. W. Church, Eds. ACL, 1999, pp. 175–182
work page 1999
-
[13]
A tree-based statistical language model for nat- ural language speech recognition,
L. R. Bahl, P . F. Brown, P . V . de Souza, and R. L. Mer- cer, “A tree-based statistical language model for nat- ural language speech recognition,”IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 7, pp. 1001–1008, 1989
work page 1989
-
[14]
Large language models in machine translation,
T. Brants, A. C. Popat, P . Xu, F. J. Och, and J. Dean, “Large language models in machine translation,” inEMNLP-CoNLL 2007, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learn- ing, June 28-30, 2007, Prague, Czech Republic, J. Eisner, Ed. ACL, 2007, pp. 858–867
work page 2007
-
[15]
S. M. Katz, “Estimation of probabilities from sparse data for the language model component of a speech recognizer,”IEEE Trans. Acoust. Speech Signal Process., vol. 35, no. 3, pp. 400–401, 1987
work page 1987
-
[16]
Good-turing frequency estimation without tears,
W. A. Gale and G. Sampson, “Good-turing frequency estimation without tears,”J. Quant. Linguistics, vol. 2, no. 3, pp. 217–237, 1995
work page 1995
-
[17]
Recurrent neural network based lan- guage model,
T. Mikolov, M. Karafi ´at, L. Burget, J. Cernock ´y, and S. Khudanpur, “Recurrent neural network based lan- guage model,” inINTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010, T. Kobayashi, K. Hirose, and S. Nakamura, Eds. ISCA, 2010, pp. 1045–1048
work page 2010
-
[18]
Recurrent neural network based language modeling in meeting recognition,
S. Kombrink, T. Mikolov, M. Karafi ´at, and L. Burget, “Recurrent neural network based language modeling in meeting recognition,” inINTERSPEECH 2011, 12th Annual Conference of the International Speech Commu- nication Association, Florence, Italy, August 27-31, 2011. ISCA, 2011, pp. 2877–2880
work page 2011
-
[19]
Distributed representations of words and phrases and their compositionality,
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” inAdvances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems
-
[20]
Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, C. J. C. Burges, L. Bottou, Z. Ghahramani, and K. Q. Weinberger, Eds., 2013, pp. 3111–3119
work page 2013
-
[21]
Ef- ficient estimation of word representations in vector space,
T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Ef- ficient estimation of word representations in vector space,” in1st International Conference on Learning Rep- resentations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2013
work page 2013
-
[22]
Deep contex- tualized word representations,
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contex- tualized word representations,” inProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long P...
work page 2018
-
[23]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural In- formation Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, 2017, pp. 5998–6008
work page 2017
-
[24]
BERT: pre-training of deep bidirectional transform- ers for language understanding,
J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transform- ers for language understanding,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short...
work page 2019
-
[25]
M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo- hamed, O. Levy, V . Stoyanov, and L. Zettlemoyer, “BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, 2020, pp. 7...
work page 2020
-
[26]
Switch trans- formers: Scaling to trillion parameter models with simple and efficient sparsity,
W. Fedus, B. Zoph, and N. Shazeer, “Switch trans- formers: Scaling to trillion parameter models with simple and efficient sparsity,”J. Mach. Learn. Res, pp. 1–40, 2021
work page 2021
-
[27]
Language models are unsuper- 101 vised multitask learners,
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskeveret al., “Language models are unsuper- 101 vised multitask learners,”OpenAI blog, p. 9, 2019
work page 2019
-
[28]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov, “Roberta: A robustly optimized BERT pretraining approach,”CoRR, vol. abs/1907.11692, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[31]
T. Wang, A. Roberts, D. Hesslow, T. L. Scao, H. W. Chung, I. Beltagy, J. Launay, and C. Raffel, “What language model architecture and pretraining objec- tive works best for zero-shot generalization?” in International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, ser. Proceedings of Machine Learning Research, vol. 16...
work page 2022
-
[32]
Scaling Laws for Neural Language Models
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language mod- els,”CoRR, vol. abs/2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[33]
Emergent Abilities of Large Language Models
J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, P . Liang, J. Dean, and W. Fedus, “Emergent abilities of large language models,”CoRR, vol. abs/2206.07682, 2022
work page internal anchor Pith review arXiv 2022
-
[34]
Talking about large language mod- els,
M. Shanahan, “Talking about large language mod- els,”CoRR, vol. abs/2212.03551, 2022
-
[35]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. H. Chi, Q. Le, and D. Zhou, “Chain of thought prompting elicits reasoning in large language models,”CoRR, vol. abs/2201.11903, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[36]
Training Compute-Optimal Large Language Models
J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre, “Training compute-optimal large language models,” vol. abs/2203.15556, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[37]
Galactica: A Large Language Model for Science
R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, A. Poulton, V . Kerkez, and R. Stojnic, “Galactica: A large language model for science,”CoRR, vol. abs/2211.09085, 2022
work page internal anchor Pith review arXiv 2022
-
[38]
P . Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, prompt, and predict: A sys- tematic survey of prompting methods in natural language processing,”ACM Comput. Surv., pp. 195:1– 195:35, 2023
work page 2023
-
[39]
A comprehensive survey on pretrained foundation mod- els: A history from bert to chatgpt,
C. Zhou, Q. Li, C. Li, J. Yu, Y. Liu, G. Wang, K. Zhang, C. Ji, Q. Yan, L. He, H. Peng, J. Li, J. Wu, Z. Liu, P . Xie, C. Xiong, J. Pei, P . S. Yu, and L. Sun, “A comprehensive survey on pretrained foundation models: A history from BERT to chatgpt,”CoRR, vol. abs/2302.09419, 2023
-
[40]
Pre-trained models: Past, present and future,
X. Han, Z. Zhang, N. Ding, Y. Gu, X. Liu, Y. Huo, J. Qiu, Y. Yao, A. Zhang, L. Zhang, W. Han, M. Huang, Q. Jin, Y. Lan, Y. Liu, Z. Liu, Z. Lu, X. Qiu, R. Song, J. Tang, J. Wen, J. Yuan, W. X. Zhao, and J. Zhu, “Pre-trained models: Past, present and future,”AI Open, vol. 2, pp. 225–250, 2021
work page 2021
-
[41]
Pre-trained models for natural language processing: A survey,
X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang, “Pre-trained models for natural language processing: A survey,”CoRR, vol. abs/2003.08271, 2020
-
[42]
S. Altman, “Planning for agi and beyond,”OpenAI Blog, February 2023
work page 2023
-
[43]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
S. Bubeck, V . Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P . Lee, Y. T. Lee, Y. Li, S. Lundberg, H. Nori, H. Palangi, M. T. Ribeiro, and Y. Zhang, “Sparks of artificial general intelligence: Early experiments with gpt-4,” vol. abs/2303.12712, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Language is not all you need: Aligning perception with language models
S. Huang, L. Dong, W. Wang, Y. Hao, S. Singhal, S. Ma, T. Lv, L. Cui, O. K. Mohammed, B. Patra, Q. Liu, K. Aggarwal, Z. Chi, J. Bjorck, V . Chaudhary, S. Som, X. Song, and F. Wei, “Language is not all you need: Aligning perception with language models,” CoRR, vol. abs/2302.14045, 2023
-
[45]
Y. Cao, S. Li, Y. Liu, Z. Yan, Y. Dai, P . S. Yu, and L. Sun, “A comprehensive survey of ai-generated content (aigc): A history of generative ai from gan to chatgpt,”arXiv preprint arXiv:2303.04226, 2023
-
[46]
PaLM-E: An Embodied Multimodal Language Model
D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdh- ery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu et al., “Palm-e: An embodied multimodal language model,”arXiv preprint arXiv:2303.03378, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, and N. Duan, “Visual chatgpt: Talking, drawing and edit- ing with visual foundation models,”arXiv preprint arXiv:2303.04671, 2023
work page internal anchor Pith review arXiv 2023
- [48]
-
[49]
How does gpt obtain its ability? tracing emergent abilities of language models to their sources,
Y. Fu, H. Peng, and T. Khot, “How does gpt obtain its ability? tracing emergent abilities of language models to their sources,”Yao Fu’s Notion, Dec 2022
work page 2022
-
[50]
Pretrained language model for text generation: A survey,
J. Li, T. Tang, W. X. Zhao, and J. Wen, “Pretrained language model for text generation: A survey,” in Proceedings of the Thirtieth International Joint Confer- ence on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021, Z. Zhou, Ed. ijcai.org, 2021, pp. 4492–4499
work page 2021
-
[51]
A survey of deep learning for mathematical reason- ing,
P . Lu, L. Qiu, W. Yu, S. Welleck, and K. Chang, “A survey of deep learning for mathematical reason- ing,”CoRR, vol. abs/2212.10535, 2022
-
[52]
A Survey on In-context Learning
Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, L. Li, and Z. Sui, “A survey for in- context learning,”CoRR, vol. abs/2301.00234, 2023
work page internal anchor Pith review arXiv 2023
-
[53]
arXiv preprint arXiv:2212.10403 , year=
J. Huang and K. C. Chang, “Towards reasoning in large language models: A survey,”CoRR, vol. abs/2212.10403, 2022
-
[54]
Reasoning with language model prompting: A survey,
S. Qiao, Y. Ou, N. Zhang, X. Chen, Y. Yao, S. Deng, C. Tan, F. Huang, and H. Chen, “Reasoning with language model prompting: A survey,”CoRR, vol. 102 abs/2212.09597, 2022
-
[55]
Chatgpt: potential, prospects, and limitations,
J. Zhou, P . Ke, X. Qiu, M. Huang, and J. Zhang, “Chatgpt: potential, prospects, and limitations,” in Frontiers of Information Technology & Electronic Engi- neering, 2023, pp. 1–6
work page 2023
-
[56]
Dense text retrieval based on pretrained language models: A survey,
W. X. Zhao, J. Liu, R. Ren, and J.-R. Wen, “Dense text retrieval based on pretrained language models: A survey,”ACM Transactions on Information Systems, vol. 42, no. 4, pp. 1–60, 2024
work page 2024
-
[57]
Language models are few-shot learners,
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P . Dhariwal, A. Neelakantan, P . Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Am...
work page 2020
-
[58]
PaLM: Scaling Language Modeling with Pathways
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P . Barham, H. W. Chung, C. Sutton, S. Gehrmann, P . Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P . Barnes, Y. Tay, N. Shazeer, V . Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Is- ard, G. Gur-Ari, P . Yin, T. Duke, A. Levskaya, S. Ghe- mawat, ...
work page internal anchor Pith review arXiv 2022
-
[59]
Llama: Open and efficient foundation language models,
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Ham- bro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,”CoRR, 2023
work page 2023
-
[60]
Scaling Laws for Autoregressive Generative Modeling
T. Henighan, J. Kaplan, M. Katz, M. Chen, C. Hesse, J. Jackson, H. Jun, T. B. Brown, P . Dhariwal, S. Gray et al., “Scaling laws for autoregressive generative modeling,”arXiv preprint arXiv:2010.14701, 2020
work page internal anchor Pith review arXiv 2010
-
[61]
S. M. Xie, H. Pham, X. Dong, N. Du, H. Liu, Y. Lu, P . Liang, Q. V . Le, T. Ma, and A. W. Yu, “Doremi: Optimizing data mixtures speeds up language model pretraining,”arXiv preprint arXiv:2305.10429, 2023
-
[62]
P . Villalobos, J. Sevilla, L. Heim, T. Besiroglu, M. Hobbhahn, and A. Ho, “Will we run out of data? an analysis of the limits of scaling datasets in machine learning,”CoRR, vol. abs/2211.04325, 2022. [Online]. Available: https://doi.org/10.48550/arXiv. 2211.04325
work page internal anchor Pith review doi:10.48550/arxiv 2022
-
[63]
Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan
N. Muennighoff, A. M. Rush, B. Barak, T. L. Scao, A. Piktus, N. Tazi, S. Pyysalo, T. Wolf, and C. Raffel, “Scaling data-constrained language models,”arXiv preprint arXiv:2305.16264, 2023
-
[64]
I. McKenzie, A. Lyzhov, A. Parrish, A. Prabhu, A. Mueller, N. Kim, S. Bowman, and E. Perez, “The inverse scaling prize,” 2022. [Online]. Available: https://github.com/inverse-scaling/prize
work page 2022
-
[65]
Phase transitions in artificial intelligence systems,
B. A. Huberman and T. Hogg, “Phase transitions in artificial intelligence systems,”Artificial Intelligence, vol. 33, no. 2, pp. 155–171, 1987
work page 1987
-
[66]
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoff- mann, H. F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young, E. Rutherford, T. Hennigan, J. Menick, A. Cassirer, R. Powell, G. van den Driessche, L. A. Hendricks, M. Rauh, P . Huang, A. Glaese, J. Welbl, S. Dathathri, S. Huang, J. Ue- sato, J. Mellor, I. Higgins, A. Creswell, N. McAleese, A. Wu, E....
work page internal anchor Pith review arXiv 2021
-
[67]
arXiv preprint arXiv:2212.10559 , year=
D. Dai, Y. Sun, L. Dong, Y. Hao, Z. Sui, and F. Wei, “Why can GPT learn in-context? language models se- cretly perform gradient descent as meta-optimizers,” CoRR, vol. abs/2212.10559, 2022
-
[68]
Training language models to follow instructions with human feedback
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wain- wright, P . Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P . Welinder, P . F. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,”CoRR, vol. abs/2203.02155, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[69]
Finetuned language models are zero-shot learners,
J. Wei, M. Bosma, V . Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned language models are zero-shot learners,” inThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. Open- Review.net, 2022
work page 2022
-
[70]
LaMDA: Language Models for Dialog Applications
R. Thoppilan, D. D. Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H. Cheng, A. Jin, T. Bos, L. Baker, Y. Du, Y. Li, H. Lee, H. S. Zheng, A. Ghafouri, M. Menegali, Y. Huang, M. Krikun, D. Lepikhin, J. Qin, D. Chen, Y. Xu, Z. Chen, A. Roberts, M. Bosma, Y. Zhou, C. Chang, I. Krivokon, W. Rusch, M. Pick- ett, K. S. Meier-Hellstern, M. R. Morris, T. Doshi, R...
work page Pith review arXiv 2022
-
[71]
Scaling Instruction-Finetuned Language Models
H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, S. Narang, G. Mishra, A. Yu, V . Y. Zhao, Y. Huang, A. M. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V . Le, and J. Wei, “Scaling instruction-finetuned languag...
work page internal anchor Pith review arXiv 2022
-
[72]
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, A. Kluska, A. Lewkowycz, A. Agarwal, A. Power, A. Ray, A. Warstadt, A. W. Kocurek, A. Safaya, A. Tazarv, A. Xiang, A. Parrish, A. Nie, A. Hussain, A. Askell, A. Dsouza, A. Rahane, A. S. Iyer, A. Andreassen, A. Santilli, A. Stuhlm ¨ul...
work page internal anchor Pith review arXiv 2022
-
[73]
R. Schaeffer, B. Miranda, and S. Koyejo, “Are emer- gent abilities of large language models a mirage?” arXiv preprint arXiv:2304.15004, 2023
-
[74]
Unlock predictable scaling from emergent abilities,
S. Hu, X. Liu, X. Han, X. Zhang, C. He, W. Zhao, Y. Lin, N. Ding, Z. Ou, G. Zeng, Z. Liu, and M. Sun, “Unlock predictable scaling from emergent abilities,” 2023
work page 2023
-
[75]
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V . Misra, “Grokking: Generalization beyond overfit- ting on small algorithmic datasets,”arXiv preprint arXiv:2201.02177, 2022
work page internal anchor Pith review arXiv 2022
-
[76]
Deepspeed: System optimizations enable training deep learning models with over 100 billion param- eters,
J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He, “Deepspeed: System optimizations enable training deep learning models with over 100 billion param- eters,” inKDD, 2020, pp. 3505–3506
2020
-
[77]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi, M. Patwary, R. Puri, P . LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Train- ing multi-billion parameter language models using model parallelism,”CoRR, vol. abs/1909.08053, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[78]
Efficient large-scale lan- guage model training on GPU clusters using megatron-lm,
D. Narayanan, M. Shoeybi, J. Casper, P . LeGres- ley, M. Patwary, V . Korthikanti, D. Vainbrand, P . Kashinkunti, J. Bernauer, B. Catanzaro, A. Phan- ishayee, and M. Zaharia, “Efficient large-scale lan- guage model training on GPU clusters using megatron-lm,” inInternational Conference for High Per- formance Computing, Networking, Storage and Analysis, SC...
2021
-
[79]
Reducing activation recomputation in large transformer models, 2022
V . Korthikanti, J. Casper, S. Lym, L. McAfee, M. An- dersch, M. Shoeybi, and B. Catanzaro, “Reducing ac- tivation recomputation in large transformer models,” CoRR, vol. abs/2205.05198, 2022
-
[80]
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilic, D. Hesslow, R. Castagn ´e, A. S. Luccioni, F. Yvon, M. Gall ´e, J. Tow, A. M. Rush, S. Biderman, A. Web- son, P . S. Ammanamanchi, T. Wang, B. Sagot, N. Muennighoff, A. V . del Moral, O. Ruwase, R. Baw- den, S. Bekman, A. McMillan-Major, I. Beltagy, H. Nguyen, L. Saulnier, S. Tan, P . O. Suarez, V . Sanh,...
work page internal anchor Pith review arXiv 2022
-
[81]
Deep reinforcement learn- ing from human preferences,
P . F. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learn- ing from human preferences,” inAdvances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Decem- ber 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergu...
2017
-
[82]
Toolformer: Language Models Can Teach Themselves to Use Tools
T. Schick, J. Dwivedi-Yu, R. Dess `ı, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,”CoRR, vol. abs/2302.04761, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.