pith. machine review for the scientific record. sign in

arxiv: 2303.18223 · v19 · submitted 2023-03-31 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

A Survey of Large Language Models

Beichen Zhang, Chen Yang, Jian-Yun Nie, Jinhao Jiang, Ji-Rong Wen, Junjie Zhang, Junyi Li, Kun Zhou, Peiyu Liu, Ruiyang Ren, Tianyi Tang, Wayne Xin Zhao, Xiaolei Wang, Xinyu Tang, Yifan Du, Yifan Li, Yingqian Min, Yupeng Hou, Yushuo Chen, Zhipeng Chen, Zican Dong, Zikang Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 22:42 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords large language modelsscaling lawsemergent abilitiespre-trainingmodel adaptationutilizationcapacity evaluationChatGPT
0
0 comments X

The pith

Large language models develop special abilities once their size exceeds a certain scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper surveys recent advances in large language models, which are enlarged pre-trained Transformer models. It emphasizes that scaling parameters beyond a threshold yields not only better performance but also unique capabilities missing from smaller models. The review covers background, key findings, and techniques in four areas: pre-training, adaptation tuning, utilization, and capacity evaluation. It also notes resources and open issues. Readers should care as these models, like ChatGPT, are changing how AI is built and applied in language tasks.

Core claim

The survey claims that large language models achieve significant performance improvements and exhibit special abilities not present in small-scale models when their parameter scale exceeds a certain level. It reviews the background, key findings, and mainstream techniques, with a focus on four major aspects: pre-training, adaptation tuning, utilization, and capacity evaluation. The paper also summarizes available resources for developing LLMs and discusses remaining issues for future directions.

What carries the argument

The four-aspect framework consisting of pre-training, adaptation tuning, utilization, and capacity evaluation that organizes the analysis of how scaling produces emergent abilities in LLMs.

If this is right

  • The technical evolution of LLMs impacts the entire AI community.
  • This evolution would revolutionize the development and use of AI algorithms.
  • The launch of ChatGPT has drawn widespread societal attention.
  • Summarized resources aid in further LLM development.
  • Identified remaining issues guide future research directions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The emergence of special abilities through scaling may apply to non-language AI systems, prompting tests in other modalities.
  • The four-aspect structure provides a model for surveying progress in adjacent fields like multimodal learning.
  • Developers could experiment with scale thresholds to predict when new capabilities appear.
  • Incorporating more industry reports might refine the review's coverage of practical advances.

Load-bearing premise

The reviewed works and four-aspect framework together capture the essential advances in large language models without significant omissions or bias.

What would settle it

Discovery of a major new technique in large language models that falls outside the categories of pre-training, adaptation tuning, utilization, or capacity evaluation would falsify the survey's claim to comprehensive coverage.

read the original abstract

Language is essentially a complex, intricate system of human expressions governed by grammatical rules. It poses a significant challenge to develop capable AI algorithms for comprehending and grasping a language. As a major approach, language modeling has been widely studied for language understanding and generation in the past two decades, evolving from statistical language models to neural language models. Recently, pre-trained language models (PLMs) have been proposed by pre-training Transformer models over large-scale corpora, showing strong capabilities in solving various NLP tasks. Since researchers have found that model scaling can lead to performance improvement, they further study the scaling effect by increasing the model size to an even larger size. Interestingly, when the parameter scale exceeds a certain level, these enlarged language models not only achieve a significant performance improvement but also show some special abilities that are not present in small-scale language models. To discriminate the difference in parameter scale, the research community has coined the term large language models (LLM) for the PLMs of significant size. Recently, the research on LLMs has been largely advanced by both academia and industry, and a remarkable progress is the launch of ChatGPT, which has attracted widespread attention from society. The technical evolution of LLMs has been making an important impact on the entire AI community, which would revolutionize the way how we develop and use AI algorithms. In this survey, we review the recent advances of LLMs by introducing the background, key findings, and mainstream techniques. In particular, we focus on four major aspects of LLMs, namely pre-training, adaptation tuning, utilization, and capacity evaluation. Besides, we also summarize the available resources for developing LLMs and discuss the remaining issues for future directions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript is a survey reviewing recent advances in large language models (LLMs). It traces the evolution of language modeling from statistical and neural approaches to pre-trained Transformer-based PLMs, notes that scaling parameters beyond a threshold yields both performance gains and emergent special abilities absent in smaller models, and organizes the review around four aspects: pre-training, adaptation tuning, utilization, and capacity evaluation. The paper also summarizes resources for LLM development and outlines remaining issues and future directions, highlighting ChatGPT as a notable industry milestone with broad AI impact.

Significance. If the coverage is balanced and citations accurate, the survey would serve as a useful organizing reference for a fast-moving field. It synthesizes reported observations on scaling effects and emergent abilities, maps mainstream techniques across the four focal areas, and points to resources and open problems, thereby helping researchers navigate the literature on LLMs and their influence on AI algorithm development and usage.

minor comments (2)
  1. The abstract introduces 'adaptation tuning' as one of the four core aspects; this phrasing is less common than 'fine-tuning' or 'instruction tuning' in the broader literature, so a brief definition or mapping to standard terminology in the corresponding section would aid clarity.
  2. The abstract states that the survey covers 'recent advances' and 'mainstream techniques' but does not indicate the temporal scope or approximate number of works reviewed; adding such a sentence would help readers gauge potential selection bias or completeness.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and the recommendation to accept the manuscript. The referee's summary accurately reflects the scope, structure, and contributions of our survey on large language models.

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is a literature survey paper with no original derivations, equations, fitted parameters, predictions, or self-referential claims. The central statements about scaling effects and emergent abilities are explicitly framed as summaries of prior work in the reviewed literature (pre-training, adaptation, utilization, evaluation). No load-bearing step reduces to a self-citation chain, ansatz, or input-by-construction; the text functions as an organizing map of external findings.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey paper, the work introduces no free parameters, axioms, or invented entities; it only summarizes existing research.

pith-pipeline@v0.9.0 · 5680 in / 1030 out tokens · 47317 ms · 2026-05-10T22:42:07.116866+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Diffusion-CAM: Faithful Visual Explanations for dMLLMs

    cs.AI 2026-04 unverdicted novelty 8.0

    Diffusion-CAM is the first method for visual explanations in dMLLMs, using differentiable probing of intermediates plus four refinement modules to produce activation maps that outperform prior CAM approaches in locali...

  2. TRUSTDESC: Preventing Tool Poisoning in LLM Applications via Trusted Description Generation

    cs.CR 2026-04 unverdicted novelty 8.0

    TRUSTDESC prevents tool poisoning in LLM applications by automatically generating accurate tool descriptions from code via a three-stage pipeline of reachability analysis, description synthesis, and dynamic verification.

  3. Large Language Diffusion Models

    cs.CL 2025-02 unverdicted novelty 8.0

    LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

  4. A Hybrid Framework for Natural Language Querying of IFC Models with Relational and Graph Representations

    cs.CL 2026-05 unverdicted novelty 7.0

    IfcLLM combines relational and graph representations of IFC models with iterative LLM reasoning to deliver 93.3-100% first-attempt accuracy on natural language queries across three test models.

  5. MLPs are Efficient Distilled Generative Recommenders

    cs.IR 2026-05 unverdicted novelty 7.0

    SID-MLP distills autoregressive generative recommenders into efficient position-specific MLP heads for Semantic ID tasks, achieving 8.74x faster inference with matching accuracy.

  6. Variance-aware Reward Modeling with Anchor Guidance

    stat.ML 2026-05 unverdicted novelty 7.0

    Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, ...

  7. StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs

    cs.CY 2026-05 accept novelty 7.0

    StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.

  8. StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs

    cs.CY 2026-05 unverdicted novelty 7.0

    StereoTales shows that LLMs produce harmful, culturally adapted stereotypes in open-ended multilingual stories, with patterns consistent across providers and aligned human-LLM harm judgments.

  9. NaiAD: Initiate Data-Driven Research for LLM Advertising

    cs.LG 2026-05 unverdicted novelty 7.0

    NaiAD is a new dataset and framework for LLM-native advertising that uses decoupled generation and calibrated scoring to identify four semantic strategies for balancing user and commercial utilities.

  10. Fin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain

    cs.CL 2026-05 unverdicted novelty 7.0

    LLMs copy biased analyst ratings in investment decisions but a new detection method encourages independent reasoning and can improve stock return predictions beyond human levels.

  11. How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation

    cs.LG 2026-05 unverdicted novelty 7.0

    DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conform...

  12. CrossCult-KIBench: A Benchmark for Cross-Cultural Knowledge Insertion in MLLMs

    cs.AI 2026-05 unverdicted novelty 7.0

    CrossCult-KIBench provides 9,800 test cases for cross-cultural knowledge insertion in MLLMs and shows that existing methods cannot reliably adapt to one culture while preserving behavior in others.

  13. CrossCult-KIBench: A Benchmark for Cross-Cultural Knowledge Insertion in MLLMs

    cs.AI 2026-05 unverdicted novelty 7.0

    CrossCult-KIBench is a new benchmark for evaluating cross-cultural knowledge insertion in MLLMs, paired with the MCKI baseline method, showing current approaches fail to balance adaptation and preservation.

  14. LLMorphism: When humans come to see themselves as language models

    cs.CY 2026-05 unverdicted novelty 7.0

    LLMorphism is a proposed bias where exposure to human-like AI language leads people to view their own thinking as similar to statistical next-token prediction, risking under-attribution of mind to humans.

  15. Anny-Fit: All-Age Human Mesh Recovery

    cs.CV 2026-05 unverdicted novelty 7.0

    Anny-Fit jointly optimizes all-age multi-person 3D human meshes in camera coordinates using complementary signals from off-the-shelf depth, segmentation, keypoint, and VLM networks, yielding better reprojection, depth...

  16. Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    SemGrad is a gradient-based uncertainty quantification technique for free-form LLM generation that operates in semantic space using a Semantic Preservation Score to select stable embeddings.

  17. Revisiting the Travel Planning Capabilities of Large Language Models

    cs.AI 2026-05 unverdicted novelty 7.0

    LLMs extract explicit constraints effectively but struggle with implicit open-world requirements, structural biases in plans, and ineffective self-correction during travel planning.

  18. Taming Request Imbalance: SLO-Aware Scheduling for Disaggregated LLM Inference

    cs.DC 2026-05 unverdicted novelty 7.0

    Kairos improves SLO attainment and throughput in LLM serving by adapting to request length imbalance with priority scheduling and adaptive batching.

  19. Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory

    cs.CL 2026-04 unverdicted novelty 7.0

    Item response theory applied to 17 LLMs on SciEntsBank and Beetle reveals that grading accuracy declines at different rates with response difficulty, with errors clustering on the partially correct label and difficult...

  20. Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory

    cs.CL 2026-04 unverdicted novelty 7.0

    Item response theory applied to 17 LLMs on SciEntsBank and Beetle reveals that models with similar overall scores differ sharply in robustness to difficult responses, with errors clustering on partial-credit labels.

  21. Tracking Conversations: Measuring Content and Identity Exposure on AI Chatbots

    cs.CR 2026-04 unverdicted novelty 7.0

    17 of 20 AI chatbots share conversation content or identifiers with third parties, including plaintext text sent to Microsoft Clarity via session replay in three cases.

  22. Tracking Conversations: Measuring Content and Identity Exposure on AI Chatbots

    cs.CR 2026-04 accept novelty 7.0

    17 of 20 AI chatbots share conversation content or identifiers with third parties, including plaintext prompt and response text with Microsoft Clarity in three cases.

  23. ProMax: Exploring the Potential of LLM-derived Profiles with Distribution Shaping for Recommender Systems

    cs.IR 2026-04 unverdicted novelty 7.0

    ProMax uses dense retrieval and dual distribution reshaping on LLM-derived profiles to guide recommender models toward preferences for unseen items, substantially boosting base model performance on public datasets.

  24. RAG-Reflect: Agentic Retrieval-Augmented Generation with Reflections for Comment-Driven Code Maintenance on Stack Overflow

    cs.SE 2026-04 unverdicted novelty 7.0

    RAG-Reflect achieves F1=0.78 on valid comment-edit prediction using retrieval-augmented reasoning and self-reflection, outperforming baselines and approaching fine-tuned models without retraining.

  25. Participatory provenance as representational auditing for AI-mediated public consultation

    cs.AI 2026-04 unverdicted novelty 7.0

    Participatory provenance auditing of Canada's AI strategy consultation shows official AI summaries exclude 15-17% of participants more than random baselines, with 33-88% exclusion for dissent clusters.

  26. A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

    cs.AI 2026-04 unverdicted novelty 7.0

    A-MAR decomposes art queries into reasoning plans to condition retrieval, leading to improved explanation quality and multi-step reasoning on art benchmarks compared to baselines.

  27. Self-Improving Tabular Language Models via Iterative Group Alignment

    cs.LG 2026-04 unverdicted novelty 7.0

    TabGRAA enables self-improving tabular language models through iterative group-relative advantage alignment using modular automated quality signals like distinguishability classifiers.

  28. STRIDE: Strategic Iterative Decision-Making for Retrieval-Augmented Multi-Hop Question Answering

    cs.AI 2026-04 unverdicted novelty 7.0

    STRIDE uses a meta-planner for entity-agnostic reasoning skeletons and a supervisor for dependency-aware execution to improve retrieval-augmented multi-hop QA.

  29. NL2SQLBench: A Modular Benchmarking Framework for LLM-Enabled NL2SQL Solutions

    cs.DB 2026-04 conditional novelty 7.0

    NL2SQLBench is a new modular benchmarking framework that evaluates LLM NL2SQL methods across three core modules on existing datasets, exposing large accuracy gaps and computational inefficiency.

  30. Exploring Knowledge Conflicts for Faithful LLM Reasoning: Benchmark and Method

    cs.CL 2026-04 unverdicted novelty 7.0

    ConflictQA benchmark shows LLMs fail to resolve conflicts between text and KG evidence and often default to one source, motivating the XoT explanation-based reasoning method.

  31. Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs

    cs.AI 2026-04 unverdicted novelty 7.0

    A multi-agent framework reconstructs the evolutionary graph of post-training LLM datasets, revealing domain patterns like vertical refinement in math data and systemic issues like redundancy and benchmark contaminatio...

  32. Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

    cs.LG 2026-04 unverdicted novelty 7.0

    The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

  33. Fusion and Alignment Enhancement with Large Language Models for Tail-item Sequential Recommendation

    cs.IR 2026-04 unverdicted novelty 7.0

    FAERec fuses collaborative ID embeddings with LLM semantic embeddings using adaptive gating and dual-level alignment to enhance tail-item sequential recommendations.

  34. Large Language Models Align with the Human Brain during Creative Thinking

    q-bio.NC 2026-04 unverdicted novelty 7.0

    LLMs show scaling and training-dependent alignment with human brain responses in creativity-related networks during divergent thinking tasks, measured via RSA on fMRI data.

  35. InfoSeeker: A Scalable Hierarchical Parallel Agent Framework for Web Information Seeking

    cs.AI 2026-04 unverdicted novelty 7.0

    InfoSeeker is a new hierarchical parallel agent framework that delivers 3-5x speedups and benchmark gains on web search tasks by using context isolation and layered aggregation.

  36. Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods

    cs.DC 2026-04 unverdicted novelty 7.0

    Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.

  37. Chronos: Learning the Language of Time Series

    cs.LG 2024-03 conditional novelty 7.0

    Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.

  38. Evaluating Object Hallucination in Large Vision-Language Models

    cs.CV 2023-05 accept novelty 7.0

    Large vision-language models exhibit severe object hallucination that varies with training instructions, and the proposed POPE polling method evaluates it more stably and flexibly than prior approaches.

  39. WizardLM: Empowering large pre-trained language models to follow complex instructions

    cs.CL 2023-04 conditional novelty 7.0

    WizardLM uses LLM-driven iterative rewriting to generate complex instruction data and fine-tunes LLaMA to reach over 90% of ChatGPT capacity on 17 of 29 evaluated skills.

  40. CHAL: Council of Hierarchical Agentic Language

    cs.AI 2026-05 unverdicted novelty 6.0

    CHAL is a multi-agent dialectic system that performs structured belief optimization over defeasible domains using Bayesian-inspired graph representations and configurable meta-cognitive value system hyperparameters.

  41. SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory

    cs.AI 2026-05 unverdicted novelty 6.0

    SAGE is a self-evolving agentic graph-memory engine that dynamically constructs and refines structured memory graphs via writer-reader feedback, yielding performance gains on multi-hop QA, open-domain retrieval, and l...

  42. Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

    cs.AI 2026-05 unverdicted novelty 6.0

    MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall i...

  43. Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

    cs.AI 2026-05 unverdicted novelty 6.0

    MORA breaks the safety-helpfulness trade-off in LLM alignment by pre-sampling single-reward prompts and rewriting them to expand multi-dimensional reward diversity, yielding 5-12.4% single-preference gains in sequenti...

  44. Conditional Memory Enhanced Item Representation for Generative Recommendation

    cs.IR 2026-05 unverdicted novelty 6.0

    ComeIR introduces dual-level Engram memory and memory-restoring prediction to reconstruct SID-token embeddings and restore token granularity in generative recommendation.

  45. Freeze Deep, Train Shallow: Interpretable Layer Allocation for Continued Pre-Training

    cs.CL 2026-05 unverdicted novelty 6.0

    Freezing deep layers and training shallow layers during continued pre-training of LLMs outperforms full fine-tuning and the opposite allocation on C-Eval and CMMLU, guided by a new layer-sensitivity diagnostic.

  46. PG-3DGS: Optimizing 3D Gaussian Splatting to Satisfy Physics Objectives

    cs.CV 2026-05 unverdicted novelty 6.0

    PG-3DGS couples 3D Gaussian Splatting with differentiable physics so that optimized shapes satisfy both visual fidelity and physical objectives such as pouring and aerodynamic lift, with real-world 3D-printed validation.

  47. DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

    cs.LG 2026-05 conditional novelty 6.0

    DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.

  48. DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

    cs.LG 2026-05 unverdicted novelty 6.0

    DECO sparse MoE matches dense Transformer performance at 20% expert activation with a 3x hardware inference speedup.

  49. Evaluating the False Trust engendered by LLM Explanations

    cs.HC 2026-05 unverdicted novelty 6.0

    A user study finds that LLM reasoning traces and post-hoc explanations create false trust by increasing acceptance of incorrect answers, whereas contrastive dual explanations improve users' ability to detect errors.

  50. Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm

    cs.CL 2026-05 unverdicted novelty 6.0

    Theoretical analysis of continual factual knowledge acquisition shows data replay stabilizes pretrained knowledge by shifting convergence dynamics while regularization only slows forgetting, leading to the STOC method...

  51. Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination

    cs.MM 2026-05 unverdicted novelty 6.0

    LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.

  52. OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces

    cs.AI 2026-05 unverdicted novelty 6.0

    OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.

  53. Event Fields: Learning Latent Event Structure for Waveform Foundation Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Event-centric waveform foundation models are learned via self-supervised consistency on latent event structures and interactions, yielding improved performance and label efficiency over sequence-based baselines on phy...

  54. Mechanism Design for Quality-Preserving LLM Advertising

    cs.GT 2026-05 unverdicted novelty 6.0

    A quality-preserving auction framework for LLM advertising uses RAG-based endogenous reserves and KL-regularized or screened VCG mechanisms to achieve DSIC, IR, higher revenue, and better semantic fidelity than baselines.

  55. Continuous Latent Diffusion Language Model

    cs.CL 2026-05 unverdicted novelty 6.0

    Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing l...

  56. The Geopolitics of AI Safety: A Causal Analysis of Regional LLM Bias

    cs.AI 2026-05 unverdicted novelty 6.0

    Causal analysis of LLMs finds standard bias metrics overestimate demographic effects due to context toxicity, with Western models showing higher refusal rates for certain groups and Eastern models showing targeted reg...

  57. OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization

    cs.LG 2026-05 unverdicted novelty 6.0

    OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.

  58. OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization

    cs.LG 2026-05 unverdicted novelty 6.0

    OSAQ uses the low-rank structure of the Hessian to construct a closed-form additive weight transformation that suppresses outliers without changing task loss, enabling better low-bit LLM quantization.

  59. Tailored Prompts, Targeted Protection: Vulnerability-Specific LLM Analysis for Smart Contracts

    cs.CR 2026-05 unverdicted novelty 6.0

    An LLM framework with tailored prompts and a new dataset of 31,165 annotated instances achieves 0.92 positive recall and 0.85 negative recall for detecting 13 smart contract vulnerability categories.

  60. Bridging Behavior and Semantics for Time-aware Cross-Domain Sequential Recommendation

    cs.IR 2026-05 unverdicted novelty 6.0

    BST-CDSR combines neural ODEs for continuous behavioral preference modeling with LLM-based temporal semantic generation and adaptive domain transfer to improve cross-domain sequential recommendations.

Reference graph

Works this paper leans on

296 extracted references · 200 canonical work pages · cited by 121 Pith papers · 65 internal anchors

  1. [1]

    A neural probabilistic language model,

    Y. Bengio, R. Ducharme, P . Vincent, and C. Janvin, “A neural probabilistic language model,”J. Mach. Learn. Res., vol. 3, pp. 1137–1155, 2003. 100

  2. [2]

    Natural language processing (almost) from scratch,

    R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P . P . Kuksa, “Natural language processing (almost) from scratch,”J. Mach. Learn. Res., vol. 12, pp. 2493–2537, 2011

  3. [3]

    Pinker,The Language Instinct: How the Mind Creates Language

    S. Pinker,The Language Instinct: How the Mind Creates Language. Brilliance Audio; Unabridged edition, 2014

  4. [4]

    The faculty of language: what is it, who has it, and how did it evolve?

    M. D. Hauser, N. Chomsky, and W. T. Fitch, “The faculty of language: what is it, who has it, and how did it evolve?”science, vol. 298, no. 5598, pp. 1569– 1579, 2002

  5. [5]

    Computing machinery and intelli- gence,

    A. M. Turing, “Computing machinery and intelli- gence,”Mind, vol. LIX, no. 236, pp. 433–460, 1950

  6. [6]

    Jelinek,Statistical Methods for Speech Recognition

    F. Jelinek,Statistical Methods for Speech Recognition. MIT Press, 1998

  7. [7]

    Introduction to the special issue on statistical language modeling,

    J. Gao and C. Lin, “Introduction to the special issue on statistical language modeling,”ACM Trans. Asian Lang. Inf. Process., vol. 3, no. 2, pp. 87–93, 2004

  8. [8]

    Two decades of statistical language modeling: Where do we go from here?

    R. Rosenfeld, “Two decades of statistical language modeling: Where do we go from here?”Proceedings of the IEEE, vol. 88, no. 8, pp. 1270–1278, 2000

  9. [9]

    Srilm-an extensible language modeling toolkit,

    A. Stolcke, “Srilm-an extensible language modeling toolkit,” inSeventh international conference on spoken language processing, 2002

  10. [10]

    Statistical language mod- eling for information retrieval,

    X. Liu and W. B. Croft, “Statistical language mod- eling for information retrieval,”Annu. Rev. Inf. Sci. Technol., vol. 39, no. 1, pp. 1–31, 2005

  11. [11]

    Zhai,Statistical Language Models for Information Re- trieval, ser

    C. Zhai,Statistical Language Models for Information Re- trieval, ser. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers, 2008

  12. [12]

    A second-order hidden markov model for part-of-speech tagging,

    S. M. Thede and M. P . Harper, “A second-order hidden markov model for part-of-speech tagging,” in27th Annual Meeting of the Association for Computa- tional Linguistics, University of Maryland, College Park, Maryland, USA, 20-26 June 1999, R. Dale and K. W. Church, Eds. ACL, 1999, pp. 175–182

  13. [13]

    A tree-based statistical language model for nat- ural language speech recognition,

    L. R. Bahl, P . F. Brown, P . V . de Souza, and R. L. Mer- cer, “A tree-based statistical language model for nat- ural language speech recognition,”IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 7, pp. 1001–1008, 1989

  14. [14]

    Large language models in machine translation,

    T. Brants, A. C. Popat, P . Xu, F. J. Och, and J. Dean, “Large language models in machine translation,” inEMNLP-CoNLL 2007, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learn- ing, June 28-30, 2007, Prague, Czech Republic, J. Eisner, Ed. ACL, 2007, pp. 858–867

  15. [15]

    Estimation of probabilities from sparse data for the language model component of a speech recognizer,

    S. M. Katz, “Estimation of probabilities from sparse data for the language model component of a speech recognizer,”IEEE Trans. Acoust. Speech Signal Process., vol. 35, no. 3, pp. 400–401, 1987

  16. [16]

    Good-turing frequency estimation without tears,

    W. A. Gale and G. Sampson, “Good-turing frequency estimation without tears,”J. Quant. Linguistics, vol. 2, no. 3, pp. 217–237, 1995

  17. [17]

    Recurrent neural network based lan- guage model,

    T. Mikolov, M. Karafi ´at, L. Burget, J. Cernock ´y, and S. Khudanpur, “Recurrent neural network based lan- guage model,” inINTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010, T. Kobayashi, K. Hirose, and S. Nakamura, Eds. ISCA, 2010, pp. 1045–1048

  18. [18]

    Recurrent neural network based language modeling in meeting recognition,

    S. Kombrink, T. Mikolov, M. Karafi ´at, and L. Burget, “Recurrent neural network based language modeling in meeting recognition,” inINTERSPEECH 2011, 12th Annual Conference of the International Speech Commu- nication Association, Florence, Italy, August 27-31, 2011. ISCA, 2011, pp. 2877–2880

  19. [19]

    Distributed representations of words and phrases and their compositionality,

    T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” inAdvances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems

  20. [20]

    Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, C. J. C. Burges, L. Bottou, Z. Ghahramani, and K. Q. Weinberger, Eds., 2013, pp. 3111–3119

  21. [21]

    Ef- ficient estimation of word representations in vector space,

    T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Ef- ficient estimation of word representations in vector space,” in1st International Conference on Learning Rep- resentations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2013

  22. [22]

    Deep contex- tualized word representations,

    M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contex- tualized word representations,” inProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long P...

  23. [23]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural In- formation Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, 2017, pp. 5998–6008

  24. [24]

    BERT: pre-training of deep bidirectional transform- ers for language understanding,

    J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transform- ers for language understanding,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short...

  25. [25]

    BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,

    M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo- hamed, O. Levy, V . Stoyanov, and L. Zettlemoyer, “BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, 2020, pp. 7...

  26. [26]

    Switch trans- formers: Scaling to trillion parameter models with simple and efficient sparsity,

    W. Fedus, B. Zoph, and N. Shazeer, “Switch trans- formers: Scaling to trillion parameter models with simple and efficient sparsity,”J. Mach. Learn. Res, pp. 1–40, 2021

  27. [27]

    Language models are unsuper- 101 vised multitask learners,

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskeveret al., “Language models are unsuper- 101 vised multitask learners,”OpenAI blog, p. 9, 2019

  28. [28]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov, “Roberta: A robustly optimized BERT pretraining approach,”CoRR, vol. abs/1907.11692, 2019

  29. [31]

    What language model architecture and pretraining objec- tive works best for zero-shot generalization?

    T. Wang, A. Roberts, D. Hesslow, T. L. Scao, H. W. Chung, I. Beltagy, J. Launay, and C. Raffel, “What language model architecture and pretraining objec- tive works best for zero-shot generalization?” in International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, ser. Proceedings of Machine Learning Research, vol. 16...

  30. [32]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language mod- els,”CoRR, vol. abs/2001.08361, 2020

  31. [33]

    Emergent Abilities of Large Language Models

    J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, P . Liang, J. Dean, and W. Fedus, “Emergent abilities of large language models,”CoRR, vol. abs/2206.07682, 2022

  32. [34]

    Talking about large language mod- els,

    M. Shanahan, “Talking about large language mod- els,”CoRR, vol. abs/2212.03551, 2022

  33. [35]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. H. Chi, Q. Le, and D. Zhou, “Chain of thought prompting elicits reasoning in large language models,”CoRR, vol. abs/2201.11903, 2022

  34. [36]

    Training Compute-Optimal Large Language Models

    J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre, “Training compute-optimal large language models,” vol. abs/2203.15556, 2022

  35. [37]

    Galactica: A Large Language Model for Science

    R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, A. Poulton, V . Kerkez, and R. Stojnic, “Galactica: A large language model for science,”CoRR, vol. abs/2211.09085, 2022

  36. [38]

    Pre-train, prompt, and predict: A sys- tematic survey of prompting methods in natural language processing,

    P . Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, prompt, and predict: A sys- tematic survey of prompting methods in natural language processing,”ACM Comput. Surv., pp. 195:1– 195:35, 2023

  37. [39]

    A comprehensive survey on pretrained foundation mod- els: A history from bert to chatgpt,

    C. Zhou, Q. Li, C. Li, J. Yu, Y. Liu, G. Wang, K. Zhang, C. Ji, Q. Yan, L. He, H. Peng, J. Li, J. Wu, Z. Liu, P . Xie, C. Xiong, J. Pei, P . S. Yu, and L. Sun, “A comprehensive survey on pretrained foundation models: A history from BERT to chatgpt,”CoRR, vol. abs/2302.09419, 2023

  38. [40]

    Pre-trained models: Past, present and future,

    X. Han, Z. Zhang, N. Ding, Y. Gu, X. Liu, Y. Huo, J. Qiu, Y. Yao, A. Zhang, L. Zhang, W. Han, M. Huang, Q. Jin, Y. Lan, Y. Liu, Z. Liu, Z. Lu, X. Qiu, R. Song, J. Tang, J. Wen, J. Yuan, W. X. Zhao, and J. Zhu, “Pre-trained models: Past, present and future,”AI Open, vol. 2, pp. 225–250, 2021

  39. [41]

    Pre-trained models for natural language processing: A survey,

    X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang, “Pre-trained models for natural language processing: A survey,”CoRR, vol. abs/2003.08271, 2020

  40. [42]

    Planning for agi and beyond,

    S. Altman, “Planning for agi and beyond,”OpenAI Blog, February 2023

  41. [43]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    S. Bubeck, V . Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P . Lee, Y. T. Lee, Y. Li, S. Lundberg, H. Nori, H. Palangi, M. T. Ribeiro, and Y. Zhang, “Sparks of artificial general intelligence: Early experiments with gpt-4,” vol. abs/2303.12712, 2023

  42. [44]

    Language is not all you need: Aligning perception with language models

    S. Huang, L. Dong, W. Wang, Y. Hao, S. Singhal, S. Ma, T. Lv, L. Cui, O. K. Mohammed, B. Patra, Q. Liu, K. Aggarwal, Z. Chi, J. Bjorck, V . Chaudhary, S. Som, X. Song, and F. Wei, “Language is not all you need: Aligning perception with language models,” CoRR, vol. abs/2302.14045, 2023

  43. [45]

    A comprehensive survey of ai-generated content (aigc): A history of generative ai from gan to chatgpt,

    Y. Cao, S. Li, Y. Liu, Z. Yan, Y. Dai, P . S. Yu, and L. Sun, “A comprehensive survey of ai-generated content (aigc): A history of generative ai from gan to chatgpt,”arXiv preprint arXiv:2303.04226, 2023

  44. [46]

    PaLM-E: An Embodied Multimodal Language Model

    D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdh- ery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu et al., “Palm-e: An embodied multimodal language model,”arXiv preprint arXiv:2303.03378, 2023

  45. [47]

    Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

    C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, and N. Duan, “Visual chatgpt: Talking, drawing and edit- ing with visual foundation models,”arXiv preprint arXiv:2303.04671, 2023

  46. [48]

    Gpt-4 technical report,

    OpenAI, “Gpt-4 technical report,”OpenAI, 2023

  47. [49]

    How does gpt obtain its ability? tracing emergent abilities of language models to their sources,

    Y. Fu, H. Peng, and T. Khot, “How does gpt obtain its ability? tracing emergent abilities of language models to their sources,”Yao Fu’s Notion, Dec 2022

  48. [50]

    Pretrained language model for text generation: A survey,

    J. Li, T. Tang, W. X. Zhao, and J. Wen, “Pretrained language model for text generation: A survey,” in Proceedings of the Thirtieth International Joint Confer- ence on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021, Z. Zhou, Ed. ijcai.org, 2021, pp. 4492–4499

  49. [51]

    A survey of deep learning for mathematical reason- ing,

    P . Lu, L. Qiu, W. Yu, S. Welleck, and K. Chang, “A survey of deep learning for mathematical reason- ing,”CoRR, vol. abs/2212.10535, 2022

  50. [52]

    A Survey on In-context Learning

    Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, L. Li, and Z. Sui, “A survey for in- context learning,”CoRR, vol. abs/2301.00234, 2023

  51. [53]

    arXiv preprint arXiv:2212.10403 , year=

    J. Huang and K. C. Chang, “Towards reasoning in large language models: A survey,”CoRR, vol. abs/2212.10403, 2022

  52. [54]

    Reasoning with language model prompting: A survey,

    S. Qiao, Y. Ou, N. Zhang, X. Chen, Y. Yao, S. Deng, C. Tan, F. Huang, and H. Chen, “Reasoning with language model prompting: A survey,”CoRR, vol. 102 abs/2212.09597, 2022

  53. [55]

    Chatgpt: potential, prospects, and limitations,

    J. Zhou, P . Ke, X. Qiu, M. Huang, and J. Zhang, “Chatgpt: potential, prospects, and limitations,” in Frontiers of Information Technology & Electronic Engi- neering, 2023, pp. 1–6

  54. [56]

    Dense text retrieval based on pretrained language models: A survey,

    W. X. Zhao, J. Liu, R. Ren, and J.-R. Wen, “Dense text retrieval based on pretrained language models: A survey,”ACM Transactions on Information Systems, vol. 42, no. 4, pp. 1–60, 2024

  55. [57]

    Language models are few-shot learners,

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P . Dhariwal, A. Neelakantan, P . Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Am...

  56. [58]

    PaLM: Scaling Language Modeling with Pathways

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P . Barham, H. W. Chung, C. Sutton, S. Gehrmann, P . Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P . Barnes, Y. Tay, N. Shazeer, V . Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Is- ard, G. Gur-Ari, P . Yin, T. Duke, A. Levskaya, S. Ghe- mawat, ...

  57. [59]

    Llama: Open and efficient foundation language models,

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Ham- bro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,”CoRR, 2023

  58. [60]

    Scaling Laws for Autoregressive Generative Modeling

    T. Henighan, J. Kaplan, M. Katz, M. Chen, C. Hesse, J. Jackson, H. Jun, T. B. Brown, P . Dhariwal, S. Gray et al., “Scaling laws for autoregressive generative modeling,”arXiv preprint arXiv:2010.14701, 2020

  59. [61]

    Greg Yang, Edward J

    S. M. Xie, H. Pham, X. Dong, N. Du, H. Liu, Y. Lu, P . Liang, Q. V . Le, T. Ma, and A. W. Yu, “Doremi: Optimizing data mixtures speeds up language model pretraining,”arXiv preprint arXiv:2305.10429, 2023

  60. [62]

    chinchilla optimal

    P . Villalobos, J. Sevilla, L. Heim, T. Besiroglu, M. Hobbhahn, and A. Ho, “Will we run out of data? an analysis of the limits of scaling datasets in machine learning,”CoRR, vol. abs/2211.04325, 2022. [Online]. Available: https://doi.org/10.48550/arXiv. 2211.04325

  61. [63]

    Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan

    N. Muennighoff, A. M. Rush, B. Barak, T. L. Scao, A. Piktus, N. Tazi, S. Pyysalo, T. Wolf, and C. Raffel, “Scaling data-constrained language models,”arXiv preprint arXiv:2305.16264, 2023

  62. [64]

    The inverse scaling prize,

    I. McKenzie, A. Lyzhov, A. Parrish, A. Prabhu, A. Mueller, N. Kim, S. Bowman, and E. Perez, “The inverse scaling prize,” 2022. [Online]. Available: https://github.com/inverse-scaling/prize

  63. [65]

    Phase transitions in artificial intelligence systems,

    B. A. Huberman and T. Hogg, “Phase transitions in artificial intelligence systems,”Artificial Intelligence, vol. 33, no. 2, pp. 155–171, 1987

  64. [66]

    Scaling Language Models: Methods, Analysis & Insights from Training Gopher

    J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoff- mann, H. F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young, E. Rutherford, T. Hennigan, J. Menick, A. Cassirer, R. Powell, G. van den Driessche, L. A. Hendricks, M. Rauh, P . Huang, A. Glaese, J. Welbl, S. Dathathri, S. Huang, J. Ue- sato, J. Mellor, I. Higgins, A. Creswell, N. McAleese, A. Wu, E....

  65. [67]

    arXiv preprint arXiv:2212.10559 , year=

    D. Dai, Y. Sun, L. Dong, Y. Hao, Z. Sui, and F. Wei, “Why can GPT learn in-context? language models se- cretly perform gradient descent as meta-optimizers,” CoRR, vol. abs/2212.10559, 2022

  66. [68]

    Training language models to follow instructions with human feedback

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wain- wright, P . Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P . Welinder, P . F. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,”CoRR, vol. abs/2203.02155, 2022

  67. [69]

    Finetuned language models are zero-shot learners,

    J. Wei, M. Bosma, V . Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned language models are zero-shot learners,” inThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. Open- Review.net, 2022

  68. [70]

    LaMDA: Language Models for Dialog Applications

    R. Thoppilan, D. D. Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H. Cheng, A. Jin, T. Bos, L. Baker, Y. Du, Y. Li, H. Lee, H. S. Zheng, A. Ghafouri, M. Menegali, Y. Huang, M. Krikun, D. Lepikhin, J. Qin, D. Chen, Y. Xu, Z. Chen, A. Roberts, M. Bosma, Y. Zhou, C. Chang, I. Krivokon, W. Rusch, M. Pick- ett, K. S. Meier-Hellstern, M. R. Morris, T. Doshi, R...

  69. [71]

    Scaling Instruction-Finetuned Language Models

    H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, S. Narang, G. Mishra, A. Yu, V . Y. Zhao, Y. Huang, A. M. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V . Le, and J. Wei, “Scaling instruction-finetuned languag...

  70. [72]

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, A. Kluska, A. Lewkowycz, A. Agarwal, A. Power, A. Ray, A. Warstadt, A. W. Kocurek, A. Safaya, A. Tazarv, A. Xiang, A. Parrish, A. Nie, A. Hussain, A. Askell, A. Dsouza, A. Rahane, A. S. Iyer, A. Andreassen, A. Santilli, A. Stuhlm ¨ul...

  71. [73]

    2023 , month = may, journal =

    R. Schaeffer, B. Miranda, and S. Koyejo, “Are emer- gent abilities of large language models a mirage?” arXiv preprint arXiv:2304.15004, 2023

  72. [74]

    Unlock predictable scaling from emergent abilities,

    S. Hu, X. Liu, X. Han, X. Zhang, C. He, W. Zhao, Y. Lin, N. Ding, Z. Ou, G. Zeng, Z. Liu, and M. Sun, “Unlock predictable scaling from emergent abilities,” 2023

  73. [75]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V . Misra, “Grokking: Generalization beyond overfit- ting on small algorithmic datasets,”arXiv preprint arXiv:2201.02177, 2022

  74. [76]

    Deepspeed: System optimizations enable training deep learning models with over 100 billion param- eters,

    J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He, “Deepspeed: System optimizations enable training deep learning models with over 100 billion param- eters,” inKDD, 2020, pp. 3505–3506

  75. [77]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    M. Shoeybi, M. Patwary, R. Puri, P . LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Train- ing multi-billion parameter language models using model parallelism,”CoRR, vol. abs/1909.08053, 2019

  76. [78]

    Efficient large-scale lan- guage model training on GPU clusters using megatron-lm,

    D. Narayanan, M. Shoeybi, J. Casper, P . LeGres- ley, M. Patwary, V . Korthikanti, D. Vainbrand, P . Kashinkunti, J. Bernauer, B. Catanzaro, A. Phan- ishayee, and M. Zaharia, “Efficient large-scale lan- guage model training on GPU clusters using megatron-lm,” inInternational Conference for High Per- formance Computing, Networking, Storage and Analysis, SC...

  77. [79]

    Reducing activation recomputation in large transformer models, 2022

    V . Korthikanti, J. Casper, S. Lym, L. McAfee, M. An- dersch, M. Shoeybi, and B. Catanzaro, “Reducing ac- tivation recomputation in large transformer models,” CoRR, vol. abs/2205.05198, 2022

  78. [80]

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilic, D. Hesslow, R. Castagn ´e, A. S. Luccioni, F. Yvon, M. Gall ´e, J. Tow, A. M. Rush, S. Biderman, A. Web- son, P . S. Ammanamanchi, T. Wang, B. Sagot, N. Muennighoff, A. V . del Moral, O. Ruwase, R. Baw- den, S. Bekman, A. McMillan-Major, I. Beltagy, H. Nguyen, L. Saulnier, S. Tan, P . O. Suarez, V . Sanh,...

  79. [81]

    Deep reinforcement learn- ing from human preferences,

    P . F. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learn- ing from human preferences,” inAdvances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Decem- ber 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergu...

  80. [82]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    T. Schick, J. Dwivedi-Yu, R. Dess `ı, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,”CoRR, vol. abs/2302.04761, 2023

Showing first 80 references.