pith. sign in

arxiv: 2108.12409 · v2 · submitted 2021-08-27 · 💻 cs.CL

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Pith reviewed 2026-05-13 00:55 UTC · model grok-4.3

classification 💻 cs.CL
keywords transformerattentionposition biaslength extrapolationALiBilanguage modelingperplexity
0
0 comments X

The pith

Attention with linear biases enables transformer models to extrapolate to input sequences twice as long as seen in training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a simple change to position handling in transformers supports extrapolation beyond training lengths. ALiBi replaces positional embeddings with a linear distance-based penalty on attention scores. A 1.3 billion parameter model trained on sequences of 1024 tokens then matches the perplexity of a sinusoidal model trained on 2048 tokens. This also cuts training time and memory by 11 percent. The built-in recency preference further improves results on WikiText-103.

Core claim

By adding a fixed negative slope bias to query-key attention scores based on token distance, ALiBi lets models train on length 1024 and extrapolate to length 2048 while matching the perplexity of models trained directly on the longer length.

What carries the argument

Attention with Linear Biases (ALiBi): a bias term subtracted from attention scores that grows linearly with the distance between each query and key position.

Load-bearing premise

A single fixed linear bias slope applied to attention scores is sufficient to produce reliable extrapolation across model sizes and sequence lengths without further changes to the model or training.

What would settle it

Train an ALiBi model on length 1024 and evaluate on length 2048; if its perplexity exceeds that of a sinusoidal model trained and tested on length 2048, the extrapolation claim fails.

read the original abstract

Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question has yet to be answered: how does a model achieve extrapolation at inference time for sequences that are longer than it saw during training? We first show that extrapolation can be enabled by simply changing the position representation method, though we find that current methods do not allow for efficient extrapolation. We therefore introduce a simpler and more efficient position method, Attention with Linear Biases (ALiBi). ALiBi does not add positional embeddings to word embeddings; instead, it biases query-key attention scores with a penalty that is proportional to their distance. We show that this method trains a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048 but training 11% faster and using 11% less memory. ALiBi's inductive bias towards recency also leads it to outperform multiple strong position methods on the WikiText-103 benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper proposes Attention with Linear Biases (ALiBi), a position method that adds a fixed linear penalty to query-key attention scores proportional to their distance, rather than injecting positional embeddings into the input. It claims this enables length extrapolation: a 1.3B-parameter model trained on sequences of length 1024 with ALiBi achieves the same perplexity on length-2048 inputs as a sinusoidal-embedding model trained directly on 2048-length sequences, while training 11% faster and using 11% less memory. ALiBi is also reported to outperform several strong position baselines on WikiText-103 due to its recency bias.

Significance. If the empirical claims hold under broader conditions, the result is significant for practical scaling of language models: it offers a lightweight way to decouple training length from inference length, yielding concrete efficiency gains without architectural changes. The approach is simple to implement and the reported speed/memory savings plus benchmark improvements provide a falsifiable, reproducible contribution to the position-embedding literature.

major comments (3)
  1. [§3] §3 (ALiBi definition): the slope schedule m_h = 2^(-8*h/(n-1)) is presented as a one-time fixed heuristic requiring no per-model retuning, yet no ablation or sensitivity analysis is shown for changes in head count, hidden dimension, or extrapolation ratio (e.g., 1024→4096). This directly supports the central claim that 'a fixed linear bias suffices' and that the method needs 'no further model changes'; without such evidence the simplicity advantage remains unproven.
  2. [§4] §4 (main 1.3B extrapolation experiment): the headline result (perplexity parity with sinusoidal-2048, 11% faster training, 11% less memory) is given without data-split details, hyperparameter grids, number of random seeds, or error bars. Because the comparison is purely empirical and the slope choice is itself a hyperparameter, these omissions make it impossible to verify whether the reported gains are robust or sensitive to training dynamics.
  3. [§4.2] §4.2 (WikiText-103 results): the claim that ALiBi 'outperforms multiple strong position methods' is load-bearing for the inductive-bias argument, but the manuscript does not state whether the sinusoidal and other baselines were also trained at 1024 tokens or at the full evaluation length; this ambiguity weakens the cross-method comparison.
minor comments (3)
  1. [Figure 2] Figure 2 (attention visualization): the color scale and axis labels are not defined in the caption, making it hard to interpret the claimed recency bias.
  2. [§2] Related-work section (§2): the discussion of prior linear-bias or distance-based attention methods omits several recent works on relative position representations that appeared after Vaswani et al. (2017).
  3. [§3] Notation: the symbol m_h is introduced without an explicit equation number; adding 'Eq. (3)' would improve readability when the slope formula is referenced later.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below, providing clarifications and committing to revisions that strengthen the empirical support and reproducibility of the work.

read point-by-point responses
  1. Referee: [§3] §3 (ALiBi definition): the slope schedule m_h = 2^(-8*h/(n-1)) is presented as a one-time fixed heuristic requiring no per-model retuning, yet no ablation or sensitivity analysis is shown for changes in head count, hidden dimension, or extrapolation ratio (e.g., 1024→4096). This directly supports the central claim that 'a fixed linear bias suffices' and that the method needs 'no further model changes'; without such evidence the simplicity advantage remains unproven.

    Authors: We appreciate this observation. The slope schedule was derived from preliminary experiments on smaller models and then applied without modification to all larger models reported in the paper. To directly address the request for sensitivity analysis, the revised manuscript will include new results (in an expanded §3 or appendix) testing the same fixed schedule across varying head counts (8–32 heads), model dimensions, and extrapolation ratios up to 4× on models up to 350M parameters. These additions will provide concrete evidence that the heuristic generalizes without per-model retuning. revision: yes

  2. Referee: [§4] §4 (main 1.3B extrapolation experiment): the headline result (perplexity parity with sinusoidal-2048, 11% faster training, 11% less memory) is given without data-split details, hyperparameter grids, number of random seeds, or error bars. Because the comparison is purely empirical and the slope choice is itself a hyperparameter, these omissions make it impossible to verify whether the reported gains are robust or sensitive to training dynamics.

    Authors: We agree that greater experimental transparency is needed. The revision will add the precise training corpus composition and data splits, the full hyperparameter configuration for the 1.3B model, and an explicit statement that the large-model runs were performed once owing to compute cost. We will also note that smaller-scale ablations (reported in the appendix) were repeated with multiple seeds and exhibited the same qualitative trends. We cannot, however, supply error bars for the 1.3B setting itself. revision: partial

  3. Referee: [§4.2] §4.2 (WikiText-103 results): the claim that ALiBi 'outperforms multiple strong position methods' is load-bearing for the inductive-bias argument, but the manuscript does not state whether the sinusoidal and other baselines were also trained at 1024 tokens or at the full evaluation length; this ambiguity weakens the cross-method comparison.

    Authors: We thank the referee for catching this ambiguity. All models—including the sinusoidal, rotary, and learned-position baselines—were trained with a maximum sequence length of 1024 tokens. WikiText-103 evaluation used the test set’s native (sometimes longer) sequences to measure extrapolation, but training length was identical across methods. The revised §4.2 will state this explicitly, removing any possibility of misinterpretation. revision: yes

standing simulated objections not resolved
  • The 1.3B-parameter experiments were run with only a single random seed due to prohibitive computational cost; consequently we cannot supply error bars or quantify sensitivity to initialization for the headline result.

Circularity Check

0 steps flagged

No significant circularity in the empirical evaluation of ALiBi.

full rationale

The paper introduces ALiBi as a position method that adds a fixed linear bias to attention scores and demonstrates its effectiveness through direct empirical comparison: a 1.3B model trained at length 1024 achieves equivalent perplexity on length 2048 to a sinusoidal baseline trained at 2048, with reported efficiency gains. The slope schedule is a fixed, predetermined choice (geometric progression across heads) presented as part of the method definition rather than fitted to the extrapolation results themselves. No equation or claim reduces the reported perplexity values to a parameter or quantity defined by the same experiment, and the central result remains an independent experimental outcome rather than a tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The claim rests on the empirical effectiveness of a linear attention bias whose slope is a tunable hyperparameter; no new entities are postulated and the background assumptions are standard transformer components.

free parameters (1)
  • linear bias slope
    The rate at which the distance penalty increases must be chosen (likely per head or layer) to achieve the reported extrapolation; this value is not derived from first principles.
axioms (1)
  • domain assumption Adding a fixed linear penalty to query-key dot products preserves the core attention mechanism and training dynamics of the transformer.
    Invoked when the authors replace positional embeddings with the bias without further architectural changes.

pith-pipeline@v0.9.0 · 5495 in / 1345 out tokens · 57592 ms · 2026-05-13T00:55:55.919030+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 52 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Rethinking Positional Encoding for Neural Vehicle Routing

    cs.AI 2026-05 unverdicted novelty 7.0

    A hierarchical anisometric positional encoding that combines distance-indexed in-route and depot-anchored angular cross-route components improves transformer-based solvers for vehicle routing problems over index-based...

  2. Positional LSH: Binary Block Matrix Approximation for Attention with Linear Biases

    cs.LG 2026-05 unverdicted novelty 7.0

    ALiBi bias is the expectation of positional LSH-induced block masks, yielding spectral and max-norm approximation bounds that reduce long-context biased attention to randomized short-context unbiased attention.

  3. PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training

    cs.LG 2026-04 unverdicted novelty 7.0

    Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.

  4. URoPE: Universal Relative Position Embedding across Geometric Spaces

    cs.CV 2026-04 unverdicted novelty 7.0

    URoPE is a parameter-free relative position embedding for transformers that works across arbitrary geometric spaces by ray sampling and projection, yielding consistent gains on novel view synthesis, 3D detection, trac...

  5. Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels

    cs.AI 2026-04 unverdicted novelty 7.0

    Multi-head Gaussian kernels inject temporal scale discrepancy as inductive bias to enable full-duplex talking-listening avatar generation, supported by a new decoupled VoxHear dataset and claimed SOTA naturalness.

  6. Group Representational Position Encoding

    cs.LG 2025-12 unverdicted novelty 7.0

    GRAPE unifies RoPE and ALiBi as special cases of group actions on positions, providing a principled design space for positional encodings via SO(d) rotations and GL unipotent transformations.

  7. Robust Filter Attention: Self-Attention as Precision-Weighted State Estimation

    cs.LG 2025-09 unverdicted novelty 7.0

    Robust Filter Attention models self-attention as consistency-based state estimation under a linear SDE for token trajectories, matching standard attention complexity while showing lower perplexity and better zero-shot...

  8. Exact Sequence Interpolation with Transformers

    cs.LG 2025-02 conditional novelty 7.0

    Transformers with O(sum m^j) blocks and O(d sum m^j) parameters can exactly interpolate any finite dataset of input sequences in R^d to output sequences of lengths m^j.

  9. Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

    cs.CL 2024-04 conditional novelty 7.0

    Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.

  10. Massive Activations in Large Language Models

    cs.CL 2024-02 unverdicted novelty 7.0

    Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.

  11. Towards Understanding Self-Pretraining for Sequence Classification

    cs.LG 2026-05 unverdicted novelty 6.0

    Self-pretraining improves Transformer sequence classification by enabling learning of proximity-biased attention from positional encodings that label supervision alone cannot easily acquire from random starts.

  12. Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory

    cs.LG 2026-05 unverdicted novelty 6.0

    PMNet uses unitary phasor dynamics and hierarchical anchors to make explicit memory stable for long sequences, matching a 3x larger Mamba model on long-context robustness with a 119M parameter network.

  13. When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction

    cs.AI 2026-05 unverdicted novelty 6.0

    Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.

  14. Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm

    cs.CL 2026-05 unverdicted novelty 6.0

    Theoretical analysis of continual factual knowledge acquisition shows data replay stabilizes pretrained knowledge by shifting convergence dynamics while regularization only slows forgetting, leading to the STOC method...

  15. Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing

    cs.CL 2026-05 conditional novelty 6.0

    EXACT re-allocates training supervision by inverse frequency of long effective-context targets, improving NoLiMa and RULER scores by 5-18 points on Qwen and LLaMA models without degrading standard QA or reasoning.

  16. Remember to Forget: Gated Adaptive Positional Encoding

    cs.LG 2026-05 unverdicted novelty 6.0

    GAPE augments RoPE with query- and key-dependent gates to stabilize attention and improve long-context performance in language models.

  17. HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution

    cs.AI 2026-05 unverdicted novelty 6.0

    HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.

  18. FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

    cs.CL 2026-05 unverdicted novelty 6.0

    FocuSFT uses an inner optimization loop to adapt fast-weight parameters into a parametric memory that sharpens attention on relevant content, then conditions outer-loop supervised fine-tuning on this representation, y...

  19. It Just Takes Two: Scaling Amortized Inference to Large Sets

    cs.LG 2026-05 unverdicted novelty 6.0

    A mean-pool deep set trained on sets of size at most two produces an encoder that generalizes to arbitrary sizes, decoupling representation learning from posterior modeling and making training cost independent of depl...

  20. FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation

    cs.LG 2026-05 unverdicted novelty 6.0

    FAAST analytically compiles labeled examples into fast weights via a single forward pass, matching backprop adaptation performance with over 90% less time and up to 95% less memory than memory-based methods.

  21. ADE: Adaptive Dictionary Embeddings -- Scaling Multi-Anchor Representations to Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    ADE scales multi-anchor word representations to transformers via Vocabulary Projection, Grouped Positional Encoding, and context-aware reweighting, achieving 98.7% fewer trainable parameters than DeBERTa-v3-base while...

  22. The Recurrent Transformer: Greater Effective Depth and Efficient Decoding

    cs.LG 2026-04 unverdicted novelty 6.0

    Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-mat...

  23. LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention

    cs.AI 2026-04 unverdicted novelty 6.0

    LoopGuard detects attention collapse loops during LLM decoding and prunes repetitive KV cache tail spans under fixed budget, cutting loop incidence by over 90 percentage points on the new LoopBench benchmark.

  24. MT-OSC: Path for LLMs that Get Lost in Multi-Turn Conversation

    cs.CL 2026-04 unverdicted novelty 6.0

    MT-OSC condenses chat history via a one-off sequential process with a few-shot Condenser and lightweight Decider to reduce tokens and preserve LLM accuracy in multi-turn settings.

  25. Stacked from One: Multi-Scale Self-Injection for Context Window Extension

    cs.CL 2026-03 unverdicted novelty 6.0

    SharedLLM stacks two copies of a short-context LLM so the lower one compresses context into query-aware multi-grained tokens that are injected only at the lowest layers of the upper one, enabling generalization from 8...

  26. Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants

    cs.LG 2025-11 unverdicted novelty 6.0

    Flashlight is a compiler-native PyTorch framework that generates efficient fused kernels for arbitrary and data-dependent attention variants, supporting more cases than FlexAttention with competitive performance.

  27. Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource

    cs.CL 2025-06 conditional novelty 6.0

    MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.

  28. Long-Context Autoregressive Video Modeling with Next-Frame Prediction

    cs.CV 2025-03 unverdicted novelty 6.0

    FAR baseline plus asymmetric kernels for long short-term context modeling achieves SOTA short and long video generation in autoregressive setups.

  29. Flex Attention: A Programming Model for Generating Optimized Attention Kernels

    cs.LG 2024-12 unverdicted novelty 6.0

    FlexAttention supplies a compiler-driven interface that expresses common attention variants in a few lines of PyTorch and emits optimized kernels whose speed matches hand-written implementations.

  30. Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning

    cs.LG 2024-11 unverdicted novelty 6.0

    CD-MoE condenses fine-grained MoE layers with shared experts into dense layers, retaining 90% accuracy with 27.5% memory cut and 1.26x speedup on DeepSeekMoE-16B, recovering 98% via brief fine-tuning.

  31. When Attention Sink Emerges in Language Models: An Empirical View

    cs.CL 2024-10 accept novelty 6.0

    Attention sinks emerge in language models from softmax-induced token dependence on attention scores and do not appear when using sigmoid attention without normalization in models up to 1B parameters.

  32. Gated Linear Attention Transformers with Hardware-Efficient Training

    cs.LG 2023-12 unverdicted novelty 6.0

    Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.

  33. MemGPT: Towards LLMs as Operating Systems

    cs.AI 2023-10 unverdicted novelty 6.0

    MemGPT uses OS-inspired virtual context management to extend LLM context windows for large document analysis and long-term multi-session chat.

  34. A Measure-Theoretic Analysis of Reasoning: Structural Generalization and Approximation Limits

    cs.LG 2026-05 unverdicted novelty 5.0

    Applies optimal transport to bound OOD generalization error in Transformers via Lipschitz continuity and TC^0 circuit depth lower bounds for Dyck-k backtracking, supported by evaluations on 54 configurations.

  35. Supporting System Testing with a Multi-Agent LLM-based Framework for Knowledge Graph Extraction: A Case Study with Ethernet Switch Systems

    cs.SE 2026-05 conditional novelty 5.0

    A multi-agent LLM-based framework extracts knowledge graphs from 50 real Ethernet switch manuals with 0.97-0.99 correctness to enable downstream test case specification generation.

  36. Kaczmarz Linear Attention

    cs.LG 2026-05 unverdicted novelty 5.0

    Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...

  37. FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation

    cs.LG 2026-05 unverdicted novelty 5.0

    FAAST performs test-time supervised adaptation by analytically deriving fast weights from examples in one forward pass, matching backprop performance with over 90% less adaptation time and up to 95% memory savings ver...

  38. Decouple and Cache: KV Cache Construction for Streaming Video Understanding

    cs.CV 2026-05 unverdicted novelty 5.0

    DSCache decouples cumulative past and instant KV caches with position-agnostic encoding to adapt offline VideoVLLMs to streaming video, delivering 2.5% average accuracy gains on QA benchmarks.

  39. Adaptive 3D-RoPE: Physics-Aligned Rotary Positional Encoding for Wireless Foundation Models

    eess.SP 2026-05 unverdicted novelty 5.0

    Adaptive 3D-RoPE adapts rotary positional encoding to wireless channel physics via learnable 3D frequencies and dynamic CSI control, yielding up to 10.7 dB NMSE gains in scale extrapolation and 1 dB in zero-shot tasks.

  40. Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity

    cs.CL 2026-04 unverdicted novelty 5.0

    Fixed-width and decay-based attention mechanisms inspired by working memory improve Transformer grammatical accuracy and human alignment under limited training data.

  41. Learning Class Difficulty in Imbalanced Histopathology Segmentation via Dynamic Focal Attention

    eess.IV 2026-04 unverdicted novelty 5.0

    Dynamic Focal Attention learns class-specific difficulty via per-class biases in attention logits, improving Dice and IoU on imbalanced histopathology segmentation benchmarks.

  42. Voxtral TTS

    cs.AI 2026-03 unverdicted novelty 5.0

    Voxtral TTS produces expressive multilingual speech from 3-second reference audio with a hybrid autoregressive-plus-flow-matching architecture and a new VQ-FSQ tokenizer, achieving 68.4% win rate over ElevenLabs in hu...

  43. Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs

    cs.CV 2025-09 unverdicted novelty 5.0

    Video Parallel Scaling improves VideoLLM performance by aggregating outputs from parallel inferences on complementary disjoint frame subsets, effectively contracting the Chinchilla scaling law via uncorrelated visual ...

  44. F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

    eess.AS 2024-10 unverdicted novelty 5.0

    F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.

  45. Galactica: A Large Language Model for Science

    cs.CL 2022-11 unverdicted novelty 5.0

    Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.

  46. Hybrid Architectures for Language Models: Systematic Analysis and Design Insights

    cs.CL 2025-10 unverdicted novelty 4.0

    This work systematically compares inter-layer and intra-layer hybridization strategies for combining self-attention and Mamba-style state space models, evaluating them on language modeling, downstream tasks, long-cont...

  47. Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

    cs.CV 2025-02 unverdicted novelty 4.0

    Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.

  48. Baichuan 2: Open Large-scale Language Models

    cs.CL 2023-09 unverdicted novelty 4.0

    Baichuan 2 presents 7B and 13B LLMs trained on 2.6T tokens that match or exceed similar open models on MMLU, CMMLU, GSM8K, HumanEval and excel in medicine and law.

  49. Positional Encoding in Transformer-Based Time Series Models: A Survey

    cs.LG 2025-02 unverdicted novelty 3.0

    A survey of positional encoding methods in transformer-based time series models that evaluates fixed, learnable, relative, and hybrid approaches on classification tasks and links effectiveness to data characteristics.

  50. A Survey on Large Language Models for Code Generation

    cs.CL 2024-06 unverdicted novelty 3.0

    A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...

  51. Large Language Models: A Survey

    cs.CL 2024-02 accept novelty 3.0

    The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

  52. Training LLMs on HPC Systems: Best Practices from the OpenGPT-X Project

    cs.DC 2025-04 unverdicted novelty 2.0

    Engineering report detailing HPC infrastructure, software choices, and performance measurements for training a 7B LLM using 3D parallelism on JUWELS Booster.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 51 Pith papers · 1 internal anchor

  1. [1]

    Adaptive Input Representations for Neural Language Modeling

    Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. CoRR, abs/1809.10853, 2018. URL http://arxiv.org/abs/1809.10853

  2. [2]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv:2004.05150, 2020

  3. [3]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  4. [4]

    Unsupervised Cross-lingual Representation Learning at Scale , booktitle =

    Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020. doi:10.18653/v1/2020.acl-main.7...

  5. [5]

    doi: 10.18653/v1/P19-1285

    Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer- XL : Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.\ 2978--2988, Florence, Italy, July 2019. Association for Computational Linguistics. doi:10.18653...

  6. [6]

    BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pp.\ 4171--4186, Minneapo...

  7. [7]

    Openwebtext corpus

    Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019

  8. [8]

    Shazeer, Andrew M

    Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam M. Shazeer, Andrew M. Dai, M. Hoffman, M. Dinculescu, and D. Eck. Music transformer: Generating music with long-term structure. In ICLR, 2019

  9. [9]

    Compositionality decomposed: How do neural networks generalise?Journal of Artificial Intelligence Research, 67:757– 795, 2020

    Dieuwke Hupkes, Verna Dankers, Mathijs Mul, and Elia Bruni. Compositionality decomposed: How do neural networks generalise? Journal of Artificial Intelligence Research, 67: 0 757--795, April 2020. doi:10.1613/jair.1.11674. URL https://doi.org/10.1613/jair.1.11674

  10. [10]

    Tying word vectors and word classifiers: A loss framework for language modeling

    Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. In ICLR, 2017. URL https://openreview.net/forum?id=r1aPbsFle

  11. [11]

    Jumper, Richard Evans, A

    J. Jumper, Richard Evans, A. Pritzel, Tim Green, Michael Figurnov, O. Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Z \'i dek, Anna Potapenko, A. Bridgland, Clemens Meyer, Simon A A Kohl, Andy Ballard, A. Cowie, B. Romera-Paredes, Stanislav Nikolov, Rishub Jain, J. Adler, T. Back, Stig Petersen, D. Reiman, Ellen Clancy, Michal Zielinski, Mart...

  12. [12]

    Generalization through Memorization: Nearest Neighbor Language Models

    Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through Memorization: Nearest Neighbor Language Models . In International Conference on Learning Representations (ICLR), 2020

  13. [13]

    Shape: Shifted absolute position embedding for transformers

    Shun Kiyono, Sosuke Kobayashi, Jun Suzuki, and Kentaro Inui. Shape: Shifted absolute position embedding for transformers. ArXiv, abs/2109.05644, 2021

  14. [14]

    Andrew Kyle Lampinen, Stephanie C. Y. Chan, Andrea Banino, and Felix Hill. Towards mental time travel: a hierarchical memory for reinforcement learning agents. CoRR, abs/2105.14039, 2021. URL https://arxiv.org/abs/2105.14039

  15. [15]

    Base layers: Simplifying training of large, sparse models, 2021

    Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. Base layers: Simplifying training of large, sparse models, 2021

  16. [16]

    Jurassic-1: Technical details and evaluation

    Opher Lieber, Or Sharir, Barak Lenz, and Yoav Shoham. Jurassic-1: Technical details and evaluation. Technical report, AI21 Labs, August 2021

  17. [17]

    CAPE: encoding relative positions with continuous augmented positional embeddings

    Tatiana Likhomanenko, Qiantong Xu, Ronan Collobert, Gabriel Synnaeve, and Alex Rogozhnikov. CAPE: encoding relative positions with continuous augmented positional embeddings. CoRR, abs/2106.03143, 2021. URL https://arxiv.org/abs/2106.03143

  18. [18]

    Roberta: A robustly optimized bert pretraining approach, 2019

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019

  19. [19]

    Pointer sentinel mixture models, 2016

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016

  20. [20]

    Tomas Mikolov and G. Zweig. Context dependent recurrent neural network language model. 2012 IEEE Spoken Language Technology Workshop (SLT), pp.\ 234--239, 2012

  21. [21]

    Karafi \'a t, L

    Tomas Mikolov, M. Karafi \'a t, L. Burget, J. Cernock \'y , and S. Khudanpur. Recurrent neural network based language model. In INTERSPEECH, 2010

  22. [22]

    Sebastian Nagel. Cc-news. https://commoncrawl.org/2016/10/news-dataset-available/, 2016

  23. [23]

    Do transformer modifications transfer across implementations and applications?, 2021

    Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry, Michael Matena, Karishma Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong Lan, Yanqi Zhou, Wei Li, Nan Ding, Jake Marcus, Adam Roberts, and Colin Raffel. Do transformer modifications transfer across implementations and applications?, 2021

  24. [24]

    On the relation between position information and sentence length in neural machine translation

    Masato Neishi and Naoki Yoshinaga. On the relation between position information and sentence length in neural machine translation. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pp.\ 328--338, Hong Kong, China, November 2019. Association for Computational Linguistics. doi:10.18653/v1/K19-1031. URL https://aclanth...

  25. [25]

    Benjamin Newman, John Hewitt, Percy Liang, and Christopher D. Manning. The eos decision and length extrapolation. In BlackBoxNLP@EMNLP, 2020. URL https://nlp.stanford.edu/pubs/newman2020extrapolation.pdf

  26. [26]

    Rodrigo Nogueira, Zhiying Jiang, and Jimmy J. Li. Investigating the limitations of the transformers with simple arithmetic tasks. ArXiv, abs/2102.13019, 2021

  27. [27]

    Scaling neural machine translation

    Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. Scaling neural machine translation. In Proceedings of the Third Conference on Machine Translation (WMT), 2018

  28. [28]

    a ckstr \

    Ankur Parikh, Oscar T \"a ckstr \"o m, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model for natural language inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.\ 2249--2255, Austin, Texas, November 2016. Association for Computational Linguistics. doi:10.18653/v1/D16-1244. URL https://a...

  29. [29]

    Using the output embedding to improve language models

    Ofir Press and Lior Wolf. Using the output embedding to improve language models. In Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pp.\ 157--163, Valencia, Spain, April 2017. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/E17-2025

  30. [30]

    Smith, and Omer Levy

    Ofir Press, Noah A. Smith, and Omer Levy. Improving transformer models by reordering their sublayers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 2996--3005, Online, July 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.270. URL https://www.aclweb.org/anthology/2020.acl-main.270

  31. [31]

    Smith, and Mike Lewis

    Ofir Press, Noah A. Smith, and Mike Lewis. Shortformer: Better language modeling using shorter inputs. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.\ 5493--5505, Online, August 2021. Association for Computati...

  32. [32]

    Rae, Anna Potapenko, Siddhant M

    Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap. Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SylKikSYDH

  33. [33]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020. URL http://jmlr.org/papers/v21/20-074.html

  34. [34]

    Analysis of positional encodings for neural machine translation

    Jan Rosendahl, Viet Anh Khoa Tran, Weiyue Wang, and Hermann Ney. Analysis of positional encodings for neural machine translation. In International Workshop on Spoken Language Translation, Hong Kong, China, November 2019

  35. [35]

    Efficient content-based sparse attention with routing transformers, 2020

    Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention with routing transformers, 2020

  36. [36]

    Self-Attention with Relative Position Representations

    Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) , pp.\ 464--468, New Orleans, Louisiana, June 2018. Association for Computational...

  37. [37]

    Roformer: Enhanced transformer with rotary position embedding, 2021

    Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2021

  38. [38]

    Trinh and Quoc V

    Trieu H. Trinh and Quoc V. Le. A simple method for commonsense reasoning, 2018

  39. [39]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https:/...

  40. [40]

    GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model

    Ben Wang and Aran Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model . https://github.com/kingoflolz/mesh-transformer-jax, May 2021

  41. [41]

    The case for translation-invariant self-attention in transformer-based language models, 2021

    Ulme Wennberg and Gustav Eje Henter. The case for translation-invariant self-attention in transformer-based language models, 2021

  42. [42]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

  43. [43]

    DA -transformer: Distance-aware transformer

    Chuhan Wu, Fangzhao Wu, and Yongfeng Huang. DA -transformer: Distance-aware transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 2059--2068, Online, June 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.naacl-main.166. URL h...

  44. [44]

    Recurrent neural network regularization, 2014

    Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization, 2014

  45. [45]

    Aligning books and movies: Towards story-like visual explanations by watching movies and reading books

    Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pp.\ 19--27, 2015