Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Mike Lewis; Noah A. Smith; Ofir Press

arxiv: 2108.12409 · v2 · submitted 2021-08-27 · 💻 cs.CL

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Ofir Press , Noah A. Smith , Mike Lewis This is my paper

Pith reviewed 2026-05-13 00:55 UTC · model grok-4.3

classification 💻 cs.CL

keywords transformerattentionposition biaslength extrapolationALiBilanguage modelingperplexity

0 comments

The pith

Attention with linear biases enables transformer models to extrapolate to input sequences twice as long as seen in training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a simple change to position handling in transformers supports extrapolation beyond training lengths. ALiBi replaces positional embeddings with a linear distance-based penalty on attention scores. A 1.3 billion parameter model trained on sequences of 1024 tokens then matches the perplexity of a sinusoidal model trained on 2048 tokens. This also cuts training time and memory by 11 percent. The built-in recency preference further improves results on WikiText-103.

Core claim

By adding a fixed negative slope bias to query-key attention scores based on token distance, ALiBi lets models train on length 1024 and extrapolate to length 2048 while matching the perplexity of models trained directly on the longer length.

What carries the argument

Attention with Linear Biases (ALiBi): a bias term subtracted from attention scores that grows linearly with the distance between each query and key position.

Load-bearing premise

A single fixed linear bias slope applied to attention scores is sufficient to produce reliable extrapolation across model sizes and sequence lengths without further changes to the model or training.

What would settle it

Train an ALiBi model on length 1024 and evaluate on length 2048; if its perplexity exceeds that of a sinusoidal model trained and tested on length 2048, the extrapolation claim fails.

read the original abstract

Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question has yet to be answered: how does a model achieve extrapolation at inference time for sequences that are longer than it saw during training? We first show that extrapolation can be enabled by simply changing the position representation method, though we find that current methods do not allow for efficient extrapolation. We therefore introduce a simpler and more efficient position method, Attention with Linear Biases (ALiBi). ALiBi does not add positional embeddings to word embeddings; instead, it biases query-key attention scores with a penalty that is proportional to their distance. We show that this method trains a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048 but training 11% faster and using 11% less memory. ALiBi's inductive bias towards recency also leads it to outperform multiple strong position methods on the WikiText-103 benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper proposes Attention with Linear Biases (ALiBi), a position method that adds a fixed linear penalty to query-key attention scores proportional to their distance, rather than injecting positional embeddings into the input. It claims this enables length extrapolation: a 1.3B-parameter model trained on sequences of length 1024 with ALiBi achieves the same perplexity on length-2048 inputs as a sinusoidal-embedding model trained directly on 2048-length sequences, while training 11% faster and using 11% less memory. ALiBi is also reported to outperform several strong position baselines on WikiText-103 due to its recency bias.

Significance. If the empirical claims hold under broader conditions, the result is significant for practical scaling of language models: it offers a lightweight way to decouple training length from inference length, yielding concrete efficiency gains without architectural changes. The approach is simple to implement and the reported speed/memory savings plus benchmark improvements provide a falsifiable, reproducible contribution to the position-embedding literature.

major comments (3)

[§3] §3 (ALiBi definition): the slope schedule m_h = 2^(-8*h/(n-1)) is presented as a one-time fixed heuristic requiring no per-model retuning, yet no ablation or sensitivity analysis is shown for changes in head count, hidden dimension, or extrapolation ratio (e.g., 1024→4096). This directly supports the central claim that 'a fixed linear bias suffices' and that the method needs 'no further model changes'; without such evidence the simplicity advantage remains unproven.
[§4] §4 (main 1.3B extrapolation experiment): the headline result (perplexity parity with sinusoidal-2048, 11% faster training, 11% less memory) is given without data-split details, hyperparameter grids, number of random seeds, or error bars. Because the comparison is purely empirical and the slope choice is itself a hyperparameter, these omissions make it impossible to verify whether the reported gains are robust or sensitive to training dynamics.
[§4.2] §4.2 (WikiText-103 results): the claim that ALiBi 'outperforms multiple strong position methods' is load-bearing for the inductive-bias argument, but the manuscript does not state whether the sinusoidal and other baselines were also trained at 1024 tokens or at the full evaluation length; this ambiguity weakens the cross-method comparison.

minor comments (3)

[Figure 2] Figure 2 (attention visualization): the color scale and axis labels are not defined in the caption, making it hard to interpret the claimed recency bias.
[§2] Related-work section (§2): the discussion of prior linear-bias or distance-based attention methods omits several recent works on relative position representations that appeared after Vaswani et al. (2017).
[§3] Notation: the symbol m_h is introduced without an explicit equation number; adding 'Eq. (3)' would improve readability when the slope formula is referenced later.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below, providing clarifications and committing to revisions that strengthen the empirical support and reproducibility of the work.

read point-by-point responses

Referee: [§3] §3 (ALiBi definition): the slope schedule m_h = 2^(-8*h/(n-1)) is presented as a one-time fixed heuristic requiring no per-model retuning, yet no ablation or sensitivity analysis is shown for changes in head count, hidden dimension, or extrapolation ratio (e.g., 1024→4096). This directly supports the central claim that 'a fixed linear bias suffices' and that the method needs 'no further model changes'; without such evidence the simplicity advantage remains unproven.

Authors: We appreciate this observation. The slope schedule was derived from preliminary experiments on smaller models and then applied without modification to all larger models reported in the paper. To directly address the request for sensitivity analysis, the revised manuscript will include new results (in an expanded §3 or appendix) testing the same fixed schedule across varying head counts (8–32 heads), model dimensions, and extrapolation ratios up to 4× on models up to 350M parameters. These additions will provide concrete evidence that the heuristic generalizes without per-model retuning. revision: yes
Referee: [§4] §4 (main 1.3B extrapolation experiment): the headline result (perplexity parity with sinusoidal-2048, 11% faster training, 11% less memory) is given without data-split details, hyperparameter grids, number of random seeds, or error bars. Because the comparison is purely empirical and the slope choice is itself a hyperparameter, these omissions make it impossible to verify whether the reported gains are robust or sensitive to training dynamics.

Authors: We agree that greater experimental transparency is needed. The revision will add the precise training corpus composition and data splits, the full hyperparameter configuration for the 1.3B model, and an explicit statement that the large-model runs were performed once owing to compute cost. We will also note that smaller-scale ablations (reported in the appendix) were repeated with multiple seeds and exhibited the same qualitative trends. We cannot, however, supply error bars for the 1.3B setting itself. revision: partial
Referee: [§4.2] §4.2 (WikiText-103 results): the claim that ALiBi 'outperforms multiple strong position methods' is load-bearing for the inductive-bias argument, but the manuscript does not state whether the sinusoidal and other baselines were also trained at 1024 tokens or at the full evaluation length; this ambiguity weakens the cross-method comparison.

Authors: We thank the referee for catching this ambiguity. All models—including the sinusoidal, rotary, and learned-position baselines—were trained with a maximum sequence length of 1024 tokens. WikiText-103 evaluation used the test set’s native (sometimes longer) sequences to measure extrapolation, but training length was identical across methods. The revised §4.2 will state this explicitly, removing any possibility of misinterpretation. revision: yes

standing simulated objections not resolved

The 1.3B-parameter experiments were run with only a single random seed due to prohibitive computational cost; consequently we cannot supply error bars or quantify sensitivity to initialization for the headline result.

Circularity Check

0 steps flagged

No significant circularity in the empirical evaluation of ALiBi.

full rationale

The paper introduces ALiBi as a position method that adds a fixed linear bias to attention scores and demonstrates its effectiveness through direct empirical comparison: a 1.3B model trained at length 1024 achieves equivalent perplexity on length 2048 to a sinusoidal baseline trained at 2048, with reported efficiency gains. The slope schedule is a fixed, predetermined choice (geometric progression across heads) presented as part of the method definition rather than fitted to the extrapolation results themselves. No equation or claim reduces the reported perplexity values to a parameter or quantity defined by the same experiment, and the central result remains an independent experimental outcome rather than a tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The claim rests on the empirical effectiveness of a linear attention bias whose slope is a tunable hyperparameter; no new entities are postulated and the background assumptions are standard transformer components.

free parameters (1)

linear bias slope
The rate at which the distance penalty increases must be chosen (likely per head or layer) to achieve the reported extrapolation; this value is not derived from first principles.

axioms (1)

domain assumption Adding a fixed linear penalty to query-key dot products preserves the core attention mechanism and training dynamics of the transformer.
Invoked when the authors replace positional embeddings with the bias without further architectural changes.

pith-pipeline@v0.9.0 · 5495 in / 1345 out tokens · 57592 ms · 2026-05-13T00:55:55.919030+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DimensionForcing eight_tick_forces_D3 unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We show that this method trains a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048 but training 11% faster and using 11% less memory.
IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ALiBi's inductive bias towards recency also leads it to outperform multiple strong position methods on the WikiText-103 benchmark.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 52 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Rethinking Positional Encoding for Neural Vehicle Routing
cs.AI 2026-05 unverdicted novelty 7.0

A hierarchical anisometric positional encoding that combines distance-indexed in-route and depot-anchored angular cross-route components improves transformer-based solvers for vehicle routing problems over index-based...
Positional LSH: Binary Block Matrix Approximation for Attention with Linear Biases
cs.LG 2026-05 unverdicted novelty 7.0

ALiBi bias is the expectation of positional LSH-induced block masks, yielding spectral and max-norm approximation bounds that reduce long-context biased attention to randomized short-context unbiased attention.
PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training
cs.LG 2026-04 unverdicted novelty 7.0

Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.
URoPE: Universal Relative Position Embedding across Geometric Spaces
cs.CV 2026-04 unverdicted novelty 7.0

URoPE is a parameter-free relative position embedding for transformers that works across arbitrary geometric spaces by ray sampling and projection, yielding consistent gains on novel view synthesis, 3D detection, trac...
Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels
cs.AI 2026-04 unverdicted novelty 7.0

Multi-head Gaussian kernels inject temporal scale discrepancy as inductive bias to enable full-duplex talking-listening avatar generation, supported by a new decoupled VoxHear dataset and claimed SOTA naturalness.
Group Representational Position Encoding
cs.LG 2025-12 unverdicted novelty 7.0

GRAPE unifies RoPE and ALiBi as special cases of group actions on positions, providing a principled design space for positional encodings via SO(d) rotations and GL unipotent transformations.
Robust Filter Attention: Self-Attention as Precision-Weighted State Estimation
cs.LG 2025-09 unverdicted novelty 7.0

Robust Filter Attention models self-attention as consistency-based state estimation under a linear SDE for token trajectories, matching standard attention complexity while showing lower perplexity and better zero-shot...
Exact Sequence Interpolation with Transformers
cs.LG 2025-02 conditional novelty 7.0

Transformers with O(sum m^j) blocks and O(d sum m^j) parameters can exactly interpolate any finite dataset of input sequences in R^d to output sequences of lengths m^j.
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
cs.CL 2024-04 conditional novelty 7.0

Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.
Massive Activations in Large Language Models
cs.CL 2024-02 unverdicted novelty 7.0

Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
Towards Understanding Self-Pretraining for Sequence Classification
cs.LG 2026-05 unverdicted novelty 6.0

Self-pretraining improves Transformer sequence classification by enabling learning of proximity-biased attention from positional encodings that label supervision alone cannot easily acquire from random starts.
Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory
cs.LG 2026-05 unverdicted novelty 6.0

PMNet uses unitary phasor dynamics and hierarchical anchors to make explicit memory stable for long sequences, matching a 3x larger Mamba model on long-context robustness with a 119M parameter network.
When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction
cs.AI 2026-05 unverdicted novelty 6.0

Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.
Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm
cs.CL 2026-05 unverdicted novelty 6.0

Theoretical analysis of continual factual knowledge acquisition shows data replay stabilizes pretrained knowledge by shifting convergence dynamics while regularization only slows forgetting, leading to the STOC method...
Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing
cs.CL 2026-05 conditional novelty 6.0

EXACT re-allocates training supervision by inverse frequency of long effective-context targets, improving NoLiMa and RULER scores by 5-18 points on Qwen and LLaMA models without degrading standard QA or reasoning.
Remember to Forget: Gated Adaptive Positional Encoding
cs.LG 2026-05 unverdicted novelty 6.0

GAPE augments RoPE with query- and key-dependent gates to stabilize attention and improve long-context performance in language models.
HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution
cs.AI 2026-05 unverdicted novelty 6.0

HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.
FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning
cs.CL 2026-05 unverdicted novelty 6.0

FocuSFT uses an inner optimization loop to adapt fast-weight parameters into a parametric memory that sharpens attention on relevant content, then conditions outer-loop supervised fine-tuning on this representation, y...
It Just Takes Two: Scaling Amortized Inference to Large Sets
cs.LG 2026-05 unverdicted novelty 6.0

A mean-pool deep set trained on sets of size at most two produces an encoder that generalizes to arbitrary sizes, decoupling representation learning from posterior modeling and making training cost independent of depl...
FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation
cs.LG 2026-05 unverdicted novelty 6.0

FAAST analytically compiles labeled examples into fast weights via a single forward pass, matching backprop adaptation performance with over 90% less time and up to 95% less memory than memory-based methods.
ADE: Adaptive Dictionary Embeddings -- Scaling Multi-Anchor Representations to Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

ADE scales multi-anchor word representations to transformers via Vocabulary Projection, Grouped Positional Encoding, and context-aware reweighting, achieving 98.7% fewer trainable parameters than DeBERTa-v3-base while...
The Recurrent Transformer: Greater Effective Depth and Efficient Decoding
cs.LG 2026-04 unverdicted novelty 6.0

Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-mat...
LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
cs.AI 2026-04 unverdicted novelty 6.0

LoopGuard detects attention collapse loops during LLM decoding and prunes repetitive KV cache tail spans under fixed budget, cutting loop incidence by over 90 percentage points on the new LoopBench benchmark.
MT-OSC: Path for LLMs that Get Lost in Multi-Turn Conversation
cs.CL 2026-04 unverdicted novelty 6.0

MT-OSC condenses chat history via a one-off sequential process with a few-shot Condenser and lightweight Decider to reduce tokens and preserve LLM accuracy in multi-turn settings.
Stacked from One: Multi-Scale Self-Injection for Context Window Extension
cs.CL 2026-03 unverdicted novelty 6.0

SharedLLM stacks two copies of a short-context LLM so the lower one compresses context into query-aware multi-grained tokens that are injected only at the lowest layers of the upper one, enabling generalization from 8...
Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants
cs.LG 2025-11 unverdicted novelty 6.0

Flashlight is a compiler-native PyTorch framework that generates efficient fused kernels for arbitrary and data-dependent attention variants, supporting more cases than FlexAttention with competitive performance.
Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource
cs.CL 2025-06 conditional novelty 6.0

MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.
Long-Context Autoregressive Video Modeling with Next-Frame Prediction
cs.CV 2025-03 unverdicted novelty 6.0

FAR baseline plus asymmetric kernels for long short-term context modeling achieves SOTA short and long video generation in autoregressive setups.
Flex Attention: A Programming Model for Generating Optimized Attention Kernels
cs.LG 2024-12 unverdicted novelty 6.0

FlexAttention supplies a compiler-driven interface that expresses common attention variants in a few lines of PyTorch and emits optimized kernels whose speed matches hand-written implementations.
Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning
cs.LG 2024-11 unverdicted novelty 6.0

CD-MoE condenses fine-grained MoE layers with shared experts into dense layers, retaining 90% accuracy with 27.5% memory cut and 1.26x speedup on DeepSeekMoE-16B, recovering 98% via brief fine-tuning.
When Attention Sink Emerges in Language Models: An Empirical View
cs.CL 2024-10 accept novelty 6.0

Attention sinks emerge in language models from softmax-induced token dependence on attention scores and do not appear when using sigmoid attention without normalization in models up to 1B parameters.
Gated Linear Attention Transformers with Hardware-Efficient Training
cs.LG 2023-12 unverdicted novelty 6.0

Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
MemGPT: Towards LLMs as Operating Systems
cs.AI 2023-10 unverdicted novelty 6.0

MemGPT uses OS-inspired virtual context management to extend LLM context windows for large document analysis and long-term multi-session chat.
A Measure-Theoretic Analysis of Reasoning: Structural Generalization and Approximation Limits
cs.LG 2026-05 unverdicted novelty 5.0

Applies optimal transport to bound OOD generalization error in Transformers via Lipschitz continuity and TC^0 circuit depth lower bounds for Dyck-k backtracking, supported by evaluations on 54 configurations.
Supporting System Testing with a Multi-Agent LLM-based Framework for Knowledge Graph Extraction: A Case Study with Ethernet Switch Systems
cs.SE 2026-05 conditional novelty 5.0

A multi-agent LLM-based framework extracts knowledge graphs from 50 real Ethernet switch manuals with 0.97-0.99 correctness to enable downstream test case specification generation.
Kaczmarz Linear Attention
cs.LG 2026-05 unverdicted novelty 5.0

Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...
FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation
cs.LG 2026-05 unverdicted novelty 5.0

FAAST performs test-time supervised adaptation by analytically deriving fast weights from examples in one forward pass, matching backprop performance with over 90% less adaptation time and up to 95% memory savings ver...
Decouple and Cache: KV Cache Construction for Streaming Video Understanding
cs.CV 2026-05 unverdicted novelty 5.0

DSCache decouples cumulative past and instant KV caches with position-agnostic encoding to adapt offline VideoVLLMs to streaming video, delivering 2.5% average accuracy gains on QA benchmarks.
Adaptive 3D-RoPE: Physics-Aligned Rotary Positional Encoding for Wireless Foundation Models
eess.SP 2026-05 unverdicted novelty 5.0

Adaptive 3D-RoPE adapts rotary positional encoding to wireless channel physics via learnable 3D frequencies and dynamic CSI control, yielding up to 10.7 dB NMSE gains in scale extrapolation and 1 dB in zero-shot tasks.
Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity
cs.CL 2026-04 unverdicted novelty 5.0

Fixed-width and decay-based attention mechanisms inspired by working memory improve Transformer grammatical accuracy and human alignment under limited training data.
Learning Class Difficulty in Imbalanced Histopathology Segmentation via Dynamic Focal Attention
eess.IV 2026-04 unverdicted novelty 5.0

Dynamic Focal Attention learns class-specific difficulty via per-class biases in attention logits, improving Dice and IoU on imbalanced histopathology segmentation benchmarks.
Voxtral TTS
cs.AI 2026-03 unverdicted novelty 5.0

Voxtral TTS produces expressive multilingual speech from 3-second reference audio with a hybrid autoregressive-plus-flow-matching architecture and a new VQ-FSQ tokenizer, achieving 68.4% win rate over ElevenLabs in hu...
Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs
cs.CV 2025-09 unverdicted novelty 5.0

Video Parallel Scaling improves VideoLLM performance by aggregating outputs from parallel inferences on complementary disjoint frame subsets, effectively contracting the Chinchilla scaling law via uncorrelated visual ...
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
eess.AS 2024-10 unverdicted novelty 5.0

F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.
Galactica: A Large Language Model for Science
cs.CL 2022-11 unverdicted novelty 5.0

Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights
cs.CL 2025-10 unverdicted novelty 4.0

This work systematically compares inter-layer and intra-layer hybridization strategies for combining self-attention and Mamba-style state space models, evaluating them on language modeling, downstream tasks, long-cont...
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
cs.CV 2025-02 unverdicted novelty 4.0

Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
Baichuan 2: Open Large-scale Language Models
cs.CL 2023-09 unverdicted novelty 4.0

Baichuan 2 presents 7B and 13B LLMs trained on 2.6T tokens that match or exceed similar open models on MMLU, CMMLU, GSM8K, HumanEval and excel in medicine and law.
Positional Encoding in Transformer-Based Time Series Models: A Survey
cs.LG 2025-02 unverdicted novelty 3.0

A survey of positional encoding methods in transformer-based time series models that evaluates fixed, learnable, relative, and hybrid approaches on classification tasks and links effectiveness to data characteristics.
A Survey on Large Language Models for Code Generation
cs.CL 2024-06 unverdicted novelty 3.0

A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...
Large Language Models: A Survey
cs.CL 2024-02 accept novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
Training LLMs on HPC Systems: Best Practices from the OpenGPT-X Project
cs.DC 2025-04 unverdicted novelty 2.0

Engineering report detailing HPC infrastructure, software choices, and performance measurements for training a 7B LLM using 3D parallelism on JUWELS Booster.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 51 Pith papers · 1 internal anchor

[1]

Adaptive Input Representations for Neural Language Modeling

Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. CoRR, abs/1809.10853, 2018. URL http://arxiv.org/abs/1809.10853

work page Pith review arXiv 2018
[2]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[3]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

work page 2020
[4]

Unsupervised Cross-lingual Representation Learning at Scale , booktitle =

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020. doi:10.18653/v1/2020.acl-main.7...

work page doi:10.18653/v1/2020.acl-main.747 2020
[5]

doi: 10.18653/v1/P19-1285

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer- XL : Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.\ 2978--2988, Florence, Italy, July 2019. Association for Computational Linguistics. doi:10.18653...

work page doi:10.18653/v1/p19-1285 2019
[6]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pp.\ 4171--4186, Minneapo...

work page doi:10.18653/v1/n19-1423 2019
[7]

Openwebtext corpus

Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019

work page 2019
[8]

Shazeer, Andrew M

Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam M. Shazeer, Andrew M. Dai, M. Hoffman, M. Dinculescu, and D. Eck. Music transformer: Generating music with long-term structure. In ICLR, 2019

work page 2019
[9]

Compositionality decomposed: How do neural networks generalise?Journal of Artificial Intelligence Research, 67:757– 795, 2020

Dieuwke Hupkes, Verna Dankers, Mathijs Mul, and Elia Bruni. Compositionality decomposed: How do neural networks generalise? Journal of Artificial Intelligence Research, 67: 0 757--795, April 2020. doi:10.1613/jair.1.11674. URL https://doi.org/10.1613/jair.1.11674

work page doi:10.1613/jair.1.11674 2020
[10]

Tying word vectors and word classifiers: A loss framework for language modeling

Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. In ICLR, 2017. URL https://openreview.net/forum?id=r1aPbsFle

work page 2017
[11]

Jumper, Richard Evans, A

J. Jumper, Richard Evans, A. Pritzel, Tim Green, Michael Figurnov, O. Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Z \'i dek, Anna Potapenko, A. Bridgland, Clemens Meyer, Simon A A Kohl, Andy Ballard, A. Cowie, B. Romera-Paredes, Stanislav Nikolov, Rishub Jain, J. Adler, T. Back, Stig Petersen, D. Reiman, Ellen Clancy, Michal Zielinski, Mart...

work page 2021
[12]

Generalization through Memorization: Nearest Neighbor Language Models

Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through Memorization: Nearest Neighbor Language Models . In International Conference on Learning Representations (ICLR), 2020

work page 2020
[13]

Shape: Shifted absolute position embedding for transformers

Shun Kiyono, Sosuke Kobayashi, Jun Suzuki, and Kentaro Inui. Shape: Shifted absolute position embedding for transformers. ArXiv, abs/2109.05644, 2021

work page arXiv 2021
[14]

Andrew Kyle Lampinen, Stephanie C. Y. Chan, Andrea Banino, and Felix Hill. Towards mental time travel: a hierarchical memory for reinforcement learning agents. CoRR, abs/2105.14039, 2021. URL https://arxiv.org/abs/2105.14039

work page arXiv 2021
[15]

Base layers: Simplifying training of large, sparse models, 2021

Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. Base layers: Simplifying training of large, sparse models, 2021

work page 2021
[16]

Jurassic-1: Technical details and evaluation

Opher Lieber, Or Sharir, Barak Lenz, and Yoav Shoham. Jurassic-1: Technical details and evaluation. Technical report, AI21 Labs, August 2021

work page 2021
[17]

CAPE: encoding relative positions with continuous augmented positional embeddings

Tatiana Likhomanenko, Qiantong Xu, Ronan Collobert, Gabriel Synnaeve, and Alex Rogozhnikov. CAPE: encoding relative positions with continuous augmented positional embeddings. CoRR, abs/2106.03143, 2021. URL https://arxiv.org/abs/2106.03143

work page arXiv 2021
[18]

Roberta: A robustly optimized bert pretraining approach, 2019

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019

work page 2019
[19]

Pointer sentinel mixture models, 2016

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016

work page 2016
[20]

Tomas Mikolov and G. Zweig. Context dependent recurrent neural network language model. 2012 IEEE Spoken Language Technology Workshop (SLT), pp.\ 234--239, 2012

work page 2012
[21]

Karafi \'a t, L

Tomas Mikolov, M. Karafi \'a t, L. Burget, J. Cernock \'y , and S. Khudanpur. Recurrent neural network based language model. In INTERSPEECH, 2010

work page 2010
[22]

Sebastian Nagel. Cc-news. https://commoncrawl.org/2016/10/news-dataset-available/, 2016

work page 2016
[23]

Do transformer modifications transfer across implementations and applications?, 2021

Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry, Michael Matena, Karishma Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong Lan, Yanqi Zhou, Wei Li, Nan Ding, Jake Marcus, Adam Roberts, and Colin Raffel. Do transformer modifications transfer across implementations and applications?, 2021

work page 2021
[24]

On the relation between position information and sentence length in neural machine translation

Masato Neishi and Naoki Yoshinaga. On the relation between position information and sentence length in neural machine translation. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pp.\ 328--338, Hong Kong, China, November 2019. Association for Computational Linguistics. doi:10.18653/v1/K19-1031. URL https://aclanth...

work page doi:10.18653/v1/k19-1031 2019
[25]

Benjamin Newman, John Hewitt, Percy Liang, and Christopher D. Manning. The eos decision and length extrapolation. In BlackBoxNLP@EMNLP, 2020. URL https://nlp.stanford.edu/pubs/newman2020extrapolation.pdf

work page 2020
[26]

Rodrigo Nogueira, Zhiying Jiang, and Jimmy J. Li. Investigating the limitations of the transformers with simple arithmetic tasks. ArXiv, abs/2102.13019, 2021

work page arXiv 2021
[27]

Scaling neural machine translation

Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. Scaling neural machine translation. In Proceedings of the Third Conference on Machine Translation (WMT), 2018

work page 2018
[28]

a ckstr \

Ankur Parikh, Oscar T \"a ckstr \"o m, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model for natural language inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.\ 2249--2255, Austin, Texas, November 2016. Association for Computational Linguistics. doi:10.18653/v1/D16-1244. URL https://a...

work page doi:10.18653/v1/d16-1244 2016
[29]

Using the output embedding to improve language models

Ofir Press and Lior Wolf. Using the output embedding to improve language models. In Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pp.\ 157--163, Valencia, Spain, April 2017. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/E17-2025

work page 2017
[30]

Smith, and Omer Levy

Ofir Press, Noah A. Smith, and Omer Levy. Improving transformer models by reordering their sublayers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 2996--3005, Online, July 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.270. URL https://www.aclweb.org/anthology/2020.acl-main.270

work page doi:10.18653/v1/2020.acl-main.270 2020
[31]

Smith, and Mike Lewis

Ofir Press, Noah A. Smith, and Mike Lewis. Shortformer: Better language modeling using shorter inputs. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.\ 5493--5505, Online, August 2021. Association for Computati...

work page 2021
[32]

Rae, Anna Potapenko, Siddhant M

Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap. Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SylKikSYDH

work page 2020
[33]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020. URL http://jmlr.org/papers/v21/20-074.html

work page 2020
[34]

Analysis of positional encodings for neural machine translation

Jan Rosendahl, Viet Anh Khoa Tran, Weiyue Wang, and Hermann Ney. Analysis of positional encodings for neural machine translation. In International Workshop on Spoken Language Translation, Hong Kong, China, November 2019

work page 2019
[35]

Efficient content-based sparse attention with routing transformers, 2020

Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention with routing transformers, 2020

work page 2020
[36]

Self-Attention with Relative Position Representations

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) , pp.\ 464--468, New Orleans, Louisiana, June 2018. Association for Computational...

work page doi:10.18653/v1/n18-2074 2018
[37]

Roformer: Enhanced transformer with rotary position embedding, 2021

Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2021

work page 2021
[38]

Trinh and Quoc V

Trieu H. Trinh and Quoc V. Le. A simple method for commonsense reasoning, 2018

work page 2018
[39]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https:/...

work page 2017
[40]

GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model

Ben Wang and Aran Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model . https://github.com/kingoflolz/mesh-transformer-jax, May 2021

work page 2021
[41]

The case for translation-invariant self-attention in transformer-based language models, 2021

Ulme Wennberg and Gustav Eje Henter. The case for translation-invariant self-attention in transformer-based language models, 2021

work page 2021
[42]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

work page 2020
[43]

DA -transformer: Distance-aware transformer

Chuhan Wu, Fangzhao Wu, and Yongfeng Huang. DA -transformer: Distance-aware transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 2059--2068, Online, June 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.naacl-main.166. URL h...

work page doi:10.18653/v1/2021.naacl-main.166 2021
[44]

Recurrent neural network regularization, 2014

Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization, 2014

work page 2014
[45]

Aligning books and movies: Towards story-like visual explanations by watching movies and reading books

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pp.\ 19--27, 2015

work page 2015

[1] [1]

Adaptive Input Representations for Neural Language Modeling

Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. CoRR, abs/1809.10853, 2018. URL http://arxiv.org/abs/1809.10853

work page Pith review arXiv 2018

[2] [2]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004

[3] [3]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

work page 2020

[4] [4]

Unsupervised Cross-lingual Representation Learning at Scale , booktitle =

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020. doi:10.18653/v1/2020.acl-main.7...

work page doi:10.18653/v1/2020.acl-main.747 2020

[5] [5]

doi: 10.18653/v1/P19-1285

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer- XL : Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.\ 2978--2988, Florence, Italy, July 2019. Association for Computational Linguistics. doi:10.18653...

work page doi:10.18653/v1/p19-1285 2019

[6] [6]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pp.\ 4171--4186, Minneapo...

work page doi:10.18653/v1/n19-1423 2019

[7] [7]

Openwebtext corpus

Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019

work page 2019

[8] [8]

Shazeer, Andrew M

Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam M. Shazeer, Andrew M. Dai, M. Hoffman, M. Dinculescu, and D. Eck. Music transformer: Generating music with long-term structure. In ICLR, 2019

work page 2019

[9] [9]

Compositionality decomposed: How do neural networks generalise?Journal of Artificial Intelligence Research, 67:757– 795, 2020

Dieuwke Hupkes, Verna Dankers, Mathijs Mul, and Elia Bruni. Compositionality decomposed: How do neural networks generalise? Journal of Artificial Intelligence Research, 67: 0 757--795, April 2020. doi:10.1613/jair.1.11674. URL https://doi.org/10.1613/jair.1.11674

work page doi:10.1613/jair.1.11674 2020

[10] [10]

Tying word vectors and word classifiers: A loss framework for language modeling

Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. In ICLR, 2017. URL https://openreview.net/forum?id=r1aPbsFle

work page 2017

[11] [11]

Jumper, Richard Evans, A

J. Jumper, Richard Evans, A. Pritzel, Tim Green, Michael Figurnov, O. Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Z \'i dek, Anna Potapenko, A. Bridgland, Clemens Meyer, Simon A A Kohl, Andy Ballard, A. Cowie, B. Romera-Paredes, Stanislav Nikolov, Rishub Jain, J. Adler, T. Back, Stig Petersen, D. Reiman, Ellen Clancy, Michal Zielinski, Mart...

work page 2021

[12] [12]

Generalization through Memorization: Nearest Neighbor Language Models

Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through Memorization: Nearest Neighbor Language Models . In International Conference on Learning Representations (ICLR), 2020

work page 2020

[13] [13]

Shape: Shifted absolute position embedding for transformers

Shun Kiyono, Sosuke Kobayashi, Jun Suzuki, and Kentaro Inui. Shape: Shifted absolute position embedding for transformers. ArXiv, abs/2109.05644, 2021

work page arXiv 2021

[14] [14]

Andrew Kyle Lampinen, Stephanie C. Y. Chan, Andrea Banino, and Felix Hill. Towards mental time travel: a hierarchical memory for reinforcement learning agents. CoRR, abs/2105.14039, 2021. URL https://arxiv.org/abs/2105.14039

work page arXiv 2021

[15] [15]

Base layers: Simplifying training of large, sparse models, 2021

Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. Base layers: Simplifying training of large, sparse models, 2021

work page 2021

[16] [16]

Jurassic-1: Technical details and evaluation

Opher Lieber, Or Sharir, Barak Lenz, and Yoav Shoham. Jurassic-1: Technical details and evaluation. Technical report, AI21 Labs, August 2021

work page 2021

[17] [17]

CAPE: encoding relative positions with continuous augmented positional embeddings

Tatiana Likhomanenko, Qiantong Xu, Ronan Collobert, Gabriel Synnaeve, and Alex Rogozhnikov. CAPE: encoding relative positions with continuous augmented positional embeddings. CoRR, abs/2106.03143, 2021. URL https://arxiv.org/abs/2106.03143

work page arXiv 2021

[18] [18]

Roberta: A robustly optimized bert pretraining approach, 2019

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019

work page 2019

[19] [19]

Pointer sentinel mixture models, 2016

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016

work page 2016

[20] [20]

Tomas Mikolov and G. Zweig. Context dependent recurrent neural network language model. 2012 IEEE Spoken Language Technology Workshop (SLT), pp.\ 234--239, 2012

work page 2012

[21] [21]

Karafi \'a t, L

Tomas Mikolov, M. Karafi \'a t, L. Burget, J. Cernock \'y , and S. Khudanpur. Recurrent neural network based language model. In INTERSPEECH, 2010

work page 2010

[22] [22]

Sebastian Nagel. Cc-news. https://commoncrawl.org/2016/10/news-dataset-available/, 2016

work page 2016

[23] [23]

Do transformer modifications transfer across implementations and applications?, 2021

Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry, Michael Matena, Karishma Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong Lan, Yanqi Zhou, Wei Li, Nan Ding, Jake Marcus, Adam Roberts, and Colin Raffel. Do transformer modifications transfer across implementations and applications?, 2021

work page 2021

[24] [24]

On the relation between position information and sentence length in neural machine translation

Masato Neishi and Naoki Yoshinaga. On the relation between position information and sentence length in neural machine translation. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pp.\ 328--338, Hong Kong, China, November 2019. Association for Computational Linguistics. doi:10.18653/v1/K19-1031. URL https://aclanth...

work page doi:10.18653/v1/k19-1031 2019

[25] [25]

Benjamin Newman, John Hewitt, Percy Liang, and Christopher D. Manning. The eos decision and length extrapolation. In BlackBoxNLP@EMNLP, 2020. URL https://nlp.stanford.edu/pubs/newman2020extrapolation.pdf

work page 2020

[26] [26]

Rodrigo Nogueira, Zhiying Jiang, and Jimmy J. Li. Investigating the limitations of the transformers with simple arithmetic tasks. ArXiv, abs/2102.13019, 2021

work page arXiv 2021

[27] [27]

Scaling neural machine translation

Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. Scaling neural machine translation. In Proceedings of the Third Conference on Machine Translation (WMT), 2018

work page 2018

[28] [28]

a ckstr \

Ankur Parikh, Oscar T \"a ckstr \"o m, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model for natural language inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.\ 2249--2255, Austin, Texas, November 2016. Association for Computational Linguistics. doi:10.18653/v1/D16-1244. URL https://a...

work page doi:10.18653/v1/d16-1244 2016

[29] [29]

Using the output embedding to improve language models

Ofir Press and Lior Wolf. Using the output embedding to improve language models. In Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pp.\ 157--163, Valencia, Spain, April 2017. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/E17-2025

work page 2017

[30] [30]

Smith, and Omer Levy

Ofir Press, Noah A. Smith, and Omer Levy. Improving transformer models by reordering their sublayers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 2996--3005, Online, July 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.270. URL https://www.aclweb.org/anthology/2020.acl-main.270

work page doi:10.18653/v1/2020.acl-main.270 2020

[31] [31]

Smith, and Mike Lewis

Ofir Press, Noah A. Smith, and Mike Lewis. Shortformer: Better language modeling using shorter inputs. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.\ 5493--5505, Online, August 2021. Association for Computati...

work page 2021

[32] [32]

Rae, Anna Potapenko, Siddhant M

Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap. Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SylKikSYDH

work page 2020

[33] [33]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020. URL http://jmlr.org/papers/v21/20-074.html

work page 2020

[34] [34]

Analysis of positional encodings for neural machine translation

Jan Rosendahl, Viet Anh Khoa Tran, Weiyue Wang, and Hermann Ney. Analysis of positional encodings for neural machine translation. In International Workshop on Spoken Language Translation, Hong Kong, China, November 2019

work page 2019

[35] [35]

Efficient content-based sparse attention with routing transformers, 2020

Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention with routing transformers, 2020

work page 2020

[36] [36]

Self-Attention with Relative Position Representations

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) , pp.\ 464--468, New Orleans, Louisiana, June 2018. Association for Computational...

work page doi:10.18653/v1/n18-2074 2018

[37] [37]

Roformer: Enhanced transformer with rotary position embedding, 2021

Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2021

work page 2021

[38] [38]

Trinh and Quoc V

Trieu H. Trinh and Quoc V. Le. A simple method for commonsense reasoning, 2018

work page 2018

[39] [39]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https:/...

work page 2017

[40] [40]

GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model

Ben Wang and Aran Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model . https://github.com/kingoflolz/mesh-transformer-jax, May 2021

work page 2021

[41] [41]

The case for translation-invariant self-attention in transformer-based language models, 2021

Ulme Wennberg and Gustav Eje Henter. The case for translation-invariant self-attention in transformer-based language models, 2021

work page 2021

[42] [42]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

work page 2020

[43] [43]

DA -transformer: Distance-aware transformer

Chuhan Wu, Fangzhao Wu, and Yongfeng Huang. DA -transformer: Distance-aware transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 2059--2068, Online, June 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.naacl-main.166. URL h...

work page doi:10.18653/v1/2021.naacl-main.166 2021

[44] [44]

Recurrent neural network regularization, 2014

Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization, 2014

work page 2014

[45] [45]

Aligning books and movies: Towards story-like visual explanations by watching movies and reading books

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pp.\ 19--27, 2015

work page 2015