hub

Transformer-xl: Attentive language models beyond a ﬁxed-length context

Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, Ruslan Salakhutdinov · 2019 · cs.LG · arXiv 1901.02860

39 Pith papers cite this work. Polarity classification is still indexing.

39 Pith papers citing it

open full Pith review browse 39 citing papers arXiv PDF

abstract

Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformers during evaluation. Notably, we improve the state-of-the-art results of bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably coherent, novel text articles with thousands of tokens. Our code, pretrained models, and hyperparameters are available in both Tensorflow and PyTorch.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

cs.CL · 2020-12-31 · conditional · novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

cs.CL · 2024-04-10 · conditional · novelty 7.0

Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

cs.CV · 2023-03-28 · conditional · novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.

Augmenting Self-attention with Persistent Memory

cs.LG · 2019-07-02 · unverdicted · novelty 7.0

Augmenting self-attention with persistent memory vectors allows removal of feed-forward layers from Transformers without degrading performance on character and word level language modeling benchmarks.

XLNet: Generalized Autoregressive Pretraining for Language Understanding

cs.CL · 2019-06-19 · accept · novelty 7.0

XLNet is a generalized autoregressive pretraining method that learns bidirectional contexts via permutation-based factorization and outperforms BERT on 20 NLP tasks.

VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation

cs.CV · 2026-05-02 · unverdicted · novelty 7.0

VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.

Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts

cs.LG · 2026-05-01 · conditional · novelty 7.0

Adding temporal memory via LIF, precision-weighted gating, and anticipatory prediction to MoE routers recovers effective expert selection at distribution transitions, with ablation confirming a super-additive beta-ant interaction.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

cs.CL · 2024-05-07 · unverdicted · novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

OPT: Open Pre-trained Transformer Language Models

cs.CL · 2022-05-02 · unverdicted · novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

cs.CL · 2019-09-26 · accept · novelty 7.0

ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

cs.CL · 2019-09-17 · unverdicted · novelty 7.0

Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.

Continuous Diffusion Scales Competitively with Discrete Diffusion for Language

cs.CL · 2026-05-18 · conditional · novelty 6.0

RePlaid achieves a 20x compute gap to autoregressive models, new SOTA PPL of 22.1 among continuous DLMs on OpenWebText, and competitive scaling laws by aligning architecture with modern discrete DLMs.

LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling

cs.CL · 2026-03-12 · unverdicted · novelty 6.0

LPC-SM is a hybrid architecture separating local attention, persistent memory, predictive correction, and control with ONT for memory writes, showing loss reductions on 158M-parameter models up to 4096-token contexts.

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

cs.AI · 2025-07-01 · conditional · novelty 6.0

Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.

LOOPE: Learnable Optimal Patch Order in Positional Embeddings for Vision Transformers

cs.CV · 2025-04-19 · unverdicted · novelty 6.0

LOOPE learns a patch ordering for positional embeddings in ViTs and introduces the Three Cell Experiment benchmark that shows 30-35% gaps in positional retention versus the usual 4-6%.

Capabilities of Gemini Models in Medicine

cs.AI · 2024-04-29 · unverdicted · novelty 6.0

Med-Gemini sets new records on 10 of 14 medical benchmarks including 91.1% on MedQA-USMLE, beats GPT-4V by 44.5% on multimodal tasks, and surpasses humans on medical text summarization.

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

cs.CV · 2023-10-03 · unverdicted · novelty 6.0

LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

cs.CV · 2023-08-16 · unverdicted · novelty 6.0

DragNUWA integrates text, image, and trajectory controls into a diffusion video model using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training to enable fine-grained open-domain video generation.

Deduplicating Training Data Makes Language Models Better

cs.CL · 2021-07-14 · unverdicted · novelty 6.0

Deduplicating training datasets reduces language model verbatim memorization by 10x, improves training efficiency, and enables more accurate evaluation by cutting train-test overlap.

Compressive Transformers for Long-Range Sequence Modelling

cs.LG · 2019-11-13 · unverdicted · novelty 6.0

Compressive Transformer sets new records on WikiText-103 (17.1 ppl) and Enwik8 (0.97 bpc) via memory compression and introduces the PG-19 long-range language benchmark.

CTRL: A Conditional Transformer Language Model for Controllable Generation

cs.CL · 2019-09-11 · unverdicted · novelty 6.0

CTRL is a large conditional transformer language model that uses naturally occurring control codes to steer text generation style and content.

R-Transformer: Recurrent Neural Network Enhanced Transformer

cs.LG · 2019-07-12 · unverdicted · novelty 6.0

R-Transformer integrates RNNs with multi-head attention to model local and global sequence dependencies without position embeddings and reports large-margin gains over prior methods on diverse tasks.

LakhNES: Improving multi-instrumental music generation with cross-domain pre-training

cs.SD · 2019-07-10 · unverdicted · novelty 6.0

Pre-training a Transformer on Lakh MIDI improves quantitative and qualitative performance when generating four-instrument NES music scores from NES-MDB.

citing papers explorer

Showing 39 of 39 citing papers.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution cs.CL · 2023-09-28 · unverdicted · none · ref 85 · internal anchor
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
The Pile: An 800GB Dataset of Diverse Text for Language Modeling cs.CL · 2020-12-31 · conditional · none · ref 141
The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention cs.CL · 2024-04-10 · conditional · none · ref 9 · internal anchor
Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention cs.CV · 2023-03-28 · conditional · none · ref 103 · internal anchor
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
Augmenting Self-attention with Persistent Memory cs.LG · 2019-07-02 · unverdicted · none · ref 8 · internal anchor
Augmenting self-attention with persistent memory vectors allows removal of feed-forward layers from Transformers without degrading performance on character and word level language modeling benchmarks.
XLNet: Generalized Autoregressive Pretraining for Language Understanding cs.CL · 2019-06-19 · accept · none · ref 9 · internal anchor
XLNet is a generalized autoregressive pretraining method that learns bidirectional contexts via permutation-based factorization and outperforms BERT on 20 NLP tasks.
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation cs.CV · 2026-05-02 · unverdicted · none · ref 245
VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts cs.LG · 2026-05-01 · conditional · none · ref 25
Adding temporal memory via LIF, precision-weighted gating, and anticipatory prediction to MoE routers recovers effective expert selection at distribution transitions, with ablation confirming a super-additive beta-ant interaction.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cs.CL · 2024-05-07 · unverdicted · none · ref 160
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
OPT: Open Pre-trained Transformer Language Models cs.CL · 2022-05-02 · unverdicted · none · ref 119
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations cs.CL · 2019-09-26 · accept · none · ref 6
ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism cs.CL · 2019-09-17 · unverdicted · none · ref 5
Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.
Continuous Diffusion Scales Competitively with Discrete Diffusion for Language cs.CL · 2026-05-18 · conditional · none · ref 14 · internal anchor
RePlaid achieves a 20x compute gap to autoregressive models, new SOTA PPL of 22.1 among continuous DLMs on OpenWebText, and competitive scaling laws by aligning architecture with modern discrete DLMs.
LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling cs.CL · 2026-03-12 · unverdicted · none · ref 14 · internal anchor
LPC-SM is a hybrid architecture separating local attention, persistent memory, predictive correction, and control with ONT for memory writes, showing loss reductions on 158M-parameter models up to 4096-token contexts.
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning cs.AI · 2025-07-01 · conditional · none · ref 245 · internal anchor
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
LOOPE: Learnable Optimal Patch Order in Positional Embeddings for Vision Transformers cs.CV · 2025-04-19 · unverdicted · none · ref 7 · internal anchor
LOOPE learns a patch ordering for positional embeddings in ViTs and introduces the Three Cell Experiment benchmark that shows 30-35% gaps in positional retention versus the usual 4-6%.
Capabilities of Gemini Models in Medicine cs.AI · 2024-04-29 · unverdicted · none · ref 22 · internal anchor
Med-Gemini sets new records on 10 of 14 medical benchmarks including 91.1% on MedQA-USMLE, beats GPT-4V by 44.5% on multimodal tasks, and surpasses humans on medical text summarization.
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment cs.CV · 2023-10-03 · unverdicted · none · ref 70 · internal anchor
LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory cs.CV · 2023-08-16 · unverdicted · none · ref 172 · internal anchor
DragNUWA integrates text, image, and trajectory controls into a diffusion video model using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training to enable fine-grained open-domain video generation.
Deduplicating Training Data Makes Language Models Better cs.CL · 2021-07-14 · unverdicted · none · ref 17 · internal anchor
Deduplicating training datasets reduces language model verbatim memorization by 10x, improves training efficiency, and enables more accurate evaluation by cutting train-test overlap.
Compressive Transformers for Long-Range Sequence Modelling cs.LG · 2019-11-13 · unverdicted · none · ref 111 · internal anchor
Compressive Transformer sets new records on WikiText-103 (17.1 ppl) and Enwik8 (0.97 bpc) via memory compression and introduces the PG-19 long-range language benchmark.
CTRL: A Conditional Transformer Language Model for Controllable Generation cs.CL · 2019-09-11 · unverdicted · none · ref 8 · internal anchor
CTRL is a large conditional transformer language model that uses naturally occurring control codes to steer text generation style and content.
R-Transformer: Recurrent Neural Network Enhanced Transformer cs.LG · 2019-07-12 · unverdicted · none · ref 7 · internal anchor
R-Transformer integrates RNNs with multi-head attention to model local and global sequence dependencies without position embeddings and reports large-margin gains over prior methods on diverse tasks.
LakhNES: Improving multi-instrumental music generation with cross-domain pre-training cs.SD · 2019-07-10 · unverdicted · none · ref 42 · internal anchor
Pre-training a Transformer on Lakh MIDI improves quantitative and qualitative performance when generating four-instrument NES music scores from NES-MDB.
RT-Transformer: The Transformer Block as a Spherical State Estimator cs.LG · 2026-05-10 · unverdicted · none · ref 148
Transformer components arise as the natural solution to precision-weighted directional state estimation on the hypersphere.
On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference cs.CR · 2026-05-06 · conditional · none · ref 47
An attack aligns differently shuffled intermediate activations from secure Transformer inference queries to recover model weights with low error using roughly one dollar of queries.
Learning to Theorize the World from Observation cs.LG · 2026-05-05 · unverdicted · none · ref 215
NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.
The Recurrent Transformer: Greater Effective Depth and Efficient Decoding cs.LG · 2026-04-23 · unverdicted · none · ref 54
Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-matched standard transformers with fewer layers.
Learning-Based Sparsification of Dynamic Graphs in Robotic Exploration Algorithms cs.RO · 2026-04-15 · unverdicted · none · ref 21
A PPO-trained transformer policy sparsifies dynamic graphs during RRT frontier exploration, cutting size by up to 96% and yielding the most consistent exploration rates across environments.
MemGPT: Towards LLMs as Operating Systems cs.AI · 2023-10-12 · unverdicted · none · ref 5
MemGPT uses OS-inspired virtual context management to extend LLM context windows for large document analysis and long-term multi-session chat.
HuggingFace's Transformers: State-of-the-art Natural Language Processing cs.CL · 2019-10-09 · accept · none · ref 151
Hugging Face releases an open-source Python library that supplies a unified API and pretrained weights for major Transformer architectures used in natural language processing.
False Alarms, Real Damage: Adversarial Attacks Using LLM-based Models on Text-based Cyber Threat Intelligence Systems cs.CR · 2025-07-05 · unverdicted · none · ref 58 · internal anchor
LLM-generated adversarial fake text can perform evasion, flooding, and poisoning attacks that mislead and degrade text-based CTI pipelines.
Agglomerative Attention cs.LG · 2019-07-15 · unverdicted · none · ref 4 · internal anchor
Presents agglomerative attention, a linear-complexity attention model that achieves comparable performance to full attention on language modeling tasks.
Towards a Data-Parameter Correspondence for LLMs: A Preliminary Discussion cs.LG · 2026-04-19 · unverdicted · none · ref 35
A data-parameter correspondence unifies data-centric and parameter-centric LLM optimizations as dual geometric operations on the statistical manifold via Fisher-Rao metric and Legendre duality.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism cs.CL · 2024-01-05 · unverdicted · none · ref 162
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
A Comprehensive Overview of Large Language Models cs.CL · 2023-07-12 · unverdicted · none · ref 111 · internal anchor
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.
EmotionX-HSU: Adopting Pre-trained BERT for Emotion Classification cs.CL · 2019-07-23 · unverdicted · none · ref 2 · internal anchor
Fine-tuning BERT yields micro-F1 scores of 79.1% on Friends and 86.2% on EmotionPush test sets for four-class emotion classification in the EmotionX-2019 shared task.
Advancing Intelligent Sequence Modeling: Evolution, Trade-offs, and Applications of State- Space Architectures from S4 to Mamba cs.LG · 2025-03-22 · unverdicted · none · ref 45 · internal anchor
A survey tracing the evolution of state-space models like S4 and Mamba, their efficiency trade-offs, and applications in NLP, vision, and other domains.
SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference cs.LG · 2026-05-13 · unreviewed · ref 4 · internal anchor

Transformer-xl: Attentive language models beyond a ﬁxed-length context

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer