hub Mixed citations

Self-attention with relative position repre- sentations

Peter Shaw, Jakob Uszkoreit, Ashish Vaswani · 2018 · cs.CL · arXiv 1803.02155

Mixed citation behavior. Most common role is background (60%).

24 Pith papers citing it

Background 60% of classified citations

open full Pith review browse 24 citing papers arXiv PDF

abstract

Relying entirely on an attention mechanism, the Transformer introduced by Vaswani et al. (2017) achieves state-of-the-art results for machine translation. In contrast to recurrent and convolutional neural networks, it does not explicitly model relative or absolute position information in its structure. Instead, it requires adding representations of absolute positions to its inputs. In this work we present an alternative approach, extending the self-attention mechanism to efficiently consider representations of the relative positions, or distances between sequence elements. On the WMT 2014 English-to-German and English-to-French translation tasks, this approach yields improvements of 1.3 BLEU and 0.3 BLEU over absolute position representations, respectively. Notably, we observe that combining relative and absolute position representations yields no further improvement in translation quality. We describe an efficient implementation of our method and cast it as an instance of relation-aware self-attention mechanisms that can generalize to arbitrary graph-labeled inputs.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 method 2

citation-polarity summary

background 3 use method 2

representative citing papers

Group Representational Position Encoding

cs.LG · 2025-12-08 · unverdicted · novelty 7.0

GRAPE unifies RoPE and ALiBi as special cases of group actions on positions, providing a principled design space for positional encodings via SO(d) rotations and GL unipotent transformations.

Selective Rotary Position Embedding

cs.CL · 2025-11-21 · unverdicted · novelty 7.0

Selective RoPE adds input-dependent rotations to generalize RoPE, showing implicit positional structure in softmax attention and improving performance on language modeling, copying, state tracking, and retrieval when added to gated transformers.

Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings

q-bio.QM · 2026-04-09 · unverdicted · novelty 7.0

Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.

Video Diffusion Models

cs.CV · 2022-04-07 · unverdicted · novelty 7.0

A diffusion model for video generation extends image architectures with joint image-video training and improved conditional sampling, delivering first large-scale text-to-video results and state-of-the-art performance on video prediction and unconditional generation benchmarks.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

cs.LG · 2019-10-23 · unverdicted · novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.

DyWPE: Signal-Aware Dynamic Wavelet Positional Encoding for Time Series Transformers

cs.LG · 2025-09-18 · unverdicted · novelty 6.0

DyWPE generates positional embeddings for time series transformers from the input signal via Discrete Wavelet Transform and outperforms standard positional encodings on ten datasets, especially longer sequences and biomedical signals.

Learning-Optimized Qubit Mapping and Reuse to Minimize Inter-Core Communication in Modular Quantum Architectures

quant-ph · 2025-06-11 · unverdicted · novelty 6.0

QARMA applies transformer-augmented reinforcement learning to qubit allocation and reuse in modular quantum systems, reporting up to 86% average reduction in inter-core communications versus optimized Qiskit baselines.

ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers

cs.CV · 2025-05-26 · unverdicted · novelty 6.0

ViTaPEs uses two-stage positional encodings in a multimodal transformer to learn task-agnostic visuotactile representations that outperform baselines on recognition tasks, show zero-shot generalization, and improve robotic grasp success prediction.

LOOPE: Learnable Optimal Patch Order in Positional Embeddings for Vision Transformers

cs.CV · 2025-04-19 · unverdicted · novelty 6.0

LOOPE learns a patch ordering for positional embeddings in ViTs and introduces the Three Cell Experiment benchmark that shows 30-35% gaps in positional retention versus the usual 4-6%.

Attention-Based Deep Reinforcement Learning for Qubit Allocation in Modular Quantum Architectures

quant-ph · 2024-06-17 · unverdicted · novelty 6.0

An attention-based DRL agent with Transformer encoder and GNN learns heuristics for qubit-to-core allocation in multi-core quantum systems to minimize state transfers and online compilation time.

The Falcon Series of Open Language Models

cs.CL · 2023-11-28 · conditional · novelty 6.0

Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.

Efficient Training of Language Models to Fill in the Middle

cs.CL · 2022-07-28 · unverdicted · novelty 6.0

Autoregressive language models trained on data with middle spans relocated to the end learn infilling without degrading left-to-right perplexity or sampling quality.

HEXST: Hexagonal Shifted-Window Transformer for Spatial Transcriptomics Gene Expression Prediction

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

HEXST applies a hexagonal shifted-window Transformer with rotary positional encodings, contrast-sensitive training objectives, and single-cell priors to predict gene expression from histology slides, outperforming prior models on seven datasets while preserving spatial heterogeneity.

Deep Wave Network for Modeling Multi-Scale Physical Dynamics

cs.LG · 2026-05-05 · unverdicted · novelty 6.0

DW-Net improves the accuracy versus computational cost Pareto front over standard U-Nets for 2D and 3D multi-scale flow benchmarks by stacking multiple waves while keeping training settings identical.

RIHA: Report-Image Hierarchical Alignment for Radiology Report Generation

cs.CV · 2026-04-30 · unverdicted · novelty 6.0

RIHA proposes a hierarchical alignment transformer that uses multi-scale visual and textual feature pyramids plus optimal transport to generate more accurate radiology reports from medical images.

Kimi Linear: An Expressive, Efficient Attention Architecture

cs.CL · 2025-10-30 · unverdicted · novelty 6.0

Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.

Reward-Forcing: Autoregressive Video Generation with Reward Feedback

cs.CV · 2026-01-23 · unverdicted · novelty 5.0

Reward-Forcing guides autoregressive video generation with reward feedback to achieve performance comparable to teacher-dependent methods on benchmarks like VBench without relying on distillation.

Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity

cs.CL · 2026-04-22 · unverdicted · novelty 5.0

Fixed-width and decay-based attention mechanisms inspired by working memory improve Transformer grammatical accuracy and human alignment under limited training data.

Musical Attention Transformer: Music Generation Using a Music-Specific Attention Model

cs.SD · 2026-05-20 · unverdicted · novelty 4.0

The paper introduces Musical Attention, an attention variant that incorporates eight musical features including metadata to generate more coherent and varied music than standard or strided attention baselines.

Lightweight Spatio-Temporal Attention Network with Graph Embedding and Rotational Position Encoding for Traffic Forecasting

cs.AI · 2025-05-17 · unverdicted · novelty 3.0

LSTAN-GERPE uses spatio-temporal attention, graph embedding, and grid-searched rotational position encoding to achieve advanced accuracy on PeMS04 and PeMS08 traffic forecasting datasets without heavy feature engineering.

Positional Encoding in Transformer-Based Time Series Models: A Survey

cs.LG · 2025-02-17 · unverdicted · novelty 3.0

A survey of positional encoding methods in transformer-based time series models that evaluates fixed, learnable, relative, and hybrid approaches on classification tasks and links effectiveness to data characteristics.

A Survey on Large Language Models for Code Generation

cs.CL · 2024-06-01 · unverdicted · novelty 3.0

A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.

Large Language Models: A Survey

cs.CL · 2024-02-09 · accept · novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

Robust Filter Attention: Self-Attention as Precision-Weighted State Estimation

cs.LG · 2025-09-04

citing papers explorer

Showing 24 of 24 citing papers.

Group Representational Position Encoding cs.LG · 2025-12-08 · unverdicted · none · ref 20 · internal anchor
GRAPE unifies RoPE and ALiBi as special cases of group actions on positions, providing a principled design space for positional encodings via SO(d) rotations and GL unipotent transformations.
Selective Rotary Position Embedding cs.CL · 2025-11-21 · unverdicted · none · ref 56 · internal anchor
Selective RoPE adds input-dependent rotations to generalize RoPE, showing implicit positional structure in softmax attention and improving performance on language modeling, copying, state tracking, and retrieval when added to gated transformers.
Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings q-bio.QM · 2026-04-09 · unverdicted · none · ref 23
Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.
Video Diffusion Models cs.CV · 2022-04-07 · unverdicted · none · ref 45
A diffusion model for video generation extends image architectures with joint image-video training and improved conditional sampling, delivering first large-scale text-to-video results and state-of-the-art performance on video prediction and unconditional generation benchmarks.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer cs.LG · 2019-10-23 · unverdicted · none · ref 66
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.
DyWPE: Signal-Aware Dynamic Wavelet Positional Encoding for Time Series Transformers cs.LG · 2025-09-18 · unverdicted · none · ref 9 · internal anchor
DyWPE generates positional embeddings for time series transformers from the input signal via Discrete Wavelet Transform and outperforms standard positional encodings on ten datasets, especially longer sequences and biomedical signals.
Learning-Optimized Qubit Mapping and Reuse to Minimize Inter-Core Communication in Modular Quantum Architectures quant-ph · 2025-06-11 · unverdicted · none · ref 44 · internal anchor
QARMA applies transformer-augmented reinforcement learning to qubit allocation and reuse in modular quantum systems, reporting up to 86% average reduction in inter-core communications versus optimized Qiskit baselines.
ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers cs.CV · 2025-05-26 · unverdicted · none · ref 4 · internal anchor
ViTaPEs uses two-stage positional encodings in a multimodal transformer to learn task-agnostic visuotactile representations that outperform baselines on recognition tasks, show zero-shot generalization, and improve robotic grasp success prediction.
LOOPE: Learnable Optimal Patch Order in Positional Embeddings for Vision Transformers cs.CV · 2025-04-19 · unverdicted · none · ref 20 · internal anchor
LOOPE learns a patch ordering for positional embeddings in ViTs and introduces the Three Cell Experiment benchmark that shows 30-35% gaps in positional retention versus the usual 4-6%.
Attention-Based Deep Reinforcement Learning for Qubit Allocation in Modular Quantum Architectures quant-ph · 2024-06-17 · unverdicted · none · ref 44 · internal anchor
An attention-based DRL agent with Transformer encoder and GNN learns heuristics for qubit-to-core allocation in multi-core quantum systems to minimize state transfers and online compilation time.
The Falcon Series of Open Language Models cs.CL · 2023-11-28 · conditional · none · ref 146 · internal anchor
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
Efficient Training of Language Models to Fill in the Middle cs.CL · 2022-07-28 · unverdicted · none · ref 135 · internal anchor
Autoregressive language models trained on data with middle spans relocated to the end learn infilling without degrading left-to-right perplexity or sampling quality.
HEXST: Hexagonal Shifted-Window Transformer for Spatial Transcriptomics Gene Expression Prediction cs.LG · 2026-05-06 · unverdicted · none · ref 20
HEXST applies a hexagonal shifted-window Transformer with rotary positional encodings, contrast-sensitive training objectives, and single-cell priors to predict gene expression from histology slides, outperforming prior models on seven datasets while preserving spatial heterogeneity.
Deep Wave Network for Modeling Multi-Scale Physical Dynamics cs.LG · 2026-05-05 · unverdicted · none · ref 65
DW-Net improves the accuracy versus computational cost Pareto front over standard U-Nets for 2D and 3D multi-scale flow benchmarks by stacking multiple waves while keeping training settings identical.
RIHA: Report-Image Hierarchical Alignment for Radiology Report Generation cs.CV · 2026-04-30 · unverdicted · none · ref 68
RIHA proposes a hierarchical alignment transformer that uses multi-scale visual and textual feature pyramids plus optimal transport to generate more accurate radiology reports from medical images.
Kimi Linear: An Expressive, Efficient Attention Architecture cs.CL · 2025-10-30 · unverdicted · none · ref 89
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
Reward-Forcing: Autoregressive Video Generation with Reward Feedback cs.CV · 2026-01-23 · unverdicted · none · ref 17 · internal anchor
Reward-Forcing guides autoregressive video generation with reward feedback to achieve performance comparable to teacher-dependent methods on benchmarks like VBench without relying on distillation.
Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity cs.CL · 2026-04-22 · unverdicted · none · ref 38
Fixed-width and decay-based attention mechanisms inspired by working memory improve Transformer grammatical accuracy and human alignment under limited training data.
Musical Attention Transformer: Music Generation Using a Music-Specific Attention Model cs.SD · 2026-05-20 · unverdicted · none · ref 8 · internal anchor
The paper introduces Musical Attention, an attention variant that incorporates eight musical features including metadata to generate more coherent and varied music than standard or strided attention baselines.
Lightweight Spatio-Temporal Attention Network with Graph Embedding and Rotational Position Encoding for Traffic Forecasting cs.AI · 2025-05-17 · unverdicted · none · ref 12 · internal anchor
LSTAN-GERPE uses spatio-temporal attention, graph embedding, and grid-searched rotational position encoding to achieve advanced accuracy on PeMS04 and PeMS08 traffic forecasting datasets without heavy feature engineering.
Positional Encoding in Transformer-Based Time Series Models: A Survey cs.LG · 2025-02-17 · unverdicted · none · ref 52 · internal anchor
A survey of positional encoding methods in transformer-based time series models that evaluates fixed, learnable, relative, and hybrid approaches on classification tasks and links effectiveness to data characteristics.
A Survey on Large Language Models for Code Generation cs.CL · 2024-06-01 · unverdicted · none · ref 237
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.
Large Language Models: A Survey cs.CL · 2024-02-09 · accept · none · ref 126
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
Robust Filter Attention: Self-Attention as Precision-Weighted State Estimation cs.LG · 2025-09-04 · unreviewed · ref 80 · internal anchor

Self-attention with relative position repre- sentations

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer