Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, Illia Polosukhin · 2017

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

browse 9 citing papers

representative citing papers

MapFormer: Self-Supervised Learning of Cognitive Maps with Input-Dependent Positional Embeddings

cs.LG · 2025-11-24 · unverdicted · novelty 7.0

MapFormers learn cognitive maps via input-dependent Lie-algebra positional encodings and achieve near-perfect OOD generalization on cognitive tasks where standard transformers fail.

HyperPersona: A Multi-Level Hypergraph Framework for Text-Based Automatic Personality Prediction

cs.AI · 2026-05-17 · unverdicted · novelty 6.0

HyperPersona is a hypergraph framework that jointly models document, sentence, and word levels of text via hyperedges and nodes, then uses a transformer graph encoder to predict Big Five personality traits from text alone.

Transformer as an Euler Discretization of Score-based Variational Flow

cs.LG · 2026-04-26 · unverdicted · novelty 6.0

The Transformer is recovered exactly as the forward Euler step of spherical SVFlow, with multi-head attention and MoE/FFN as approximations to its vector field.

Learning-Based Sparsification of Dynamic Graphs in Robotic Exploration Algorithms

cs.RO · 2026-04-15 · unverdicted · novelty 6.0

A PPO-trained transformer policy sparsifies dynamic graphs during RRT frontier exploration, cutting size by up to 96% and yielding the most consistent exploration rates across environments.

AE-ViT: Stable Long-Horizon Parametric Partial Differential Equations Modeling

cs.LG · 2026-04-07 · unverdicted · novelty 6.0

AE-ViT combines a convolutional autoencoder with a latent-space transformer and multi-stage parameter plus coordinate injection to deliver stable long-horizon predictions for parametric PDEs, cutting relative rollout error by roughly five times versus prior DL-ROMs and ViTs on advection-diffusion-re

BASIS: Balanced Activation Sketching with Invariant Scalars for "Ghost Backpropagation"

cs.LG · 2026-03-05 · unverdicted · novelty 6.0

BASIS uses balanced hashing and invariant scalars to sketch activations, cutting memory to O(L*R*N) while matching exact backprop performance on GPT training at R=32.

Scalable Spatiotemporal Inference with Biased Scan Attention Transformer Neural Processes

cs.LG · 2025-06-10 · unverdicted · novelty 6.0

BSA-TNP is a new neural process model with KRBlocks and biased scan attention that claims to match top accuracy while scaling inference to over 1M points in under a minute on a single GPU and supporting translation invariance.

Identifying and Mitigating Gender Cues in Academic Recommendation Letters: An Interpretability Case Study

cs.LG · 2026-04-14 · unverdicted · novelty 5.0

Transformer models detect applicant gender in de-gendered academic recommendation letters via implicit linguistic patterns such as associations with words like 'emotional' and 'humanitarian', and removing these cues reduces but does not eliminate prediction accuracy above chance.

Ordinary Least Squares is a Special Case of Transformer

cs.LG · 2026-04-15 · unverdicted · novelty 4.0

Ordinary least squares is a special case of the single-layer linear transformer when attention parameters are set via spectral decomposition of the empirical covariance matrix.

citing papers explorer

Showing 9 of 9 citing papers.

MapFormer: Self-Supervised Learning of Cognitive Maps with Input-Dependent Positional Embeddings cs.LG · 2025-11-24 · unverdicted · none · ref 3
MapFormers learn cognitive maps via input-dependent Lie-algebra positional encodings and achieve near-perfect OOD generalization on cognitive tasks where standard transformers fail.
HyperPersona: A Multi-Level Hypergraph Framework for Text-Based Automatic Personality Prediction cs.AI · 2026-05-17 · unverdicted · none · ref 24
HyperPersona is a hypergraph framework that jointly models document, sentence, and word levels of text via hyperedges and nodes, then uses a transformer graph encoder to predict Big Five personality traits from text alone.
Transformer as an Euler Discretization of Score-based Variational Flow cs.LG · 2026-04-26 · unverdicted · none · ref 33
The Transformer is recovered exactly as the forward Euler step of spherical SVFlow, with multi-head attention and MoE/FFN as approximations to its vector field.
Learning-Based Sparsification of Dynamic Graphs in Robotic Exploration Algorithms cs.RO · 2026-04-15 · unverdicted · none · ref 17
A PPO-trained transformer policy sparsifies dynamic graphs during RRT frontier exploration, cutting size by up to 96% and yielding the most consistent exploration rates across environments.
AE-ViT: Stable Long-Horizon Parametric Partial Differential Equations Modeling cs.LG · 2026-04-07 · unverdicted · none · ref 11
AE-ViT combines a convolutional autoencoder with a latent-space transformer and multi-stage parameter plus coordinate injection to deliver stable long-horizon predictions for parametric PDEs, cutting relative rollout error by roughly five times versus prior DL-ROMs and ViTs on advection-diffusion-re
BASIS: Balanced Activation Sketching with Invariant Scalars for "Ghost Backpropagation" cs.LG · 2026-03-05 · unverdicted · none · ref 11
BASIS uses balanced hashing and invariant scalars to sketch activations, cutting memory to O(L*R*N) while matching exact backprop performance on GPT training at R=32.
Scalable Spatiotemporal Inference with Biased Scan Attention Transformer Neural Processes cs.LG · 2025-06-10 · unverdicted · none · ref 37
BSA-TNP is a new neural process model with KRBlocks and biased scan attention that claims to match top accuracy while scaling inference to over 1M points in under a minute on a single GPU and supporting translation invariance.
Identifying and Mitigating Gender Cues in Academic Recommendation Letters: An Interpretability Case Study cs.LG · 2026-04-14 · unverdicted · none · ref 41
Transformer models detect applicant gender in de-gendered academic recommendation letters via implicit linguistic patterns such as associations with words like 'emotional' and 'humanitarian', and removing these cues reduces but does not eliminate prediction accuracy above chance.
Ordinary Least Squares is a Special Case of Transformer cs.LG · 2026-04-15 · unverdicted · none · ref 1
Ordinary least squares is a special case of the single-layer linear transformer when attention parameters are set via spectral decomposition of the empirical covariance matrix.

Attention is all you need

fields

years

verdicts

representative citing papers

citing papers explorer