Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, Illia Polosukhin · 2017

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

browse 5 citing papers

representative citing papers

RobuQ: Pushing DiTs to W1.58A2 via Robust Activation Quantization

cs.CV · 2025-09-28 · conditional · novelty 6.0

RobuQ delivers the first stable DiT image generation at W1.58A2 average bits via Hadamard-based robust activation quantization and layer-wise mixed-precision activations.

Differential-Integral Neural Operator for Long-Term Turbulence Forecasting

cs.LG · 2025-09-25 · unverdicted · novelty 6.0

DINO decomposes turbulent evolution into parallel local differential and global integral operators to achieve stable autoregressive forecasting on 2D Kolmogorov flow.

EMMA: End-to-End Multimodal Model for Autonomous Driving

cs.CV · 2024-10-30 · unverdicted · novelty 6.0

EMMA is an end-to-end multimodal LLM that converts camera data into trajectories, objects, and road graphs via text prompts and reports state-of-the-art motion planning on nuScenes plus competitive detection results on Waymo.

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

cs.CV · 2023-09-30 · accept · novelty 6.0

PixArt-α matches commercial text-to-image quality with a diffusion transformer trained in 675 A100 GPU days through decomposed training stages, cross-attention text injection, and vision-language model dense captions.

VisualBERT: A Simple and Performant Baseline for Vision and Language

cs.CV · 2019-08-09 · conditional · novelty 6.0

VisualBERT is a Transformer model that implicitly aligns text and image regions through self-attention and achieves competitive or superior results on VQA, VCR, NLVR2, and Flickr30K after pre-training on captions.

citing papers explorer

Showing 4 of 4 citing papers after filters.

RobuQ: Pushing DiTs to W1.58A2 via Robust Activation Quantization cs.CV · 2025-09-28 · conditional · none · ref 56
RobuQ delivers the first stable DiT image generation at W1.58A2 average bits via Hadamard-based robust activation quantization and layer-wise mixed-precision activations.
EMMA: End-to-End Multimodal Model for Autonomous Driving cs.CV · 2024-10-30 · unverdicted · none · ref 78
EMMA is an end-to-end multimodal LLM that converts camera data into trajectories, objects, and road graphs via text prompts and reports state-of-the-art motion planning on nuScenes plus competitive detection results on Waymo.
PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis cs.CV · 2023-09-30 · accept · none · ref 61
PixArt-α matches commercial text-to-image quality with a diffusion transformer trained in 675 A100 GPU days through decomposed training stages, cross-attention text injection, and vision-language model dense captions.
VisualBERT: A Simple and Performant Baseline for Vision and Language cs.CV · 2019-08-09 · conditional · none · ref 35
VisualBERT is a Transformer model that implicitly aligns text and image regions through self-attention and achieves competitive or superior results on VQA, VCR, NLVR2, and Flickr30K after pre-training on captions.

Attention is all you need

fields

years

verdicts

representative citing papers

citing papers explorer