pith. sign in

Efficient content-based sparse attention with routing Transformers.arXiv preprint arXiv:2003.05997, 2020

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

representative citing papers

Rethinking Attention with Performers

cs.LG · 2020-09-30 · unverdicted · novelty 7.0

Performers approximate full-rank softmax attention in Transformers via FAVOR+ random features for linear complexity, with theoretical guarantees of unbiased estimation and competitive results on pixel, text, and protein tasks.

Longformer: The Long-Document Transformer

cs.CL · 2020-04-10 · accept · novelty 7.0

Longformer uses local windowed attention plus task-specific global attention to achieve linear scaling and state-of-the-art results on long-document language modeling, QA, and summarization after pretraining.

PaLM: Scaling Language Modeling with Pathways

cs.CL · 2022-04-05 · accept · novelty 6.0

PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.

citing papers explorer

Showing 6 of 6 citing papers.

  • FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision cs.LG · 2024-07-11 · accept · none · ref 47

    FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.

  • Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads cs.LG · 2024-01-19 · conditional · none · ref 241

    Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.

  • Rethinking Attention with Performers cs.LG · 2020-09-30 · unverdicted · none · ref 147

    Performers approximate full-rank softmax attention in Transformers via FAVOR+ random features for linear complexity, with theoretical guarantees of unbiased estimation and competitive results on pixel, text, and protein tasks.

  • Longformer: The Long-Document Transformer cs.CL · 2020-04-10 · accept · none · ref 115

    Longformer uses local windowed attention plus task-specific global attention to achieve linear scaling and state-of-the-art results on long-document language modeling, QA, and summarization after pretraining.

  • PaLM: Scaling Language Modeling with Pathways cs.CL · 2022-04-05 · accept · none · ref 129

    PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.

  • Deformable DETR: Deformable Transformers for End-to-End Object Detection cs.CV · 2020-10-08 · accept · none · ref 9

    Deformable DETR achieves higher accuracy than DETR, especially on small objects, while converging in one-tenth the training epochs by using sparse deformable attention on image features.