Sageattention: Accurate 8-bit attention for plug-and-play inference acceleration

Jintao Zhang, Haofeng Huang, Pengle Zhang, Jun Zhu, Jianfei Chen, et al · 2025 · arXiv 2410.02367

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

HASTE delivers up to 1.93x speedup on Wan2.1 video DiTs via head-wise adaptive sparse attention using temporal mask reuse and error-guided per-head calibration while preserving video quality.

Efficient Video Diffusion Models: Advancements and Challenges

cs.CV · 2026-04-17 · unverdicted · novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.

${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

cs.LG · 2026-04-16 · unverdicted · novelty 7.0

π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.

Search Your Block Floating Point Scales!

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.

Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding

cs.AR · 2026-04-27 · unverdicted · novelty 6.0

Salca is a new ASIC accelerator that achieves 3.82× speedup and 74.19× energy efficiency over A100 for long-context attention via dual-compression dynamic sparse attention and pipelined hardware.

Focused Forcing: Content-Aware Per-Frame KV Selection for Efficient Autoregressive Video Diffusion

cs.CV · 2026-05-18 · unverdicted · novelty 5.0

Focused Forcing is a training-free per-frame KV selection method that combines attention scores with diversity metrics and head-importance estimation to accelerate autoregressive video diffusion up to 1.48x while improving quality.

Wan: Open and Advanced Large-Scale Video Generative Models

cs.CV · 2025-03-26 · unverdicted · novelty 5.0

Wan releases open 1.3B and 14B video diffusion models claiming superior performance over open-source and commercial baselines across multiple tasks with consumer-grade efficiency.

Efficient Matrix Implementation for Rotary Position Embedding

cs.LG · 2026-04-10 · unverdicted · novelty 4.0

RoME reformulates RoPE as matrix operations to eliminate dimension-specific vector overhead and enable fused execution on modern hardware while remaining mathematically equivalent.

citing papers explorer

Showing 8 of 8 citing papers.

HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention cs.CV · 2026-05-14 · unverdicted · none · ref 44
HASTE delivers up to 1.93x speedup on Wan2.1 video DiTs via head-wise adaptive sparse attention using temporal mask reuse and error-guided per-head calibration while preserving video quality.
Efficient Video Diffusion Models: Advancements and Challenges cs.CV · 2026-04-17 · unverdicted · none · ref 177
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities cs.LG · 2026-04-16 · unverdicted · none · ref 114
π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
Search Your Block Floating Point Scales! cs.LG · 2026-05-12 · unverdicted · none · ref 170
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding cs.AR · 2026-04-27 · unverdicted · none · ref 70
Salca is a new ASIC accelerator that achieves 3.82× speedup and 74.19× energy efficiency over A100 for long-context attention via dual-compression dynamic sparse attention and pipelined hardware.
Focused Forcing: Content-Aware Per-Frame KV Selection for Efficient Autoregressive Video Diffusion cs.CV · 2026-05-18 · unverdicted · none · ref 55
Focused Forcing is a training-free per-frame KV selection method that combines attention scores with diversity metrics and head-importance estimation to accelerate autoregressive video diffusion up to 1.48x while improving quality.
Wan: Open and Advanced Large-Scale Video Generative Models cs.CV · 2025-03-26 · unverdicted · none · ref 51
Wan releases open 1.3B and 14B video diffusion models claiming superior performance over open-source and commercial baselines across multiple tasks with consumer-grade efficiency.
Efficient Matrix Implementation for Rotary Position Embedding cs.LG · 2026-04-10 · unverdicted · none · ref 29
RoME reformulates RoPE as matrix operations to eliminate dimension-specific vector overhead and enable fused execution on modern hardware while remaining mathematically equivalent.

Sageattention: Accurate 8-bit attention for plug-and-play inference acceleration

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer