Efficient context scaling with longcat zigzag attention

Chen Zhang, Yang Bai, Jiahuan Li, Anchun Gui, Keheng Wang, Feifan Liu, Guanyu Wu, Yuwei Jiang, Defei Bu, Li Wei, et al · 2025 · arXiv 2512.23966

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

representative citing papers

Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference

cs.LG · 2026-04-08 · unverdicted · novelty 5.0

Flux Attention uses a context-aware Layer Router to dynamically assign full or sparse attention to each LLM layer, achieving up to 2.8x prefill and 2.0x decode speedups with competitive performance on long-context and reasoning tasks.

SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining

cs.LG · 2026-02-11 · conditional · novelty 5.0

SnapMLA achieves up to 1.91x higher throughput in long-output MLA decoding using FP8 quantization and specialized kernels while keeping benchmark quality near the BF16 baseline.

citing papers explorer

Showing 2 of 2 citing papers.

Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference cs.LG · 2026-04-08 · unverdicted · none · ref 53
Flux Attention uses a context-aware Layer Router to dynamically assign full or sparse attention to each LLM layer, achieving up to 2.8x prefill and 2.0x decode speedups with competitive performance on long-context and reasoning tasks.
SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining cs.LG · 2026-02-11 · conditional · none · ref 46
SnapMLA achieves up to 1.91x higher throughput in long-output MLA decoding using FP8 quantization and specialized kernels while keeping benchmark quality near the BF16 baseline.

Efficient context scaling with longcat zigzag attention

fields

years

verdicts

representative citing papers

citing papers explorer