Tensor Cache augments sliding-window attention with an eviction-fed outer-product associative memory and a training correction to improve long-context performance under bounded memory.
hub Canonical reference
Ring Attention with Blockwise Transformers for Near-Infinite Context
Canonical reference. 94% of citing Pith papers cite this work as background.
abstract
Transformers have emerged as the architecture of choice for many state-of-the-art AI models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands imposed by Transformers limit their ability to handle long sequences, thereby posing challenges in utilizing videos, actions, and other long-form sequences and modalities in complex environments. We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices while fully overlapping the communication of key-value blocks with the computation of blockwise attention. Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers, without resorting to approximations or incurring additional communication and computation overheads. Extensive experiments on language modeling and reinforcement learning tasks demonstrate the effectiveness of our approach in allowing millions of tokens context size and improving performance.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell and other GPUs.
Proves that RoPE attention loses locality bias and token distinction in long contexts, approaching random behavior independent of content.
Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower loss than standard full-attention training.
SCOUT achieves state-of-the-art long-text understanding with up to 8x lower token use by actively foraging for sparse query-relevant information and updating a compact provenance-grounded epistemic state.
MoE-Prefill achieves 1.35-1.59x higher throughput for prefill-only MoE serving by using asynchronous expert parallelism to overlap weight AllGather with computation and prefix-aware routing with true-FLOPs tracking.
PrefixWall mitigates APC side channels in multi-tenant LLM systems via selective prefix isolation, delivering up to 70% higher cache reuse and 30% lower latency than full-isolation baselines.
Exact Flow Linear Attention derives a closed-form exact update for delta-rule linear attention from continuous-time dynamics, removing Euler discretization error while preserving linear complexity and structure.
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.
FedRAG uses a Scrambled Distributed Attention protocol with feature scrambling and token permutation to enable high-throughput, privacy-preserving federated RAG without special hardware or retraining.
A supervised fine-tuning approach using inverted multi-resolution planning scaffolds from public-domain novels trains models to generate book-length stories with more human-like literary qualities than standard instruction-tuned LLMs.
ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
EXACT re-allocates training supervision by inverse frequency of long effective-context targets, improving NoLiMa and RULER scores by 5-18 points on Qwen and LLaMA models without degrading standard QA or reasoning.
FocuSFT uses an inner optimization loop to adapt fast-weight parameters into a parametric memory that sharpens attention on relevant content, then conditions outer-loop supervised fine-tuning on this representation, yielding gains on long-context benchmarks.
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
ChipLight is a multi-objective optimization framework that co-designs chiplet hardware, training parallelism, and optical networks to improve efficiency in distributed LLM training clusters.
Q-Gate dynamically routes keyframe selection in long videos via query-modulated gating across visual grounding, global matching, and contextual alignment experts to improve MLLM performance.
Hive is a multi-agent infrastructure with a logits cache for reducing cross-path redundancy in sampling and agent-aware scheduling for better compute and KV-cache allocation, shown to deliver 1.11x-1.76x speedups and 33%-51% lower hotspot miss rates.
CoCoDiff achieves 3.6x average and 8.4x peak speedup for distributed DiT inference on up to 96 GPU tiles via tile-aware all-to-all, V-first scheduling, and selective V communication.