hub

Transformers are multi- state rnns

Oren, M · 2024 · arXiv 2401.06104

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Long Context Pre-Training with Lighthouse Attention

cs.CL · 2026-05-07 · conditional · novelty 7.0

Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower loss than standard full-attention training.

Transactional Attention: Semantic Sponsorship for KV-Cache Retention

cs.CL · 2026-04-13 · unverdicted · novelty 7.0

Transactional Attention uses semantic sponsorship from anchor patterns to retain dormant critical tokens in KV caches, achieving 100% credential retrieval at 16 tokens where all prior methods fail.

DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning

cs.CL · 2025-10-10 · conditional · novelty 7.0

DELTA partitions layers into full, delta, and sparse groups to select salient tokens via aggregated attention scores, matching full-attention accuracy on AIME and GPQA while cutting attended tokens up to 4.25x and achieving 1.54x speedup.

ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing

cs.CL · 2026-05-09 · conditional · novelty 6.0

ReST-KV formulates KV eviction as layer-wise output reconstruction optimization with spatial-temporal smoothing, outperforming baselines by 2.58% on LongBench and 15.2% on RULER while cutting decoding latency by 10.61x at 128k context.

AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization

cs.AR · 2026-04-20 · unverdicted · novelty 6.0

AQPIM performs in-memory product quantization of activations for LLMs on PIM hardware, reducing GPU-CPU communication by 90-98.5% and delivering 3.4x speedup over prior PIM methods.

MoBA: Mixture of Block Attention for Long-Context LLMs

cs.LG · 2025-02-18 · unverdicted · novelty 6.0

MoBA routes attention over blocks via MoE-style gating to enable dynamic, bias-light long-context attention that matches full attention performance at lower cost.

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

cs.CL · 2024-06-16 · unverdicted · novelty 6.0

Quest speeds up long-context LLM self-attention by up to 2.23x via query-dependent selection of top-K critical KV cache pages, cutting overall latency by 7.03x with negligible accuracy loss.

Gated Linear Attention Transformers with Hardware-Efficient Training

cs.LG · 2023-12-11 · unverdicted · novelty 6.0

Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.

The Pitfalls of KV Cache Compression

cs.LG · 2025-09-30 · conditional · novelty 5.0

KV cache compression causes certain instructions to degrade rapidly and be ignored in multi-instruction prompting, with system prompt leakage worsened by method choice, instruction order, and eviction bias; simple policy changes can mitigate this.

RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference

cs.LG · 2026-02-20 · 2 refs

citing papers explorer

Showing 10 of 10 citing papers.

Long Context Pre-Training with Lighthouse Attention cs.CL · 2026-05-07 · conditional · none · ref 26
Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower loss than standard full-attention training.
Transactional Attention: Semantic Sponsorship for KV-Cache Retention cs.CL · 2026-04-13 · unverdicted · none · ref 18
Transactional Attention uses semantic sponsorship from anchor patterns to retain dormant critical tokens in KV caches, achieving 100% credential retrieval at 16 tokens where all prior methods fail.
DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning cs.CL · 2025-10-10 · conditional · none · ref 17
DELTA partitions layers into full, delta, and sparse groups to select salient tokens via aggregated attention scores, matching full-attention accuracy on AIME and GPQA while cutting attended tokens up to 4.25x and achieving 1.54x speedup.
ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing cs.CL · 2026-05-09 · conditional · none · ref 17
ReST-KV formulates KV eviction as layer-wise output reconstruction optimization with spatial-temporal smoothing, outperforming baselines by 2.58% on LongBench and 15.2% on RULER while cutting decoding latency by 10.61x at 128k context.
AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization cs.AR · 2026-04-20 · unverdicted · none · ref 57
AQPIM performs in-memory product quantization of activations for LLMs on PIM hardware, reducing GPU-CPU communication by 90-98.5% and delivering 3.4x speedup over prior PIM methods.
MoBA: Mixture of Block Attention for Long-Context LLMs cs.LG · 2025-02-18 · unverdicted · none · ref 28
MoBA routes attention over blocks via MoE-style gating to enable dynamic, bias-light long-context attention that matches full attention performance at lower cost.
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference cs.CL · 2024-06-16 · unverdicted · none · ref 17
Quest speeds up long-context LLM self-attention by up to 2.23x via query-dependent selection of top-K critical KV cache pages, cutting overall latency by 7.03x with negligible accuracy loss.
Gated Linear Attention Transformers with Hardware-Efficient Training cs.LG · 2023-12-11 · unverdicted · none · ref 61
Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
The Pitfalls of KV Cache Compression cs.LG · 2025-09-30 · conditional · none · ref 9
KV cache compression causes certain instructions to degrade rapidly and be ignored in multi-instruction prompting, with system prompt leakage worsened by method choice, instruction order, and eviction bias; simple policy changes can mitigate this.
RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference cs.LG · 2026-02-20 · unreviewed · ref 24 · 2 links

Transformers are multi- state rnns

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer