Title resolution pending

Katharopoulos, Angelos, Vyas, Apoorv, Pappas, Nikolaos, Fleuret, Fran · 2020 · arXiv 2006.16236

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

WriteSAE: Sparse Autoencoders for Recurrent State

cs.LG · 2026-05-12 · unverdicted · novelty 8.0 · 2 refs

WriteSAE decomposes recurrent model cache writes into substitutable atoms with a closed-form logit shift, achieving high substitution success and targeted behavioral installs on models like Qwen3.5 and Mamba-2.

Chem-GMNet: A Sphere-Native Geometric Transformer for Molecular Property Prediction

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Chem-GMNet uses sphere-native embeddings, DualSKA attention, and SH-FFN layers to match or beat ChemBERTa-2 on MoleculeNet tasks with fewer parameters and sometimes no pretraining.

Rethinking Attention with Performers

cs.LG · 2020-09-30 · unverdicted · novelty 7.0

Performers approximate full-rank softmax attention in Transformers via FAVOR+ random features for linear complexity, with theoretical guarantees of unbiased estimation and competitive results on pixel, text, and protein tasks.

RT-Transformer: The Transformer Block as a Spherical State Estimator

cs.LG · 2026-05-10 · unverdicted · novelty 6.0

Transformer components arise as the natural solution to precision-weighted directional state estimation on the hypersphere.

Structured Recurrent Mixers for Massively Parallelized Sequence Generation

cs.CL · 2026-05-09 · unverdicted · novelty 6.0

Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.

Retrieval from Within: An Intrinsic Capability of Attention-Based Models

cs.LG · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

Attention-based models can retrieve evidence intrinsically by using decoder attention to score and reuse their own pre-encoded chunks, outperforming separate retrieval pipelines on QA benchmarks.

HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models

cs.LG · 2026-04-24 · unverdicted · novelty 6.0

HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.

In-Place Test-Time Training

cs.LG · 2026-04-07 · conditional · novelty 6.0

In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.

Deformable DETR: Deformable Transformers for End-to-End Object Detection

cs.CV · 2020-10-08 · accept · novelty 6.0

Deformable DETR achieves higher accuracy than DETR, especially on small objects, while converging in one-tenth the training epochs by using sparse deformable attention on image features.

StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k

cs.LG · 2026-05-04 · accept · novelty 5.0

Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.

Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models

cs.LG · 2026-04-24 · unverdicted · novelty 5.0

Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.

citing papers explorer

Showing 11 of 11 citing papers.

WriteSAE: Sparse Autoencoders for Recurrent State cs.LG · 2026-05-12 · unverdicted · none · ref 62 · 2 links
WriteSAE decomposes recurrent model cache writes into substitutable atoms with a closed-form logit shift, achieving high substitution success and targeted behavioral installs on models like Qwen3.5 and Mamba-2.
Chem-GMNet: A Sphere-Native Geometric Transformer for Molecular Property Prediction cs.LG · 2026-05-13 · unverdicted · none · ref 17
Chem-GMNet uses sphere-native embeddings, DualSKA attention, and SH-FFN layers to match or beat ChemBERTa-2 on MoleculeNet tasks with fewer parameters and sometimes no pretraining.
Rethinking Attention with Performers cs.LG · 2020-09-30 · unverdicted · none · ref 132
Performers approximate full-rank softmax attention in Transformers via FAVOR+ random features for linear complexity, with theoretical guarantees of unbiased estimation and competitive results on pixel, text, and protein tasks.
RT-Transformer: The Transformer Block as a Spherical State Estimator cs.LG · 2026-05-10 · unverdicted · none · ref 17
Transformer components arise as the natural solution to precision-weighted directional state estimation on the hypersphere.
Structured Recurrent Mixers for Massively Parallelized Sequence Generation cs.CL · 2026-05-09 · unverdicted · none · ref 58
Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.
Retrieval from Within: An Intrinsic Capability of Attention-Based Models cs.LG · 2026-05-07 · unverdicted · none · ref 17 · 2 links
Attention-based models can retrieve evidence intrinsically by using decoder attention to score and reuse their own pre-encoded chunks, outperforming separate retrieval pipelines on QA benchmarks.
HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models cs.LG · 2026-04-24 · unverdicted · none · ref 14
HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.
In-Place Test-Time Training cs.LG · 2026-04-07 · conditional · none · ref 34
In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
Deformable DETR: Deformable Transformers for End-to-End Object Detection cs.CV · 2020-10-08 · accept · none · ref 7
Deformable DETR achieves higher accuracy than DETR, especially on small objects, while converging in one-tenth the training epochs by using sparse deformable attention on image features.
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k cs.LG · 2026-05-04 · accept · none · ref 16
Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models cs.LG · 2026-04-24 · unverdicted · none · ref 49
Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer