Title resolution pending

Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y · 2023 · arXiv 2302.10866

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.

Exact Flow Linear Attention: Exact Solution from Continuous-Time Dynamics

cs.LG · 2025-12-14 · unverdicted · novelty 7.0

Exact Flow Linear Attention derives a closed-form exact update for delta-rule linear attention from continuous-time dynamics, removing Euler discretization error while preserving linear complexity and structure.

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

cs.CL · 2024-10-14 · conditional · novelty 7.0

DuoAttention identifies retrieval heads requiring full KV cache and streaming heads using constant-length cache to reduce memory and latency in long-context LLM inference.

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

cs.LG · 2024-02-29 · unverdicted · novelty 7.0

Griffin hybrid model matches Llama-2 performance while trained on over 6 times fewer tokens and offers lower inference latency with higher throughput.

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

Structured Recurrent Mixers for Massively Parallelized Sequence Generation

cs.CL · 2026-05-09 · conditional · novelty 6.0 · 2 refs

Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, yielding higher throughput, concurrency, and training efficiency than comparable linear-complexity models on language tasks.

Retrieval from Within: An Intrinsic Capability of Attention-Based Models

cs.LG · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

Attention-based models can retrieve evidence intrinsically by using decoder attention to score and reuse their own pre-encoded chunks, outperforming separate retrieval pipelines on QA benchmarks.

Positional Encoding via Token-Aware Phase Attention

cs.CL · 2025-09-16 · unverdicted · novelty 6.0

TAPA adds a learnable phase function to attention to preserve long-range token interactions, enabling direct continual pretraining, length extrapolation, lower perplexity, and stronger retrieval than RoPE-style methods.

RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

cs.CL · 2024-01-31 · unverdicted · novelty 6.0

RAPTOR introduces a tree-organized retrieval method using recursive abstractive summaries, achieving a 20% absolute accuracy improvement on the QuALITY benchmark when paired with GPT-4.

The Falcon Series of Open Language Models

cs.CL · 2023-11-28 · conditional · novelty 6.0

Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.

Retentive Network: A Successor to Transformer for Large Language Models

cs.CL · 2023-07-17 · unverdicted · novelty 6.0

RetNet is a new sequence modeling architecture that delivers parallel training, constant-time inference, and competitive language modeling performance as a potential replacement for Transformers.

Parallel Recursive LSTM

cs.LG · 2026-05-16 · unverdicted · novelty 5.0

PR-LSTM replaces linear recurrence with recursive gated merging over a balanced binary tree to achieve log-depth parallelism without restricting transitions to linear or associative forms.

mHC-SSM: Manifold-Constrained Hyper-Connections for State Space Language Models with Stream-Specialized Adapters

cs.LG · 2026-05-08 · unverdicted · novelty 5.0

Manifold-constrained multi-stream mixing plus per-stream adapters improves SSM language model validation loss from 6.3507 to 6.1353 and perplexity from 572.91 to 461.88 on WikiText-2.

StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k

cs.LG · 2026-05-04 · accept · novelty 5.0

Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.

Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models

cs.LG · 2026-04-24 · unverdicted · novelty 5.0

Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.

citing papers explorer

Showing 15 of 15 citing papers.

Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences cs.LG · 2026-04-22 · unverdicted · none · ref 41
Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
Exact Flow Linear Attention: Exact Solution from Continuous-Time Dynamics cs.LG · 2025-12-14 · unverdicted · none · ref 20
Exact Flow Linear Attention derives a closed-form exact update for delta-rule linear attention from continuous-time dynamics, removing Euler discretization error while preserving linear complexity and structure.
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads cs.CL · 2024-10-14 · conditional · none · ref 39
DuoAttention identifies retrieval heads requiring full KV cache and streaming heads using constant-length cache to reduce memory and latency in long-context LLM inference.
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models cs.LG · 2024-02-29 · unverdicted · none · ref 24
Griffin hybrid model matches Llama-2 performance while trained on over 6 times fewer tokens and offers lower inference latency with higher throughput.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces cs.LG · 2026-05-12 · unverdicted · none · ref 230
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Structured Recurrent Mixers for Massively Parallelized Sequence Generation cs.CL · 2026-05-09 · conditional · none · ref 60 · 2 links
Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, yielding higher throughput, concurrency, and training efficiency than comparable linear-complexity models on language tasks.
Retrieval from Within: An Intrinsic Capability of Attention-Based Models cs.LG · 2026-05-07 · unverdicted · none · ref 25 · 2 links
Attention-based models can retrieve evidence intrinsically by using decoder attention to score and reuse their own pre-encoded chunks, outperforming separate retrieval pipelines on QA benchmarks.
Positional Encoding via Token-Aware Phase Attention cs.CL · 2025-09-16 · unverdicted · none · ref 12
TAPA adds a learnable phase function to attention to preserve long-range token interactions, enabling direct continual pretraining, length extrapolation, lower perplexity, and stronger retrieval than RoPE-style methods.
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval cs.CL · 2024-01-31 · unverdicted · none · ref 157
RAPTOR introduces a tree-organized retrieval method using recursive abstractive summaries, achieving a 20% absolute accuracy improvement on the QuALITY benchmark when paired with GPT-4.
The Falcon Series of Open Language Models cs.CL · 2023-11-28 · conditional · none · ref 136
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
Retentive Network: A Successor to Transformer for Large Language Models cs.CL · 2023-07-17 · unverdicted · none · ref 15
RetNet is a new sequence modeling architecture that delivers parallel training, constant-time inference, and competitive language modeling performance as a potential replacement for Transformers.
Parallel Recursive LSTM cs.LG · 2026-05-16 · unverdicted · none · ref 9
PR-LSTM replaces linear recurrence with recursive gated merging over a balanced binary tree to achieve log-depth parallelism without restricting transitions to linear or associative forms.
mHC-SSM: Manifold-Constrained Hyper-Connections for State Space Language Models with Stream-Specialized Adapters cs.LG · 2026-05-08 · unverdicted · none · ref 22
Manifold-constrained multi-stream mixing plus per-stream adapters improves SSM language model validation loss from 6.3507 to 6.1353 and perplexity from 572.91 to 461.88 on WikiText-2.
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k cs.LG · 2026-05-04 · accept · none · ref 24
Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models cs.LG · 2026-04-24 · unverdicted · none · ref 50
Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer