Gomez and Lukasz Kaiser and Illia Polosukhin , bibsource =

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N · 2017

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

browse 5 citing papers

representative citing papers

Instructions Shape Production of Language, not Processing

cs.CL · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

cs.LG · 2024-09-16 · conditional · novelty 6.0

RetrievalAttention approximates full attention in long-context LLMs by retrieving relevant KV vectors from CPU-based ANNS indexes with an attention-aware algorithm, achieving near-full accuracy while accessing only 1-3% of the data.

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

cs.CL · 2023-05-23 · conditional · novelty 6.0

UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.

Average Attention Transformers and Arithmetic Circuits

cs.CC · 2026-05-06 · unverdicted · novelty 5.0

Average hard attention transformers simulate constant-depth arithmetic circuits using unbounded addition, binary multiplication, and sign gates when circuits are provided as input.

Gated Delta Networks: Improving Mamba2 with Delta Rule

cs.CL · 2024-12-09 · unverdicted · novelty 5.0

Gated DeltaNet integrates gating and delta rules into linear transformers, outperforming Mamba2 and DeltaNet on language modeling, reasoning, retrieval, and long-context tasks.

citing papers explorer

Showing 5 of 5 citing papers.

Instructions Shape Production of Language, not Processing cs.CL · 2026-05-11 · unverdicted · none · ref 255 · 2 links
Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval cs.LG · 2024-09-16 · conditional · none · ref 9
RetrievalAttention approximates full attention in long-context LLMs by retrieving relevant KV vectors from CPU-based ANNS indexes with an attention-aware algorithm, achieving near-full accuracy while accessing only 1-3% of the data.
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations cs.CL · 2023-05-23 · conditional · none · ref 34
UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.
Average Attention Transformers and Arithmetic Circuits cs.CC · 2026-05-06 · unverdicted · none · ref 4
Average hard attention transformers simulate constant-depth arithmetic circuits using unbounded addition, binary multiplication, and sign gates when circuits are provided as input.
Gated Delta Networks: Improving Mamba2 with Delta Rule cs.CL · 2024-12-09 · unverdicted · none · ref 115
Gated DeltaNet integrates gating and delta rules into linear transformers, outperforming Mamba2 and DeltaNet on language modeling, reasoning, retrieval, and long-context tasks.

Gomez and Lukasz Kaiser and Illia Polosukhin , bibsource =

fields

years

verdicts

representative citing papers

citing papers explorer