Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing, November 2023

Bondarenko, Y · 2023 · arXiv 2306.12929

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

cs.CL · 2025-12-01 · conditional · novelty 7.0

Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.

Massive Activations in Large Language Models

cs.CL · 2024-02-27 · unverdicted · novelty 7.0

Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.

Contribution Weights: A Geometrical Analysis of Self-Attention Transformers

cs.LG · 2026-05-29 · unverdicted · novelty 6.0 · 2 refs

Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.

Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

cs.CL · 2026-02-01

citing papers explorer

Showing 3 of 3 citing papers after filters.

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling cs.CL · 2025-12-01 · conditional · none · ref 33
Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.
Massive Activations in Large Language Models cs.CL · 2024-02-27 · unverdicted · none · ref 108
Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse cs.CL · 2026-02-01 · unreviewed · ref 3

Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing, November 2023

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer