Align before fuse: Vision and language representation learning with momentum distillation.Advances in neural information processing systems, 34:9694–9705

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, Steven Chu Hong Hoi · 2021

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Vision Transformers Need More Than Registers

cs.CV · 2026-02-25 · unverdicted · novelty 6.0

ViTs exhibit lazy aggregation by relying on irrelevant background patches for global semantics, and selectively integrating patch features into the CLS token reduces this effect and improves results across label-, text-, and self-supervision.

citing papers explorer

Showing 1 of 1 citing paper.

Vision Transformers Need More Than Registers cs.CV · 2026-02-25 · unverdicted · none · ref 18
ViTs exhibit lazy aggregation by relying on irrelevant background patches for global semantics, and selectively integrating patch features into the CLS token reduces this effect and improves results across label-, text-, and self-supervision.

Align before fuse: Vision and language representation learning with momentum distillation.Advances in neural information processing systems, 34:9694–9705

fields

years

verdicts

representative citing papers

citing papers explorer