pith. sign in

Align before fuse: Vision and language representation learning with momentum distillation.Advances in neural information processing systems, 34:9694–9705

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

fields

cs.CV 1

years

2026 1

verdicts

UNVERDICTED 1

representative citing papers

Vision Transformers Need More Than Registers

cs.CV · 2026-02-25 · unverdicted · novelty 6.0

ViTs exhibit lazy aggregation by relying on irrelevant background patches for global semantics, and selectively integrating patch features into the CLS token reduces this effect and improves results across label-, text-, and self-supervision.

citing papers explorer

Showing 1 of 1 citing paper.

  • Vision Transformers Need More Than Registers cs.CV · 2026-02-25 · unverdicted · none · ref 18

    ViTs exhibit lazy aggregation by relying on irrelevant background patches for global semantics, and selectively integrating patch features into the CLS token reduces this effect and improves results across label-, text-, and self-supervision.