VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.
Openclip, July 2021
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
baseline 1
citation-polarity summary
verdicts
UNVERDICTED 2roles
baseline 1polarities
baseline 1representative citing papers
JAM aligns frozen vision and language models via joint autoencoders and multimodal Spread Loss, reliably inducing cross-modal alignment across layer depths, objectives, and model scales.
citing papers explorer
-
Elastic Attention Cores for Scalable Vision Transformers
VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.
-
Escaping Plato's Cave: JAM for Aligning Independently Trained Vision and Language Models
JAM aligns frozen vision and language models via joint autoencoders and multimodal Spread Loss, reliably inducing cross-modal alignment across layer depths, objectives, and model scales.