NEO-ov is a native one-vision model that learns cross-frame and pixel-word correspondence end-to-end and narrows the gap to modular VLMs on multi-image, video, and spatial tasks.
Langbridge: Interpreting image as a combination of language embeddings
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
citation-role summary
other 1
citation-polarity summary
fields
cs.CV 3verdicts
UNVERDICTED 3roles
other 1polarities
unclear 1representative citing papers
EvoComp compresses visual tokens in MLLMs by 3x while retaining 99.3% accuracy via an evolutionary labeling strategy that searches for low-loss, semantically diverse token subsets.
UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.
citing papers explorer
-
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.