FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.
Pope: 6-dof promptable pose estimation of any object, in any scene, with one reference
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 2verdicts
UNVERDICTED 2representative citing papers
A semantic feature optimization grounds disconnected partial 3D reconstructions to geospatially accurate reference models derived from Google Earth, improving global alignment across classical and learning-based pipelines.
citing papers explorer
-
FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.
-
Scene Grounding In the Wild
A semantic feature optimization grounds disconnected partial 3D reconstructions to geospatially accurate reference models derived from Google Earth, improving global alignment across classical and learning-based pipelines.