hub

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi · 2023

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

browse 13 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

cs.CV · 2025-12-09 · unverdicted · novelty 8.0

ConceptPose delivers state-of-the-art zero-shot relative pose estimation by matching open-vocabulary 3D concept vectors derived from VLM saliency maps, beating the strongest baseline by 62% in ADD(-S) without training.

Learning to Track Instance from Single Nature Language Description

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

Tracker is a self-supervised VL tracker that uses a Dynamic Token Aggregation Module to learn instance tracking from single language descriptions in unlabeled videos and outperforms prior self-supervised methods.

PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

PolySLGen generates contextually appropriate and temporally coherent multimodal speaking and listening reactions for polyadic interactions by fusing group motion and social cues.

PLUME: Latent Reasoning Based Universal Multimodal Embedding

cs.CV · 2026-04-02 · unverdicted · novelty 7.0

PLUME uses latent-state autoregressive rollouts and a progressive training curriculum to deliver efficient reasoning for universal multimodal embeddings without generating explicit rationales.

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

cs.CV · 2025-12-01 · unverdicted · novelty 7.0

AV-SpeakerBench is a new speaker-centered benchmark showing that top multimodal models still struggle with fine-grained audiovisual speech understanding, with Gemini 2.5 Pro leading but open models lagging on fusion.

Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

Vision Inference Former adds a direct visual-to-output bridge that continuously injects visual semantics during MLLM decoding to sustain consistency and reduce modality imbalance.

Multi-Scale Gaussian-Language Map for Zero-shot Embodied Navigation and Reasoning

cs.CV · 2026-05-03 · unverdicted · novelty 6.0

GLMap combines explicit 3D Gaussians with multi-scale language semantics in a dual-modality structure and uses an analytical Gaussian Estimator for incremental map building, improving zero-shot performance on navigation and reasoning tasks.

CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging

cs.CV · 2026-04-24 · unverdicted · novelty 6.0

CheXmix combines masked autoencoder pretraining with early-fusion generative modeling to outperform prior models on chest X-ray classification by up to 8.6% AUROC, inpainting by 51%, and report generation by 45% on GREEN.

G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval

cs.CV · 2026-04-16 · unverdicted · novelty 6.0

G-MIXER achieves state-of-the-art zero-shot composed image retrieval by using geodesic mixup to build diverse implicit candidates and MLLM-derived explicit semantics for re-ranking.

SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

cs.CV · 2026-03-28 · unverdicted · novelty 6.0

SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.

Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy

cs.AI · 2026-03-02 · unverdicted · novelty 6.0

Nano-EmoX is a compact 2.2B multimodal model that unifies six core affective tasks across perception, understanding, and interaction levels via a curriculum framework, achieving competitive benchmark performance.

Adapting Foundation Vision-Language Models to Medical Diagnosis via Query-Driven Expert Bridging

cs.CV · 2025-05-27 · unverdicted · novelty 6.0

MedBridge adapts pretrained VLMs to multi-label medical diagnosis via query tokens for non-destructive alignment and expert routing, reporting 6-15% AUC gains on chest radiograph benchmarks across eight models.

A Geometric Algebra-Informed 3DGS Framework for Wireless Channel Prediction

cs.NI · 2026-05-18

citing papers explorer

Showing 13 of 13 citing papers.

ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors cs.CV · 2025-12-09 · unverdicted · none · ref 25
ConceptPose delivers state-of-the-art zero-shot relative pose estimation by matching open-vocabulary 3D concept vectors derived from VLM saliency maps, beating the strongest baseline by 62% in ADD(-S) without training.
Learning to Track Instance from Single Nature Language Description cs.CV · 2026-05-08 · unverdicted · none · ref 19
Tracker is a self-supervised VL tracker that uses a Dynamic Token Aggregation Module to learn instance tracking from single language descriptions in unlabeled videos and outperforms prior self-supervised methods.
PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction cs.CV · 2026-04-09 · unverdicted · none · ref 38
PolySLGen generates contextually appropriate and temporally coherent multimodal speaking and listening reactions for polyadic interactions by fusing group motion and social cues.
PLUME: Latent Reasoning Based Universal Multimodal Embedding cs.CV · 2026-04-02 · unverdicted · none · ref 27
PLUME uses latent-state autoregressive rollouts and a progressive training curriculum to deliver efficient reasoning for universal multimodal embeddings without generating explicit rationales.
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models cs.CV · 2025-12-01 · unverdicted · none · ref 31
AV-SpeakerBench is a new speaker-centered benchmark showing that top multimodal models still struggle with fine-grained audiovisual speech understanding, with Gemini 2.5 Pro leading but open models lagging on fusion.
Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models cs.CV · 2026-05-18 · unverdicted · none · ref 17
Vision Inference Former adds a direct visual-to-output bridge that continuously injects visual semantics during MLLM decoding to sustain consistency and reduce modality imbalance.
Multi-Scale Gaussian-Language Map for Zero-shot Embodied Navigation and Reasoning cs.CV · 2026-05-03 · unverdicted · none · ref 18
GLMap combines explicit 3D Gaussians with multi-scale language semantics in a dual-modality structure and uses an analytical Gaussian Estimator for incremental map building, improving zero-shot performance on navigation and reasoning tasks.
CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging cs.CV · 2026-04-24 · unverdicted · none · ref 19
CheXmix combines masked autoencoder pretraining with early-fusion generative modeling to outperform prior models on chest X-ray classification by up to 8.6% AUROC, inpainting by 51%, and report generation by 45% on GREEN.
G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval cs.CV · 2026-04-16 · unverdicted · none · ref 18
G-MIXER achieves state-of-the-art zero-shot composed image retrieval by using geodesic mixup to build diverse implicit candidates and MLLM-derived explicit semantics for re-ranking.
SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning cs.CV · 2026-03-28 · unverdicted · none · ref 26
SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.
Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy cs.AI · 2026-03-02 · unverdicted · none · ref 28
Nano-EmoX is a compact 2.2B multimodal model that unifies six core affective tasks across perception, understanding, and interaction levels via a curriculum framework, achieving competitive benchmark performance.
Adapting Foundation Vision-Language Models to Medical Diagnosis via Query-Driven Expert Bridging cs.CV · 2025-05-27 · unverdicted · none · ref 15
MedBridge adapts pretrained VLMs to multi-label medical diagnosis via query tokens for non-destructive alignment and expert routing, reporting 6-15% AUC gains on chest radiograph benchmarks across eight models.
A Geometric Algebra-Informed 3DGS Framework for Wireless Channel Prediction cs.NI · 2026-05-18 · unreviewed · ref 19

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer