Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi · 2022

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

browse 8 citing papers

citation-role summary

background 1 baseline 1

citation-polarity summary

background 1 baseline 1

representative citing papers

Conceptualizing Embeddings: Sparse Disentanglement for Vision-Language Models

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

CEDAR learns an invertible rotation of vision-language embeddings to concentrate semantics into sparse, axis-aligned coordinates for improved interpretability.

The Indra Representation Hypothesis for Multimodal Alignment

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

Unimodal model representations converge to a relational structure captured by the Indra representation via V-enriched Yoneda embedding, which is unique and structure-preserving and improves cross-model and cross-modal robustness when instantiated with angular distance.

Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?

cs.CV · 2025-11-27 · unverdicted · novelty 7.0

Introduces the first dedicated benchmark for live multi-modal LLM task guidance with mistake detection and a streaming baseline model.

Learning Relative Representations for Fine-Grained Multimodal Alignment with Limited Data

cs.CV · 2026-05-16 · unverdicted · novelty 6.0

A new post-hoc alignment technique uses learnable anchors to capture token-level relative similarities between modalities, outperforming global alignment baselines on zero-shot classification, retrieval, and segmentation with scarce paired examples.

ReCoVR: Closing the Loop in Interactive Composed Video Retrieval

cs.IR · 2026-05-11 · unverdicted · novelty 6.0

ReCoVR introduces a reflexive dual-pathway architecture for interactive composed video retrieval that outperforms baselines by combining intent routing with trajectory-level reflection on retrieval history.

SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs

cs.CV · 2026-04-15 · conditional · novelty 6.0 · 2 refs

SLQ adapts frozen MLLMs for multimodal retrieval by appending shared latent queries to text and image tokens and introduces KARR-Bench to test knowledge-aware reasoning retrieval.

Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM

cs.CV · 2025-05-23 · unverdicted · novelty 6.0

Slot-MLLM introduces a slot-attention-based object-centric visual tokenizer with Q-Former encoder, diffusion decoder, and residual vector quantization for improved local visual comprehension and generation in multimodal LLMs.

Frequency-Domain Regularized Adversarial Alignment for Transferable Attacks against Closed-Source MLLMs

cs.CR · 2026-05-20 · unverdicted · novelty 5.0

FRA-Attack uses high-pass DCT feature alignment and frequency-domain gradient regularization to boost adversarial transferability across 15 MLLMs from 7 vendors.

citing papers explorer

Showing 8 of 8 citing papers.

Conceptualizing Embeddings: Sparse Disentanglement for Vision-Language Models cs.CV · 2026-05-21 · unverdicted · none · ref 2
CEDAR learns an invertible rotation of vision-language embeddings to concentrate semantics into sparse, axis-aligned coordinates for improved interpretability.
The Indra Representation Hypothesis for Multimodal Alignment cs.CV · 2026-04-06 · unverdicted · none · ref 42
Unimodal model representations converge to a relational structure captured by the Indra representation via V-enriched Yoneda embedding, which is unique and structure-preserving and improves cross-model and cross-modal robustness when instantiated with angular distance.
Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance? cs.CV · 2025-11-27 · unverdicted · none · ref 31
Introduces the first dedicated benchmark for live multi-modal LLM task guidance with mistake detection and a streaming baseline model.
Learning Relative Representations for Fine-Grained Multimodal Alignment with Limited Data cs.CV · 2026-05-16 · unverdicted · none · ref 15
A new post-hoc alignment technique uses learnable anchors to capture token-level relative similarities between modalities, outperforming global alignment baselines on zero-shot classification, retrieval, and segmentation with scarce paired examples.
ReCoVR: Closing the Loop in Interactive Composed Video Retrieval cs.IR · 2026-05-11 · unverdicted · none · ref 54
ReCoVR introduces a reflexive dual-pathway architecture for interactive composed video retrieval that outperforms baselines by combining intent routing with trajectory-level reflection on retrieval history.
SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs cs.CV · 2026-04-15 · conditional · none · ref 19 · 2 links
SLQ adapts frozen MLLMs for multimodal retrieval by appending shared latent queries to text and image tokens and introduces KARR-Bench to test knowledge-aware reasoning retrieval.
Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM cs.CV · 2025-05-23 · unverdicted · none · ref 36
Slot-MLLM introduces a slot-attention-based object-centric visual tokenizer with Q-Former encoder, diffusion decoder, and residual vector quantization for improved local visual comprehension and generation in multimodal LLMs.
Frequency-Domain Regularized Adversarial Alignment for Transferable Attacks against Closed-Source MLLMs cs.CR · 2026-05-20 · unverdicted · none · ref 23
FRA-Attack uses high-pass DCT feature alignment and frequency-domain gradient regularization to boost adversarial transferability across 15 MLLMs from 7 vendors.

Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer