Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi · 2023

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

cs.AI · 2024-07-01 · accept · novelty 7.0

WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.

Emu3: Next-Token Prediction is All You Need

cs.CV · 2024-09-27 · unverdicted · novelty 6.0

Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

cs.CV · 2025-03-16 · unverdicted · novelty 2.0

The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.

citing papers explorer

Showing 3 of 3 citing papers.

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning? cs.AI · 2024-07-01 · accept · none · ref 10
WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
Emu3: Next-Token Prediction is All You Need cs.CV · 2024-09-27 · unverdicted · none · ref 48
Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey cs.CV · 2025-03-16 · unverdicted · none · ref 195
The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.

Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer