Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al · 2022

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

browse 7 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Lance: Unified Multimodal Modeling by Multi-Task Synergy

cs.CV · 2026-05-18 · unverdicted · novelty 6.0 · 2 refs

Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.

Thinking with Drafting: Optical Decompression via Logical Reconstruction

cs.CL · 2026-02-12 · unverdicted · novelty 6.0

Thinking with Drafting reconceptualizes visual reasoning as optical decompression by forcing models to draft mental models into executable DSL code for deterministic self-verification on the VisAlg benchmark.

Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

cs.CV · 2025-05-08 · unverdicted · novelty 6.0

Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interleaved outputs including zero-shot editing.

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

cs.CV · 2025-09-09 · unverdicted · novelty 5.0

Mini-o3 scales visual search reasoning to tens of interaction turns via a new probe dataset, iterative trajectory collection, and over-turn masking in RL, claiming SOTA performance while training only up to six turns.

LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

cs.CV · 2026-04-13 · unverdicted · novelty 3.0

This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challenges like instance permanence and consistent interaction.

Let ViT Speak: Generative Language-Image Pre-training

cs.CV · 2026-05-01

Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

cs.CV · 2026-02-02

citing papers explorer

Showing 7 of 7 citing papers.

Lance: Unified Multimodal Modeling by Multi-Task Synergy cs.CV · 2026-05-18 · unverdicted · none · ref 2 · 2 links
Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.
Thinking with Drafting: Optical Decompression via Logical Reconstruction cs.CL · 2026-02-12 · unverdicted · none · ref 2
Thinking with Drafting reconceptualizes visual reasoning as optical decompression by forcing models to draft mental models into executable DSL code for deterministic self-verification on the VisAlg benchmark.
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation cs.CV · 2025-05-08 · unverdicted · none · ref 2
Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interleaved outputs including zero-shot editing.
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search cs.CV · 2025-09-09 · unverdicted · none · ref 2
Mini-o3 scales visual search reasoning to tens of interaction turns via a new probe dataset, iterative trajectory collection, and over-turn masking in RL, claiming SOTA performance while training only up to six turns.
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation cs.CV · 2026-04-13 · unverdicted · none · ref 3
This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challenges like instance permanence and consistent interaction.
Let ViT Speak: Generative Language-Image Pre-training cs.CV · 2026-05-01 · unreviewed · ref 3
Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models cs.CV · 2026-02-02 · unreviewed · ref 1

Flamingo: a visual language model for few-shot learning

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer