pith. sign in

hub Mixed citations

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Mixed citation behavior. Most common role is background (64%).

35 Pith papers citing it
Background 64% of classified citations
abstract

Recent advances in Large Multimodal Models (LMMs) have unveiled great potential as visual assistants. However, most existing works focus on responding to individual instructions or using previous dialogues for contextual understanding. There is little discussion on employing both images and text as in-context examples to enhance the instruction following capability. To bridge this gap, we introduce the \textbf{Otter} model to leverage both textual and visual in-context examples for instruction tuning. Specifically, Otter builds upon Flamingo with Perceiver architecture, and has been instruction tuned for general purpose multi-modal assistant. Otter seamlessly processes multi-modal inputs, supporting modalities including text, multiple images, and dynamic video content. To support the training of Otter, we present the \textbf{MIMIC-IT} (\textbf{M}ult\textbf{I}-\textbf{M}odal \textbf{I}n-\textbf{C}ontext \textbf{I}nstruction \textbf{T}uning) dataset, which encompasses over 3 million multi-modal instruction-response pairs, including approximately 2.2 million unique instructions across a broad spectrum of images and videos. MIMIC-IT has been carefully curated to feature a diverse array of in-context examples for each entry. Comprehensive evaluations suggest that instruction tuning with these in-context examples substantially enhances model convergence and generalization capabilities. Notably, the extensive scenario coverage provided by the MIMIC-IT dataset empowers the Otter model to excel in tasks involving complex video and multi-image understanding.

hub tools

citation-role summary

background 10 baseline 3 method 1

citation-polarity summary

representative citing papers

AffectVerse: Emotional World Models for Multimodal Affective Computing

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

AffectVerse improves multimodal emotion recognition by at least 2.57% on nine benchmarks through an Emotion World Module that performs short-horizon latent affective prediction via cross-modal temporal imagination and belief aggregation.

MLVU: Benchmarking Multi-task Long Video Understanding

cs.CV · 2024-06-06 · conditional · novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.

Deep Pre-Alignment for VLMs

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

Deep Pre-Alignment uses a small VLM perceiver instead of ViT to pre-align visual features with LLM text space, yielding 1.9-3.0 point gains on multimodal benchmarks and 32.9% less language forgetting.

Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM

cs.CV · 2026-03-29 · unverdicted · novelty 6.0

Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five benchmarks using pre-trained encoders.

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

cs.HC · 2024-01-17 · unverdicted · novelty 6.0

SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.

citing papers explorer

Showing 35 of 35 citing papers.