Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin; Bin Zhu; Jiaxi Cui; Li Yuan; Munan Ning; Peng Jin; Yang Ye

arxiv: 2311.10122 · v3 · pith:UMTFWY7Fnew · submitted 2023-11-16 · 💻 cs.CV

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin , Yang Ye , Bin Zhu , Jiaxi Cui , Munan Ning , Peng Jin , Li Yuan This is my paper

Pith reviewed 2026-05-14 18:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords unified visual representationlarge vision-language modelimage video alignmentmulti-modal LLMvideo understandingmutual enhancement

0 comments

The pith

By aligning images and videos into the language feature space before projection, a single LLM processes both modalities and lets them improve each other.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies misalignment of image and video features before projection as the core obstacle preventing an LLM from learning joint multi-modal interactions. It shows that first mapping both into the same language feature space removes this barrier and allows training on a combined image-video dataset. The resulting Video-LLaVA model then exhibits mutual gains: image data helps video understanding and video data helps image understanding. This produces a simple baseline that beats prior specialized systems on nine image benchmarks and on four video datasets.

Core claim

We unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other. Video-LLaVA achieves superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits, and outperforms Video-ChatGPT by 5.8 percent, 9.9 percent, 18.6 percent, and 10.1 percent on MSRVTT, MSVD, TGIF, and ActivityNet respectively.

What carries the argument

Alignment before projection, the step that places image and video features into a common language feature space prior to the LLM projection layers so that a single model can learn from mixed data.

If this is right

A single model trained on mixed image-video data outperforms models built specifically for images on nine image benchmarks.
The same model outperforms Video-ChatGPT by 5.8 to 18.6 percent on four standard video datasets.
Images and videos improve each other's performance when processed inside one unified representation.
A straightforward alignment step before projection is sufficient to create a working unified LVLM baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pre-projection alignment idea could be tested with additional modalities such as audio or depth maps.
If alignment before projection is the decisive factor, then future work could reduce emphasis on ever-more-complex projection layers.
Scaling the mixed dataset size while keeping the unified representation fixed would test whether the mutual-benefit effect grows or saturates.

Load-bearing premise

The main difficulty for an LLM with multi-modal inputs is the absence of unified tokenization for images and videos before the projection layers are applied.

What would settle it

Train a non-unified model that still uses separate image and video encoders but receives the same mixed dataset and check whether it matches or exceeds Video-LLaVA on both image and video benchmarks.

read the original abstract

The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers. In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other. Video-LLaVA achieves superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits. Additionally, our Video-LLaVA also outperforms Video-ChatGPT by 5.8%, 9.9%, 18.6%, and 10.1% on MSRVTT, MSVD, TGIF, and ActivityNet, respectively. Notably, extensive experiments demonstrate that Video-LLaVA mutually benefits images and videos within a unified visual representation, outperforming models designed specifically for images or videos. We aim for this work to provide modest insights into the multi-modal inputs for the LLM. Code address: \href{https://github.com/PKU-YuanGroup/Video-LLaVA}

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Video-LLaVA shows that aligning image and video features before the projection layer lets one model train on mixed data and pick up gains on both modalities.

read the letter

Video-LLaVA's core move is to align image and video features in a shared space before they reach the projection layer into the LLM. This lets the model train on a combined image-video dataset and report that each modality helps the other. The abstract claims this produces a simple baseline that beats Video-ChatGPT by 5.8-18.6% on four video datasets and holds its own or better across nine image benchmarks. The code release is a plus for anyone who wants to test the setup themselves. What the work does cleanly is demonstrate that separate encoders are not required once early alignment removes the token mismatch. The mutual-benefit result is presented as an empirical outcome rather than a deep theoretical claim, and the numbers line up with that framing. The soft spot is that the abstract gives little detail on how the alignment is actually done or on ablations that isolate its contribution from extra data or training tweaks. Without those tables it is hard to judge how load-bearing the pre-projection step really is. The benchmarks themselves are standard, so there is no circularity in the evaluation. This paper is aimed at groups building unified vision-language models who need a single checkpoint that handles both static images and short videos without switching architectures. It is the kind of incremental but practical baseline that deserves a serious referee to check the methods and run the numbers. I would send it to review rather than desk-reject.

Referee Report

3 major / 2 minor

Summary. The paper proposes Video-LLaVA, an LVLM that aligns image and video features into a shared language feature space prior to projection into the LLM. This unified representation enables joint training on mixed image-video datasets, yielding mutual performance gains. The model reports state-of-the-art results on 9 image benchmarks (across 5 QA datasets and 4 toolkits) and outperforms Video-ChatGPT by 5.8–18.6% on four video datasets (MSRVTT, MSVD, TGIF, ActivityNet).

Significance. If the empirical link between pre-projection alignment and the observed mutual enhancement holds, the work supplies a simple, reproducible baseline for unified LVLMs. The public code release strengthens the contribution by enabling direct verification of the mixed-training protocol and benchmark numbers.

major comments (3)

[§3] §3 (Method): The alignment-before-projection step is described at a high level, but the manuscript does not specify whether the alignment loss is applied to frozen or jointly optimized encoders, nor the exact form of the alignment objective (contrastive, reconstruction, etc.). Without this, the causal contribution of the alignment step to the reported gains cannot be isolated from the mixed-dataset training itself.
[§4.2] §4.2 (Ablation studies): No ablation table isolates the effect of pre-projection alignment versus post-projection fusion or separate image/video projectors. The central claim that alignment enables mutual enhancement therefore rests on the headline benchmark numbers alone rather than controlled comparisons.
[Table 2] Table 2 (video results): The 5.8–18.6% gains over Video-ChatGPT are reported without standard deviations or multiple-run statistics; given that Video-ChatGPT itself uses a different projector and training schedule, it is unclear whether the margin is attributable to the unified representation or to other hyper-parameter differences.

minor comments (2)

[Abstract] The abstract states '9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits'; the exact mapping between these counts and the tables in §4.1 should be clarified for reproducibility.
[§3.1] Notation for the unified visual token space (e.g., the symbol used for the aligned feature before the LLM projector) is introduced inconsistently between §3.1 and Figure 2.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for minor revision. We address each major point below and will update the manuscript accordingly.

read point-by-point responses

Referee: [§3] §3 (Method): The alignment-before-projection step is described at a high level, but the manuscript does not specify whether the alignment loss is applied to frozen or jointly optimized encoders, nor the exact form of the alignment objective (contrastive, reconstruction, etc.). Without this, the causal contribution of the alignment step to the reported gains cannot be isolated from the mixed-dataset training itself.

Authors: We thank the referee for highlighting this omission. The alignment is performed with a contrastive loss between the visual features and language embeddings while jointly optimizing the encoders; the encoders are not frozen. We will revise Section 3 to include the precise loss formulation, optimization schedule, and training details so that the contribution of the alignment step can be more clearly isolated. revision: yes
Referee: [§4.2] §4.2 (Ablation studies): No ablation table isolates the effect of pre-projection alignment versus post-projection fusion or separate image/video projectors. The central claim that alignment enables mutual enhancement therefore rests on the headline benchmark numbers alone rather than controlled comparisons.

Authors: We agree that a controlled ablation would strengthen the central claim. In the revised manuscript we will add an ablation study in Section 4.2 that directly compares the pre-projection unified alignment against (i) post-projection fusion and (ii) separate image/video projectors while keeping all other factors fixed. revision: yes
Referee: [Table 2] Table 2 (video results): The 5.8–18.6% gains over Video-ChatGPT are reported without standard deviations or multiple-run statistics; given that Video-ChatGPT itself uses a different projector and training schedule, it is unclear whether the margin is attributable to the unified representation or to other hyper-parameter differences.

Authors: We acknowledge that variance statistics would be preferable. Due to the high computational cost of LVLM training we report single-run results, which is standard practice in the field. We will add a clarifying note in the revised paper stating this limitation and pointing out that the observed gains are consistent across four distinct video benchmarks and are accompanied by mutual improvements on image tasks, supporting attribution to the unified representation rather than hyper-parameter differences alone. revision: partial

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of early alignment rather than on new theoretical axioms or invented entities.

free parameters (1)

projection and training hyperparameters
Standard deep-learning hyperparameters required to train the model; not enumerated in the abstract.

axioms (1)

domain assumption Transformer-based LLMs can integrate aligned visual tokens effectively
Background assumption inherited from prior LVLM work.

pith-pipeline@v0.9.0 · 5584 in / 1101 out tokens · 57514 ms · 2026-05-14T18:00:44.719539+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RobustSora: De-Watermarked Benchmark for Robust AI-Generated Video Detection
cs.CV 2025-12 conditional novelty 8.0

RobustSora benchmark demonstrates that current AI video detectors rely heavily on visible watermarks, with average accuracy drops of 6.6 percentage points when watermarks are erased and increased false alarms when wat...
AffectVerse: Emotional World Models for Multimodal Affective Computing
cs.CV 2026-05 unverdicted novelty 7.0

AffectVerse improves multimodal emotion recognition by at least 2.57% on nine benchmarks through an Emotion World Module that performs short-horizon latent affective prediction via cross-modal temporal imagination and...
CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.
WirelessSenseLLM: Zero-Shot Human Activity Understanding by Bridging Wireless Signals and Human Language
cs.NI 2026-05 unverdicted novelty 7.0

WirelessSenseLLM bridges unsegmented Wi-Fi CSI signals to LLMs via a CSI-to-Language Adapter for zero-shot human activity understanding and reasoning.
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
cs.CV 2026-05 unverdicted novelty 7.0

A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

EyeCue detects driver cognitive distraction by modeling gaze-visual context interactions in egocentric videos and achieves 74.38% accuracy on the new CogDrive dataset, outperforming 11 baselines.
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models
cs.CV 2026-04 unverdicted novelty 7.0

LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-laye...
Grounding Video Reasoning in Physical Signals
cs.CV 2026-04 unverdicted novelty 7.0

A new benchmark converts video clips into shared grounded event records and tests models across physics, semantic, and control prompts under original, shuffled, ablated, and masked conditions, finding selective robust...
Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic
cs.AI 2026-04 unverdicted novelty 7.0

SAri-RFT applies GRPO-based reinforcement fine-tuning to LVLMs on novel two-term and three-term visual semantic arithmetic tasks, reaching SOTA on the new IRPD dataset and Visual7W-Telling.
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
cs.CV 2026-04 unverdicted novelty 7.0

SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
cs.CV 2026-03 unverdicted novelty 7.0

SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
cs.CV 2026-01 unverdicted novelty 7.0

VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
LFS: Learnable Frame Selector for Event-Aware and Temporally Diverse Video Captioning
cs.CV 2026-01 conditional novelty 7.0

LFS learns to select temporally diverse and event-aware frames for video captioning by using direct feedback from frozen video-LLMs, yielding gains up to 2% on VDC and over 4% on the new ICH-CC benchmark.
Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation
cs.CV 2026-01 unverdicted novelty 7.0

GAR-Font is a global-aware autoregressive framework for multimodal few-shot font generation that adds global tokenization, a language-style adapter, and post-refinement to improve style coherence over patch-based methods.
StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos
cs.CV 2025-12 unverdicted novelty 7.0

StreamGaze is a new benchmark and QA generation pipeline that measures how well MLLMs leverage gaze trajectories for temporal reasoning and proactive intention prediction in streaming egocentric videos.
Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?
cs.CV 2025-11 unverdicted novelty 7.0

Introduces the first dedicated benchmark for live multi-modal LLM task guidance with mistake detection and a streaming baseline model.
TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies?
cs.CV 2025-09 unverdicted novelty 7.0

Introduces TennisTV benchmark for evaluating 17 MLLMs on tennis video understanding from stroke-level to rally-level tasks with automated pipelines and human verification.
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
cs.CV 2025-02 unverdicted novelty 7.0

WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
cs.CV 2024-10 accept novelty 7.0

PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.
MLVU: Benchmarking Multi-task Long Video Understanding
cs.CV 2024-06 conditional novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly
cs.CV 2026-05 unverdicted novelty 6.0

Flat-Pack Bench is a new evaluation suite that shows state-of-the-art LVLMs perform poorly on nuanced spatio-temporal reasoning required for furniture assembly videos.
Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models
cs.CV 2026-05 conditional novelty 6.0

SPpruner reduces visual tokens in VLMs via focus identification followed by context-aware scanning, retaining 22.2% tokens for 2.53x speedup on Qwen2.5-VL with negligible accuracy loss.
Dynamic Model Merging Made Slim
cs.LG 2026-05 unverdicted novelty 6.0

DiDi-Merging achieves dynamic model merging performance matching or exceeding prior methods while using only 1.24x to 1.4x the parameters of a single fine-tuned model.
OProver: A Unified Framework for Agentic Formal Theorem Proving
cs.CL 2026-05 unverdicted novelty 6.0

OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-p...
SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs
cs.CV 2026-05 unverdicted novelty 6.0

SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.
WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization
cs.CV 2026-05 unverdicted novelty 6.0

WindowQuant performs window-adaptive mixed-precision KV cache quantization guided by similarity to the text prompt, with reordering to enable efficient inference in VLMs.
See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection
cs.CV 2026-04 unverdicted novelty 6.0

ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.
UniCon: Unified Framework for Efficient Contrastive Alignment via Kernels
cs.LG 2026-04 unverdicted novelty 6.0

UniCon unifies contrastive alignment across encoders and alignment types using kernels to enable exact closed-form updates instead of stochastic optimization.
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...
ViLL-E: Video LLM Embeddings for Retrieval
cs.CV 2026-04 unverdicted novelty 6.0

ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.
CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning
cs.AI 2026-04 unverdicted novelty 6.0

CFMS is a coarse-to-fine framework that uses MLLMs to create a multi-perspective knowledge tuple as a reasoning map for symbolic table operations, yielding competitive accuracy on WikiTQ and TabFact.
Spatio-Temporal Grounding of Large Language Models from Perception Streams
cs.RO 2026-04 unverdicted novelty 6.0

FESTS uses Spatial Regular Expressions compiled from queries to generate 27k training tuples that raise a 3B-parameter LLM's frame-level F1 on spatio-temporal video reasoning from 48.5% to 87.5%, matching GPT-4.1 whil...
Progressive Video Condensation with MLLM Agent for Long-form Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

ProVCA progressively condenses long videos via segment localization, snippet selection, and keyframe refinement to achieve SOTA zero-shot accuracies on EgoSchema, NExT-QA, and IntentQA with fewer frames.
ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling
cs.CV 2026-03 unverdicted novelty 6.0

ForestPrune prunes 90% of visual tokens in video MLLMs like LLaVA-OneVision while retaining 95.8% accuracy by modeling tokens as spatial-temporal forests and scoring importance via tree depth and node roles.
Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy
cs.AI 2026-03 unverdicted novelty 6.0

Nano-EmoX is a compact 2.2B multimodal model that unifies six core affective tasks across perception, understanding, and interaction levels via a curriculum framework, achieving competitive benchmark performance.
Structure Over Scale: Learning Visual Reasoning from Pedagogical Video
cs.CV 2026-01 unverdicted novelty 6.0

Fine-tuning VLMs on 10K QA pairs from pedagogical children's videos produces consistent gains on NExT-QA, Video-MME, and MotionBench, indicating that explicit structure can substitute for data scale.
Streaming Video Instruction Tuning
cs.CV 2025-12 unverdicted novelty 6.0

Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.
Towards Effective Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval
cs.CV 2025-12 unverdicted novelty 6.0

OneClip-RAG enables MLLMs to handle long videos via one-shot clip retrieval and unified chunking-retrieval, delivering performance gains like matching GPT-5 level on MLVU with high efficiency on standard GPUs.
GA2-CLIP: Generic Attribute Anchor for Efficient Prompt Tuningin Video-Language Models
cs.CV 2025-11 unverdicted novelty 6.0

GA2-CLIP uses generic attribute anchors and coupled hard-soft prompts to preserve generalization in prompt-tuned video-language models on base-to-new class tasks.
Boosting Reasoning in Large Multimodal Models via Activation Replay
cs.CV 2025-11 unverdicted novelty 6.0

Activation Replay boosts multimodal reasoning in post-trained LMMs by replaying low-entropy activations from base models to RLVR counterparts at test time via visual token manipulation.
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
cs.CV 2025-05 unverdicted novelty 6.0

Spatial-MLLM boosts MLLM spatial intelligence from 2D inputs via dual encoders initialized from geometry models plus space-aware sampling, claiming state-of-the-art results.
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
cs.CV 2025-05 unverdicted novelty 6.0

Spatial-MLLM adds a 3D spatial encoder initialized from a visual geometry model and space-aware frame sampling to MLLMs to improve spatial understanding and reasoning from purely 2D visual inputs.
ImgEdit: A Unified Image Editing Dataset and Benchmark
cs.CV 2025-05 conditional novelty 6.0

ImgEdit supplies 1.2 million curated edit pairs and a three-part benchmark that let a VLM-based model outperform prior open-source editors on adherence, quality, and detail preservation.
LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval
cs.CV 2025-05 unverdicted novelty 6.0

LiveVLM introduces VSB and PaR to compress and retrieve KV cache in streaming video LLMs, enabling LLaVA-OneVision to reach SOTA accuracy among training-free query-agnostic and training-based online models.
FaVChat: Hierarchical Prompt-Query Guided Facial Video Understanding with Data-Efficient GRPO
cs.CV 2025-03 unverdicted novelty 6.0

FaVChat proposes hierarchical prompt-query guided visual features and Data-Efficient GRPO for efficient training, plus the FaVChat-170K dataset, claiming consistent outperformance over prior VLLMs on facial video tasks.
Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks
cs.RO 2024-12 unverdicted novelty 6.0

Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-w...
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
cs.CV 2024-10 unverdicted novelty 6.0

LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal de...
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
cs.CV 2024-08 unverdicted novelty 6.0

LongVILA scales visual-language models from 8 to 2048 video frames with 99.8% needle-in-a-haystack accuracy using long-context extension, supervised fine-tuning, and multi-modal sequence parallelism on up to 256 GPUs.
What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction
cs.CV 2024-07 unverdicted novelty 6.0

Introduces the QEVD benchmark for asynchronous situated interaction in fitness coaching and proposes a streaming baseline to address limitations of existing vision-language models.
TempCompass: Do Video LLMs Really Understand Videos?
cs.CV 2024-03 unverdicted novelty 6.0

TempCompass benchmark reveals that state-of-the-art Video LLMs have poor ability to perceive temporal aspects such as speed, direction, and ordering in videos.
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
cs.CV 2024-01 conditional novelty 6.0

MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.
Gemini: A Family of Highly Capable Multimodal Models
cs.CL 2023-12 conditional novelty 6.0

Gemini Ultra reaches human-expert performance on MMLU for the first time and sets new state-of-the-art results on 30 of 32 benchmarks, including all 20 multimodal ones tested.
MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering
cs.CV 2026-05 conditional novelty 5.0

MuKV adds multi-grained KV cache compression at patch-frame-segment levels plus semi-hierarchical retrieval to raise accuracy and cut memory in long video question-answering.
Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models
cs.CV 2026-03 unverdicted novelty 5.0

AOT reduces visual tokens in VLLMs via intra-frame and inter-frame anchors with local-global optimal transport, delivering competitive benchmark performance and efficiency gains in a training-free way.
Enhancing Speech Large Language Models through Reinforced Behavior Alignment
cs.CL 2025-08 unverdicted novelty 5.0

Reinforced Behavior Alignment (RBA) uses self-synthesized data from a teacher LLM and reinforcement learning to close the instruction-following gap in SpeechLMs, outperforming distillation and reaching SOTA on spoken ...
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
cs.CV 2025-06 unverdicted novelty 5.0

UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.
Growing a Multi-head Twig via Distillation and Reinforcement Learning to Accelerate Large Vision-Language Models
cs.CV 2025-03 unverdicted novelty 5.0

TwigVLM adds a twig module to VLMs for twig-guided token pruning and self-speculative decoding, retaining 96% performance after pruning 88.9% visual tokens and delivering 154% speedup on long responses for LLaVA-1.5-7B.
LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
cs.CV 2025-01 unverdicted novelty 5.0

LLaVA-Octopus introduces instruction-driven adaptive fusion of multiple visual projectors in a multimodal LLM to improve video understanding performance.
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
cs.CV 2024-08 unverdicted novelty 5.0

mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.
LLaVA-OneVision: Easy Visual Task Transfer
cs.CV 2024-08 unverdicted novelty 5.0

LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · cited by 68 Pith papers · 28 internal anchors

[1]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716--23736

work page 2022
[3]

Max Bain, Arsha Nagrani, G \"u l Varol, and Andrew Zisserman. 2021. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728--1738

work page 2021
[5]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901

work page 2020
[6]

David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pages 190--200

work page 2011
[8]

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90\ See https://vicuna. lmsys. org (accessed 14 April 2023)

work page 2023
[9]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023. https://arxiv.org/abs/2305.06500 Instructblip: Towards general-purpose vision-language models with instruction tuning . Preprint, arXiv:2305.06500

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. 2023. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180--15190

work page 2023
[14]

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904--6913

work page 2017
[15]

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. 2018. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608--3617

work page 2018
[17]

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll \'a r, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000--16009

work page 2022
[19]

Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700--6709

work page 2019
[21]

Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2017. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2758--2766

work page 2017
[23]

Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583--5594. PMLR

work page 2021
[24]

Obelisc: An open web-scale filtered dataset of interleaved image-text documents

Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. 2023. https://arxiv.org/abs/2306.16527 Obelics: An open web-scale filtered dataset of interleaved image-text documents . Preprint, arXiv:2306.16527

work page arXiv 2023
[27]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888--12900. PMLR

work page 2022
[28]

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694--9705

work page 2021
[35]

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507--2521

work page 2022
[39]

OpenAI. 2023. https://arxiv.org/abs/2303.08774 Gpt-4 technical report . Preprint, arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730--27744

work page 2022
[42]

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556--2565

work page 2018
[44]

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317--8326

work page 2019
[46]

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023. Stanford alpaca: An instruction-following llama model

work page 2023
[50]

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288--5296

work page 2016
[55]

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. 2019. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9127--9134

work page 2019
[60]

Improved Baselines with Visual Instruction Tuning

Improved Baselines with Visual Instruction Tuning , author=. arXiv preprint arXiv:2310.03744 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[61]

Visual Instruction Tuning

Visual instruction tuning , author=. arXiv preprint arXiv:2304.08485 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[62]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models , author=. arXiv preprint arXiv:2306.05424 , year=

work page internal anchor Pith review arXiv
[63]

VideoChat: Chat-Centric Video Understanding

Videochat: Chat-centric video understanding , author=. arXiv preprint arXiv:2305.06355 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[64]

Valley: Video assistant with large language model enhanced ability.arXiv preprint arXiv:2306.07207,

Valley: Video Assistant with Large Language model Enhanced abilitY , author=. arXiv preprint arXiv:2306.07207 , year=

work page arXiv
[65]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Frozen in time: A joint video and image encoder for end-to-end retrieval , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[66]

2023 , eprint=

GPT-4 Technical Report , author=. 2023 , eprint=

work page 2023
[67]

Stanford alpaca: An instruction-following llama model , author=

work page
[68]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[69]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[70]

See https://vicuna

Vicuna: An open-source chatbot impressing gpt-4 with 90\ author=. See https://vicuna. lmsys. org (accessed 14 April 2023) , year=

work page 2023
[71]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page
[72]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[73]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Visual chatgpt: Talking, drawing and editing with visual foundation models , author=. arXiv preprint arXiv:2303.04671 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[74]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface , author=. arXiv preprint arXiv:2303.17580 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[75]

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Mm-react: Prompting chatgpt for multimodal reasoning and action , author=. arXiv preprint arXiv:2303.11381 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[76]

ViperGPT: Visual Inference via Python Execution for Reasoning

Vipergpt: Visual inference via python execution for reasoning , author=. arXiv preprint arXiv:2303.08128 , year=

work page internal anchor Pith review arXiv
[77]

2023 , eprint=

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , author=. 2023 , eprint=

work page 2023
[78]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Minigpt-4: Enhancing vision-language understanding with advanced large language models , author=. arXiv preprint arXiv:2304.10592 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[79]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

mplug-owl: Modularization empowers large language models with multimodality , author=. arXiv preprint arXiv:2304.14178 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[80]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Video-llama: An instruction-tuned audio-visual language model for video understanding , author=. arXiv preprint arXiv:2306.02858 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[81]

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Llama-adapter: Efficient fine-tuning of language models with zero-init attention , author=. arXiv preprint arXiv:2303.16199 , year=

work page Pith review arXiv
[82]

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Llama-adapter v2: Parameter-efficient visual instruction model , author=. arXiv preprint arXiv:2304.15010 , year=

work page internal anchor Pith review arXiv
[83]

Imagebind-llm: Multi-modality instruction tun- ing

Imagebind-llm: Multi-modality instruction tuning , author=. arXiv preprint arXiv:2309.03905 , year=

work page arXiv
[84]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Imagebind: One embedding space to bind them all , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[85]

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning , author=. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[86]

International Conference on Machine Learning , pages=

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022
[87]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Making the v in vqa matter: Elevating the role of image understanding in visual question answering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[88]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Gqa: A new dataset for real-world visual reasoning and compositional question answering , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[89]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Vizwiz grand challenge: Answering visual questions from blind people , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[90]

Advances in Neural Information Processing Systems , volume=

Learn to explain: Multimodal reasoning via thought chains for science question answering , author=. Advances in Neural Information Processing Systems , volume=

work page
[91]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Towards vqa models that can read , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[92]

Evaluating Object Hallucination in Large Vision-Language Models

Evaluating object hallucination in large vision-language models , author=. arXiv preprint arXiv:2305.10355 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[93]

MMBench: Is Your Multi-modal Model an All-around Player?

MMBench: Is Your Multi-modal Model an All-around Player? , author=. arXiv preprint arXiv:2307.06281 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[94]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Mm-vet: Evaluating large multimodal models for integrated capabilities , author=. arXiv preprint arXiv:2308.02490 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[95]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Msr-vtt: A large video description dataset for bridging video and language , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[96]

Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies , pages=

Collecting highly parallel data for paraphrase evaluation , author=. Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies , pages=

work page
[97]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Tgif-qa: Toward spatio-temporal reasoning in visual question answering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[98]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Activitynet-qa: A dataset for understanding complex web videos via question answering , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[99]

PALM: Pre-training an Autoencoding and Autoregressive Language Model for Context-conditioned Generation , url =

Palm: Pre-training an autoencoding&autoregressive language model for context-conditioned generation , author=. arXiv preprint arXiv:2004.07159 , year=

work page arXiv 2004
[100]

PaLM 2 Technical Report

Palm 2 technical report , author=. arXiv preprint arXiv:2305.10403 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[101]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Bloom: A 176b-parameter open-access multilingual language model , author=. arXiv preprint arXiv:2211.05100 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[102]

Advances in Neural Information Processing Systems , volume=

Flamingo: a visual language model for few-shot learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[103]

Advances in neural information processing systems , volume=

Align before fuse: Vision and language representation learning with momentum distillation , author=. Advances in neural information processing systems , volume=

work page
[104]

International Conference on Machine Learning , pages=

Vilt: Vision-and-language transformer without convolution or region supervision , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[105]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. arXiv preprint arXiv:2301.12597 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[106]

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Otter: A multi-modal model with in-context instruction tuning , author=. arXiv preprint arXiv:2305.03726 , year=

work page internal anchor Pith review arXiv
[107]

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment , author=. arXiv preprint arXiv:2310.01852 , year=

work page internal anchor Pith review arXiv
[108]

Multimodal-gpt: A vision and language model for dialogue with humans

Multimodal-gpt: A vision and language model for dialogue with humans , author=. arXiv preprint arXiv:2305.04790 , year=

work page arXiv
[109]

Chat-univi: Unified vi- sual representation empowers large language models with image and video understanding

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding , author=. arXiv preprint arXiv:2311.08046 , year=

work page arXiv
[110]

2023 , eprint=

OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents , author=. 2023 , eprint=

work page 2023
[111]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Masked autoencoders are scalable vision learners , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[112]

Proceedings of the IEEE international conference on computer vision , pages=

Vqa: Visual question answering , author=. Proceedings of the IEEE international conference on computer vision , pages=

work page
[113]

X-LLM: Bootstrapping advanced large language models by treating multi-modalities as foreign languages.arXiv preprint arXiv:2305.04160,

X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages , author=. arXiv preprint arXiv:2305.04160 , year=

work page arXiv
[114]

Macaw-LLM : Multi-modal language modeling with image, audio, video, and text integration

Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration , author=. arXiv preprint arXiv:2306.09093 , year=

work page arXiv
[115]

Ilharco, M

Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig , title =. doi:10.5281/zenodo.5143773 , url =

work page doi:10.5281/zenodo.5143773

Showing first 80 references.

[1] [1]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716--23736

work page 2022

[2] [3]

Max Bain, Arsha Nagrani, G \"u l Varol, and Andrew Zisserman. 2021. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728--1738

work page 2021

[3] [5]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901

work page 2020

[4] [6]

David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pages 190--200

work page 2011

[5] [8]

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90\ See https://vicuna. lmsys. org (accessed 14 April 2023)

work page 2023

[6] [9]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023. https://arxiv.org/abs/2305.06500 Instructblip: Towards general-purpose vision-language models with instruction tuning . Preprint, arXiv:2305.06500

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [12]

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. 2023. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180--15190

work page 2023

[8] [14]

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904--6913

work page 2017

[9] [15]

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. 2018. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608--3617

work page 2018

[10] [17]

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll \'a r, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000--16009

work page 2022

[11] [19]

Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700--6709

work page 2019

[12] [21]

Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2017. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2758--2766

work page 2017

[13] [23]

Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583--5594. PMLR

work page 2021

[14] [24]

Obelisc: An open web-scale filtered dataset of interleaved image-text documents

Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. 2023. https://arxiv.org/abs/2306.16527 Obelics: An open web-scale filtered dataset of interleaved image-text documents . Preprint, arXiv:2306.16527

work page arXiv 2023

[15] [27]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888--12900. PMLR

work page 2022

[16] [28]

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694--9705

work page 2021

[17] [35]

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507--2521

work page 2022

[18] [39]

OpenAI. 2023. https://arxiv.org/abs/2303.08774 Gpt-4 technical report . Preprint, arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [40]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730--27744

work page 2022

[20] [42]

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556--2565

work page 2018

[21] [44]

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317--8326

work page 2019

[22] [46]

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023. Stanford alpaca: An instruction-following llama model

work page 2023

[23] [50]

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288--5296

work page 2016

[24] [55]

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. 2019. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9127--9134

work page 2019

[25] [60]

Improved Baselines with Visual Instruction Tuning

Improved Baselines with Visual Instruction Tuning , author=. arXiv preprint arXiv:2310.03744 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[26] [61]

Visual Instruction Tuning

Visual instruction tuning , author=. arXiv preprint arXiv:2304.08485 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [62]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models , author=. arXiv preprint arXiv:2306.05424 , year=

work page internal anchor Pith review arXiv

[28] [63]

VideoChat: Chat-Centric Video Understanding

Videochat: Chat-centric video understanding , author=. arXiv preprint arXiv:2305.06355 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[29] [64]

Valley: Video assistant with large language model enhanced ability.arXiv preprint arXiv:2306.07207,

Valley: Video Assistant with Large Language model Enhanced abilitY , author=. arXiv preprint arXiv:2306.07207 , year=

work page arXiv

[30] [65]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Frozen in time: A joint video and image encoder for end-to-end retrieval , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[31] [66]

2023 , eprint=

GPT-4 Technical Report , author=. 2023 , eprint=

work page 2023

[32] [67]

Stanford alpaca: An instruction-following llama model , author=

work page

[33] [68]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[34] [69]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[35] [70]

See https://vicuna

Vicuna: An open-source chatbot impressing gpt-4 with 90\ author=. See https://vicuna. lmsys. org (accessed 14 April 2023) , year=

work page 2023

[36] [71]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page

[37] [72]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page

[38] [73]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Visual chatgpt: Talking, drawing and editing with visual foundation models , author=. arXiv preprint arXiv:2303.04671 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[39] [74]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface , author=. arXiv preprint arXiv:2303.17580 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[40] [75]

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Mm-react: Prompting chatgpt for multimodal reasoning and action , author=. arXiv preprint arXiv:2303.11381 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[41] [76]

ViperGPT: Visual Inference via Python Execution for Reasoning

Vipergpt: Visual inference via python execution for reasoning , author=. arXiv preprint arXiv:2303.08128 , year=

work page internal anchor Pith review arXiv

[42] [77]

2023 , eprint=

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , author=. 2023 , eprint=

work page 2023

[43] [78]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Minigpt-4: Enhancing vision-language understanding with advanced large language models , author=. arXiv preprint arXiv:2304.10592 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[44] [79]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

mplug-owl: Modularization empowers large language models with multimodality , author=. arXiv preprint arXiv:2304.14178 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[45] [80]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Video-llama: An instruction-tuned audio-visual language model for video understanding , author=. arXiv preprint arXiv:2306.02858 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[46] [81]

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Llama-adapter: Efficient fine-tuning of language models with zero-init attention , author=. arXiv preprint arXiv:2303.16199 , year=

work page Pith review arXiv

[47] [82]

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Llama-adapter v2: Parameter-efficient visual instruction model , author=. arXiv preprint arXiv:2304.15010 , year=

work page internal anchor Pith review arXiv

[48] [83]

Imagebind-llm: Multi-modality instruction tun- ing

Imagebind-llm: Multi-modality instruction tuning , author=. arXiv preprint arXiv:2309.03905 , year=

work page arXiv

[49] [84]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Imagebind: One embedding space to bind them all , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[50] [85]

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning , author=. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[51] [86]

International Conference on Machine Learning , pages=

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022

[52] [87]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Making the v in vqa matter: Elevating the role of image understanding in visual question answering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page

[53] [88]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Gqa: A new dataset for real-world visual reasoning and compositional question answering , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[54] [89]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Vizwiz grand challenge: Answering visual questions from blind people , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page

[55] [90]

Advances in Neural Information Processing Systems , volume=

Learn to explain: Multimodal reasoning via thought chains for science question answering , author=. Advances in Neural Information Processing Systems , volume=

work page

[56] [91]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Towards vqa models that can read , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[57] [92]

Evaluating Object Hallucination in Large Vision-Language Models

Evaluating object hallucination in large vision-language models , author=. arXiv preprint arXiv:2305.10355 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[58] [93]

MMBench: Is Your Multi-modal Model an All-around Player?

MMBench: Is Your Multi-modal Model an All-around Player? , author=. arXiv preprint arXiv:2307.06281 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[59] [94]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Mm-vet: Evaluating large multimodal models for integrated capabilities , author=. arXiv preprint arXiv:2308.02490 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[60] [95]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Msr-vtt: A large video description dataset for bridging video and language , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page

[61] [96]

Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies , pages=

Collecting highly parallel data for paraphrase evaluation , author=. Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies , pages=

work page

[62] [97]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Tgif-qa: Toward spatio-temporal reasoning in visual question answering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page

[63] [98]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Activitynet-qa: A dataset for understanding complex web videos via question answering , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[64] [99]

PALM: Pre-training an Autoencoding and Autoregressive Language Model for Context-conditioned Generation , url =

Palm: Pre-training an autoencoding&autoregressive language model for context-conditioned generation , author=. arXiv preprint arXiv:2004.07159 , year=

work page arXiv 2004

[65] [100]

PaLM 2 Technical Report

Palm 2 technical report , author=. arXiv preprint arXiv:2305.10403 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[66] [101]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Bloom: A 176b-parameter open-access multilingual language model , author=. arXiv preprint arXiv:2211.05100 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[67] [102]

Advances in Neural Information Processing Systems , volume=

Flamingo: a visual language model for few-shot learning , author=. Advances in Neural Information Processing Systems , volume=

work page

[68] [103]

Advances in neural information processing systems , volume=

Align before fuse: Vision and language representation learning with momentum distillation , author=. Advances in neural information processing systems , volume=

work page

[69] [104]

International Conference on Machine Learning , pages=

Vilt: Vision-and-language transformer without convolution or region supervision , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021

[70] [105]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. arXiv preprint arXiv:2301.12597 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[71] [106]

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Otter: A multi-modal model with in-context instruction tuning , author=. arXiv preprint arXiv:2305.03726 , year=

work page internal anchor Pith review arXiv

[72] [107]

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment , author=. arXiv preprint arXiv:2310.01852 , year=

work page internal anchor Pith review arXiv

[73] [108]

Multimodal-gpt: A vision and language model for dialogue with humans

Multimodal-gpt: A vision and language model for dialogue with humans , author=. arXiv preprint arXiv:2305.04790 , year=

work page arXiv

[74] [109]

Chat-univi: Unified vi- sual representation empowers large language models with image and video understanding

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding , author=. arXiv preprint arXiv:2311.08046 , year=

work page arXiv

[75] [110]

2023 , eprint=

OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents , author=. 2023 , eprint=

work page 2023

[76] [111]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Masked autoencoders are scalable vision learners , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[77] [112]

Proceedings of the IEEE international conference on computer vision , pages=

Vqa: Visual question answering , author=. Proceedings of the IEEE international conference on computer vision , pages=

work page

[78] [113]

X-LLM: Bootstrapping advanced large language models by treating multi-modalities as foreign languages.arXiv preprint arXiv:2305.04160,

X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages , author=. arXiv preprint arXiv:2305.04160 , year=

work page arXiv

[79] [114]

Macaw-LLM : Multi-modal language modeling with image, audio, video, and text integration

Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration , author=. arXiv preprint arXiv:2306.09093 , year=

work page arXiv

[80] [115]

Ilharco, M

Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig , title =. doi:10.5281/zenodo.5143773 , url =

work page doi:10.5281/zenodo.5143773