PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

Andrea Madotto; Andrew Westbury; Babak Damavandi; Christoph Feichtenhofer; Daniel Bolya; Effrosyni Mavroudi; Hanoona Rasheed; Huiyu Wang; Jang Hyun Cho; Kristen Grauman

arxiv: 2504.13180 · v3 · pith:AILP7NVMnew · submitted 2025-04-17 · 💻 cs.CV · cs.AI· cs.LG

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

Jang Hyun Cho , Andrea Madotto , Effrosyni Mavroudi , Triantafyllos Afouras , Tushar Nagarajan , Muhammad Maaz , Yale Song , Tengyu Ma

show 21 more authors

Shuming Hu Suyog Jain Miguel Martin Huiyu Wang Hanoona Rasheed Peize Sun Po-Yao Huang Daniel Bolya Nikhila Ravi Shashank Jain Tammy Stark Shane Moon Babak Damavandi Vivian Lee Andrew Westbury Salman Khan Philipp Kr\"ahenb\"uhl Piotr Doll\'ar Lorenzo Torresani Kristen Grauman Christoph Feichtenhofer

This is my paper

classification 💻 cs.CV cs.AIcs.LG

keywords datamodelsvideotrainingunderstandingresearchdetaileddistillation

0 comments

read the original abstract

Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the teacher model and its data sources, scientific progress remains difficult to measure. In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding. We analyze standard training pipelines without distillation from proprietary models and explore large-scale synthetic data to identify critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2.8M human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions. Additionally, we introduce PLM-VideoBench, a suite for evaluating challenging video understanding tasks focusing on the ability to reason about "what", "where", "when", and "how" of a video. We make our work fully reproducible by providing data, training recipes, code & models. https://github.com/facebookresearch/perception_models

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DataComp-VLM: Improved Open Datasets for Vision-Language Models
cs.CV 2026-06 conditional novelty 8.0

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
cs.CV 2026-01 unverdicted novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
MoHallBench: A Benchmark for Motion Hallucination in Video Large Language Models
cs.CV 2026-07 unverdicted novelty 7.0

MoHallBench is a new benchmark evaluating motion hallucination in VideoLLMs from co-occurrence priors, sequential inference, and similarity confusion, revealing decoupling from action recognition performance.
SSMNBench: Diagnosing Image-based Cross-View Human-Object Understanding via Single-View Sufficiency and Multi-View Necessity
cs.CV 2026-06 unverdicted novelty 7.0

SSMNBench shows that MLLMs suffer distraction degradation on single-view-sufficient tasks and fail to integrate geometric evidence across views, instead relying on semantic averaging and view preference.
AMALIA-VL: A Native European Portuguese Open-Source Vision and Language Model
cs.CV 2026-06 unverdicted novelty 7.0

AMALIA-VL introduces the first open-source instruction-tuned LVLM natively optimized for European Portuguese via vision-language alignment, instruction tuning, preference optimization, and a pt-PT-centric data mix.
Don't Pause! Every prediction matters in a streaming video
cs.CV 2026-04 unverdicted novelty 7.0

SPOT-Bench tests real-time streaming video perception with timeliness metrics, exposing limitations in current models and introducing AsynKV as an improved baseline.
InstrAct: Towards Action-Centric Understanding in Instructional Videos
cs.CV 2026-04 unverdicted novelty 7.0

InstrAction pretrains video foundation models using action-centric data filtering, hard negatives, an Action Perceiver module, DTW-Align, and Masked Action Modeling to reduce static bias and outperform prior models on...
SAM 3: Segment Anything with Concepts
cs.CV 2025-11 unverdicted novelty 7.0

SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.
VisReflect: Latent Visual Reflection for Fine-Grained Perception in Long Visual Context
cs.CV 2026-06 unverdicted novelty 6.0

VisReflect generates continuous latent visual reflections to emphasize relevant visual features and guide attention in LVLMs, yielding 4.1% gains on image benchmarks and 1.8% on video benchmarks with 44% less inferenc...
DataComp-VLM: Improved Open Datasets for Vision-Language Models
cs.CV 2026-06 unverdicted novelty 6.0

DataComp-VLM benchmark shows instruction-heavy data mixtures outperform caption-heavy ones for VLM training, with DCVLM-Baseline reaching 63.6% on 33 tasks using 200B tokens, +5.4pp over FineVision.
Graph it first! Enabling Reasoning on Long-form Egocentric Videos through Scene Graphs
cs.CV 2026-06 unverdicted novelty 6.0

Egocentric Scene Graphs convert long videos into short structured text so MLLMs can answer questions about entire sequences, achieving SOTA on HD-EPIC VQA.
Graph it first! Enabling Reasoning on Long-form Egocentric Videos through Scene Graphs
cs.CV 2026-06 unverdicted novelty 6.0

Introduces temporally grounded EgoSGs to convert long egocentric videos into compact symbolic text for MLLM-based VQA, claiming SOTA results on HD-EPIC without subsampling.
DiffusionBench: On Holistic Evaluation of Diffusion Transformers
cs.CV 2026-06 conditional novelty 6.0

NanoGen unifies DiT training on ImageNet and T2I, reveals negative Pearson correlations (-0.377 to -0.580) in method rankings across metrics from 21 models, and motivates DiffusionBench for holistic evaluation.
AMALIA-VL: A Native European Portuguese Open-Source Vision and Language Model
cs.CV 2026-06 unverdicted novelty 6.0

AMALIA-VL is the first open-source LVLM natively optimized for European Portuguese via three-stage training on a pt-PT-centric data mix combining curated, translated, and novel datasets.
AMALIA-VL: A Native European Portuguese Open-Source Vision and Language Model
cs.CV 2026-06 unverdicted novelty 6.0

Introduces AMALIA-VL, the first open-source instruction-tuned LVLM for European Portuguese, using a high-resolution vision encoder, pt-PT language model, learned connector, and three-stage training on a custom data mix.
AdaCodec: A Predictive Visual Code for Video MLLMs
cs.CV 2026-06 unverdicted novelty 6.0

AdaCodec introduces a predictive visual code that cuts visual token use in video MLLMs by sending full frames only on high predictive cost and otherwise encoding inter-frame changes as P-tokens, yielding better benchm...
Zamba2-VL Technical Report
cs.CV 2026-05 unverdicted novelty 6.0

Zamba2-VL is a family of 1.2B–7B hybrid Mamba2-transformer vision-language models that match leading transformer VLMs on image, reasoning, OCR, grounding and counting benchmarks while delivering roughly 10x lower time...
Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 6.0

CRPO applies counterfactual videos and a cross-branch relation reward in RL post-training to reduce shortcut reliance in Video LLMs, with gains shown on the new DyBench paired benchmark.
Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly
cs.CV 2026-05 unverdicted novelty 6.0

Flat-Pack Bench is a new evaluation suite that shows state-of-the-art LVLMs perform poorly on nuanced spatio-temporal reasoning required for furniture assembly videos.
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
Building a Precise Video Language with Human-AI Oversight
cs.CV 2026-04 unverdicted novelty 6.0

CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video gene...
Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs
cs.CV 2026-04 unverdicted novelty 6.0

Perception Programs rewrite dense visual tool outputs into language-native summaries, boosting MLLM accuracy by 15-45% absolute on BLINK perception tasks and setting new state-of-the-art results.
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
cs.AI 2025-06 unverdicted novelty 6.0

V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...
Perception Encoder: The best visual embeddings are not at the output of the network
cs.CV 2025-04 unverdicted novelty 6.0

Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, a...
ZAYA1-VL-8B Technical Report
cs.CV 2026-05 unverdicted novelty 4.0

ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting b...