hub Canonical reference

Video models are zero-shot learners and reasoners

· 2025 · cs.LG · arXiv 2509.20328

Canonical reference. 77% of citing Pith papers cite this work as background.

56 Pith papers citing it

Background 77% of classified citations

open full Pith review browse 56 citing papers arXiv PDF

abstract

The remarkable zero-shot capabilities of Large Language Models (LLMs) have propelled natural language processing from task-specific models to unified, generalist foundation models. This transformation emerged from simple primitives: large, generative models trained on web-scale data. Curiously, the same primitives apply to today's generative video models. Could video models be on a trajectory towards general-purpose vision understanding, much like LLMs developed general-purpose language understanding? We demonstrate that Veo 3 can solve a broad variety of tasks it wasn't explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and more. These abilities to perceive, model, and manipulate the visual world enable early forms of visual reasoning like maze and symmetry solving. Veo's emergent zero-shot capabilities indicate that video models are on a path to becoming unified, generalist vision foundation models.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 11 baseline 2

citation-polarity summary

background 10 baseline 2 support 1

representative citing papers

Quo Vadis, Visual In-Context Learning? A Unified Benchmark Across Domains and Tasks

cs.CV · 2026-06-09 · unverdicted · novelty 7.0

The paper constructs the VIBE benchmark and evaluates six visual in-context learning models on 14 datasets, 12 tasks, and 106 combinations under a unified one-shot protocol, revealing limitations and failure modes.

OmniTryOn: Video Try-On Anything at Once!

cs.CV · 2026-06-07 · unverdicted · novelty 7.0

OmniTryOn performs multi-object video virtual try-on in one pass using first-frame wearable caching and spatiotemporal RoPE, outperforming single-garment baselines on a new TryAny-Bench dataset.

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

cs.CV · 2026-06-01 · unverdicted · novelty 7.0 · 2 refs

VLMs formulate differentiable rewards from task-specific rules to enable test-time online LoRA optimization of VGMs, delivering 16.7-point gains on symbolic and general video reasoning benchmarks over VLM-as-solver and Best-of-N baselines.

Are Video Models Zero-Shot Learners and Reasoners in Education? EduVideoBench, A Knowledge-Skills-Attitude Benchmark for Educational Video Generation

cs.CL · 2026-05-26 · unverdicted · novelty 7.0

EduVideoBench is a new KSA-grounded benchmark that evaluates five frontier video generation models and finds substantial gaps in educational validity across knowledge, skills, and attitudes.

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

MotiMotion adds visual reasoning via a training-free VLM to refine primary trajectories and hallucinate secondary motions, plus a confidence-aware guidance scheme, yielding more plausible interactions on the new MotiBench benchmark.

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

VGenST-Bench is a new video benchmark for MLLM spatio-temporal reasoning built via generative synthesis, a multi-agent pipeline with human oversight, a 3x2x2 taxonomy, and hierarchical tasks separating perception from reasoning.

RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic Manipulation

cs.RO · 2026-05-17 · unverdicted · novelty 7.0

RoboFlow4D is an end-to-end lightweight flow world model that predicts multi-frame 3D flows from visual observations and textual instructions to provide explicit planning for real-time robotic manipulation.

Soap2Soap: Long Cinematic Video Remaking via Multi-Agent Collaboration

cs.CV · 2026-05-17 · unverdicted · novelty 7.0

Soap2Soap uses a multi-agent system with dual-bridge consistency via JSON screenplays and visual anchors plus batch keyframe generation to achieve better long-term consistency in cinematic video remaking than commercial APIs.

Progressive Photorealistic Simplification

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

Progressive semantic image simplification uses VLMs and a verifier to iteratively remove and inpaint scene elements while preserving photorealism, distilled into an image-to-video model for direct sequence prediction.

Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs

cs.CL · 2026-05-10 · unverdicted · novelty 7.0

LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.

CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models

cs.CV · 2026-05-09 · unverdicted · novelty 7.0

CollabVR improves video reasoning performance by coupling vision-language models and video generation models in a closed-loop step-level collaboration that detects and repairs generation failures.

Grokking of Diffusion Models: Case Study on Modular Addition

cs.LG · 2026-04-20 · unverdicted · novelty 7.0

Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.

ViPS: Video-informed Pose Spaces for Auto-Rigged Meshes

cs.CV · 2026-04-19 · unverdicted · novelty 7.0 · 2 refs

ViPS learns a universal, controllable pose space for auto-rigged meshes by transferring motion priors from video diffusion models, matching SOTA performance on plausibility and diversity while enabling zero-shot generalization.

GTASA: Ground Truth Annotations for Spatiotemporal Analysis, Evaluation and Training of Video Models

cs.CV · 2026-04-12 · unverdicted · novelty 7.0

GTASA supplies annotated multi-actor videos with exact 3D spatial and temporal ground truth that outperforms neural video generators in physical and semantic validity while enabling new probes of video encoders.

Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control

cs.RO · 2026-03-18 · conditional · novelty 7.0

GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

cs.RO · 2026-02-06 · unverdicted · novelty 7.0

DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.

PerpetualWonder: Long-Horizon Action-Conditioned 4D Scene Generation

cs.CV · 2026-02-04 · unverdicted · novelty 7.0

PerpetualWonder introduces a closed-loop generative simulator with a unified physical-visual representation for long-horizon action-conditioned 4D scene generation from one image.

Rewriting Video: Text-Driven Reauthoring of Video Footage

cs.HC · 2026-01-13 · unverdicted · novelty 7.0

A generative reconstruction algorithm turns video into editable text prompts, enabling text-driven reauthoring as shown in a creator study that identified use cases such as virtual reshooting and tensions around coherence and creative alignment.

VideoCoF: Unified Video Editing with Temporal Reasoner

cs.CV · 2025-12-08 · unverdicted · novelty 7.0

VideoCoF adds an explicit reasoning step using edit-region latents in video diffusion models to enable precise mask-free editing and motion alignment with only 50k training pairs.

Target-Bench: Can Video World Models Achieve Mapless Path Planning with Semantic Targets?

cs.CV · 2025-11-21 · unverdicted · novelty 7.0

Target-Bench shows the best off-the-shelf video world model scores only 0.341 on semantic target-approaching and directional consistency, with fine-tuning on a small robot dataset yielding measurable gains.

A Good Talk Does not Look Like a Summary, It Teaches You! Measuring Takeaways from Paper-to-Video Talks

cs.MM · 2026-06-26 · unverdicted · novelty 6.0

EffectivePresentationScorer evaluates paper-to-video talks for instructional quality by checking clear explanation of ideas, prerequisite concepts, and links to contributions, finding that current systems cover topics but fail to teach.

PointAction: 3D Points as Universal Action Representations for Robot Control

cs.RO · 2026-06-02 · unverdicted · novelty 6.0

PointAction uses predicted dynamic 3D pointmaps from fine-tuned video models as an embodiment-agnostic action representation to map video predictions to executable robot actions.

Cosmos 3: Omnimodal World Models for Physical AI

cs.CV · 2026-06-01 · unverdicted · novelty 6.0

Cosmos 3 presents a unified omnimodal world model family based on mixture-of-transformers that processes language, vision, audio, and action for Physical AI applications.

AlbedoEdit: Unified Instance-Level Video Editing with Albedo Guidance

cs.GR · 2026-05-31 · unverdicted · novelty 6.0

AlbedoEdit fine-tunes video foundation models to translate RGB videos into edited versions conditioned on user-edited first-frame albedo maps, trained on a new synthetic paired dataset for insertion, removal, and texture tasks.

citing papers explorer

Showing 47 of 47 citing papers after filters.

Quo Vadis, Visual In-Context Learning? A Unified Benchmark Across Domains and Tasks cs.CV · 2026-06-09 · unverdicted · none · ref 99 · internal anchor
The paper constructs the VIBE benchmark and evaluates six visual in-context learning models on 14 datasets, 12 tasks, and 106 combinations under a unified one-shot protocol, revealing limitations and failure modes.
OmniTryOn: Video Try-On Anything at Once! cs.CV · 2026-06-07 · unverdicted · none · ref 49 · internal anchor
OmniTryOn performs multi-object video virtual try-on in one pass using first-frame wearable caching and spatiotemporal RoPE, outperforming single-garment baselines on a new TryAny-Bench dataset.
VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization cs.CV · 2026-06-01 · unverdicted · none · ref 42 · 2 links · internal anchor
VLMs formulate differentiable rewards from task-specific rules to enable test-time online LoRA optimization of VGMs, delivering 16.7-point gains on symbolic and general video reasoning benchmarks over VLM-as-solver and Best-of-N baselines.
Are Video Models Zero-Shot Learners and Reasoners in Education? EduVideoBench, A Knowledge-Skills-Attitude Benchmark for Educational Video Generation cs.CL · 2026-05-26 · unverdicted · none · ref 20 · internal anchor
EduVideoBench is a new KSA-grounded benchmark that evaluates five frontier video generation models and finds substantial gaps in educational validity across knowledge, skills, and attitudes.
MotiMotion: Motion-Controlled Video Generation with Visual Reasoning cs.CV · 2026-05-21 · unverdicted · none · ref 93 · internal anchor
MotiMotion adds visual reasoning via a training-free VLM to refine primary trajectories and hallucinate secondary motions, plus a confidence-aware guidance scheme, yielding more plausible interactions on the new MotiBench benchmark.
VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis cs.CV · 2026-05-21 · unverdicted · none · ref 78 · internal anchor
VGenST-Bench is a new video benchmark for MLLM spatio-temporal reasoning built via generative synthesis, a multi-agent pipeline with human oversight, a 3x2x2 taxonomy, and hierarchical tasks separating perception from reasoning.
RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic Manipulation cs.RO · 2026-05-17 · unverdicted · none · ref 29 · internal anchor
RoboFlow4D is an end-to-end lightweight flow world model that predicts multi-frame 3D flows from visual observations and textual instructions to provide explicit planning for real-time robotic manipulation.
Soap2Soap: Long Cinematic Video Remaking via Multi-Agent Collaboration cs.CV · 2026-05-17 · unverdicted · none · ref 43 · internal anchor
Soap2Soap uses a multi-agent system with dual-bridge consistency via JSON screenplays and visual anchors plus batch keyframe generation to achieve better long-term consistency in cinematic video remaking than commercial APIs.
Progressive Photorealistic Simplification cs.CV · 2026-05-11 · unverdicted · none · ref 38 · internal anchor
Progressive semantic image simplification uses VLMs and a verifier to iteratively remove and inpaint scene elements while preserving photorealism, distilled into an image-to-video model for direct sequence prediction.
Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs cs.CL · 2026-05-10 · unverdicted · none · ref 30 · internal anchor
LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.
CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models cs.CV · 2026-05-09 · unverdicted · none · ref 35 · internal anchor
CollabVR improves video reasoning performance by coupling vision-language models and video generation models in a closed-loop step-level collaboration that detects and repairs generation failures.
Grokking of Diffusion Models: Case Study on Modular Addition cs.LG · 2026-04-20 · unverdicted · none · ref 30 · internal anchor
Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.
ViPS: Video-informed Pose Spaces for Auto-Rigged Meshes cs.CV · 2026-04-19 · unverdicted · none · ref 48 · 2 links · internal anchor
ViPS learns a universal, controllable pose space for auto-rigged meshes by transferring motion priors from video diffusion models, matching SOTA performance on plausibility and diversity while enabling zero-shot generalization.
GTASA: Ground Truth Annotations for Spatiotemporal Analysis, Evaluation and Training of Video Models cs.CV · 2026-04-12 · unverdicted · none · ref 48 · internal anchor
GTASA supplies annotated multi-actor videos with exact 3D spatial and temporal ground truth that outperforms neural video generators in physical and semantic validity while enabling new probes of video encoders.
DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos cs.RO · 2026-02-06 · unverdicted · none · ref 100 · internal anchor
DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.
PerpetualWonder: Long-Horizon Action-Conditioned 4D Scene Generation cs.CV · 2026-02-04 · unverdicted · none · ref 45 · internal anchor
PerpetualWonder introduces a closed-loop generative simulator with a unified physical-visual representation for long-horizon action-conditioned 4D scene generation from one image.
Rewriting Video: Text-Driven Reauthoring of Video Footage cs.HC · 2026-01-13 · unverdicted · none · ref 24 · internal anchor
A generative reconstruction algorithm turns video into editable text prompts, enabling text-driven reauthoring as shown in a creator study that identified use cases such as virtual reshooting and tensions around coherence and creative alignment.
VideoCoF: Unified Video Editing with Temporal Reasoner cs.CV · 2025-12-08 · unverdicted · none · ref 36 · internal anchor
VideoCoF adds an explicit reasoning step using edit-region latents in video diffusion models to enable precise mask-free editing and motion alignment with only 50k training pairs.
Target-Bench: Can Video World Models Achieve Mapless Path Planning with Semantic Targets? cs.CV · 2025-11-21 · unverdicted · none · ref 35 · internal anchor
Target-Bench shows the best off-the-shelf video world model scores only 0.341 on semantic target-approaching and directional consistency, with fine-tuning on a small robot dataset yielding measurable gains.
A Good Talk Does not Look Like a Summary, It Teaches You! Measuring Takeaways from Paper-to-Video Talks cs.MM · 2026-06-26 · unverdicted · none · ref 91 · internal anchor
EffectivePresentationScorer evaluates paper-to-video talks for instructional quality by checking clear explanation of ideas, prerequisite concepts, and links to contributions, finding that current systems cover topics but fail to teach.
PointAction: 3D Points as Universal Action Representations for Robot Control cs.RO · 2026-06-02 · unverdicted · none · ref 59 · internal anchor
PointAction uses predicted dynamic 3D pointmaps from fine-tuned video models as an embodiment-agnostic action representation to map video predictions to executable robot actions.
Cosmos 3: Omnimodal World Models for Physical AI cs.CV · 2026-06-01 · unverdicted · none · ref 14 · internal anchor
Cosmos 3 presents a unified omnimodal world model family based on mixture-of-transformers that processes language, vision, audio, and action for Physical AI applications.
AlbedoEdit: Unified Instance-Level Video Editing with Albedo Guidance cs.GR · 2026-05-31 · unverdicted · none · ref 46 · internal anchor
AlbedoEdit fine-tunes video foundation models to translate RGB videos into edited versions conditioned on user-edited first-frame albedo maps, trained on a new synthetic paired dataset for insertion, removal, and texture tasks.
StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement cs.CV · 2026-05-29 · unverdicted · none · ref 110 · internal anchor
StressDream optimizes initial noise in diffusion video world models using VLM semantic and plausibility objectives to steer generations toward specified high-impact outcomes for improved policy evaluation.
Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players cs.CV · 2026-05-27 · unverdicted · none · ref 55 · internal anchor
A multi-agent video world model using simplex rotary agent encoding and sparse hub attention achieves better fidelity, controllability, and consistency than baselines while generalizing from 2 to 4 players.
GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation cs.CV · 2026-05-18 · unverdicted · none · ref 74 · internal anchor
GeoFlow adds a geometry-consistency reward based on rigid camera flow and object appearance preservation, integrated via reinforcement fine-tuning to improve geometric coherence in video generation.
Video Models Can Reason with Verifiable Rewards cs.CV · 2026-05-14 · unverdicted · none · ref 40 · internal anchor
VideoRLVR uses SDE-GRPO optimization, dense decomposed rewards, and Early-Step Focus to train video diffusion models on verifiable reasoning tasks, outperforming supervised fine-tuning and other video generators on Maze, FlowFree, and Sokoban.
WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors cs.CV · 2026-05-11 · unverdicted · none · ref 26 · internal anchor
The paper presents WorldReasonBench, a benchmark that tests video generators on maintaining physical, social, logical, and informational consistency when predicting future states from initial conditions and actions.
Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency cs.CV · 2026-05-07 · unverdicted · none · ref 30 · 4 links · internal anchor
Introduces Eulerian motion guidance with bidirectional geometric consistency to improve training speed and temporal quality in diffusion-based image animation.
Open-Source Image Editing Models Are Zero-Shot Vision Learners cs.CV · 2026-05-06 · unverdicted · none · ref 27 · internal anchor
Open-source image-editing models show competitive zero-shot performance on monocular depth, surface normals, and semantic segmentation, sometimes matching tuned models.
How Far Are Video Models from True Multimodal Reasoning? cs.CV · 2026-04-21 · unverdicted · none · ref 76 · internal anchor
Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation cs.CV · 2026-04-20 · unverdicted · none · ref 50 · internal anchor
A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
VibeFlow: Versatile Video Chroma-Lux Editing through Self-Supervised Learning cs.CV · 2026-04-15 · unverdicted · none · ref 36 · internal anchor
VibeFlow performs versatile video chroma-lux editing in zero-shot fashion by self-supervised disentanglement of structure and color-illumination cues inside pre-trained video models, plus residual velocity fields and consistency regularization.
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis cs.RO · 2026-04-10 · unverdicted · none · ref 70 · internal anchor
VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas cs.CV · 2026-03-30 · unverdicted · none · ref 59 · internal anchor
Stepper uses stepwise panoramic expansion with a multi-view 360-degree diffusion model and geometry reconstruction to produce high-fidelity, structurally consistent immersive 3D scenes from text.
DriveLaW:Unifying Planning and Video Generation in a Latent Driving World cs.CV · 2025-12-29 · unverdicted · none · ref 72 · internal anchor
DriveLaW unifies video world modeling and trajectory planning by injecting video-generator latents into a diffusion planner, achieving SOTA video prediction and a new record on the NAVSIM planning benchmark.
Kling-Omni Technical Report cs.CV · 2025-12-18 · unverdicted · none · ref 33 · internal anchor
Kling-Omni is a unified multimodal generative system that produces cinematic videos from diverse inputs by integrating generation, editing, and intelligent reasoning in a single end-to-end model.
mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs cs.RO · 2025-12-17 · unverdicted · none · ref 57 · internal anchor
mimic-video combines internet video pretraining with a flow-matching decoder to achieve state-of-the-art robotic manipulation performance with 10x better sample efficiency than vision-language-action models.
Are Video Models Emerging as Zero-Shot Learners and Reasoners in Medical Imaging? cs.CV · 2025-10-11 · unverdicted · none · ref 36 · internal anchor
A video-trained large vision model achieves competitive zero-shot performance on organ segmentation, denoising, super-resolution, and 4D CT motion prediction in medical imaging, outperforming some specialized baselines on patient data from 122 cases.
Bridging Video Understanding and Generation in a Unified Framework cs.CV · 2026-06-30 · unverdicted · none · ref 65 · internal anchor
Vega unifies video understanding and generation via shared vocabulary and hybrid autoregressive-diffusion architecture, reporting strong results on VBench and VideoMME.
Data-Driven Automation econ.TH · 2026-06-08 · unverdicted · none · ref 52 · internal anchor
Dynamic model of data-driven automation with heterogeneous accumulating data and spillovers derives conditions for partial versus full automation, shows asymptotic power-law decay in labor share, generic inefficiency, and with endogenous capital, explosive growth but stagnant long-run wages.
OptiWorld: Optimal Control for Video World Generation under Physical Constraints cs.CV · 2026-05-30 · unverdicted · none · ref 19 · internal anchor
OptiWorld inserts a classical optimal-control layer that extracts a world state, plans an optimal trajectory on a geometric manifold under physical constraints, and renders the video conditioned on that trajectory.
PhyWorld: Physics-Faithful World Model for Video Generation cs.CV · 2026-05-19 · unverdicted · none · ref 3 · internal anchor
PhyWorld improves temporal consistency and physical plausibility in video world models via flow matching fine-tuning followed by DPO on physics preference pairs, with reported gains on VBench and a custom physical-faithfulness benchmark.
Neural Computers cs.LG · 2026-04-07 · unverdicted · none · ref 37 · internal anchor
Neural Computers are introduced as a new machine form where computation, memory, and I/O are unified in a learned runtime state, with initial video-model experiments showing acquisition of basic interface primitives from traces.
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm cs.CV · 2025-11-06 · unverdicted · none · ref 39 · internal anchor
Video generation models demonstrate competitive multimodal reasoning on a new benchmark, matching or exceeding VLMs on visual puzzles and achieving 92% on MATH and 69.2% on MMMU.
Motif-Video 2B: Technical Report cs.CV · 2026-04-14 · unverdicted · none · ref 43 · 2 links · internal anchor
Motif-Video 2B reaches 83.76% on VBench, outperforming a 14B-parameter model with 7x fewer parameters and far less training data through shared cross-attention and a three-part backbone.
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation cs.CV · 2026-04-13 · unverdicted · none · ref 185 · internal anchor
This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challenges like instance permanence and consistent interaction.

Video models are zero-shot learners and reasoners

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer