pith. sign in

hub Canonical reference

Video models are zero-shot learners and reasoners

Canonical reference. 77% of citing Pith papers cite this work as background.

43 Pith papers citing it
Background 77% of classified citations
abstract

The remarkable zero-shot capabilities of Large Language Models (LLMs) have propelled natural language processing from task-specific models to unified, generalist foundation models. This transformation emerged from simple primitives: large, generative models trained on web-scale data. Curiously, the same primitives apply to today's generative video models. Could video models be on a trajectory towards general-purpose vision understanding, much like LLMs developed general-purpose language understanding? We demonstrate that Veo 3 can solve a broad variety of tasks it wasn't explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and more. These abilities to perceive, model, and manipulate the visual world enable early forms of visual reasoning like maze and symmetry solving. Veo's emergent zero-shot capabilities indicate that video models are on a path to becoming unified, generalist vision foundation models.

hub tools

citation-role summary

background 11 baseline 2

citation-polarity summary

years

2026 34 2025 9

representative citing papers

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

MotiMotion adds visual reasoning via a training-free VLM to refine primary trajectories and hallucinate secondary motions, plus a confidence-aware guidance scheme, yielding more plausible interactions on the new MotiBench benchmark.

Progressive Photorealistic Simplification

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

Progressive semantic image simplification uses VLMs and a verifier to iteratively remove and inpaint scene elements while preserving photorealism, distilled into an image-to-video model for direct sequence prediction.

Grokking of Diffusion Models: Case Study on Modular Addition

cs.LG · 2026-04-20 · unverdicted · novelty 7.0

Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.

ViPS: Video-informed Pose Spaces for Auto-Rigged Meshes

cs.CV · 2026-04-19 · unverdicted · novelty 7.0 · 2 refs

ViPS learns a universal, controllable pose space for auto-rigged meshes by transferring motion priors from video diffusion models, matching SOTA performance on plausibility and diversity while enabling zero-shot generalization.

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

cs.RO · 2026-02-06 · unverdicted · novelty 7.0

DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.

Rewriting Video: Text-Driven Reauthoring of Video Footage

cs.HC · 2026-01-13 · unverdicted · novelty 7.0

A generative reconstruction algorithm turns video into editable text prompts, enabling text-driven reauthoring as shown in a creator study that identified use cases such as virtual reshooting and tensions around coherence and creative alignment.

VideoCoF: Unified Video Editing with Temporal Reasoner

cs.CV · 2025-12-08 · unverdicted · novelty 7.0

VideoCoF adds an explicit reasoning step using edit-region latents in video diffusion models to enable precise mask-free editing and motion alignment with only 50k training pairs.

Video Models Can Reason with Verifiable Rewards

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

VideoRLVR uses SDE-GRPO optimization, dense decomposed rewards, and Early-Step Focus to train video diffusion models on verifiable reasoning tasks, outperforming supervised fine-tuning and other video generators on Maze, FlowFree, and Sokoban.

Do multimodal models imagine electric sheep?

cs.CV · 2026-05-10 · conditional · novelty 6.0

Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.

Image Generators are Generalist Vision Learners

cs.CV · 2026-04-22 · conditional · novelty 6.0 · 2 refs

Image generation pretraining builds generalist vision models that reach SOTA on 2D and 3D perception tasks by reframing them as RGB image outputs.

How Far Are Video Models from True Multimodal Reasoning?

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.

citing papers explorer

Showing 43 of 43 citing papers.