The paper constructs the VIBE benchmark and evaluates six visual in-context learning models on 14 datasets, 12 tasks, and 106 combinations under a unified one-shot protocol, revealing limitations and failure modes.
hub Canonical reference
Video models are zero-shot learners and reasoners
Canonical reference. 77% of citing Pith papers cite this work as background.
abstract
The remarkable zero-shot capabilities of Large Language Models (LLMs) have propelled natural language processing from task-specific models to unified, generalist foundation models. This transformation emerged from simple primitives: large, generative models trained on web-scale data. Curiously, the same primitives apply to today's generative video models. Could video models be on a trajectory towards general-purpose vision understanding, much like LLMs developed general-purpose language understanding? We demonstrate that Veo 3 can solve a broad variety of tasks it wasn't explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and more. These abilities to perceive, model, and manipulate the visual world enable early forms of visual reasoning like maze and symmetry solving. Veo's emergent zero-shot capabilities indicate that video models are on a path to becoming unified, generalist vision foundation models.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
OmniTryOn performs multi-object video virtual try-on in one pass using first-frame wearable caching and spatiotemporal RoPE, outperforming single-garment baselines on a new TryAny-Bench dataset.
VLMs formulate differentiable rewards from task-specific rules to enable test-time online LoRA optimization of VGMs, delivering 16.7-point gains on symbolic and general video reasoning benchmarks over VLM-as-solver and Best-of-N baselines.
EduVideoBench is a new KSA-grounded benchmark that evaluates five frontier video generation models and finds substantial gaps in educational validity across knowledge, skills, and attitudes.
MotiMotion adds visual reasoning via a training-free VLM to refine primary trajectories and hallucinate secondary motions, plus a confidence-aware guidance scheme, yielding more plausible interactions on the new MotiBench benchmark.
VGenST-Bench is a new video benchmark for MLLM spatio-temporal reasoning built via generative synthesis, a multi-agent pipeline with human oversight, a 3x2x2 taxonomy, and hierarchical tasks separating perception from reasoning.
RoboFlow4D is an end-to-end lightweight flow world model that predicts multi-frame 3D flows from visual observations and textual instructions to provide explicit planning for real-time robotic manipulation.
Soap2Soap uses a multi-agent system with dual-bridge consistency via JSON screenplays and visual anchors plus batch keyframe generation to achieve better long-term consistency in cinematic video remaking than commercial APIs.
Progressive semantic image simplification uses VLMs and a verifier to iteratively remove and inpaint scene elements while preserving photorealism, distilled into an image-to-video model for direct sequence prediction.
LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.
CollabVR improves video reasoning performance by coupling vision-language models and video generation models in a closed-loop step-level collaboration that detects and repairs generation failures.
Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.
ViPS learns a universal, controllable pose space for auto-rigged meshes by transferring motion priors from video diffusion models, matching SOTA performance on plausibility and diversity while enabling zero-shot generalization.
GTASA supplies annotated multi-actor videos with exact 3D spatial and temporal ground truth that outperforms neural video generators in physical and semantic validity while enabling new probes of video encoders.
GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.
DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.
PerpetualWonder introduces a closed-loop generative simulator with a unified physical-visual representation for long-horizon action-conditioned 4D scene generation from one image.
A generative reconstruction algorithm turns video into editable text prompts, enabling text-driven reauthoring as shown in a creator study that identified use cases such as virtual reshooting and tensions around coherence and creative alignment.
VideoCoF adds an explicit reasoning step using edit-region latents in video diffusion models to enable precise mask-free editing and motion alignment with only 50k training pairs.
Target-Bench shows the best off-the-shelf video world model scores only 0.341 on semantic target-approaching and directional consistency, with fine-tuning on a small robot dataset yielding measurable gains.
EffectivePresentationScorer evaluates paper-to-video talks for instructional quality by checking clear explanation of ideas, prerequisite concepts, and links to contributions, finding that current systems cover topics but fail to teach.
PointAction uses predicted dynamic 3D pointmaps from fine-tuned video models as an embodiment-agnostic action representation to map video predictions to executable robot actions.
Cosmos 3 presents a unified omnimodal world model family based on mixture-of-transformers that processes language, vision, audio, and action for Physical AI applications.
AlbedoEdit fine-tunes video foundation models to translate RGB videos into edited versions conditioned on user-edited first-frame albedo maps, trained on a new synthetic paired dataset for insertion, removal, and texture tasks.
citing papers explorer
- Image Generators are Generalist Vision Learners
- Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories
- OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
- Pretrained Video Models as Differentiable Physics Simulators for Urban Wind Flows
- WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling