A video foundation model trained on human demonstrations generates zero-shot plans that convert to executable robot actions on novel scenes and tasks.
Learning to see before seeing: Demystifying llm visual priors from lan- guage pre-training
6 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
OPPO is an evidence-aware preference optimization objective that contrasts faithful responses under varying visual evidence strengths to reduce hallucinations in MLLMs.
MIRROR derives a closed-form Semi-Inverse Gromov-Wasserstein loss to align language-derived relational priors with visual representations inside decoder-only Transformers.
Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
Training LLMs on text-to-ASCII spatial layout construction improves text-only spatial reasoning and transfers to external benchmarks.
citing papers explorer
-
Large Video Planner Enables Generalizable Robot Control
A video foundation model trained on human demonstrations generates zero-shot plans that convert to executable robot actions on novel scenes and tasks.
-
Clearer Sight, Fewer Lies: Oriented Pickup Preference Optimization for Multimodal Hallucination Mitigation
OPPO is an evidence-aware preference optimization objective that contrasts faithful responses under varying visual evidence strengths to reduce hallucinations in MLLMs.
-
MIRROR: Aligning Semantic Relations from Language to Image via Gromov--Wasserstein
MIRROR derives a closed-form Semi-Inverse Gromov-Wasserstein loss to align language-derived relational priors with visual representations inside decoder-only Transformers.
-
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
-
Learning to Draw ASCII Improves Spatial Reasoning in Language Models
Training LLMs on text-to-ASCII spatial layout construction improves text-only spatial reasoning and transfers to external benchmarks.
- MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs