Learning to see before seeing: Demystifying llm visual priors from lan- guage pre-training

Junlin Han, Shengbang Tong, David Fan, Yufan Ren, Koustuv Sinha, Philip Torr, Filippos Kokkinos · 2025 · arXiv 2509.26625

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

representative citing papers

Large Video Planner Enables Generalizable Robot Control

cs.RO · 2025-12-17 · conditional · novelty 7.0

A video foundation model trained on human demonstrations generates zero-shot plans that convert to executable robot actions on novel scenes and tasks.

Clearer Sight, Fewer Lies: Oriented Pickup Preference Optimization for Multimodal Hallucination Mitigation

cs.CV · 2026-06-29 · unverdicted · novelty 6.0 · 2 refs

OPPO is an evidence-aware preference optimization objective that contrasts faithful responses under varying visual evidence strengths to reduce hallucinations in MLLMs.

MIRROR: Aligning Semantic Relations from Language to Image via Gromov--Wasserstein

cs.CV · 2026-06-28 · unverdicted · novelty 6.0

MIRROR derives a closed-form Semi-Inverse Gromov-Wasserstein loss to align language-derived relational priors with visual representations inside decoder-only Transformers.

Learning to See What You Need: Gaze Attention for Multimodal Large Language Models

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.

Learning to Draw ASCII Improves Spatial Reasoning in Language Models

cs.AI · 2026-04-16 · unverdicted · novelty 5.0

Training LLMs on text-to-ASCII spatial layout construction improves text-only spatial reasoning and transfers to external benchmarks.

MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

cs.CV · 2025-11-18

citing papers explorer

Showing 6 of 6 citing papers.

Large Video Planner Enables Generalizable Robot Control cs.RO · 2025-12-17 · conditional · none · ref 35
A video foundation model trained on human demonstrations generates zero-shot plans that convert to executable robot actions on novel scenes and tasks.
Clearer Sight, Fewer Lies: Oriented Pickup Preference Optimization for Multimodal Hallucination Mitigation cs.CV · 2026-06-29 · unverdicted · none · ref 18 · 2 links
OPPO is an evidence-aware preference optimization objective that contrasts faithful responses under varying visual evidence strengths to reduce hallucinations in MLLMs.
MIRROR: Aligning Semantic Relations from Language to Image via Gromov--Wasserstein cs.CV · 2026-06-28 · unverdicted · none · ref 19
MIRROR derives a closed-form Semi-Inverse Gromov-Wasserstein loss to align language-derived relational priors with visual representations inside decoder-only Transformers.
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models cs.CV · 2026-05-13 · unverdicted · none · ref 110
Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
Learning to Draw ASCII Improves Spatial Reasoning in Language Models cs.AI · 2026-04-16 · unverdicted · none · ref 2
Training LLMs on text-to-ASCII spatial layout construction improves text-only spatial reasoning and transfers to external benchmarks.
MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs cs.CV · 2025-11-18 · unreviewed · ref 22

Learning to see before seeing: Demystifying llm visual priors from lan- guage pre-training

fields

years

verdicts

representative citing papers

citing papers explorer