Reconstructive visual instruction tuning.arXiv preprint arXiv:2410.09575, 2024

Haochen Wang, Anlin Zheng, Yucheng Zhao, Tiancai Wang, Zheng Ge, Xiangyu Zhang, Zhaoxiang Zhang · 2024 · arXiv 2410.09575

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

VGR: Visual Grounded Reasoning

cs.CV · 2025-06-13 · unverdicted · novelty 7.0

VGR introduces a visual-grounded reasoning MLLM that detects and replays image regions during inference, achieving gains on visual benchmarks with 30% fewer image tokens than the LLaVA-NeXT-7B baseline.

Semantic Generative Tuning for Unified Multimodal Models

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

Semantic Generative Tuning uses image segmentation as a generative proxy to align misaligned representation spaces in unified multimodal models and improve both perception and generative layout fidelity.

Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

Vision Inference Former adds a direct visual-to-output bridge that continuously injects visual semantics during MLLM decoding to sustain consistency and reduce modality imbalance.

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.

Latent Denoising Improves Visual Alignment in Large Multimodal Models

cs.CV · 2026-04-23 · unverdicted · novelty 6.0

A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.

DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

cs.CV · 2025-10-14 · unverdicted · novelty 6.0

DriveVLA-W0 adds world modeling to predict future images in VLA models, overcoming sparse action supervision and amplifying data scaling laws on NAVSIM benchmarks and a large in-house dataset.

citing papers explorer

Showing 6 of 6 citing papers.

VGR: Visual Grounded Reasoning cs.CV · 2025-06-13 · unverdicted · none · ref 44
VGR introduces a visual-grounded reasoning MLLM that detects and replays image regions during inference, achieving gains on visual benchmarks with 30% fewer image tokens than the LLaVA-NeXT-7B baseline.
Semantic Generative Tuning for Unified Multimodal Models cs.CV · 2026-05-18 · unverdicted · none · ref 61
Semantic Generative Tuning uses image segmentation as a generative proxy to align misaligned representation spaces in unified multimodal models and improve both perception and generative layout fidelity.
Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models cs.CV · 2026-05-18 · unverdicted · none · ref 40
Vision Inference Former adds a direct visual-to-output bridge that continuously injects visual semantics during MLLM decoding to sustain consistency and reduce modality imbalance.
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding cs.CV · 2026-05-18 · unverdicted · none · ref 68
SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.
Latent Denoising Improves Visual Alignment in Large Multimodal Models cs.CV · 2026-04-23 · unverdicted · none · ref 86
A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.
DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving cs.CV · 2025-10-14 · unverdicted · none · ref 29
DriveVLA-W0 adds world modeling to predict future images in VLA models, overcoming sparse action supervision and amplifying data scaling laws on NAVSIM benchmarks and a large in-house dataset.

Reconstructive visual instruction tuning.arXiv preprint arXiv:2410.09575, 2024

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer