Visual embodied brain: Let multimodal large language models see, think, and control in spaces

· 2025 · arXiv 2506.00123

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding

cs.CV · 2026-04-09 · unverdicted · novelty 8.0

3D-VCD reduces hallucinations in 3D-LLM embodied agents by contrasting predictions from original and distorted 3D scene representations at inference time.

SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

SpaceDG introduces the first large-scale degradation-aware spatial reasoning dataset using 3D Gaussian Splatting synthesis, showing that visual degradations impair MLLM performance but finetuning on the data improves robustness and can exceed human levels under degradation.

Token Warping Helps MLLMs Look from Nearby Viewpoints

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.

MiMo-Embodied: X-Embodied Foundation Model Technical Report

cs.RO · 2025-11-20 · unverdicted · novelty 6.0

MiMo-Embodied is a single foundation model that achieves state-of-the-art results on 17 embodied AI benchmarks and 12 autonomous driving benchmarks through multi-stage learning, curated data, and CoT/RL fine-tuning that produces positive cross-domain transfer.

InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

cs.RO · 2025-10-15 · unverdicted · novelty 6.0

InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.

Talk Less, Fly Lighter: Autonomous Semantic Compression for UAV Swarm Communication via LLMs

cs.RO · 2025-08-16 · unverdicted · novelty 5.0

LLM-based autonomous semantic compression in four 2D UAV swarm simulations shows potential for efficient collaborative communication under bandwidth constraints.

GeoWorld-VLM: Geometry from World Models for Vision-Language Models

cs.CV · 2026-05-15

citing papers explorer

Showing 7 of 7 citing papers.

3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding cs.CV · 2026-04-09 · unverdicted · none · ref 29
3D-VCD reduces hallucinations in 3D-LLM embodied agents by contrasting predictions from original and distorted 3D scene representations at inference time.
SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation cs.CV · 2026-05-21 · unverdicted · none · ref 5
SpaceDG introduces the first large-scale degradation-aware spatial reasoning dataset using 3D Gaussian Splatting synthesis, showing that visual degradations impair MLLM performance but finetuning on the data improves robustness and can exceed human levels under degradation.
Token Warping Helps MLLMs Look from Nearby Viewpoints cs.CV · 2026-04-03 · unverdicted · none · ref 58
Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.
MiMo-Embodied: X-Embodied Foundation Model Technical Report cs.RO · 2025-11-20 · unverdicted · none · ref 35
MiMo-Embodied is a single foundation model that achieves state-of-the-art results on 17 embodied AI benchmarks and 12 autonomous driving benchmarks through multi-stage learning, curated data, and CoT/RL fine-tuning that produces positive cross-domain transfer.
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy cs.RO · 2025-10-15 · unverdicted · none · ref 25
InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
Talk Less, Fly Lighter: Autonomous Semantic Compression for UAV Swarm Communication via LLMs cs.RO · 2025-08-16 · unverdicted · none · ref 3
LLM-based autonomous semantic compression in four 2D UAV swarm simulations shows potential for efficient collaborative communication under bandwidth constraints.
GeoWorld-VLM: Geometry from World Models for Vision-Language Models cs.CV · 2026-05-15 · unreviewed · ref 30

Visual embodied brain: Let multimodal large language models see, think, and control in spaces

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer