3D-VCD reduces hallucinations in 3D-LLM embodied agents by contrasting predictions from original and distorted 3D scene representations at inference time.
hub
Visual embodied brain: Let multimodal large language models see, think, and control in spaces
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
OVO-S-Bench provides 1680 human-annotated questions on 348 videos to measure streaming spatial intelligence in MLLMs across instantaneous perception, spatiotemporal tracking, spatial simulation, and allocentric mapping.
SpaceDG is the first large-scale benchmark dataset (~1M QA pairs) simulating nine visual degradations in 3DGS-rendered scenes to measure and improve spatial intelligence robustness in MLLMs.
Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.
Introduces PinCoT paradigm with visual reasoning anchors, builds PIN-170K dataset via automated pipeline, and trains 4B RoboPIN model via three-stage post-training to outperform 7B baselines by 12% on embodied reasoning benchmarks.
RoboProcessBench is a new benchmark decomposing process-aware understanding into static monitoring and dynamic reasoning across 12 question families, with evaluations showing VLM limitations but post-training gains on the provided data.
GeoWorld-VLM aligns VLM image features with intermediate representations from camera-conditioned world models via fine-tuning only the encoder and projector, yielding ~4% gains on What'sUp and VSR spatial benchmarks across two VLM backbones.
MiMo-Embodied is a single foundation model that achieves state-of-the-art results on 17 embodied AI benchmarks and 12 autonomous driving benchmarks through multi-stage learning, curated data, and CoT/RL fine-tuning that produces positive cross-domain transfer.
InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
SPARC generates reliable spatial annotations for robot demonstrations by leveraging spatio-temporal task structure, outperforming detection baselines on localization accuracy while retaining more samples and enabling competitive model performance without manual annotations.
Embodied-R1.5 is an 8B EFM achieving SOTA on 16 of 24 embodied VLM benchmarks, fine-tunable to outperform leading VLAs, with claimed zero-shot real-robot generalization.
GEM adds generative depth supervision to VLM pre-training and reports improved results on embodied benchmarks plus real-world robot execution.
Introduces EQA-Decision dataset with 4M+ QA pairs across four embodied reasoning dimensions and RoboDecision baseline for joint perception-reasoning-decision evaluation.
LLM-based autonomous semantic compression in four 2D UAV swarm simulations shows potential for efficient collaborative communication under bandwidth constraints.
citing papers explorer
-
RoboPIN: Grounded Embodied Reasoning via Pinned Chain-of-Thought
Introduces PinCoT paradigm with visual reasoning anchors, builds PIN-170K dataset via automated pipeline, and trains 4B RoboPIN model via three-stage post-training to outperform 7B baselines by 12% on embodied reasoning benchmarks.