Towards visuospatial cognition via hierarchical fusion of visual experts

Qi Feng · 2025 · arXiv 2505.12363

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1 dataset 1

citation-polarity summary

background 1 use dataset 1

representative citing papers

ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.

SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs

cs.CV · 2026-05-10 · unverdicted · novelty 6.0

SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.

Rethinking VLM Representation for VLA Initialization

cs.CV · 2026-05-25 · unverdicted · novelty 5.0

Experiments indicate original VLM representations are crucial for VLA performance, LoRA outperforms full finetuning, and staged robot-data pretraining yields the strongest initialization.

citing papers explorer

Showing 3 of 3 citing papers.

ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models cs.CV · 2026-05-11 · unverdicted · none · ref 16
ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs cs.CV · 2026-05-10 · unverdicted · none · ref 15
SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.
Rethinking VLM Representation for VLA Initialization cs.CV · 2026-05-25 · unverdicted · none · ref 15
Experiments indicate original VLM representations are crucial for VLA performance, LoRA outperforms full finetuning, and staged robot-data pretraining yields the strongest initialization.

Towards visuospatial cognition via hierarchical fusion of visual experts

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer