arxiv: 2605.09719 · v1 · submitted 2026-05-10 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT

Alaa Asfour , Christopher Indris , Leihan Chen , Tejas Vyas , Guanghui Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords knowledge distillation3D vision-language modelsspatial reasoninglatent chain-of-thoughtmodel compressionVGGT vision encoderScanNet3D-FRONT

0 comments

The pith

A 2.29B vision-language model distilled from a 7B teacher retains 54-72% of 3D spatial reasoning performance while running 8.7 times faster.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a distillation framework to transfer 3D spatial reasoning from a large teacher model to a compact student model. It combines a VGGT vision encoder with multi-task learning and uncertainty-aware loss weighting to maintain capability across tasks. The innovation lies in Hidden CoT, which uses learnable latent tokens as an internal scratchpad to support reasoning without explicit chain-of-thought examples. This results in a model that is three times smaller and much faster, making 3D scene understanding practical for devices with limited resources.

Core claim

The central claim is that spatial reasoning from large 3D VLMs can be compressed into smaller models through knowledge distillation augmented by latent scratchpad tokens. Using a 7B teacher and 2.29B student with VGGT encoder, the method achieves 8.7x lower latency and 3x size reduction while preserving 54-72% of performance on proximity and contact reasoning in ScanNet and 3D-FRONT. The student jointly handles spatial description, depth estimation, and object detection.

What carries the argument

Hidden CoT latent tokens that act as a learnable internal scratchpad for reasoning before answer generation, allowing the model to simulate chain-of-thought internally without external data.

Load-bearing premise

The retention of performance on the chosen proximity and contact tasks indicates genuine spatial reasoning transfer instead of fitting to benchmark-specific features.

What would settle it

Testing the student model on entirely new 3D datasets with different scene structures and checking if spatial task accuracy falls below 50% of the teacher's level.

Figures

Figures reproduced from arXiv: 2605.09719 by Alaa Asfour, Christopher Indris, Guanghui Wang, Leihan Chen, Tejas Vyas.

**Figure 1.** Figure 1: Architecture and distillation pipeline. Left: Teacher (LLaVA-3D-7B) with CLIP vision tower and LLaMA provides answers and supervision. Center: [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Efficiency comparison between teacher and student models. (a) The [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Hidden CoT training loss convergence over 5 epochs. Training [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of output with and without Diagnostic Mode on a ScanNet [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Large-scale 3D vision-language models (VLMs) like LLaVA-3D offer strong spatial reasoning but are difficult to deploy due to high computational costs. We propose a knowledge distillation framework that transfers spatial reasoning from a 7B teacher to a 2.29B student model. Our approach achieves 8.7x lower inference latency and a 3x reduction in model size while retaining 54-72% of the teacher's performance. The framework utilizes VGGT as the vision encoder and a multi-task distillation pipeline with uncertainty-aware loss weighting. To improve reasoning without chain-of-thought (CoT) data, we introduce "Hidden CoT": learnable latent tokens that serve as an internal scratchpad before answer generation. This is the first use of latent scratchpad reasoning in distilled 3D VLMs. The student model jointly performs spatial description, depth estimation, and object detection. Experiments on ScanNet and 3D-FRONT show strong spatial understanding, reaching 68-72% accuracy on proximity and contact tasks. Our framework enables efficient 3D scene QA on resource-constrained platforms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a knowledge distillation framework to compress spatial reasoning from a 7B-parameter 3D VLM teacher (LLaVA-3D) into a 2.29B student model. It uses the VGGT vision encoder, a multi-task distillation pipeline with uncertainty-aware loss weighting across spatial description, depth estimation, and object detection, and introduces learnable 'Hidden CoT' latent tokens as an internal scratchpad to enable reasoning without explicit chain-of-thought data. On ScanNet and 3D-FRONT, the student is reported to achieve 8.7x lower inference latency, 3x model-size reduction, and 54-72% retention of teacher performance (68-72% accuracy on proximity and contact tasks).

Significance. If the retention figures are shown to stem from the Hidden CoT mechanism rather than the multi-task setup alone, the work would be significant for practical deployment of 3D spatial reasoning on resource-limited platforms. The efficiency gains and the first reported use of latent scratchpad tokens in distilled 3D VLMs would constitute a useful engineering contribution, provided the empirical isolation of components is supplied.

major comments (2)

[Abstract] Abstract: The central claim that Hidden CoT latent tokens transfer spatial reasoning and account for the 54-72% performance retention is load-bearing, yet no ablation is reported that compares the full pipeline against an identical student trained without the latent tokens. The 68-72% accuracies on proximity/contact tasks could therefore be produced by the VGGT encoder and uncertainty-weighted multi-task losses alone; without this isolation the attribution to the proposed reasoning mechanism cannot be verified.
[Abstract] Abstract: No training details, loss formulation for the latent tokens, hyper-parameters, or optimization schedule are supplied. This absence prevents assessment of whether the reported 8.7x latency and 3x size reductions are reproducible or whether the uncertainty-aware weighting is defined in a manner that could be fitted to the same benchmarks.

minor comments (2)

[Abstract] Abstract: The exact baselines, number of evaluation samples, and any error bars or variance across runs for the 68-72% accuracy figures are not stated, making the retention percentages difficult to interpret in context.
[Abstract] Abstract: The VGGT encoder is referenced without a citation or brief description of its architecture; adding this would improve accessibility for readers unfamiliar with the component.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below and will incorporate the suggested revisions to improve clarity and verifiability.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that Hidden CoT latent tokens transfer spatial reasoning and account for the 54-72% performance retention is load-bearing, yet no ablation is reported that compares the full pipeline against an identical student trained without the latent tokens. The 68-72% accuracies on proximity/contact tasks could therefore be produced by the VGGT encoder and uncertainty-weighted multi-task losses alone; without this isolation the attribution to the proposed reasoning mechanism cannot be verified.

Authors: We agree that an explicit ablation isolating the Hidden CoT latent tokens is necessary to substantiate the attribution of the reported performance retention. The current results reflect the complete framework (VGGT encoder + multi-task distillation + Hidden CoT), and we did not include a direct comparison against an otherwise identical student without the latent tokens. In the revised manuscript we will add this ablation: we will train and evaluate an ablated student that replaces the learnable Hidden CoT tokens with standard positional embeddings while retaining the VGGT encoder, uncertainty-weighted losses, and all other training settings. We will report the resulting accuracies on the proximity and contact tasks from ScanNet and 3D-FRONT so that readers can directly assess the incremental contribution of the latent scratchpad mechanism. revision: yes
Referee: [Abstract] Abstract: No training details, loss formulation for the latent tokens, hyper-parameters, or optimization schedule are supplied. This absence prevents assessment of whether the reported 8.7x latency and 3x size reductions are reproducible or whether the uncertainty-aware weighting is defined in a manner that could be fitted to the same benchmarks.

Authors: We acknowledge that the main text omitted the necessary implementation details. The loss formulation for the Hidden CoT tokens (treated as an auxiliary task within the uncertainty-weighted multi-task objective), the full set of hyperparameters, and the optimization schedule are documented in the supplementary material. In the revised version we will add a dedicated “Implementation Details” subsection to the main text that includes the loss equations, a hyperparameter table, the optimizer and learning-rate schedule, and the precise definition of the uncertainty-aware weighting (following the standard learnable-uncertainty formulation applied to the three tasks). This will make the 8.7x latency and performance numbers fully reproducible from the manuscript alone. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on external benchmarks and architectural comparisons

full rationale

The paper describes a distillation pipeline transferring capabilities from a 7B teacher to a 2.29B student via VGGT encoder and multi-task losses, with performance measured on independent external datasets (ScanNet, 3D-FRONT). No equations, fitted parameters, or self-referential definitions appear that would make the reported 54-72% retention or latency reductions equivalent to quantities defined inside the same experiment. The Hidden CoT mechanism is introduced descriptively without a loss formulation or derivation that reduces to the input data by construction. All quantitative claims are therefore falsifiable against held-out benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Only the abstract is available; the ledger is therefore limited to elements explicitly named in the abstract.

free parameters (1)

uncertainty-aware loss weights
Multi-task distillation pipeline uses these weights; no values or fitting procedure given in abstract.

axioms (1)

domain assumption VGGT encoder provides suitable 3D visual features for the distillation target
Framework is built on VGGT without further justification in the abstract.

invented entities (1)

Hidden CoT latent tokens no independent evidence
purpose: Serve as internal scratchpad for reasoning without explicit CoT training data
New component introduced to improve reasoning in the student model

pith-pipeline@v0.9.0 · 5512 in / 1428 out tokens · 67836 ms · 2026-05-12T02:37:06.285604+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 5 internal anchors

[1]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,et al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning, pp. 8748–8763, PmLR, 2021

work page 2021
[2]

Flamingo: a visual language model for few-shot learning,

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds,et al., “Flamingo: a visual language model for few-shot learning,”Advances in neural information processing systems, vol. 35, pp. 23716–23736, 2022

work page 2022
[3]

Superlora: Parameter-efficient unified adaptation for large vision models,

X. Chen, J. Liu, Y . Wang, P. Wang, M. Brand, G. Wang, and T. Koike- Akino, “Superlora: Parameter-efficient unified adaptation for large vision models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8050–8055, 2024

work page 2024
[4]

arXiv preprint arXiv:2409.18125 (2024)

C. Zhu, T. Wang, W. Zhang, J. Pang, and X. Liu, “Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness,”arXiv preprint arXiv:2409.18125, 2024

work page arXiv 2024
[5]

Palm-e: An embodied multimodal language model,

D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang,et al., “Palm-e: An embodied multimodal language model,” 2023

work page 2023
[6]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu,et al., “Rt-1: Robotics transformer for real-world control at scale,”arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Energy and policy con- siderations for deep learning in nlp,

E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy con- siderations for deep learning in nlp,” inProceedings of the 57th annual meeting of the association for computational linguistics, pp. 3645–3650, 2019

work page 2019
[8]

Mofa: A model simplification roadmap for image restoration on mobile devices,

X. Chen, R. Zhen, S. Li, X. Li, and G. Wang, “Mofa: A model simplification roadmap for image restoration on mobile devices,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1322–1332, 2023

work page 2023
[9]

Learning efficient vision transformers via fine-grained manifold distil- lation,

Z. Hao, J. Guo, D. Jia, K. Han, Y . Tang, C. Zhang, H. Hu, and Y . Wang, “Learning efficient vision transformers via fine-grained manifold distil- lation,”Advances in Neural Information Processing Systems, vol. 35, pp. 9164–9175, 2022

work page 2022
[10]

Flashattention: Fast and memory-efficient exact attention with io-awareness,

T. Dao, D. Fu, S. Ermon, A. Rudra, and C. R ´e, “Flashattention: Fast and memory-efficient exact attention with io-awareness,”Advances in neural information processing systems, vol. 35, pp. 16344–16359, 2022

work page 2022
[11]

Knowledge distillation in vision transformers: A critical review,

G. Habib, T. J. Saleem, and B. Lall, “Knowledge distillation in vision transformers: A critical review,”arXiv preprint arXiv:2302.02108, 2023

work page arXiv 2023
[12]

Distilling the Knowledge in a Neural Network

G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,”arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[13]

Knowledge distillation: A survey,

J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A survey,”International journal of computer vision, vol. 129, no. 6, pp. 1789–1819, 2021

work page 2021
[14]

Vggt: Visual geometry grounded transformer,

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inProceedings of the Computer Vision and Pattern Recognition Conference, pp. 5294–5306, 2025

work page 2025
[15]

Pf3det: A prompted foundation feature assisted visual lidar 3d detector,

K. Li, T. Zhang, K.-C. Peng, and G. Wang, “Pf3det: A prompted foundation feature assisted visual lidar 3d detector,” inProceedings of the Computer Vision and Pattern Recognition Conference, pp. 3778– 3787, 2025

work page 2025
[16]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, pp. 34892– 34916, 2023

work page 2023
[17]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inInternational conference on machine learning, pp. 12888–12900, PMLR, 2022

work page 2022
[18]

Vqa: Visual question answering,

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” inProceedings of the IEEE international conference on computer vision, pp. 2425–2433, 2015

work page 2015
[19]

Gqa: A new dataset for real-world visual reasoning and compositional question answering,

D. A. Hudson and C. D. Manning, “Gqa: A new dataset for real-world visual reasoning and compositional question answering,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6700–6709, 2019

work page 2019
[20]

3d-llm: Injecting the 3d world into large language models,

Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan, “3d-llm: Injecting the 3d world into large language models,”Advances in Neural Information Processing Systems, vol. 36, pp. 20482–20494, 2023

work page 2023
[21]

Distilling vision-language models on millions of videos,

Y . Zhao, L. Zhao, X. Zhou, J. Wu, C.-T. Chu, H. Miao, F. Schroff, H. Adam, T. Liu, B. Gong,et al., “Distilling vision-language models on millions of videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13106–13116, 2024

work page 2024
[22]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou,et al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24824–24837, 2022

work page 2022
[23]

Large lan- guage models are zero-shot reasoners,

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large lan- guage models are zero-shot reasoners,”Advances in neural information processing systems, vol. 35, pp. 22199–22213, 2022

work page 2022
[24]

Scanqa: 3d question answering for spatial scene understanding,

D. Azuma, T. Miyanishi, S. Kurita, and M. Kawanabe, “Scanqa: 3d question answering for spatial scene understanding,” inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 19129–19139, 2022

work page 2022
[25]

3d-sps: Single-stage 3d visual grounding via referred point progressive selection,

J. Luo, J. Fu, X. Kong, C. Gao, H. Ren, H. Shen, H. Xia, and S. Liu, “3d-sps: Single-stage 3d visual grounding via referred point progressive selection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16454–16463, 2022

work page 2022
[26]

Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai,

T. Wang, X. Mao, C. Zhu, R. Xu, R. Lyu, P. Li, X. Chen, W. Zhang, K. Chen, T. Xue,et al., “Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19757– 19767, 2024

work page 2024
[27]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar,et al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Quiet-star: Language models can teach themselves to think before speaking.arXiv preprint arXiv:2403.09629,

E. Zelikman, G. Harik, Y . Shao, V . Jayasiri, N. Haber, and N. D. Good- man, “Quiet-star: Language models can teach themselves to think before speaking, 2024,”URL https://arxiv. org/abs/2403.09629, vol. 2403

work page arXiv 2024
[29]

Multi-task learning using uncer- tainty to weigh losses for scene geometry and semantics,

A. Kendall, Y . Gal, and R. Cipolla, “Multi-task learning using uncer- tainty to weigh losses for scene geometry and semantics,” inProceedings of the IEEE conference on computer vision and pattern recognition, pp. 7482–7491, 2018

work page 2018
[30]

Scannet: Richly-annotated 3d reconstructions of indoor scenes,

A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” inProceedings of the IEEE conference on computer vision and pattern recognition, pp. 5828–5839, 2017

work page 2017
[31]

Midi: Multi-instance diffusion for single image to 3d scene generation,

Z. Huang, Y .-C. Guo, X. An, Y . Yang, Y . Li, Z.-X. Zou, D. Liang, X. Liu, Y .-P. Cao, and L. Sheng, “Midi: Multi-instance diffusion for single image to 3d scene generation,” inProceedings of the Computer Vision and Pattern Recognition Conference, pp. 23646–23657, 2025

work page 2025
[32]

SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

P. Xu, S. Wang, Y . Zhu, J. Li, and Y . Zhang, “Spatialbench: Bench- marking multimodal large language models for spatial cognition,”arXiv preprint arXiv:2511.21471, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

3dsrbench: A comprehensive 3d spatial reasoning bench- mark,

W. Ma, H. Chen, G. Zhang, Y .-C. Chou, J. Chen, C. de Melo, and A. Yuille, “3dsrbench: A comprehensive 3d spatial reasoning bench- mark,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6924–6934, 2025

work page 2025
[34]

Mobilevlm : A fast, strong and open vision language assistant for mobile devices

X. Chu, L. Qiao, X. Lin, S. Xu, Y . Yang, Y . Hu, F. Wei, X. Zhang, B. Zhang, X. Wei,et al., “Mobilevlm: A fast, strong and open vision language assistant for mobile devices,”arXiv preprint arXiv:2312.16886, 2023

work page arXiv 2023
[35]

PaliGemma: A versatile 3B VLM for transfer

L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello, et al., “Paligemma: A versatile 3b vlm for transfer,”arXiv preprint arXiv:2407.07726, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024