pith. sign in

arxiv: 2605.31460 · v1 · pith:EUBXDN46new · submitted 2026-05-29 · 💻 cs.RO · cs.SY· eess.SY

On-Device Robotic Planning: Eliminating Inference Redundancy for Efficient Decision-Making

Pith reviewed 2026-06-28 22:23 UTC · model grok-4.3

classification 💻 cs.RO cs.SYeess.SY
keywords robotic planninginference optimizationtemporal redundancyon-device deploymentvision-language modelsALFRED benchmarkscene gating
0
0 comments X

The pith

Robotic reasoning can skip most inferences when consecutive observations repeat the same actions and subgoals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that large-model robotic planners waste compute on repeated decisions across time steps. It introduces REIS to detect and bypass this redundancy with lightweight checks before full reasoning. The method keeps task success rates competitive on ALFRED and real robots while cutting inference load. A sympathetic reader would see this as a practical route to real-time on-device control without retraining the underlying models.

Core claim

Robotic reasoning workloads contain substantial temporal redundancy where consecutive observations frequently produce identical actions and subgoals; REIS exploits this by combining lightweight scene gating, KV-steered affordance routing, and deliberative reasoning to minimize unnecessary inference while preserving semantic adaptability and competitive performance on ALFRED and real-world tasks.

What carries the argument

REIS framework, which uses lightweight scene gating to detect redundancy, KV-steered affordance routing to select routes, and deliberative reasoning only when needed.

If this is right

  • On-device robotic policies become feasible with lower latency under embodied constraints.
  • Task completion rates on ALFRED and physical robots stay close to full-reasoning baselines.
  • The same redundancy pattern can be exploited in other sequential embodied tasks without changing the base model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If temporal redundancy holds across a wider range of vision-language models, the approach could generalize beyond the tested planners.
  • Real-world deployment would still require verifying that the gating mechanism itself stays cheap on target hardware.
  • The same observation of repeated decisions might apply to non-robotic sequential planning problems such as game agents or autonomous vehicles.

Load-bearing premise

Robotic tasks produce enough repeated actions and subgoals across consecutive observations that skipping them does not hurt overall performance.

What would settle it

Measure the fraction of consecutive time steps that produce identical actions and subgoals in a new robotic environment or model; if the fraction is near zero, the claimed overhead reduction disappears.

Figures

Figures reproduced from arXiv: 2605.31460 by Hyunmi Kim, Hyunseung Shin, JeongGil Ko, Joonhee Lee, Pei Zhang.

Figure 1
Figure 1. Figure 1: Overview of REIS, illustrating the interaction between System One and System Two. Detailed illustration of System One and Two on the bottom row substantial structural scene changes are detected. Rather than using all visual features, EMA-HSVS selects transformer heads sensitive to geometric variation through a greedy calibration process on datasets such as ALFRED and LIBERO. For incoming frames, EMA-HSVS f… view at source ↗
Figure 2
Figure 2. Figure 2: Example of Datasets used for Optimization and Evaluation [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A Sample of Navigation Task (DNF: Did Not Finish) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Confusion Matrix of Vision Similarity Measure Methods [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Pipeline of Head Selection (Figure 3 from Main Text) [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visual pipelines for (Left) Contrastive Pair Generation (Figure 4) and (Right) Calculation [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Positive and Negative Prompt for KV Steering Pair Generation [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Sweep Result of EMA-HSVS Greedy Search [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Hallway with Bright Color and Complex Objects on The Wall [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Hallway with Monotone Color and Dynamic Pathway Appearing [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison of Other Transformer Based Vision Encoder Performance [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
read the original abstract

Reasoning-based robotic policies using large language and vision-language models achieve strong semantic planning capabilities but mostly suffer from a high inference latency that limits practical real-time deployment. In this work, we observe that robotic reasoning workloads contain substantial temporal redundancy, where consecutive observations frequently produce identical actions and subgoals. Based on this insight, we present REIS, a human cognition inspired robotic decision-making framework that minimizes unnecessary reasoning while preserving semantic adaptability. REIS combines lightweight scene gating, KV-steered affordance routing, and deliberative reasoning to accelerate robotic control under embodied constraints. Experiments on ALFRED, and real-world robotic tasks demonstrate that REIS significantly suppresses reasoning overhead while maintaining competitive task performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that robotic reasoning workloads exhibit substantial temporal redundancy (consecutive observations frequently yielding identical actions and subgoals), and introduces the REIS framework—combining lightweight scene gating, KV-steered affordance routing, and deliberative reasoning—to suppress inference overhead while preserving semantic adaptability and competitive task performance on ALFRED and real-world robotic tasks.

Significance. If the redundancy observation holds with independent quantification and the components deliver the claimed latency reductions without hidden costs to adaptability, the work would address a key barrier to real-time on-device deployment of VLM/LLM-based robotic planners, offering a practical path to lower-latency embodied decision-making.

major comments (2)
  1. [Abstract] Abstract: the central claim that REIS 'significantly suppresses reasoning overhead' rests on the unquantified observation of 'substantial temporal redundancy'; no metrics (e.g., fraction of identical consecutive actions/subgoals across ALFRED trajectories or robot logs, or ablation across other VLMs) are reported, so it is impossible to verify whether the property is general or setup-specific.
  2. [Abstract] Abstract: without baseline comparisons, error bars, or explicit measurement of redundancy and overhead (e.g., tokens or latency before/after each REIS component), the performance claims cannot be assessed for soundness or generality.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the need to strengthen the abstract with explicit quantification. We address each major comment below and will revise the manuscript to incorporate the requested metrics and clarifications.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that REIS 'significantly suppresses reasoning overhead' rests on the unquantified observation of 'substantial temporal redundancy'; no metrics (e.g., fraction of identical consecutive actions/subgoals across ALFRED trajectories or robot logs, or ablation across other VLMs) are reported, so it is impossible to verify whether the property is general or setup-specific.

    Authors: We agree that the abstract would benefit from explicit quantification of the temporal redundancy observation. The full manuscript provides supporting experimental analysis on ALFRED and real-robot trajectories, but we will revise the abstract (and add a dedicated quantification subsection) to report concrete metrics such as the fraction of identical consecutive actions and subgoals. Ablations across additional VLMs will be included where existing data permits; new experiments on further models are not feasible within the revision timeline but can be noted as future work. revision: yes

  2. Referee: [Abstract] Abstract: without baseline comparisons, error bars, or explicit measurement of redundancy and overhead (e.g., tokens or latency before/after each REIS component), the performance claims cannot be assessed for soundness or generality.

    Authors: The experimental section already contains baseline comparisons, latency/token measurements, and component-wise breakdowns. To directly address the abstract-level concern, we will revise the abstract to include key quantitative results (with error bars) and explicit before/after overhead numbers for each REIS component (scene gating, KV-steered routing, deliberative reasoning). This will make the performance claims immediately verifiable from the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity; framework rests on empirical observation without self-referential derivations

full rationale

The paper states an observation about temporal redundancy in robotic reasoning workloads and introduces REIS components (scene gating, KV-steered routing, deliberative reasoning) as engineering responses, validated empirically on ALFRED and real-robot tasks. No equations, parameter-fitting steps, self-citations, or uniqueness theorems appear in the provided text. The central claim of overhead suppression follows from experimental results rather than any reduction of outputs to inputs by construction, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no equations, parameters, or explicit assumptions beyond the stated observation of temporal redundancy; no free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.1-grok · 5658 in / 1105 out tokens · 22461 ms · 2026-06-28T22:23:42.272400+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 8 canonical work pages · 1 internal anchor

  1. [1]

    Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finn, A. Handa, M.-Y . Liu, D. Xiang, G. Wetzstein, and T.-Y . Lin. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2025. URLhttps://openaccess.thecvf.com/ conten...

  2. [2]

    J. Wen, Y . Zhu, M. Zhu, Z. Tang, J. Li, Z. Zhou, X. Liu, C. Shen, Y . Peng, and F. Feng. DiffusionVLA: Scaling robot foundation models via unified diffusion and autoregression. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 66558–66574. PMLR, 2025. URL https://proceedin...

  3. [3]

    Jiang, J

    W. Jiang, J. Clemons, K. Sankaralingam, and C. Kozyrakis. How fast can i run my vla? demystifying vla inference performance with vla-perf, 2026. URLhttps://arxiv.org/ abs/2602.18397

  4. [4]

    Y . Dai, H. Gu, T. Wang, Q. Cheng, Y . Zheng, Z. Qiu, L. Gong, W. Lou, and X. Zhou. Ac- tionflow: A pipelined action acceleration for vision language models on edge.arXiv preprint arXiv:2512.20276, 2025. URLhttps://arxiv.org/abs/2512.20276

  5. [5]

    Scarciglia, A

    L. Scarciglia, A. Paolillo, and D. Palossi. A map-free deep learning-based framework for gate-to-gate monocular visual navigation aboard miniaturized aerial vehicles. In2025 IEEE International Conference on Robotics and Automation (ICRA), 2025. URLhttps://arxiv. org/abs/2503.05251

  6. [6]

    Y . Qian, Y . Zhu, J. Luo, L. Liu, Y . Yuan, G. Ning, and H. Liao. Breaking the latency barrier: Synergistic perception and control for high-frequency 3D ultra- sound servoing. In2026 IEEE International Conference on Robotics and Automa- tion (ICRA), 2026. URLhttps://ras.papercept.net/conferences/conferences/ ICRA26/program/ICRA26_ContentListWeb_4.html

  7. [7]

    Z. Duan, Y . Zhang, S. Geng, G. Liu, J. Boedecker, and C. X. Lu. Fast ecot: Efficient embodied chain-of-thought via thoughts reuse. In2026 IEEE International Conference on Robotics and Automation (ICRA), 2026. URLhttps://arxiv.org/abs/2506.07639

  8. [8]

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Dif- fusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 2024. doi:10.1177/02783649241273668. URLhttps://doi.org/10. 1177/02783649241273668

  9. [9]

    Z. Wang, Z. Li, A. Mandlekar, Z. Xu, J. Fan, Y . Narang, L. Fan, Y . Zhu, Y . Balaji, M. Zhou, M.-Y . Liu, and Y . Zeng. One-step diffusion policy: Fast visuomotor policies via diffusion distillation. InInternational Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=E2VsqgKNlr

  10. [10]

    Zhong, J

    Z. Zhong, J. Li, J. He, H. Yan, X. Gong, G. Zhao, Y . Cai, J. Gao, X. Yan, B. Liu, Y . Chen, L. Yang, and H. Li. Dualcot-vla: Visual-linguistic chain of thought via parallel reasoning for vision-language-action models.arXiv preprint arXiv:2603.22280, 2026. URLhttps: //arxiv.org/abs/2603.22280

  11. [11]

    Y . Duan, H. Yin, and D. Kragic. Real-time iteration scheme for diffusion policy. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025. URL https://arxiv.org/abs/2508.05396. 9

  12. [12]

    Zhang, Y

    J. Zhang, Y . Guo, X. Chen, Y .-J. Wang, Y . Hu, C. Shi, and J. Chen. Hirt: Enhancing robotic control with hierarchical robot transformers. In P. Agrawal, O. Kroemer, and W. Burgard, editors,Proceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Research, pages 933–946. PMLR, 06–09 Nov 2025. URLhttps:// proceedin...

  13. [13]

    Q. Bu, H. Li, L. Chen, J. Cai, J. Zeng, H. Cui, M. Yao, and Y . Qiao. Towards synergistic, generalized and efficient dual-system for robotic manipulation, 2024. URLhttps://arxiv. org/abs/2410.08001

  14. [14]

    B. Han, J. Kim, and J. Jang. A dual process vla: Efficient robotic manipulation leveraging vlm. InConference on Robot Learning Workshop on Language and Robot Learning, pages 1–10,

  15. [15]

    URLhttps://arxiv.org/abs/2410.15549

  16. [16]

    J. Liu, P. Zhao, Z. Kong, X. Shen, P. Dong, F. Yang, L. Cui, H. Tang, G. Yuan, W. Niu, W. Zhang, X. Lin, G. Liu, Y . Wang, and D. Huang. When should a robot think? resource-aware reasoning via reinforcement learning for embodied robotic decision-making.arXiv preprint arXiv:2603.16673, 2026. URLhttps://arxiv.org/abs/2603.16673

  17. [17]

    Hu, J.-N

    Y . Hu, J.-N. Zaech, N. Nikolov, Y . Yao, S. Dey, G. Albanese, R. Detry, L. Van Gool, and D. Paudel. AR-VLA: True autoregressive action expert for vision-language-action models. In Robotics: Science and Systems, 2026. URLhttps://arxiv.org/abs/2603.10126

  18. [18]

    J. Tang, Y . Sun, Y . Zhao, S. Yang, Y . Lin, Z. Zhang, J. Hou, Y . Lu, Z. Liu, and S. Han. VLASH: Real-time VLAs via future-state-aware asynchronous inference.arXiv preprint arXiv:2512.01031, 2025. URLhttps://arxiv.org/abs/2512.01031

  19. [19]

    Huang, Y

    Z. Huang, Y . Zhang, J. Liu, R. Song, C. Tang, and J. Ma. TIC-VLA: A think-in-control vision-language-action model for robot navigation in dynamic environments. InInternational Conference on Machine Learning, 2026. URLhttps://arxiv.org/abs/2602.02459

  20. [21]

    URLhttps://arxiv.org/abs/2603.06480

  21. [22]

    H. Xu, Z. Liu, Y . Luomei, and F. Xu. Aerial vision-language navigation with a unified frame- work for spatial, temporal and embodied reasoning.arXiv preprint arXiv:2512.08639, 2025. URLhttps://arxiv.org/abs/2512.08639

  22. [23]

    S. Wang, Y . Luo, X. Chen, A. Luo, D. Li, C. Liu, S. Chen, Y . Zhang, and J. Yu. VLingNav: Embodied navigation with adaptive reasoning and visual-assisted linguistic memory.arXiv preprint arXiv:2601.08665, 2026. URLhttps://arxiv.org/abs/2601.08665

  23. [24]

    Luo and X

    K. Luo and X. Ma. EmergeNav: Structured embodied inference for zero-shot vision-and- language navigation in continuous environments.arXiv preprint arXiv:2603.16947, 2026. URLhttps://arxiv.org/abs/2603.16947

  24. [25]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An open-source vision-language-action model. In P. Agrawal, O. Kroe- mer, and W. Burgard, editors,Proceedings of The 8th Conference on Robot L...

  25. [26]

    GeForce RTX 4090 graphics cards for gaming.https://www.nvidia.com/ en-us/geforce/graphics-cards/40-series/rtx-4090/, 2026

    NVIDIA. GeForce RTX 4090 graphics cards for gaming.https://www.nvidia.com/ en-us/geforce/graphics-cards/40-series/rtx-4090/, 2026. Accessed: 2026-05-15. 10

  26. [27]

    Jetson AGX Orin for next-gen robotics.https://www.nvidia.com/en-us/ autonomous-machines/embedded-systems/jetson-orin/, 2026

    NVIDIA. Jetson AGX Orin for next-gen robotics.https://www.nvidia.com/en-us/ autonomous-machines/embedded-systems/jetson-orin/, 2026. Accessed: 2026-05- 15

  27. [28]

    D. Shah, B. Osinski, B. Ichter, and S. Levine. LM-nav: Robotic navigation with large pre- trained models of language, vision, and action. In6th Annual Conference on Robot Learning,

  28. [29]

    URLhttps://openreview.net/forum?id=UW5A3SweAH

  29. [30]

    RoboAgent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking,

    N. Yokoyama, S. Qian, M. Chang, T.-Y . Yang, and W. Wang. Vlfm: Vision-language fron- tier maps for zero-shot semantic navigation. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 111–117. IEEE, 2024. doi:10.1109/ICRA57147.2024. 10610287. URLhttps://ieeexplore.ieee.org/document/10610287

  30. [31]

    Z. Yin, C. Cheng, , Y . Guo, and Z. Li. Navigation with vlm framework: Towards going to any language, 2025. URLhttps://arxiv.org/abs/2410.02787

  31. [32]

    Blukis, C

    V . Blukis, C. Paxton, D. Fox, A. Garg, and Y . Artzi. A persistent spatial semantic representation for high-level natural language instruction execution. InProceedings of The 5th Conference on Robot Learning, volume 164 ofProceedings of Machine Learning Research, pages 706–717. PMLR, 2022. URLhttps://proceedings.mlr.press/v164/blukis22a.html

  32. [33]

    de Heuvel, W

    J. de Heuvel, W. Shi, X. Zeng, and M. Bennewitz. Subgoal-driven navigation in dynamic en- vironments using attention-based deep reinforcement learning.Field Robotics, 4:182–210,

  33. [34]

    URLhttps://fieldrobotics.net/Field_Robotics/Volume_4_files/Vol4_ 07.pdf

  34. [35]

    W. Hu, Y . Zhou, and H. W. Ho. Hierarchical deep deterministic policy gradient for autonomous maze navigation of mobile robots.arXiv preprint arXiv:2508.04994, 2025. URLhttps: //arxiv.org/abs/2508.04994

  35. [36]

    C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, and Y . Su. LLM-Planner: Few-shot grounded planning for embodied agents with large language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2998–3009, 2023. URL https://openaccess.thecvf.com/content/ICCV2023/html/Song_LLM-Planner_ Few-Shot_Grounded_Plannin...

  36. [37]

    M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakr- ishnan, K. Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. InConference on Robot Learning (CoRL), 2022

  37. [38]

    V2xp-asg: Generating adversarial scenes for vehicle-to-everything perception

    I. Singh, V . Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and A. Garg. Progprompt: Generating situated robot task plans using large language models. In2023 IEEE International Conference on Robotics and Automation (ICRA), 2023. doi:10. 1109/ICRA48891.2023.10161317. URLhttps://doi.org/10.1109/ICRA48891.2023. 10161317

  38. [39]

    Dalal, T

    M. Dalal, T. Chiruvolu, D. S. Chaplot, and R. Salakhutdinov. Plan-seq-learn: Language model guided reinforcement learning for solving long horizon robotics tasks.arXiv preprint arXiv:2405.01534, 2024. URLhttps://arxiv.org/abs/2405.01534

  39. [40]

    M. F. Ginting, S.-K. Kim, D. D. Fan, M. Palieri, M. J. Kochenderfer, and A.-a. Agha- Mohammadi. Seek: Semantic reasoning for object goal navigation in real world inspection tasks. InRobotics: Science and Systems, 2024. URLhttps://arxiv.org/abs/2405. 09822

  40. [41]

    X. Zhou, T. Xiao, L. Liu, Y . Wang, M. Chen, X. Meng, X. Wang, W. Feng, W. Sui, and Z. Su. FSR-VLN: Fast and slow reasoning for vision-language navigation with hierarchical multi- modal scene graph.arXiv preprint arXiv:2509.13733, 2025. URLhttps://arxiv.org/ abs/2509.13733. 11

  41. [42]

    G. Li, N. Tsagkas, J. Song, R. Mon-Williams, S. Vijayakumar, K. Shao, and L. Sevilla- Lara. Learning precise affordances from egocentric videos for robotic manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025. URLhttps://openaccess.thecvf.com/content/ICCV2025/html/Li_Learning_ Precise_Affordances_from_Egocentric_Vi...

  42. [43]

    M. Pan, J. Zhang, T. Wu, Y . Zhao, W. Gao, and H. Dong. Omnimanip: Towards general robotic manipulation via object-centric interaction primitives as spatial constraints. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. URLhttps://openaccess.thecvf.com/content/CVPR2025/html/Pan_OmniManip_ Towards_General_Robotic_M...

  43. [44]

    Shridhar, J

    M. Shridhar, J. Thomason, D. Gordon, Y . Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10740–10749, 2020

  44. [45]

    C. Li, R. Zhang, J. Wong, C. Gokmen, S. Srivastava, R. Mart ´ın-Mart´ın, C. Wang, G. Levine, W. Ai, B. Martinez, H. Yin, M. Lingelbach, M. Hwang, A. Hiranaka, S. Garlanka, A. Ay- din, S. Lee, J. Sun, M. Anvari, M. Sharma, D. Bansal, S. Hunter, K.-Y . Kim, A. Lou, C. R. Matthews, I. Villa-Renteria, J. H. Tang, C. Tang, F. Xia, Y . Li, S. Savarese, H. Gweon...

  45. [46]

    Brohan, N

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Fu, S. Guadarrama, et al. Rt-2: Vision-language-action models transfer knowledge to robotics.Conference on Robot Learning (CoRL), 2023

  46. [48]

    URLhttps://arxiv.org/abs/2410.24164

  47. [49]

    Belitsky, D

    M. Belitsky, D. J. Kopiczko, M. Dorkenwald, M. J. Mirza, J. R. Glass, C. G. M. Snoek, and Y . M. Asano. Kv cache steering for controlling frozen llms.arXiv preprint arXiv:2507.08799,

  48. [50]

    doi:10.48550/arXiv.2507.08799

  49. [51]

    Budzianowski, W

    P. Budzianowski, W. Maa, M. Freed, J. Mo, W. Hsiao, A. Xie, T. Mloduchowski, V . Tip- nis, and B. Bolte. Edgevla: Efficient vision-language-action models. InIROS 2024 Workshop on Mobile Manipulation and Embodied Intelligence, 2024. URLhttps:// mobile-manipulation.net/events/moma-iros24/

  50. [52]

    M. Xu, D. Niyato, H. Zhang, J. Kang, Z. Xiong, S. Mao, and Z. Han. Joint foundation model caching and inference of generative AI services for edge intelligence. In2023 IEEE Global Communications Conference (GLOBECOM), pages 3548–3553, 2023. doi:10.1109/ GLOBECOM54140.2023.10436771. URLhttps://doi.org/10.1109/GLOBECOM54140. 2023.10436771

  51. [53]

    Huang, I

    H. Huang, I. Fedorov, A. Gromov, B. Beckerman, N. Suda, D. Eriksson, M. Balandat, R. Con- way, P. Huber, C. Sankar, A. Dalmia, Z. Liu, L. Wu, T. Elgamal, A. Sagar, V . Chandra, and R. Krishnamoorthi. MobileLLM-flash: Latency-guided on-device LLM design for industry scale. InProceedings of the Annual Meeting of the Association for Computational Linguistics...

  52. [55]

    URLhttps://arxiv.org/abs/2510.14624

  53. [56]

    J. Li, M. Wu, J. Cao, A. Tiulpin, and M. B. Blaschko. EchoPrune: Interpreting redundancy as temporal echoes for efficient VideoLLMs.arXiv preprint arXiv:2605.10050, 2026. URL https://arxiv.org/abs/2605.10050

  54. [57]

    H. Chen, J. Liu, C. Gu, Z. Liu, R. Zhang, X. Li, X. He, Y . Guo, C.-W. Fu, S. Zhang, and P.-A. Heng. Fast-in-slow: A dual-system vla model unifying fast manipulation within slow reasoning. InAdvances in Neural Information Processing Systems, 2025. URLhttps:// openreview.net/forum?id=4asFznbzJg

  55. [58]

    Sakurada, M

    K. Sakurada, M. Shibuya, and W. Wang. Weakly Supervised Silhouette-based Semantic Scene Change Detection. InIEEE International Conference on Robotics and Automation (ICRA), pages 6861–6867. IEEE, 2020. doi:10.1109/ICRA40945.2020.9196985

  56. [59]

    B. Liu, Y . Liu, Z. Cui, S. Huang, and Y . Zhu. Libero: Benchmarking knowledge transfer in lifelong robot learning. InThirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023

  57. [60]

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. doi:10.48550/arXiv.2511. 21631. URLhttps://arxiv.org/abs/2511.21631

  58. [61]

    Codepy-Deepthink-3B.https://huggingface.co/prithivMLmods/ Codepy-Deepthink-3B, 2024

    prithivMLmods. Codepy-Deepthink-3B.https://huggingface.co/prithivMLmods/ Codepy-Deepthink-3B, 2024. Hugging Face model card, accessed 2026-05-26

  59. [62]

    Llama-3.2-3B-Instruct.https://huggingface.co/meta-llama/Llama-3

    Meta. Llama-3.2-3B-Instruct.https://huggingface.co/meta-llama/Llama-3. 2-3B-Instruct, 2024. Hugging Face model card, accessed 2026-05-26

  60. [63]

    T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y . Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush. Transformers: State-of-the-art natural lan- guage processing. In Q. Liu and D. Schlangen, editors,Proceedings of the 2...

  61. [64]

    Instruction

    S.-A. Yu, F. Gao, Y . Wu, C. Yu, and Y . Wang. D3P: Dynamic denoising diffusion policy via reinforcement learning.arXiv preprint arXiv:2508.06804, 2025. URLhttps://arxiv.org/ abs/2508.06804. 13 A Appendix A.1 Affordance Router Details: Prompts and Conditions This section provides the comprehensive prompt templates and the complete mapping matrix utilized ...