pith. machine review for the scientific record. sign in

arxiv: 2604.13959 · v1 · submitted 2026-04-15 · 💻 cs.AI

Recognition: unknown

[Emerging Ideas] Artificial Tripartite Intelligence: A Bio-Inspired, Sensor-First Architecture for Physical AI

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:50 UTC · model grok-4.3

classification 💻 cs.AI
keywords physical AIsensor-first architecturebio-inspired designadaptive sensingtripartite intelligenceedge-cloud executionmobile cameraembodied AI
0
0 comments X

The pith

A bio-inspired tripartite architecture for physical AI keeps adaptive sensing on-device and invokes higher inference only when needed, raising accuracy while cutting remote calls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Physical AI must operate under strict limits on latency, energy, privacy, and reliability that cannot be solved by larger models alone. The paper proposes Artificial Tripartite Intelligence as a sensor-first contract that splits responsibilities into three layers: reflexive safety and signal control at the base, continuous calibration in the middle, and selective inference at the top. This organization lets sensing adapt in real time while reserving edge-cloud reasoning for when it is truly required. In a mobile camera prototype facing changing light and motion, the approach produced clear gains over standard auto-exposure. A sympathetic reader would care because such co-design could make embodied systems practical where scaling compute fails.

Core claim

The paper establishes that structuring physical AI into a Brainstem layer for reflexive safety and signal-integrity control, a Cerebellum layer for continuous sensor calibration, and a Cerebral Inference Subsystem spanning L3 and L4 for skill selection, coordination, and deep reasoning allows sensor control, adaptive sensing, edge-cloud execution, and foundation-model reasoning to co-evolve inside one closed-loop architecture, with time-critical operations kept local and higher inference invoked only as needed.

What carries the argument

Artificial Tripartite Intelligence (ATI), a three-layer bio-inspired division in which L1 provides reflexive sensing control, L2 performs ongoing calibration, and L3/L4 handle inference tasks to enable closed-loop adaptation.

If this is right

  • Time-critical sensing and reflexive control remain on-device, lowering latency and energy use.
  • Higher-level foundation model reasoning is invoked selectively rather than continuously.
  • Sensor control and inference components can co-evolve within the same modular architecture.
  • The design meets combined constraints on latency, energy, privacy, and reliability more effectively than model scaling alone.
  • Performance gains appear in dynamic environments such as varying lighting and motion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same layered split could be tested in other sensor-rich domains such as autonomous navigation or wearable health monitoring to check whether the three-part structure generalizes.
  • Future prototypes might vary the boundaries between L1, L2, and L3/L4 to determine which division of labor yields the largest efficiency gains.
  • The results suggest that hardware-software co-design starting from controllable sensors may be more important for physical AI than further increases in model size.
  • Keeping more processing local could reduce data transmission and thereby improve privacy, an implication the prototype does not directly measure.

Load-bearing premise

That the tripartite bio-inspired split of sensing, calibration, and inference responsibilities will deliver benefits for physical AI systems beyond the single tested mobile camera prototype.

What would settle it

Running the same routed evaluation on a different physical AI embodiment such as a robot arm or drone and finding that end-to-end accuracy stays near 53.8 percent with no reduction in remote L4 invocations would falsify the claim of general value.

Figures

Figures reproduced from arXiv: 2604.13959 by Hyung-Sin Kim, Subeom Park, You Rim Choi.

Figure 1
Figure 1. Figure 1: Limitations of prevailing paradigms for physical AI. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: ATI architecture. (a) ATI role mapping for a camera-based system. (b) General ATI interfaces, from physical-world [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Experimental setup for the prototype. A Galaxy S25 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: L2 sensor calibration via contextual bandits: learn [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Evolution of camera parameters and confidence [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Sample images of the eight object classes used in [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Ablation of the L3 escalation threshold 𝜏𝑐𝑜𝑛 𝑓 (EfficientNet-Lite0): (Left) Accuracy. (Middle) Escalation rate. (Right) Accuracy–escalation trade-off; best at 𝜏𝑐𝑜𝑛 𝑓 = 0.5. Why split inference outperforms L4-only inference. Within ATI, L1/L2-L3-L4 outperforms both L1/L2-L3 and L1/L2-L4, indi￾cating that neither L3-only nor L4-only inference is sufficient by itself. The gain comes from selective routing and… view at source ↗
Figure 8
Figure 8. Figure 8: Performance under dynamic lighting on the Teddy [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison of captured frames under [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
read the original abstract

As AI moves from data centers to robots and wearables, scaling ever-larger models becomes insufficient. Physical AI operates under tight latency, energy, privacy, and reliability constraints, and its performance depends not only on model capacity but also on how signals are acquired through controllable sensors in dynamic environments. We present Artificial Tripartite Intelligence (ATI), a bio-inspired, sensor-first architectural contract for physical AI. ATI is tripartite at the systems level: a Brainstem (L1) provides reflexive safety and signal-integrity control, a Cerebellum (L2) performs continuous sensor calibration, and a Cerebral Inference Subsystem spanning L3/L4 supports routine skill selection and execution, coordination, and deep reasoning. This modular organization allows sensor control, adaptive sensing, edge-cloud execution, and foundation model reasoning to co-evolve within one closed-loop architecture, while keeping time-critical sensing and control on device and invoking higher-level inference only when needed. We instantiate ATI in a mobile camera prototype under dynamic lighting and motion. In our routed evaluation (L3-L4 split inference), compared to the default auto-exposure setting, ATI (L1/L2 adaptive sensing) improves end-to-end accuracy from 53.8% to 88% while reducing remote L4 invocations by 43.3%. These results show the value of co-designing sensing and inference for embodied AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Artificial Tripartite Intelligence (ATI), a bio-inspired, sensor-first architectural contract for physical AI. It divides the system into a Brainstem (L1) for reflexive safety and signal-integrity control, a Cerebellum (L2) for continuous sensor calibration, and a Cerebral Inference Subsystem (L3/L4) for skill selection, execution, coordination, and deep reasoning. The architecture is instantiated in a mobile camera prototype under dynamic lighting and motion, where a routed L3-L4 evaluation claims that ATI adaptive sensing raises end-to-end accuracy from 53.8% to 88% while cutting remote L4 invocations by 43.3% relative to default auto-exposure.

Significance. If the empirical claims can be substantiated with proper controls and baselines, the work would offer a concrete modular framework for co-designing controllable sensors with edge-cloud inference in embodied systems. The emphasis on keeping time-critical sensing on-device while invoking higher-level reasoning only when needed addresses real constraints in latency, energy, and privacy for physical AI. The tripartite division provides a reusable contract that could guide future sensor-inference co-evolution, though its generality beyond the single prototype remains to be demonstrated.

major comments (2)
  1. [Abstract] Abstract: The central quantitative claim states that ATI (L1/L2 adaptive sensing) improves accuracy from 53.8% to 88% and reduces L4 invocations by 43.3% versus default auto-exposure in a routed L3-L4 evaluation. No task definition, dataset, trial count, variance, statistical tests, or implementation details of the L1 reflexive or L2 calibration modules are supplied. Without these, the result cannot be assessed for support of the architectural claim.
  2. [Prototype evaluation] Prototype evaluation: The reported gains are compared solely against default auto-exposure. To show that the tripartite structure (rather than generic adaptive sensing) is load-bearing, the evaluation must include at least one non-ATI adaptive baseline (e.g., histogram equalization, PID exposure control, or a learned policy). Absent such controls, the 34.2 pp accuracy lift and 43.3% invocation reduction cannot be attributed specifically to the L1/L2/L3-L4 division.
minor comments (2)
  1. [Introduction] The bio-inspired terminology (Brainstem, Cerebellum) is used consistently but would benefit from a brief table or paragraph explicitly mapping each level's functions to the corresponding biological roles to clarify the analogy's scope.
  2. [Abstract] Clarify the precise meaning of 'end-to-end accuracy' and the mobile-camera task (e.g., object classification, detection, or tracking under motion) so readers can judge the practical relevance of the 88% figure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on clarity and experimental rigor. We address each major point below and outline targeted revisions.

read point-by-point responses
  1. Referee: The abstract's central quantitative claim lacks task definition, dataset, trial count, variance, statistical tests, or implementation details of the L1 reflexive or L2 calibration modules, preventing assessment of support for the architectural claim.

    Authors: The abstract is kept concise per the Emerging Ideas format. The full manuscript defines the task as image classification under dynamic lighting and motion, describes the custom dataset collected with the mobile prototype, reports trial counts and variance in Section 4, and details L1 (threshold-based signal integrity) and L2 (continuous calibration loop) implementations in Section 3. We will revise the abstract to briefly reference the evaluation protocol, dataset size, and key L1/L2 mechanisms, and will add explicit statistical test results if not already highlighted. revision: yes

  2. Referee: Gains are compared only against default auto-exposure; to show the tripartite structure is load-bearing rather than generic adaptive sensing, at least one non-ATI adaptive baseline (e.g., histogram equalization or PID control) is required.

    Authors: We agree that sole comparison to the camera's default auto-exposure limits isolation of the L1/L2/L3-L4 contribution. The default represents the production baseline, but to demonstrate the specific value of the tripartite contract we will add a non-ATI adaptive baseline (PID-based exposure control) and report comparative accuracy and invocation metrics in the revised evaluation section. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural proposal plus empirical prototype measurement

full rationale

The paper advances a tripartite bio-inspired architecture (L1 brainstem reflexive control, L2 cerebellum calibration, L3/L4 cerebral inference) and reports a single prototype experiment on a mobile camera under dynamic lighting. The headline result (53.8% to 88% accuracy, 43.3% fewer L4 calls) is presented as a direct empirical comparison against default auto-exposure, not as a mathematical prediction or derivation. No equations, fitted parameters, ansatzes, or derivation chains appear in the manuscript. Consequently there are no opportunities for self-definition, fitted-input-as-prediction, or self-citation load-bearing reductions. The work is self-contained as an architectural contract plus measured outcome; the tripartite division is offered as a design choice whose value is tested experimentally rather than deduced from prior self-referential premises.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on the bio-inspiration mapping and the unreplicated prototype result; the three subsystems are defined within the paper without external grounding.

axioms (1)
  • domain assumption Bio-inspired brain structures can be directly mapped to effective layered architectures for physical AI.
    Invoked to justify the Brainstem-Cerebellum-Cerebral division.
invented entities (3)
  • Brainstem (L1) no independent evidence
    purpose: Reflexive safety and signal-integrity control
    New subsystem introduced by the architecture.
  • Cerebellum (L2) no independent evidence
    purpose: Continuous sensor calibration
    New subsystem introduced by the architecture.
  • Cerebral Inference Subsystem (L3/L4) no independent evidence
    purpose: Skill selection, coordination, and deep reasoning
    New subsystem introduced by the architecture.

pith-pipeline@v0.9.0 · 5555 in / 1488 out tokens · 43405 ms · 2026-05-10T12:50:22.995975+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

75 extracted references · 8 canonical work pages · 4 internal anchors

  1. [1]

    Jonathan Ashmore. 2008. Cochlear outer hair cell motility.Physiological reviews 88, 1 (2008), 173–210. MobiSys ’26, June 21–25, 2026, Cambridge, United Kingdom You Rim Choi, Subeom Park, and Hyung-Sin Kim

  2. [2]

    Eunsu Baek, Sung-hwan Han, Taesik Gong, and Hyung-Sin Kim. 2025. Adaptive Camera Sensor for Vision Models. InThe Thirteenth International Conference on Learning Representations

  3. [3]

    Eunsu Baek, Keondo Park, Jiyoon Kim, and Hyung-Sin Kim. 2024. Unexplored faces of robustness and out-of-distribution: Covariate shifts in environment and sensor domains. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22294–22303

  4. [4]

    Eunsu Baek, Keondo Park, Jeonggil Ko, Min-hwan Oh, Taesik Gong, and Hyung- Sin Kim. 2025. Position: AI Should Sense Better, Not Just Scale Bigger: Adaptive Sensing as a Paradigm Shift. InThe Thirty-Ninth Annual Conference on Neural Information Processing Systems Position Paper Track

  5. [5]

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. 2025. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734(2025)

  6. [6]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901

  7. [7]

    Herman Bruyninckx. 2001. Open robot control software: the OROCOS project. In Proceedings 2001 ICRA. IEEE international conference on robotics and automation (Cat. No. 01CH37164), Vol. 3. IEEE, 2523–2528

  8. [8]

    Kaiyuan Chen, Kush Hari, Trinity Chung, Michael Wang, Nan Tian, Christian Juette, Jeffrey Ichnowski, Liu Ren, John Kubiatowicz, Ion Stoica, et al . 2024. FogROS2-FT: Fault Tolerant Cloud Robotics. In2024 IEEE/RSJ International Con- ference on Intelligent Robots and Systems (IROS). IEEE, 1390–1397

  9. [9]

    Kaifei Chen, Tong Li, Hyung-Sin Kim, David E Culler, and Randy H Katz. 2018. Marvel: Enabling mobile augmented reality with low energy and low latency. In Proceedings of the 16th ACM Conference on Embedded Networked Sensor Systems. 292–304

  10. [10]

    Kaiyuan Chen, Nan Tian, Christian Juette, Tianshuang Qiu, Liu Ren, John Kubia- towicz, and Ken Goldberg. 2025. Fogros2-plr: Probabilistic latency-reliability for cloud robotics. In2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 16290–16297

  11. [11]

    Lun Chen, Yingzhao Zhu, and Man Li. 2024. Tactile-GAT: tactile graph attention networks for robot tactile perception classification.Scientific Reports14, 1 (2024), 27543

  12. [12]

    Yousung Choi, Ahreum Seo, and Hyung-Sin Kim. 2022. ScriptPainter: Vision- based, on-device test script generation for mobile systems. In2022 21st ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN). IEEE, 477–490

  13. [13]

    Michael W Cole, Jeremy R Reynolds, Jonathan D Power, Grega Repovs, Alan Anticevic, and Todd S Braver. 2013. Multi-task connectivity reveals flexible hubs for adaptive task control.Nature neuroscience16, 9 (2013), 1348–1355

  14. [14]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

  15. [15]

    Eduardo Cuervo, Aruna Balasubramanian, Dae-ki Cho, Alec Wolman, Stefan Saroiu, Ranveer Chandra, and Paramvir Bahl. 2010. Maui: making smartphones last longer with code offload. InProceedings of the 8th international conference on Mobile systems, applications, and services. 49–62

  16. [16]

    Michael Tri H Do. 2019. Melanopsin and the intrinsically photosensitive retinal ganglion cells: biophysics to behavior.Neuron104, 2 (2019), 205–226

  17. [17]

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al

  18. [18]

    InInternational Conference on Machine Learning

    PaLM-E: An Embodied Multimodal Language Model. InInternational Conference on Machine Learning. PMLR, 8469–8488

  19. [19]

    John Duncan. 2010. The multiple-demand (MD) system of the primate brain: mental programs for intelligent behaviour.Trends in cognitive sciences14, 4 (2010), 172–179

  20. [20]

    Wen Fan, Hongbo Bo, Yijiong Lin, Yifan Xing, Weiru Liu, Nathan Lepora, and Dandan Zhang. 2022. Graph neural networks for interpretable tactile sensing. In 2022 27th International Conference on Automation and Computing (ICAC). IEEE, 1–6

  21. [21]

    Biyi Fang, Xiao Zeng, and Mi Zhang. 2018. Nestdnn: Resource-aware multi-tenant on-device deep learning for continuous mobile vision. InProceedings of the 24th Annual International Conference on Mobile Computing and Networking. 115–127

  22. [22]

    Yuan Gong, Yu-An Chung, and James Glass. 2021. AST: Audio Spectrogram Transformer. InInterspeech 2021. 571–575. doi:10.21437/Interspeech.2021-698

  23. [23]

    Lixiang Han, Zimu Zhou, and Zhenjiang Li. 2024. Pantheon: Preemptible multi- dnn inference on mobile edge gpus. InProceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services. 465–478

  24. [24]

    Seungyeop Han, Haichen Shen, Matthai Philipose, Sharad Agarwal, Alec Wolman, and Arvind Krishnamurthy. 2016. Mcdnn: An approximation-based execution framework for deep stream processing under resource constraints. InProceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services. 123–136

  25. [25]

    Yegyu Han, Taegyoon Yoon, Dayeon Woo, Sojeong Kim, and Hyung-Sin Kim

  26. [26]

    Senseshift6d: Multimodal rgb-d benchmarking for robust 6d pose estimation across environment and sensor variations.arXiv preprint arXiv:2507.05751(2025)

  27. [27]

    J Johanna Hopp and Albert F Fuchs. 2004. The characteristics and neuronal substrate of saccadic eye movement plasticity.Progress in neurobiology72, 1 (2004), 27–53

  28. [28]

    Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingx- ing Tan, Weijun Wang, Yukun arrows Zhu, Ruoming Pang, Vijay Vasudevan, et al

  29. [29]

    InProceedings of the IEEE/CVF international conference on computer vision

    Searching for mobilenetv3. InProceedings of the IEEE/CVF international conference on computer vision. 1314–1324

  30. [30]

    Yafei Hu, Quanting Xie, Vidhi Jain, Jonathan Francis, Jay Patrikar, Nikhil Keetha, Seungchan Kim, Yaqi Xie, Tianyi Zhang, Hao-Shu Fang, et al . 2023. Toward general-purpose robots via foundation models: A survey and meta-analysis. arXiv preprint arXiv:2312.08782(2023)

  31. [31]

    Jeffrey Ichnowski, Kaiyuan Chen, Karthik Dharmarajan, Simeon Adebola, Michael Danielczuk, Víctor Mayoral-Vilches, Nikhil Jha, Hugo Zhan, Edith Llontop, Derek Xu, et al. 2023. FogROS2: An Adaptive Platform for Cloud and Fog Robotics Using ROS 2. In2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 5493–5500

  32. [32]

    Boyeong Im, Wooseok Lee, Yoojin Kwon, and Hyung-Sin Kim. 2025. From Tokens to Photons: Test-Time Physical Prompting for Vison-Language Models.arXiv preprint arXiv:2512.12571(2025)

  33. [33]

    Masao Ito. 1998. Cerebellar learning in the vestibulo–ocular reflex.Trends in cognitive sciences2, 9 (1998), 313–321

  34. [34]

    Masao Ito. 2008. Control of mental activities by internal models in the cerebellum. Nature Reviews Neuroscience9, 4 (2008), 304–313

  35. [35]

    Joo Seong Jeong, Jingyu Lee, Donghyun Kim, Changmin Jeon, Changjin Jeong, Youngki Lee, and Byung-Gon Chun. 2022. Band: coordinated multi-dnn infer- ence on heterogeneous mobile processors. InProceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services. 235–247

  36. [36]

    Fucheng Jia, Shiqi Jiang, Ting Cao, Wei Cui, Tianrui Xia, Xu Cao, Yuanchun Li, Qipeng Wang, Deyu Zhang, Ju Ren, et al. 2024. Empowering in-browser deep learning inference on edge through just-in-time kernel optimization. InProceed- ings of the 22nd Annual International Conference on Mobile Systems, Applications and Services. 438–450

  37. [37]

    Roland S Johansson and J Randall Flanagan. 2009. Coding and use of tactile signals from the fingertips in object manipulation tasks.Nature Reviews Neuroscience10, 5 (2009), 345–359

  38. [38]

    Dong-Sig Kang, Eunsu Baek, Sungwook Son, Youngki Lee, Taesik Gong, and Hyung-Sin Kim. 2024. Mirror: Towards generalizable on-device video virtual try- on for mobile shopping.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies7, 4 (2024), 1–27

  39. [39]

    Ji-yoon Kim, Eunsu Baek, and Hyung-Sin Kim. 2026. ImageNet-sES: A First Sys- tematic Study of Sensor-Environment Simulation Anchored by Real Recaptures. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1117–1126

  40. [40]

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifi- cation with deep convolutional neural networks.Advances in neural information processing systems25 (2012)

  41. [41]

    Victor AF Lamme and Pieter R Roelfsema. 2000. The distinct modes of vision offered by feedforward and recurrent processing.Trends in neurosciences23, 11 (2000), 571–579

  42. [42]

    Susan J Lederman and Roberta L Klatzky. 2009. Haptic perception: A tutorial. Attention, Perception, & Psychophysics71, 7 (2009), 1439–1459

  43. [43]

    Kyunghyun Lee, Ukcheol Shin, and Byeong-Uk Lee. 2024. Learning to control camera exposure via reinforcement learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2975–2983

  44. [44]

    Stephen G Lisberger. 2015. Visual guidance of smooth pursuit eye movements. Annual review of vision science1, 1 (2015), 447–468

  45. [45]

    Yang Liu, Weixing Chen, Yongjie Bai, Xiaodan Liang, Guanbin Li, Wen Gao, and Liang Lin. 2025. Aligning cyber space with physical world: A comprehensive survey on embodied ai.IEEE/ASME Transactions on Mechatronics(2025)

  46. [46]

    Hong Lu, Wei Pan, Nicholas D Lane, Tanzeem Choudhury, and Andrew T Camp- bell. 2009. Soundsense: scalable sound sensing for people-centric applications on mobile phones. InProceedings of the 7th international conference on Mobile systems, applications, and services. 165–178

  47. [47]

    Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. 2024. A survey on vision-language-action models for embodied ai.arXiv preprint arXiv:2405.14093(2024)

  48. [48]

    Steven Macenski, Tully Foote, Brian Gerkey, Chris Lalancette, and William Woodall. 2022. Robot operating system 2: Design, architecture, and uses in the wild.Science robotics7, 66 (2022), eabm6074

  49. [49]

    James L McClelland, Bruce L McNaughton, and Randall C O’Reilly. 1995. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory.Psychological review102, 3 (1995), 419. [Emerging Ideas] Artificial Tripartite Intelligence: A Bio-Inspired, Sensor-Firs...

  50. [50]

    Emiliano Miluzzo, Nicholas D Lane, Kristóf Fodor, Ronald Peterson, Hong Lu, Mirco Musolesi, Shane B Eisenman, Xiao Zheng, and Andrew T Campbell. 2008. Sensing meets mobile social networks: the design, implementation and evaluation of the cenceme application. InProceedings of the 6th ACM conference on Embedded network sensor systems. 337–350

  51. [51]

    Jonathan W Mink. 1996. The basal ganglia: focused selection and inhibition of competing motor programs.Progress in neurobiology50, 4 (1996), 381–425

  52. [52]

    Jonathan W Mink. 2018. Basal ganglia mechanisms in action selection, plasticity, and dystonia.European Journal of Paediatric Neurology22, 2 (2018), 225–229

  53. [53]

    2012.Hearing: anatomy, physiology, and disorders of the auditory system

    Aage R Møller. 2012.Hearing: anatomy, physiology, and disorders of the auditory system. Plural Publishing

  54. [54]

    Keondo Park, You Rim Choi, Inhoe Lee, and Hyung-Sin Kim. 2023. PointSplit: Towards on-device 3D object detection with heterogeneous low-power accelera- tors. InProceedings of the 22nd International Conference on Information Processing in Sensor Networks. 67–81

  55. [55]

    Keondo Park, Joopyo Hong, Wooseok Lee, Hyun-Woo Shin, and Hyung-Sin Kim. 2025. DistillSleep: real-time, on-device, interpretable sleep staging from single-channel electroencephalogram.SLEEPJ48, 12 (2025), zsaf240

  56. [56]

    José Luis Pech-Pacheco, Gabriel Cristóbal, Jesús Chamorro-Martinez, and Joaquín Fernández-Valdivia. 2000. Diatom autofocusing in brightfield microscopy: a com- parative study. InProceedings 15th International Conference on Pattern Recognition. ICPR-2000, Vol. 3. IEEE, 314–317

  57. [57]

    Ngoc-Quan Pham, Thai Son Nguyen, Jan Niehues, Markus Müller, and Alexan- der H. Waibel. 2019. Very Deep Self-Attention Networks for End-to-End Speech Recognition. InProc. Interspeech

  58. [58]

    Morgan Quigley, Ken Conley, Brian Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, Rob Wheeler, Andrew Y Ng, et al. 2009. ROS: an open-source Robot Operating System. InICRA workshop on open source software, Vol. 3. Kobe, 5

  59. [59]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

  60. [60]

    Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang- Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4510–4520

  61. [61]

    Mahadev Satyanarayanan, Paramvir Bahl, Ramón Caceres, and Nigel Davies

  62. [62]

    The case for vm-based cloudlets in mobile computing.IEEE pervasive Computing8, 4 (2009), 14–23

  63. [63]

    V Skljarevski and NM Ramadan. 2002. The nociceptive flexion reflex in humans– review article.Pain96, 1-2 (2002), 3–8

  64. [64]

    Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. InInternational conference on machine learning. PMLR, 6105–6114

  65. [65]

    Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. 2025. Gemini robotics: Bring- ing ai into the physical world.arXiv preprint arXiv:2503.20020(2025)

  66. [66]

    Simon Thorpe, Denis Fize, and Catherine Marlot. 1996. Speed of processing in the human visual system.nature381, 6582 (1996), 520–522

  67. [67]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

  68. [68]

    John Von Neumann. 1993. First Draft of a Report on the EDVAC.IEEE Annals of the History of Computing15, 4 (1993), 27–75

  69. [69]

    Zhonglin Yang, Hao Fang, Huanyu Liu, Junbao Li, Yutong Jiang, and Mengqi Zhu. 2024. Active Visual Perception Enhancement Method Based on Deep Reinforcement Learning.Electronics13, 9 (2024), 1654

  70. [70]

    Juheon Yi, Sunghyun Choi, and Youngki Lee. 2020. EagleEye: Wearable camera- based person identification in crowded urban spaces. InProceedings of the 26th Annual International Conference on Mobile Computing and Networking. 1–14

  71. [71]

    Juheon Yi and Youngki Lee. 2020. Heimdall: mobile GPU coordination platform for augmented reality applications. InProceedings of the 26th Annual International Conference on Mobile Computing and Networking. 1–14

  72. [72]

    Taegyoon Yoon, Yegyu Han, Seojin Ji, Jaewoo Park, Sojeong Kim, Taein Kwon, and Hyung-Sin Kim. 2026. EgoXtreme: A Dataset for Robust Object Pose Estimation in Egocentric Views under Extreme Conditions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  73. [73]

    Jiseok Youn, Jaehun Song, Hyung-Sin Kim, and Saewoong Bahk. 2022. Bitwidth- adaptive quantization-aware neural network training: A meta-learning approach. InEuropean Conference on Computer Vision. Springer, 208–224

  74. [74]

    Shuyang Zhang, Jinhao He, Yilong Zhu, Jin Wu, and Jie Yuan. 2024. Efficient Camera Exposure Control for Visual Odometry via Deep Reinforcement Learning. IEEE Robotics and Automation Letters(2024)

  75. [75]

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. 2023. Rt-2: Vision-language- action models transfer web knowledge to robotic control. InConference on Robot Learning. PMLR, 2165–2183