Semantic-based Internet of Embodied Intelligence: Visions and Frontiers

Feiliang Song; Huishi Song; Lexi Xu; Linyuan Hu; Ping Zhang; Rui Meng; Tony Q. S. Quek; Xiaodong Xu; Yaheng Wang; Yiming Liu

arxiv: 2607.00342 · v1 · pith:I7KUGIYJnew · submitted 2026-07-01 · 📡 eess.SP

Semantic-based Internet of Embodied Intelligence: Visions and Frontiers

Yaheng Wang , Rui Meng , Xiaodong Xu , Yiming Liu , Feiliang Song , Linyuan Hu , Huishi Song , Lexi Xu

show 2 more authors

Tony Q. S. Quek Ping Zhang

This is my paper

Pith reviewed 2026-07-02 00:39 UTC · model grok-4.3

classification 📡 eess.SP

keywords semantic IoEIembodied intelligencesemantic communicationmulti-agent systemsperception and controlnetworkinglatency reductionchannel robustness

0 comments

The pith

Semantic information serves as a unified metric integrating perception, intelligence, control, and communication for networks of embodied agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Semantic-based Internet of Embodied Intelligence to handle massive multimodal data overhead and the split between logical reasoning and physical constraints when scaling embodied agents into networks. It defines four dimensions of embodied intelligence and describes how semantic information can transform environmental perception, cognition and task planning, action generation and robust control, plus communication and networking. A case study verifies gains in channel robustness and lower end-to-end latency. This approach matters if it allows multi-agent physical systems to operate with compact meaning exchanges instead of raw data volumes.

Core claim

The paper claims that semantic information leveraged as a unified metric throughout the agent lifecycle revolutionizes environmental perception, cognition and task planning, action generation and robust control, and communication and networking, with a case study verifying significant improvements in channel robustness and reduced end-to-end latency for EI.

What carries the argument

The SIoEI paradigm, which applies semantic information as a unified metric across the four dimensions of perception, intelligence, control, and communication.

If this is right

Semantic processing enhances environmental perception for embodied agents.
Cognition and task planning align more closely with physical constraints.
Action generation and control gain robustness against uncertainties.
Communication and networking achieve lower latency and higher robustness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Multi-agent embodied systems could scale with far lower bandwidth demands if meanings replace raw sensor streams.
The unified metric may reduce mismatches between AI planning outputs and real-world actuator limits.
Standard ways to extract and share semantics across heterogeneous agents would need development for broad adoption.

Load-bearing premise

Semantic information can be reliably extracted, represented, and applied as a single metric across perception, intelligence, control, and communication without losing critical physical details or introducing new errors in embodied agents.

What would settle it

A direct comparison experiment on multi-agent embodied systems showing that semantic processing fails to improve or worsens channel robustness and end-to-end latency relative to non-semantic baselines.

Figures

Figures reproduced from arXiv: 2607.00342 by Feiliang Song, Huishi Song, Lexi Xu, Linyuan Hu, Ping Zhang, Rui Meng, Tony Q. S. Quek, Xiaodong Xu, Yaheng Wang, Yiming Liu.

**Figure 1.** Figure 1: Representative semantic-empowered technologies across the four EI dimensions: including environmental perception, cognition and task planning, [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Block diagrams of the three embodied-agent communication pipelines. (a) Baseline (JPEG+LDPC+VGR). (b) SemComm (SwinJSCC+VGR). (c) [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Task success rate of the three schemes across SNR [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Recent advances in generative artificial intelligence (AI) and embodied intelligence (EI) enable autonomous agents to interact with the physical world. However, scaling these systems into networks of multiple agents, namely the Internet of EI (IoEI), faces critical bottlenecks. These include the overhead of massive multimodal data transmission and the decoupling of logical reasoning from physical constraints. To address these challenges, we envision the Semantic-based IoEI (SIoEI), which leverages semantic information as a unified metric throughout the agent lifecycle. We systematically define four key dimensions of EI: perception, intelligence, control, and communication. We further elaborate how semantic empowerment revolutionizes environmental perception, cognition and task planning, action generation and robust control, and communication and networking. We also present a case study to verify that, the semantic-empowered end-to-end process significantly improves channel robustness and reduces end-to-end latency for EI. Finally, we outline critical open research directions for the SIoEI paradigm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Vision paper proposing semantic info as a unifier for embodied agent networks across four dimensions, but with thin evidence behind the claims.

read the letter

The main takeaway is that this is a vision paper proposing Semantic-based IoEI, where semantic information acts as a unified metric across the agent lifecycle in four dimensions: perception, intelligence, control, and communication. It claims this approach addresses data overhead and decoupling issues in multi-agent systems.

The paper organizes existing concepts from semantic communications and embodied AI into a coherent structure. Mapping how semantics could revolutionize each dimension provides a useful way to think about the integration. The outline of critical open research directions is also helpful for guiding future work in this area.

On the downside, the supporting evidence is thin. The case study is referenced to show gains in channel robustness and reduced latency, but without any description of the setup or results, it's impossible to assess the strength of those claims. The key assumption that semantic information can be extracted and applied without losing critical physical details remains untested in the provided text.

This paper is aimed at researchers in semantic communications and embodied intelligence who are interested in scaling to networked systems. It offers a big-picture perspective rather than specific methods or data.

It deserves a serious referee because the framework is logically presented and the open questions are relevant, even though the work is conceptual.

I would recommend sending it to peer review to get feedback on fleshing out the ideas.

Referee Report

1 major / 1 minor

Summary. The manuscript envisions the Semantic-based Internet of Embodied Intelligence (SIoEI) paradigm, which leverages semantic information as a unified metric across the agent lifecycle to overcome bottlenecks in scaling embodied intelligence (EI) systems, such as massive multimodal data transmission and decoupling of reasoning from physical constraints. It systematically defines four EI dimensions (perception, intelligence, control, communication), elaborates semantic empowerment in environmental perception, cognition/task planning, action generation/robust control, and communication/networking, presents a case study verifying improvements in channel robustness and end-to-end latency, and outlines open research directions.

Significance. If the vision holds, SIoEI could provide a unifying framework for semantic integration in multi-agent EI systems, directing research toward more efficient perception-to-action pipelines. The manuscript's strength lies in its structured definition of the four dimensions and explicit outline of critical open research directions, which offers a clear roadmap without relying on fitted parameters or self-referential definitions.

major comments (1)

[Case Study] Case study section: the claim that the semantic-empowered end-to-end process 'significantly improves channel robustness and reduces end-to-end latency' is presented without any description of the experimental setup, metrics used, quantitative results, baselines, or error analysis. This detail is load-bearing for the central claim that semantics yield verifiable gains.

minor comments (1)

The transition between the four EI dimensions and the semantic empowerment subsections could include explicit cross-references to avoid repetition in how semantics address physical constraints.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the manuscript's structured definition of the four EI dimensions along with its outline of open research directions. We address the single major comment below.

read point-by-point responses

Referee: [Case Study] Case study section: the claim that the semantic-empowered end-to-end process 'significantly improves channel robustness and reduces end-to-end latency' is presented without any description of the experimental setup, metrics used, quantitative results, baselines, or error analysis. This detail is load-bearing for the central claim that semantics yield verifiable gains.

Authors: We agree that the case study, as currently presented, does not supply the necessary experimental details to support the stated performance claims. In the revised manuscript we will expand the case study section to include: (i) a complete description of the simulation/experimental setup (network topology, channel models, agent configurations), (ii) the precise metrics employed (e.g., packet error rate or semantic similarity for robustness; end-to-end latency in milliseconds), (iii) quantitative results with numerical values, (iv) explicit baselines (traditional bit-level transmission and non-semantic EI pipelines), and (v) an error analysis or statistical significance assessment. These additions will make the verification reproducible and will directly address the load-bearing nature of the claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a vision and frontiers piece that defines four EI dimensions (perception, intelligence, control, communication) and conceptually elaborates prospective benefits of semantic information as a unifying metric. No equations, derivations, fitted parameters, or technical protocols are present in the provided text. The case study is invoked only at a high level to support robustness and latency claims without any reduction to self-referential inputs or self-citation chains. The central framing remains independent of any internal construction that would force the claimed outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

This is a conceptual vision paper; the central proposal rests on domain assumptions about semantic unification rather than new parameters or entities with independent evidence.

axioms (1)

domain assumption Semantic information can serve as a unified metric across perception, intelligence, control, and communication without loss of critical physical constraints
Invoked throughout the abstract as the basis for revolutionizing all four EI dimensions.

invented entities (1)

Semantic-based IoEI (SIoEI) no independent evidence
purpose: New paradigm to address data transmission overhead and reasoning-physical decoupling in multi-agent EI
Introduced in the abstract as the proposed solution; no independent evidence or falsifiable prediction provided.

pith-pipeline@v0.9.1-grok · 5724 in / 1319 out tokens · 38856 ms · 2026-07-02T00:39:39.140692+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 1 canonical work pages

[1]

A survey of embodied ai: From simulators to research tasks,

J. Duan, S. Yu, H. L. Tan, H. Zhu, and C. Tan, “A survey of embodied ai: From simulators to research tasks,”IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 6, no. 2, pp. 230–244, 2022

2022
[2]

Semantics-empowered communication for networked intelligent systems,

M. Kountouris and N. Pappas, “Semantics-empowered communication for networked intelligent systems,”IEEE Communications Magazine, vol. 59, no. 6, pp. 96–102, 2021

2021
[3]

Semantic radio access networks: Architecture, state-of-the-art, and future directions,

R. Meng, Z. Huang, J. Yan, M. Sun, Y . Liu, C. Feng, X. Xu, Z. Zhang, S. Gao, P. Zhanget al., “Semantic radio access networks: Architecture, state-of-the-art, and future directions,”IEEE Transactions on Cognitive Communications and Networking, vol. 12, pp. 7076–7097, 2026

2026
[4]

Towards semantic-based agent communication networks: Vision, technologies, and challenges,

P. Zhang, R. Meng, X. Xu, Y . Wang, Z. Huang, Y . Liu, R. Zhang, Y . Liu, H. Tong, H. Songet al., “Towards semantic-based agent communication networks: Vision, technologies, and challenges,”arXiv preprint arXiv:2603.24328, 2026

work page arXiv 2026
[5]

Deep learning enabled semantic communication systems,

H. Xie, Z. Qin, G. Y . Li, and B.-H. Juang, “Deep learning enabled semantic communication systems,”IEEE transactions on signal pro- cessing, vol. 69, pp. 2663–2675, 2021

2021
[6]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inProceedings of the 38th International Conference on Machine Learning (ICML), 2021, pp. 8748–8763

2021
[7]

Generative diffusion models for wireless networks: Fundamental, architecture, and state-of-the-art,

D. Fan, R. Meng, X. Xu, Y . Liu, G. Nan, C. Feng, S. Han, S. Gao, B. Xu, D. Niyatoet al., “Generative diffusion models for wireless networks: Fundamental, architecture, and state-of-the-art,”IEEE Communications Surveys & Tutorials, vol. 28, pp. 5632–5677, 2026

2026
[8]

Do as i can, not as i say: Grounding language in robotic affordances,

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausmanet al., “Do as i can, not as i say: Grounding language in robotic affordances,” inProceedings of the 11th International Conference on Learning Representations (ICLR), 2023

2023
[9]

Deep joint source- channel coding for wireless image transmission,

E. Bourtsoulatze, D. B. Kurka, and D. G ¨und¨uz, “Deep joint source- channel coding for wireless image transmission,”IEEE Transactions on Cognitive Communications and Networking, vol. 5, no. 3, pp. 567–579, 2019

2019
[10]

Nonlinear transform source-channel coding for semantic communications,

J. Dai, S. Wang, K. Tan, Z. Si, X. Liu, K. Li, and Z. Ping, “Nonlinear transform source-channel coding for semantic communications,”IEEE Journal on Selected Areas in Communications, vol. 40, no. 8, pp. 2300– 2316, Aug. 2022

2022
[11]

Kimera: From SLAM to spatial perception with 3D dynamic scene graphs,

A. Rosinol, M. Abate, Y . Chang, and L. Carlone, “Kimera: From SLAM to spatial perception with 3D dynamic scene graphs,”The International Journal of Robotics Research, vol. 40, no. 12–14, pp. 1510–1546, 2021

2021
[12]

Open-vocabulary object detection via vision and language knowledge distillation,

X. Gu, T.-Y . Lin, W. Kuo, and Y . Cui, “Open-vocabulary object detection via vision and language knowledge distillation,” inProceedings of the 10th International Conference on Learning Representations (ICLR), 2022

2022
[13]

PaLM-E: An embodied multimodal language model,

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yuet al., “PaLM-E: An embodied multimodal language model,” inProceedings of the 40th International Conference on Machine Learning (ICML), 2023, pp. 8469–8488

2023
[14]

Toward edge general intelligence with agentic ai and agentification: Concepts, technologies, and future directions,

R. Zhang, G. Liu, Y . Liu, C. Zhao, J. Wang, Y . Xu, D. Niyato, J. Kang, Y . Li, S. Maoet al., “Toward edge general intelligence with agentic ai and agentification: Concepts, technologies, and future directions,”IEEE Communications Surveys & Tutorials, vol. 28, pp. 4285–4318, 2026

2026
[15]

Enhanced ground–satellite direct access via onboard rydberg atomic quantum receivers,

Q. Peng, T. Gong, Z. Song, Q. Luo, Z. Lin, P. Xiao, and C. Yuen, “Enhanced ground–satellite direct access via onboard rydberg atomic quantum receivers,”IEEE Wireless Communications, vol. 33, no. 3, pp. 23–30, 2026

2026

[1] [1]

A survey of embodied ai: From simulators to research tasks,

J. Duan, S. Yu, H. L. Tan, H. Zhu, and C. Tan, “A survey of embodied ai: From simulators to research tasks,”IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 6, no. 2, pp. 230–244, 2022

2022

[2] [2]

Semantics-empowered communication for networked intelligent systems,

M. Kountouris and N. Pappas, “Semantics-empowered communication for networked intelligent systems,”IEEE Communications Magazine, vol. 59, no. 6, pp. 96–102, 2021

2021

[3] [3]

Semantic radio access networks: Architecture, state-of-the-art, and future directions,

R. Meng, Z. Huang, J. Yan, M. Sun, Y . Liu, C. Feng, X. Xu, Z. Zhang, S. Gao, P. Zhanget al., “Semantic radio access networks: Architecture, state-of-the-art, and future directions,”IEEE Transactions on Cognitive Communications and Networking, vol. 12, pp. 7076–7097, 2026

2026

[4] [4]

Towards semantic-based agent communication networks: Vision, technologies, and challenges,

P. Zhang, R. Meng, X. Xu, Y . Wang, Z. Huang, Y . Liu, R. Zhang, Y . Liu, H. Tong, H. Songet al., “Towards semantic-based agent communication networks: Vision, technologies, and challenges,”arXiv preprint arXiv:2603.24328, 2026

work page arXiv 2026

[5] [5]

Deep learning enabled semantic communication systems,

H. Xie, Z. Qin, G. Y . Li, and B.-H. Juang, “Deep learning enabled semantic communication systems,”IEEE transactions on signal pro- cessing, vol. 69, pp. 2663–2675, 2021

2021

[6] [6]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inProceedings of the 38th International Conference on Machine Learning (ICML), 2021, pp. 8748–8763

2021

[7] [7]

Generative diffusion models for wireless networks: Fundamental, architecture, and state-of-the-art,

D. Fan, R. Meng, X. Xu, Y . Liu, G. Nan, C. Feng, S. Han, S. Gao, B. Xu, D. Niyatoet al., “Generative diffusion models for wireless networks: Fundamental, architecture, and state-of-the-art,”IEEE Communications Surveys & Tutorials, vol. 28, pp. 5632–5677, 2026

2026

[8] [8]

Do as i can, not as i say: Grounding language in robotic affordances,

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausmanet al., “Do as i can, not as i say: Grounding language in robotic affordances,” inProceedings of the 11th International Conference on Learning Representations (ICLR), 2023

2023

[9] [9]

Deep joint source- channel coding for wireless image transmission,

E. Bourtsoulatze, D. B. Kurka, and D. G ¨und¨uz, “Deep joint source- channel coding for wireless image transmission,”IEEE Transactions on Cognitive Communications and Networking, vol. 5, no. 3, pp. 567–579, 2019

2019

[10] [10]

Nonlinear transform source-channel coding for semantic communications,

J. Dai, S. Wang, K. Tan, Z. Si, X. Liu, K. Li, and Z. Ping, “Nonlinear transform source-channel coding for semantic communications,”IEEE Journal on Selected Areas in Communications, vol. 40, no. 8, pp. 2300– 2316, Aug. 2022

2022

[11] [11]

Kimera: From SLAM to spatial perception with 3D dynamic scene graphs,

A. Rosinol, M. Abate, Y . Chang, and L. Carlone, “Kimera: From SLAM to spatial perception with 3D dynamic scene graphs,”The International Journal of Robotics Research, vol. 40, no. 12–14, pp. 1510–1546, 2021

2021

[12] [12]

Open-vocabulary object detection via vision and language knowledge distillation,

X. Gu, T.-Y . Lin, W. Kuo, and Y . Cui, “Open-vocabulary object detection via vision and language knowledge distillation,” inProceedings of the 10th International Conference on Learning Representations (ICLR), 2022

2022

[13] [13]

PaLM-E: An embodied multimodal language model,

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yuet al., “PaLM-E: An embodied multimodal language model,” inProceedings of the 40th International Conference on Machine Learning (ICML), 2023, pp. 8469–8488

2023

[14] [14]

Toward edge general intelligence with agentic ai and agentification: Concepts, technologies, and future directions,

R. Zhang, G. Liu, Y . Liu, C. Zhao, J. Wang, Y . Xu, D. Niyato, J. Kang, Y . Li, S. Maoet al., “Toward edge general intelligence with agentic ai and agentification: Concepts, technologies, and future directions,”IEEE Communications Surveys & Tutorials, vol. 28, pp. 4285–4318, 2026

2026

[15] [15]

Enhanced ground–satellite direct access via onboard rydberg atomic quantum receivers,

Q. Peng, T. Gong, Z. Song, Q. Luo, Z. Lin, P. Xiao, and C. Yuen, “Enhanced ground–satellite direct access via onboard rydberg atomic quantum receivers,”IEEE Wireless Communications, vol. 33, no. 3, pp. 23–30, 2026

2026