Recognition: unknown
Sema: Semantic Transport for Real-Time Multimodal Agents
Pith reviewed 2026-05-09 22:16 UTC · model grok-4.3
The pith
Sema shifts multimodal agent transport from raw signal fidelity to semantic meaning preservation, cutting uplink bandwidth by 64x for audio and 130-210x for screenshots while holding task accuracy within 0.7 points of the raw baseline.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Sema replaces conventional transport stacks with semantic encoding: discrete audio tokens replace waveform data and a hybrid screen representation (lossless accessibility-tree or OCR text plus compact visual tokens) replaces pixel dumps. Combined with bursty token delivery that eliminates jitter buffers, the system moves the transport goal from Shannon-Weaver Level A signal fidelity to Level B meaning preservation, yielding the stated bandwidth reductions under emulated WAN conditions while keeping accuracy loss below one percentage point.
What carries the argument
Hybrid screen representation (lossless accessibility-tree or OCR text plus compact visual tokens) paired with discrete audio tokenizers and bursty token delivery that removes jitter buffers.
If this is right
- Screenshot upload no longer dominates end-to-end action latency on constrained uplinks.
- Voice pipelines avoid sending 43-64x more audio data than the agent actually needs.
- Agents can maintain near-baseline accuracy while operating on links that would previously force heavy compression or dropped frames.
- Transport stacks can be redesigned around event-driven semantic consumption rather than continuous perceptual playout.
Where Pith is reading between the lines
- The same semantic-reduction principle could extend to other agent inputs such as continuous video or sensor streams once suitable tokenizers exist.
- If the encodings prove robust across tasks, they suggest a path toward parameter-free or low-overhead semantic layers in future machine-to-machine protocols.
- Deployment on real mobile uplinks would test whether the emulated-WAN gains survive variable packet loss and changing network conditions not covered in simulation.
Load-bearing premise
The chosen discrete audio tokens and hybrid screen representations capture all task-relevant semantics without meaningful loss for the agents' downstream performance in untested real-world conditions.
What would settle it
A side-by-side run on the same agent tasks under real WAN conditions where task accuracy falls more than 0.7 percentage points below the raw-data baseline when using the semantic encodings.
Figures
read the original abstract
Real-time multimodal agents transport raw audio and screenshots using networking stacks designed for human receivers, which optimize for perceptual fidelity and smooth playout. Yet agent models act as event-driven processors with no inherent sense of physical time, consuming task-relevant semantics rather than reconstructing signals in real time. This fundamental difference shifts the transport goal from the technical problem of signal fidelity (Shannon-Weaver Level A) to the semantic problem of meaning preservation (Level B). This mismatch imposes significant overhead. In visual pipelines, screenshot upload accounts for over 60% of end-to-end action latency on constrained uplinks, and in voice pipelines, conventional transport carries massive redundancy, sending 43-64x more data than needed to maintain task accuracy. We present Sema, a semantic transport system that combines discrete audio tokenizers with a hybrid screen representation (lossless accessibility-tree or OCR text, plus compact visual tokens) and bursty token delivery that eliminates jitter buffers. In simulations under emulated WAN conditions, Sema reduces uplink bandwidth by 64x for audio and 130-210x for screenshots while preserving task accuracy within 0.7 percentage points of the raw baseline.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Sema, a semantic transport system for real-time multimodal agents that replaces raw audio and screenshot transmission with discrete audio tokenizers and a hybrid screen representation (lossless accessibility-tree or OCR text plus compact visual tokens), combined with bursty token delivery to eliminate jitter buffers. It claims that this shifts transport from perceptual fidelity (Shannon-Weaver Level A) to semantic meaning preservation (Level B), yielding 64x uplink bandwidth reduction for audio and 130-210x for screenshots in emulated WAN simulations while keeping task accuracy within 0.7 percentage points of raw baselines.
Significance. If the results hold under more detailed scrutiny, the work could be significant for multimedia systems and AI agent networking by demonstrating that agent-specific semantic compression can achieve order-of-magnitude bandwidth savings without substantial performance degradation. It provides a concrete example of adapting transport stacks to event-driven, non-real-time agent processors rather than human receivers.
major comments (2)
- [Evaluation results] Evaluation results (as stated in the abstract): The claim of accuracy preservation 'within 0.7 percentage points' provides no details on tokenization methods, exact tasks, agent models, baseline implementations, error bars, or statistical significance. Without these, it is impossible to determine whether the 0.7pp margin is robust or sensitive to post-hoc simulation choices.
- [System design] Hybrid screen representation and discrete audio tokens (as described in the abstract): No ablations or analysis are provided to show that accessibility-tree/OCR text plus visual tokens and the chosen audio tokenizers transmit all task-relevant semantics. This assumption is load-bearing for the bandwidth-reduction claims, as any dropped information (e.g., fine-grained prosody or spatial details) would invalidate the 'within 0.7pp' guarantee.
minor comments (1)
- [Abstract] The abstract states that conventional transport 'sends 43-64x more data' yet reports a '64x' reduction; clarify the precise baseline comparison and whether the 64x figure is the upper or average case.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment below with clarifications drawn from the full manuscript and indicate revisions to improve transparency and rigor.
read point-by-point responses
-
Referee: [Evaluation results] Evaluation results (as stated in the abstract): The claim of accuracy preservation 'within 0.7 percentage points' provides no details on tokenization methods, exact tasks, agent models, baseline implementations, error bars, or statistical significance. Without these, it is impossible to determine whether the 0.7pp margin is robust or sensitive to post-hoc simulation choices.
Authors: The abstract summarizes the primary outcome; full details appear in Sections 4 and 5. Tokenization employs discrete audio tokens (e.g., from EnCodec-style models) and hybrid screen encoding (accessibility tree or OCR text plus compact visual tokens). Tasks cover multimodal web navigation and voice-command agent benchmarks. Agent models include representative vision-language models; baselines use raw audio and screenshot transmission. Results report means with error bars across repeated emulated WAN runs. We will revise to add explicit statistical significance tests (e.g., confidence intervals or paired tests) confirming the 0.7pp margin is not sensitive to simulation choices. revision: partial
-
Referee: [System design] Hybrid screen representation and discrete audio tokens (as described in the abstract): No ablations or analysis are provided to show that accessibility-tree/OCR text plus visual tokens and the chosen audio tokenizers transmit all task-relevant semantics. This assumption is load-bearing for the bandwidth-reduction claims, as any dropped information (e.g., fine-grained prosody or spatial details) would invalidate the 'within 0.7pp' guarantee.
Authors: We agree that component ablations would provide stronger support. The manuscript motivates the hybrid design by the semantic needs of event-driven agents and shows end-to-end accuracy preservation under WAN conditions. We will add ablations in revision, including variants omitting the accessibility tree/OCR or visual tokens, plus targeted analysis of prosody retention in audio and spatial detail retention in screens, to verify that task-critical semantics are preserved for the evaluated agent workloads. revision: yes
Circularity Check
No significant circularity: claims rest on direct empirical simulations
full rationale
The paper reports bandwidth reductions (64x audio, 130-210x screenshots) and accuracy preservation (within 0.7pp) as outcomes of simulations under emulated WAN conditions, directly compared to raw baselines. No equations, fitted parameters, or self-citations are invoked to derive these results; the hybrid token representations and bursty delivery are presented as design choices whose efficacy is measured externally rather than defined into the metrics. The derivation chain is therefore self-contained against the simulation benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Discrete audio tokens and hybrid screen representations preserve all task-relevant semantics for the evaluated agent models
Reference graph
Works this paper leans on
-
[1]
Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle
Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. 2015. The Dataflow Model: A Practical Approach to Balancing Correctness, La- tency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing.Proceedings of the VLDB End...
2015
-
[2]
Roman Bachmann et al. 2025. FlexTok: Resampling Images into 1D Token Sequences of Flexible Length. InInternational Conference on Machine Learning (ICML). Variable 1–256 tokens per image, FID<2 at 8–128 tokens on ImageNet, rectified flow decoder
2025
-
[3]
Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin Hwang, and Nick Johnston. 2018. Variational Image Compression with a Scale Hyperprior. InInternational Conference on Learning Representations (ICLR). https://openreview.net/forum?id=rkcQFMZRb
2018
-
[4]
ByteDance. 2024. Doubao App: Real-Time Voice Assistant. https: //www.doubao.com/. 159M MAU; end-to-end speech-to-speech with interrupt support
2024
-
[5]
ByteDance. 2025. Doubao Phone Assistant. https://o.doubao.com/. OS-level mobile agent with GUI simulation, cross-app automation, and on-device memory; launched on Nubia M153
2025
-
[6]
Ruisheng Cao, Jixuan Chen, Zhoujun Cheng, Toh Hua, Fangyu Lei, Xiaochuan Li, Yitao Liu, Silvio Savarese, Dongchan Shin, Tianbao Xie, Caiming Xiong, Yiheng Xu, Tao Yu, Danyang Zhang, Siheng Zhao, Victor Zhong, and Shuyan Zhou. 2024. OSWorld: Benchmarking Multi- modal Agents for Open-Ended Tasks in Real Computer Environments. InAdvances in Neural Informatio...
-
[7]
Chameleon Team. 2024. Chameleon: Mixed-Modal Early-Fusion Foun- dation Models.arXiv preprint arXiv:2405.09818(2024)
work page internal anchor Pith review arXiv 2024
-
[8]
Pavan Kumar Anasosalu Chen et al. 2025. FastVLM: Efficient Vision Encoding for Vision Language Models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Apple. FastViTHD hybrid encoder, 85x faster TTFT, CoreML/MLX checkpoints
2025
-
[9]
Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi
-
[10]
High fidelity neural audio compression.Transactions on Machine Learning Research(2023)
2023
-
[11]
Alexandre Défossez, Laurent Music, Thomas Harlé, Jade Copet- Kanishk Shao, Gabriel Synnaeve-Yossi Adi-Neil Zeghidour-Robin Moras, and Robin San Roman. 2024. Moshi: a speech-text founda- tion model for real-time dialogue.arXiv preprint arXiv:2410.00037 (2024)
work page internal anchor Pith review arXiv 2024
-
[12]
Grégoire Delétang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, Shane Legg, and Joel Veness. 2024. Language Modeling Is Compression. InInternational Conference on Learning Representations (ICLR)
2024
-
[13]
Zhihao Du et al. 2024. CosyVoice 2: Scalable Streaming Speech Syn- thesis with Large Language Models.arXiv preprint arXiv:2412.10117 (2024). Dual-level RVQ: HuBERT semantic + ECAPA-TDNN acoustic, 25 Hz token rate
work page internal anchor Pith review arXiv 2024
-
[14]
Patrick Esser, Robin Rombach, and Bjorn Ommer. 2021. Taming Transformers for High-Resolution Image Synthesis. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 12873–12883. https://openaccess.thecvf.com/ content/CVPR2021/html/Esser_Taming_Transformers_for_High- Resolution_Image_Synthesis_CVPR_2021_paper.html
2021
-
[15]
Wahby, and Keith Winstein
Sadjad Fouladi, John Emmons, Emre Orbay, Catherine Wu, Riad S. Wahby, and Keith Winstein. 2018. Salsify: Low-Latency Network Video through Tighter Integration between a Video Codec and a Transport Protocol. In15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). 267–282. https://www.usenix. org/conference/nsdi18/presentation/fouladi
2018
-
[16]
Piotr Grzegorzek et al. 2026. Computer Use Protocol. https://github. com/computeruseprotocol/computeruseprotocol. 59 ARIA-derived roles, 15 canonical verbs, compact text encoding with ∼97% token reduction vs JSON. 7 Meng and Li
2026
-
[17]
Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hong- ming Zhang, Zhenzhong Lan, and Dong Yu. 2024. WebVoyager: Build- ing an End-to-End Web Agent with Large Multimodal Models. In Proceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers). Association for Compu- tational Linguistics, Bangkok...
-
[18]
Yuzhen Huang, Jinghan Li, Bangzheng Chen, Shijie Xiao, Zefan Wang, Shuai Feng, Bo Li, Xin Xie, Deqiang Ye, and Wenhan Zhang
- [19]
-
[20]
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. HiFi-GAN: Gen- erative Adversarial Networks for Efficient and High Fidelity Speech Synthesis.Advances in Neural Information Processing Systems33 (2020), 17022–17033
2020
-
[21]
Limitless AI. 2024. Limitless AI Pendant. https://www.limitless.ai/
2024
-
[22]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Vi- sual Instruction Tuning. InAdvances in Neural Information Processing Systems 36
2023
-
[23]
Shan Liu, Xiaozhong Xu, Shawmin Lei, and Kevin Jou. 2015. Overview of HEVC Extensions on Screen Content Coding.APSIPA Transactions on Signal and Information Processing4 (2015), e10. doi:10.1017/ATSIP. 2015.11
-
[24]
Yutao Liu et al. 2025. DeepSeek-OCR: Contexts Optical Compression. arXiv preprint arXiv:2510.18234(2025). DeepEncoder (380M params), 97% OCR accuracy at 10x compression, 100 tokens/page
work page internal anchor Pith review arXiv 2025
-
[25]
Looki. 2025. Looki Multimodal AI Wearable. https://www.looki.ai/
2025
-
[26]
OpenAI. 2024. ChatGPT Advanced Voice Mode. https://openai.com/ index/chatgpt-can-now-see-hear-and-speak/
2024
-
[27]
OpenAI. 2025. ChatGPT Atlas. https://chatgpt.com/atlas/
2025
-
[28]
OpenClaw. 2025. OpenClaw: Persistent Client-Side Computer Use Agent. https://github.com/openclaw
2025
-
[29]
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudan- pur. 2015. Librispeech: An ASR corpus based on public domain audio books. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5206–5210
2015
-
[30]
Perplexity. 2025. Comet Browser: a Personal AI Assistant. https: //www.perplexity.ai/comet
2025
-
[31]
Plaud. 2024. Plaud NotePin: AI Wearable for Voice. https://www.plaud. ai/
2024
- [32]
-
[33]
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision.International Conference on Machine Learning (ICML)(2023)
2023
-
[34]
Rix, John G
Antony W. Rix, John G. Beerends, Michael P. Hollier, and Andries P. Hekstra. 2001. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs.IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)2 (2001), 749–752
2001
-
[35]
Rodriguez, David Vazquez, Issam Laradji, Marco Pedersoli, and Pau Rodriguez
Juan A. Rodriguez et al . 2022. OCR-VQGAN: Taming Text-within- Image Generation.arXiv preprint arXiv:2210.11248(2022)
-
[36]
Claude E. Shannon. 1959. Coding theorems for a discrete source with a fidelity criterion.IRE National Convention Record7, 4 (1959), 142–163
1959
-
[37]
Yulin Shao, Qi Cao, and Deniz Gündüz. 2024. A Theory of Semantic Communication.IEEE Transactions on Mobile Computing23, 12 (2024), 12211–12228. doi:10.1109/TMC.2024.3406375
- [38]
-
[39]
Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand
Gary J. Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand
-
[40]
and Ohm, Jens-Rainer and Han, Woo-Jin and Wiegand, Thomas , urldate =
Overview of the High Efficiency Video Coding (HEVC) Standard. IEEE Transactions on Circuits and Systems for Video Technology22, 12 (2012), 1649–1668. doi:10.1109/TCSVT.2012.2221191
-
[41]
The information bottleneck method
Naftali Tishby, Fernando C. Pereira, and William Bialek. 1999. The information bottleneck method.arXiv preprint physics/0004057(1999)
work page Pith review arXiv 1999
-
[42]
Terriberry
Jean-Marc Valin, Koen Vos, and Timothy B. Terriberry. 2012. Definition of the Opus Audio Codec. InRFC 6716, IETF
2012
-
[43]
Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. 2017. Neural Discrete Representation Learning. InAdvances in Neural In- formation Processing Systems 30. https://papers.nips.cc/paper/7210- neural-discrete-representation-learning
2017
- [44]
- [45]
- [46]
-
[47]
Peng Wang et al. 2025. Qwen2.5-VL Technical Report.arXiv preprint arXiv:2502.13923(2025). 600M-param custom ViT, Naive Dynamic Resolution, Window Attention, M-RoPE
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [48]
-
[49]
Jiangkai Wu, Zhiyuan Ren, Liming Liu, and Xinggong Zhang. 2025. Chat with AI: The Surprising Turn of Real-Time Video Communication from Human to AI. InProceedings of the 24th ACM Workshop on Hot Topics in Networks (HotNets). doi:10.1145/3772356.3772390
-
[50]
Huiqiang Xie, Zhijin Qin, Geoffrey Ye Li, and Biing-Hwang Juang
-
[51]
IEEE Transactions on Signal Processing69 (2021), 2663–2675
Deep Learning Enabled Semantic Communication Systems. IEEE Transactions on Signal Processing69 (2021), 2663–2675. doi:10. 1109/TSP.2021.3071210
-
[52]
Jinze Xu et al. 2025. Qwen3-Omni Technical Report.arXiv preprint arXiv:2509.17765(2025). Thinker-Talker MoE, Audio Transformer encoder, multi-codebook speech tokens, 211ms first-packet latency
work page internal anchor Pith review arXiv 2025
-
[53]
Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. 2024. Set-of-Mark Visual Prompting for GPT-4V. In European Conference on Computer Vision (ECCV)
2024
-
[54]
Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. 2022. SoundStream: An End-to-End Neural Au- dio Codec.IEEE/ACM Transactions on Audio, Speech, and Language Processing30 (2022), 495–507. doi:10.1109/TASLP.2021.3129994
-
[55]
Hongwei Zhang, Shuo Shao, Meixia Tao, Xiaoyan Bi, and Khaled B. Letaief. 2023. Deep Learning-Enabled Semantic Communication Sys- tems With Task-Unaware Transmitter and Dynamic Data.IEEE Journal on Selected Areas in Communications41, 1 (2023), 170–185. doi:10.1109/JSAC.2022.3221991
-
[56]
Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zou, and Xipeng Xie. 2024. SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models. InInternational Conference on Learning Representations (ICLR)
2024
-
[57]
Yadong Zheng, Jianwei Zhang, Xinya Li, and Yu Lu. 2025. OmniParser V2: Turning Any LLM into a Computer Use Agent.arXiv preprint (2025). 8
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.