pith. machine review for the scientific record. sign in

arxiv: 2604.20940 · v1 · submitted 2026-04-22 · 💻 cs.MM · cs.NI· cs.SD

Recognition: unknown

Sema: Semantic Transport for Real-Time Multimodal Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:16 UTC · model grok-4.3

classification 💻 cs.MM cs.NIcs.SD
keywords semantic transportmultimodal agentsbandwidth reductiondiscrete audio tokensaccessibility treehybrid screen representationreal-time agentsWAN emulation
0
0 comments X

The pith

Sema shifts multimodal agent transport from raw signal fidelity to semantic meaning preservation, cutting uplink bandwidth by 64x for audio and 130-210x for screenshots while holding task accuracy within 0.7 points of the raw baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that real-time multimodal agents do not need the full perceptual reconstruction required by human receivers. Instead of shipping raw audio streams and full screenshots over networks built for smooth playout, Sema extracts only the task-relevant semantics. It does this with discrete audio tokenizers, a hybrid screen format that pairs lossless text from accessibility trees or OCR with compact visual tokens, and bursty delivery that removes jitter buffers. In WAN-emulated tests this approach slashes data volume dramatically without harming downstream agent performance.

Core claim

Sema replaces conventional transport stacks with semantic encoding: discrete audio tokens replace waveform data and a hybrid screen representation (lossless accessibility-tree or OCR text plus compact visual tokens) replaces pixel dumps. Combined with bursty token delivery that eliminates jitter buffers, the system moves the transport goal from Shannon-Weaver Level A signal fidelity to Level B meaning preservation, yielding the stated bandwidth reductions under emulated WAN conditions while keeping accuracy loss below one percentage point.

What carries the argument

Hybrid screen representation (lossless accessibility-tree or OCR text plus compact visual tokens) paired with discrete audio tokenizers and bursty token delivery that removes jitter buffers.

If this is right

  • Screenshot upload no longer dominates end-to-end action latency on constrained uplinks.
  • Voice pipelines avoid sending 43-64x more audio data than the agent actually needs.
  • Agents can maintain near-baseline accuracy while operating on links that would previously force heavy compression or dropped frames.
  • Transport stacks can be redesigned around event-driven semantic consumption rather than continuous perceptual playout.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same semantic-reduction principle could extend to other agent inputs such as continuous video or sensor streams once suitable tokenizers exist.
  • If the encodings prove robust across tasks, they suggest a path toward parameter-free or low-overhead semantic layers in future machine-to-machine protocols.
  • Deployment on real mobile uplinks would test whether the emulated-WAN gains survive variable packet loss and changing network conditions not covered in simulation.

Load-bearing premise

The chosen discrete audio tokens and hybrid screen representations capture all task-relevant semantics without meaningful loss for the agents' downstream performance in untested real-world conditions.

What would settle it

A side-by-side run on the same agent tasks under real WAN conditions where task accuracy falls more than 0.7 percentage points below the raw-data baseline when using the semantic encodings.

Figures

Figures reproduced from arXiv: 2604.20940 by Bojie Li, Jiaying Meng.

Figure 1
Figure 1. Figure 1: Per-turn uplink bytes (log scale). 0 1000 2000 3000 4000 5000 6000 1 2 5 10 20 50 100 Raw+Compress (WebP) Sema (encode + transfer) Screenshot Latency (ms) Uplink Bandwidth (Mbps) [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Architectural comparison. (a) Traditional [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Pre-inference latency (en￾code + transfer + server decode, ex￾cluding constant model inference) vs. uplink bandwidth (RTT = 50 ms). 0 1000 2000 3000 4000 5000 6000 R+C S-H S-A R+C S-H S-A R+C S-H S-A 1 5 20 Model Server Transfer Encode Latency (ms) Uplink Bandwidth (Mbps) [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 7
Figure 7. Figure 7: Task accuracy by workload category (95% CIs). 0 1 2 3 4 5 6 0 100 200 300 400 500 0 1 2 3 4 5 MOS < 3.0 (unacceptable) WER MOS WER (%) MOS (1-5) Delivery Jitter Std. Dev. (ms) [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
read the original abstract

Real-time multimodal agents transport raw audio and screenshots using networking stacks designed for human receivers, which optimize for perceptual fidelity and smooth playout. Yet agent models act as event-driven processors with no inherent sense of physical time, consuming task-relevant semantics rather than reconstructing signals in real time. This fundamental difference shifts the transport goal from the technical problem of signal fidelity (Shannon-Weaver Level A) to the semantic problem of meaning preservation (Level B). This mismatch imposes significant overhead. In visual pipelines, screenshot upload accounts for over 60% of end-to-end action latency on constrained uplinks, and in voice pipelines, conventional transport carries massive redundancy, sending 43-64x more data than needed to maintain task accuracy. We present Sema, a semantic transport system that combines discrete audio tokenizers with a hybrid screen representation (lossless accessibility-tree or OCR text, plus compact visual tokens) and bursty token delivery that eliminates jitter buffers. In simulations under emulated WAN conditions, Sema reduces uplink bandwidth by 64x for audio and 130-210x for screenshots while preserving task accuracy within 0.7 percentage points of the raw baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Sema, a semantic transport system for real-time multimodal agents that replaces raw audio and screenshot transmission with discrete audio tokenizers and a hybrid screen representation (lossless accessibility-tree or OCR text plus compact visual tokens), combined with bursty token delivery to eliminate jitter buffers. It claims that this shifts transport from perceptual fidelity (Shannon-Weaver Level A) to semantic meaning preservation (Level B), yielding 64x uplink bandwidth reduction for audio and 130-210x for screenshots in emulated WAN simulations while keeping task accuracy within 0.7 percentage points of raw baselines.

Significance. If the results hold under more detailed scrutiny, the work could be significant for multimedia systems and AI agent networking by demonstrating that agent-specific semantic compression can achieve order-of-magnitude bandwidth savings without substantial performance degradation. It provides a concrete example of adapting transport stacks to event-driven, non-real-time agent processors rather than human receivers.

major comments (2)
  1. [Evaluation results] Evaluation results (as stated in the abstract): The claim of accuracy preservation 'within 0.7 percentage points' provides no details on tokenization methods, exact tasks, agent models, baseline implementations, error bars, or statistical significance. Without these, it is impossible to determine whether the 0.7pp margin is robust or sensitive to post-hoc simulation choices.
  2. [System design] Hybrid screen representation and discrete audio tokens (as described in the abstract): No ablations or analysis are provided to show that accessibility-tree/OCR text plus visual tokens and the chosen audio tokenizers transmit all task-relevant semantics. This assumption is load-bearing for the bandwidth-reduction claims, as any dropped information (e.g., fine-grained prosody or spatial details) would invalidate the 'within 0.7pp' guarantee.
minor comments (1)
  1. [Abstract] The abstract states that conventional transport 'sends 43-64x more data' yet reports a '64x' reduction; clarify the precise baseline comparison and whether the 64x figure is the upper or average case.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below with clarifications drawn from the full manuscript and indicate revisions to improve transparency and rigor.

read point-by-point responses
  1. Referee: [Evaluation results] Evaluation results (as stated in the abstract): The claim of accuracy preservation 'within 0.7 percentage points' provides no details on tokenization methods, exact tasks, agent models, baseline implementations, error bars, or statistical significance. Without these, it is impossible to determine whether the 0.7pp margin is robust or sensitive to post-hoc simulation choices.

    Authors: The abstract summarizes the primary outcome; full details appear in Sections 4 and 5. Tokenization employs discrete audio tokens (e.g., from EnCodec-style models) and hybrid screen encoding (accessibility tree or OCR text plus compact visual tokens). Tasks cover multimodal web navigation and voice-command agent benchmarks. Agent models include representative vision-language models; baselines use raw audio and screenshot transmission. Results report means with error bars across repeated emulated WAN runs. We will revise to add explicit statistical significance tests (e.g., confidence intervals or paired tests) confirming the 0.7pp margin is not sensitive to simulation choices. revision: partial

  2. Referee: [System design] Hybrid screen representation and discrete audio tokens (as described in the abstract): No ablations or analysis are provided to show that accessibility-tree/OCR text plus visual tokens and the chosen audio tokenizers transmit all task-relevant semantics. This assumption is load-bearing for the bandwidth-reduction claims, as any dropped information (e.g., fine-grained prosody or spatial details) would invalidate the 'within 0.7pp' guarantee.

    Authors: We agree that component ablations would provide stronger support. The manuscript motivates the hybrid design by the semantic needs of event-driven agents and shows end-to-end accuracy preservation under WAN conditions. We will add ablations in revision, including variants omitting the accessibility tree/OCR or visual tokens, plus targeted analysis of prosody retention in audio and spatial detail retention in screens, to verify that task-critical semantics are preserved for the evaluated agent workloads. revision: yes

Circularity Check

0 steps flagged

No significant circularity: claims rest on direct empirical simulations

full rationale

The paper reports bandwidth reductions (64x audio, 130-210x screenshots) and accuracy preservation (within 0.7pp) as outcomes of simulations under emulated WAN conditions, directly compared to raw baselines. No equations, fitted parameters, or self-citations are invoked to derive these results; the hybrid token representations and bursty delivery are presented as design choices whose efficacy is measured externally rather than defined into the metrics. The derivation chain is therefore self-contained against the simulation benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central performance claims rest on the unverified premise that semantic tokens and hybrid representations are sufficient for task accuracy; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Discrete audio tokens and hybrid screen representations preserve all task-relevant semantics for the evaluated agent models
    This assumption directly supports the claim that accuracy remains within 0.7 points of raw baselines.

pith-pipeline@v0.9.0 · 5500 in / 1180 out tokens · 41285 ms · 2026-05-09T22:16:06.986422+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 24 canonical work pages · 6 internal anchors

  1. [1]

    Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle

    Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. 2015. The Dataflow Model: A Practical Approach to Balancing Correctness, La- tency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing.Proceedings of the VLDB End...

  2. [2]

    Roman Bachmann et al. 2025. FlexTok: Resampling Images into 1D Token Sequences of Flexible Length. InInternational Conference on Machine Learning (ICML). Variable 1–256 tokens per image, FID<2 at 8–128 tokens on ImageNet, rectified flow decoder

  3. [3]

    Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin Hwang, and Nick Johnston. 2018. Variational Image Compression with a Scale Hyperprior. InInternational Conference on Learning Representations (ICLR). https://openreview.net/forum?id=rkcQFMZRb

  4. [4]

    ByteDance. 2024. Doubao App: Real-Time Voice Assistant. https: //www.doubao.com/. 159M MAU; end-to-end speech-to-speech with interrupt support

  5. [5]

    ByteDance. 2025. Doubao Phone Assistant. https://o.doubao.com/. OS-level mobile agent with GUI simulation, cross-app automation, and on-device memory; launched on Nubia M153

  6. [6]

    Ruisheng Cao, Jixuan Chen, Zhoujun Cheng, Toh Hua, Fangyu Lei, Xiaochuan Li, Yitao Liu, Silvio Savarese, Dongchan Shin, Tianbao Xie, Caiming Xiong, Yiheng Xu, Tao Yu, Danyang Zhang, Siheng Zhao, Victor Zhong, and Shuyan Zhou. 2024. OSWorld: Benchmarking Multi- modal Agents for Open-Ended Tasks in Real Computer Environments. InAdvances in Neural Informatio...

  7. [7]

    Chameleon Team. 2024. Chameleon: Mixed-Modal Early-Fusion Foun- dation Models.arXiv preprint arXiv:2405.09818(2024)

  8. [8]

    Pavan Kumar Anasosalu Chen et al. 2025. FastVLM: Efficient Vision Encoding for Vision Language Models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Apple. FastViTHD hybrid encoder, 85x faster TTFT, CoreML/MLX checkpoints

  9. [9]

    Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi

  10. [10]

    High fidelity neural audio compression.Transactions on Machine Learning Research(2023)

  11. [11]

    Alexandre Défossez, Laurent Music, Thomas Harlé, Jade Copet- Kanishk Shao, Gabriel Synnaeve-Yossi Adi-Neil Zeghidour-Robin Moras, and Robin San Roman. 2024. Moshi: a speech-text founda- tion model for real-time dialogue.arXiv preprint arXiv:2410.00037 (2024)

  12. [12]

    Grégoire Delétang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, Shane Legg, and Joel Veness. 2024. Language Modeling Is Compression. InInternational Conference on Learning Representations (ICLR)

  13. [13]

    Zhihao Du et al. 2024. CosyVoice 2: Scalable Streaming Speech Syn- thesis with Large Language Models.arXiv preprint arXiv:2412.10117 (2024). Dual-level RVQ: HuBERT semantic + ECAPA-TDNN acoustic, 25 Hz token rate

  14. [14]

    Patrick Esser, Robin Rombach, and Bjorn Ommer. 2021. Taming Transformers for High-Resolution Image Synthesis. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 12873–12883. https://openaccess.thecvf.com/ content/CVPR2021/html/Esser_Taming_Transformers_for_High- Resolution_Image_Synthesis_CVPR_2021_paper.html

  15. [15]

    Wahby, and Keith Winstein

    Sadjad Fouladi, John Emmons, Emre Orbay, Catherine Wu, Riad S. Wahby, and Keith Winstein. 2018. Salsify: Low-Latency Network Video through Tighter Integration between a Video Codec and a Transport Protocol. In15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). 267–282. https://www.usenix. org/conference/nsdi18/presentation/fouladi

  16. [16]

    Piotr Grzegorzek et al. 2026. Computer Use Protocol. https://github. com/computeruseprotocol/computeruseprotocol. 59 ARIA-derived roles, 15 canonical verbs, compact text encoding with ∼97% token reduction vs JSON. 7 Meng and Li

  17. [17]

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hong- ming Zhang, Zhenzhong Lan, and Dong Yu. 2024. WebVoyager: Build- ing an End-to-End Web Agent with Large Multimodal Models. In Proceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers). Association for Compu- tational Linguistics, Bangkok...

  18. [18]

    Yuzhen Huang, Jinghan Li, Bangzheng Chen, Shijie Xiao, Zefan Wang, Shuai Feng, Bo Li, Xin Xie, Deqiang Ye, and Wenhan Zhang

  19. [19]

    Compression Represents Intelligence Linearly.arXiv preprint arXiv:2404.09937(2024)

  20. [20]

    Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. HiFi-GAN: Gen- erative Adversarial Networks for Efficient and High Fidelity Speech Synthesis.Advances in Neural Information Processing Systems33 (2020), 17022–17033

  21. [21]

    Limitless AI. 2024. Limitless AI Pendant. https://www.limitless.ai/

  22. [22]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Vi- sual Instruction Tuning. InAdvances in Neural Information Processing Systems 36

  23. [23]

    Shan Liu, Xiaozhong Xu, Shawmin Lei, and Kevin Jou. 2015. Overview of HEVC Extensions on Screen Content Coding.APSIPA Transactions on Signal and Information Processing4 (2015), e10. doi:10.1017/ATSIP. 2015.11

  24. [24]

    Yutao Liu et al. 2025. DeepSeek-OCR: Contexts Optical Compression. arXiv preprint arXiv:2510.18234(2025). DeepEncoder (380M params), 97% OCR accuracy at 10x compression, 100 tokens/page

  25. [25]

    Looki. 2025. Looki Multimodal AI Wearable. https://www.looki.ai/

  26. [26]

    OpenAI. 2024. ChatGPT Advanced Voice Mode. https://openai.com/ index/chatgpt-can-now-see-hear-and-speak/

  27. [27]

    OpenAI. 2025. ChatGPT Atlas. https://chatgpt.com/atlas/

  28. [28]

    OpenClaw. 2025. OpenClaw: Persistent Client-Side Computer Use Agent. https://github.com/openclaw

  29. [29]

    Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudan- pur. 2015. Librispeech: An ASR corpus based on public domain audio books. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5206–5210

  30. [30]

    Perplexity. 2025. Comet Browser: a Personal AI Assistant. https: //www.perplexity.ai/comet

  31. [31]

    Plaud. 2024. Plaud NotePin: AI Wearable for Voice. https://www.plaud. ai/

  32. [32]

    Yiming Qu, Sijie Li, Kecheng Yan, Jiaqi Wu, Kai Li, Jie Du, Shurun Xie, Xiu Luo, Jian Nie, et al. 2025. Layton: Latent Consistency Tokenizer for 1024-pixel Image Reconstruction and Generation by 256 Tokens. arXiv preprint arXiv:2503.08377(2025)

  33. [33]

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision.International Conference on Machine Learning (ICML)(2023)

  34. [34]

    Rix, John G

    Antony W. Rix, John G. Beerends, Michael P. Hollier, and Andries P. Hekstra. 2001. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs.IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)2 (2001), 749–752

  35. [35]

    Rodriguez, David Vazquez, Issam Laradji, Marco Pedersoli, and Pau Rodriguez

    Juan A. Rodriguez et al . 2022. OCR-VQGAN: Taming Text-within- Image Generation.arXiv preprint arXiv:2210.11248(2022)

  36. [36]

    Claude E. Shannon. 1959. Coding theorems for a discrete source with a fidelity criterion.IRE National Convention Record7, 4 (1959), 142–163

  37. [37]

    Yulin Shao, Qi Cao, and Deniz Gündüz. 2024. A Theory of Semantic Communication.IEEE Transactions on Mobile Computing23, 12 (2024), 12211–12228. doi:10.1109/TMC.2024.3406375

  38. [38]

    Step-Audio Team. 2025. Step-Audio: Unified Understanding and Gener- ation in Intelligent Speech Interaction.arXiv preprint arXiv:2502.11946 (2025). 130B-param, dual tokenization: Paraformer (16.7 Hz) + CosyVoice (25 Hz)

  39. [39]

    Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand

    Gary J. Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand

  40. [40]

    and Ohm, Jens-Rainer and Han, Woo-Jin and Wiegand, Thomas , urldate =

    Overview of the High Efficiency Video Coding (HEVC) Standard. IEEE Transactions on Circuits and Systems for Video Technology22, 12 (2012), 1649–1668. doi:10.1109/TCSVT.2012.2221191

  41. [41]

    The information bottleneck method

    Naftali Tishby, Fernando C. Pereira, and William Bialek. 1999. The information bottleneck method.arXiv preprint physics/0004057(1999)

  42. [42]

    Terriberry

    Jean-Marc Valin, Koen Vos, and Timothy B. Terriberry. 2012. Definition of the Opus Audio Codec. InRFC 6716, IETF

  43. [43]

    Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. 2017. Neural Discrete Representation Learning. InAdvances in Neural In- formation Processing Systems 30. https://papers.nips.cc/paper/7210- neural-discrete-representation-learning

  44. [44]

    Various. 2025. Token Communications: A Unified Framework for Cross-modal Context-aware Semantic Communications.arXiv preprint arXiv:2502.12096(2025)

  45. [45]

    Various. 2025. VLF-MSC: Vision-Language Feature-Based Multimodal Semantic Communication System.arXiv preprint arXiv:2511.10074 (2025)

  46. [46]

    Various. 2026. AgentOCR: Reimagining Agent History via Optical Self-Compression.arXiv preprint arXiv:2601.04786(2026). 50%+ token reduction, 95% task performance, 20x rendering speedup

  47. [47]

    Peng Wang et al. 2025. Qwen2.5-VL Technical Report.arXiv preprint arXiv:2502.13923(2025). 600M-param custom ViT, Naive Dynamic Resolution, Window Attention, M-RoPE

  48. [48]

    Yue Wang et al. 2025. GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents.arXiv preprint arXiv:2506.03143(2025)

  49. [49]

    Jiangkai Wu, Zhiyuan Ren, Liming Liu, and Xinggong Zhang. 2025. Chat with AI: The Surprising Turn of Real-Time Video Communication from Human to AI. InProceedings of the 24th ACM Workshop on Hot Topics in Networks (HotNets). doi:10.1145/3772356.3772390

  50. [50]

    Huiqiang Xie, Zhijin Qin, Geoffrey Ye Li, and Biing-Hwang Juang

  51. [51]

    IEEE Transactions on Signal Processing69 (2021), 2663–2675

    Deep Learning Enabled Semantic Communication Systems. IEEE Transactions on Signal Processing69 (2021), 2663–2675. doi:10. 1109/TSP.2021.3071210

  52. [52]

    Jinze Xu et al. 2025. Qwen3-Omni Technical Report.arXiv preprint arXiv:2509.17765(2025). Thinker-Talker MoE, Audio Transformer encoder, multi-codebook speech tokens, 211ms first-packet latency

  53. [53]

    Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. 2024. Set-of-Mark Visual Prompting for GPT-4V. In European Conference on Computer Vision (ECCV)

  54. [54]

    Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. 2022. SoundStream: An End-to-End Neural Au- dio Codec.IEEE/ACM Transactions on Audio, Speech, and Language Processing30 (2022), 495–507. doi:10.1109/TASLP.2021.3129994

  55. [55]

    Hongwei Zhang, Shuo Shao, Meixia Tao, Xiaoyan Bi, and Khaled B. Letaief. 2023. Deep Learning-Enabled Semantic Communication Sys- tems With Task-Unaware Transmitter and Dynamic Data.IEEE Journal on Selected Areas in Communications41, 1 (2023), 170–185. doi:10.1109/JSAC.2022.3221991

  56. [56]

    Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zou, and Xipeng Xie. 2024. SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models. InInternational Conference on Learning Representations (ICLR)

  57. [57]

    Yadong Zheng, Jianwei Zhang, Xinya Li, and Yu Lu. 2025. OmniParser V2: Turning Any LLM into a Computer Use Agent.arXiv preprint (2025). 8