arxiv: 2604.20940 · v1 · submitted 2026-04-22 · 💻 cs.MM · cs.NI· cs.SD

Recognition: unknown

Sema: Semantic Transport for Real-Time Multimodal Agents

Jiaying Meng , Bojie Li

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:16 UTC · model grok-4.3

classification 💻 cs.MM cs.NIcs.SD

keywords semantic transportmultimodal agentsbandwidth reductiondiscrete audio tokensaccessibility treehybrid screen representationreal-time agentsWAN emulation

0 comments

The pith

Sema shifts multimodal agent transport from raw signal fidelity to semantic meaning preservation, cutting uplink bandwidth by 64x for audio and 130-210x for screenshots while holding task accuracy within 0.7 points of the raw baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that real-time multimodal agents do not need the full perceptual reconstruction required by human receivers. Instead of shipping raw audio streams and full screenshots over networks built for smooth playout, Sema extracts only the task-relevant semantics. It does this with discrete audio tokenizers, a hybrid screen format that pairs lossless text from accessibility trees or OCR with compact visual tokens, and bursty delivery that removes jitter buffers. In WAN-emulated tests this approach slashes data volume dramatically without harming downstream agent performance.

Core claim

Sema replaces conventional transport stacks with semantic encoding: discrete audio tokens replace waveform data and a hybrid screen representation (lossless accessibility-tree or OCR text plus compact visual tokens) replaces pixel dumps. Combined with bursty token delivery that eliminates jitter buffers, the system moves the transport goal from Shannon-Weaver Level A signal fidelity to Level B meaning preservation, yielding the stated bandwidth reductions under emulated WAN conditions while keeping accuracy loss below one percentage point.

What carries the argument

Hybrid screen representation (lossless accessibility-tree or OCR text plus compact visual tokens) paired with discrete audio tokenizers and bursty token delivery that removes jitter buffers.

If this is right

Screenshot upload no longer dominates end-to-end action latency on constrained uplinks.
Voice pipelines avoid sending 43-64x more audio data than the agent actually needs.
Agents can maintain near-baseline accuracy while operating on links that would previously force heavy compression or dropped frames.
Transport stacks can be redesigned around event-driven semantic consumption rather than continuous perceptual playout.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same semantic-reduction principle could extend to other agent inputs such as continuous video or sensor streams once suitable tokenizers exist.
If the encodings prove robust across tasks, they suggest a path toward parameter-free or low-overhead semantic layers in future machine-to-machine protocols.
Deployment on real mobile uplinks would test whether the emulated-WAN gains survive variable packet loss and changing network conditions not covered in simulation.

Load-bearing premise

The chosen discrete audio tokens and hybrid screen representations capture all task-relevant semantics without meaningful loss for the agents' downstream performance in untested real-world conditions.

What would settle it

A side-by-side run on the same agent tasks under real WAN conditions where task accuracy falls more than 0.7 percentage points below the raw-data baseline when using the semantic encodings.

Figures

Figures reproduced from arXiv: 2604.20940 by Bojie Li, Jiaying Meng.

**Figure 1.** Figure 1: Per-turn uplink bytes (log scale). 0 1000 2000 3000 4000 5000 6000 1 2 5 10 20 50 100 Raw+Compress (WebP) Sema (encode + transfer) Screenshot Latency (ms) Uplink Bandwidth (Mbps) [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 3.** Figure 3: Architectural comparison. (a) Traditional [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Pre-inference latency (encode + transfer + server decode, excluding constant model inference) vs. uplink bandwidth (RTT = 50 ms). 0 1000 2000 3000 4000 5000 6000 R+C S-H S-A R+C S-H S-A R+C S-H S-A 1 5 20 Model Server Transfer Encode Latency (ms) Uplink Bandwidth (Mbps) [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 7.** Figure 7: Task accuracy by workload category (95% CIs). 0 1 2 3 4 5 6 0 100 200 300 400 500 0 1 2 3 4 5 MOS < 3.0 (unacceptable) WER MOS WER (%) MOS (1-5) Delivery Jitter Std. Dev. (ms) [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

read the original abstract

Real-time multimodal agents transport raw audio and screenshots using networking stacks designed for human receivers, which optimize for perceptual fidelity and smooth playout. Yet agent models act as event-driven processors with no inherent sense of physical time, consuming task-relevant semantics rather than reconstructing signals in real time. This fundamental difference shifts the transport goal from the technical problem of signal fidelity (Shannon-Weaver Level A) to the semantic problem of meaning preservation (Level B). This mismatch imposes significant overhead. In visual pipelines, screenshot upload accounts for over 60% of end-to-end action latency on constrained uplinks, and in voice pipelines, conventional transport carries massive redundancy, sending 43-64x more data than needed to maintain task accuracy. We present Sema, a semantic transport system that combines discrete audio tokenizers with a hybrid screen representation (lossless accessibility-tree or OCR text, plus compact visual tokens) and bursty token delivery that eliminates jitter buffers. In simulations under emulated WAN conditions, Sema reduces uplink bandwidth by 64x for audio and 130-210x for screenshots while preserving task accuracy within 0.7 percentage points of the raw baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sema sketches a semantic transport layer for multimodal agents that cuts bandwidth sharply in simulations by swapping raw signals for tokenized audio and hybrid screen reps, but the thin methodological details make the accuracy claims hard to trust.

read the letter

The main point is that Sema swaps conventional raw audio and screenshot transport for discrete audio tokens plus a hybrid visual format (accessibility-tree or OCR text plus compact visual tokens) and bursty delivery. In emulated WAN simulations this yields 64x audio and 130-210x screenshot bandwidth reduction while holding task accuracy to within 0.7 percentage points of the raw baseline. The framing around Shannon-Weaver levels is direct: agents do not need perceptual fidelity or smooth playout, only task semantics, so jitter buffers and full-signal fidelity are unnecessary overhead. The hybrid screen approach and removal of time-based buffering are sensible adaptations to agent pipelines. The paper also quantifies the existing problem, noting that screenshot uploads can dominate end-to-end latency on constrained links. Those observations are useful and grounded in the mismatch between human-oriented stacks and event-driven models. The soft spot is the evaluation. The abstract reports clear numerical gains but supplies no description of the tokenizers, the exact agent tasks or models, baseline code, error bars, or statistical tests. Without ablations it is unclear whether the chosen representations preserve everything the downstream agents need, especially for tasks relying on prosody, precise spatial layout, or fast visual changes. The stress-test note correctly flags this as the load-bearing assumption. If any task-relevant signal is lost, the bandwidth numbers no longer come with the stated accuracy guarantee. This paper is for researchers and engineers working on efficient multimodal agent deployments over constrained networks. A reader already thinking about semantic communication applied to AI systems would find the architecture worth examining. I would send it to peer review. The core idea targets a real deployment bottleneck and the simulation results are large enough to merit checking the methods and asking for the missing controls.

Referee Report

2 major / 1 minor

Summary. The paper proposes Sema, a semantic transport system for real-time multimodal agents that replaces raw audio and screenshot transmission with discrete audio tokenizers and a hybrid screen representation (lossless accessibility-tree or OCR text plus compact visual tokens), combined with bursty token delivery to eliminate jitter buffers. It claims that this shifts transport from perceptual fidelity (Shannon-Weaver Level A) to semantic meaning preservation (Level B), yielding 64x uplink bandwidth reduction for audio and 130-210x for screenshots in emulated WAN simulations while keeping task accuracy within 0.7 percentage points of raw baselines.

Significance. If the results hold under more detailed scrutiny, the work could be significant for multimedia systems and AI agent networking by demonstrating that agent-specific semantic compression can achieve order-of-magnitude bandwidth savings without substantial performance degradation. It provides a concrete example of adapting transport stacks to event-driven, non-real-time agent processors rather than human receivers.

major comments (2)

[Evaluation results] Evaluation results (as stated in the abstract): The claim of accuracy preservation 'within 0.7 percentage points' provides no details on tokenization methods, exact tasks, agent models, baseline implementations, error bars, or statistical significance. Without these, it is impossible to determine whether the 0.7pp margin is robust or sensitive to post-hoc simulation choices.
[System design] Hybrid screen representation and discrete audio tokens (as described in the abstract): No ablations or analysis are provided to show that accessibility-tree/OCR text plus visual tokens and the chosen audio tokenizers transmit all task-relevant semantics. This assumption is load-bearing for the bandwidth-reduction claims, as any dropped information (e.g., fine-grained prosody or spatial details) would invalidate the 'within 0.7pp' guarantee.

minor comments (1)

[Abstract] The abstract states that conventional transport 'sends 43-64x more data' yet reports a '64x' reduction; clarify the precise baseline comparison and whether the 64x figure is the upper or average case.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below with clarifications drawn from the full manuscript and indicate revisions to improve transparency and rigor.

read point-by-point responses

Referee: [Evaluation results] Evaluation results (as stated in the abstract): The claim of accuracy preservation 'within 0.7 percentage points' provides no details on tokenization methods, exact tasks, agent models, baseline implementations, error bars, or statistical significance. Without these, it is impossible to determine whether the 0.7pp margin is robust or sensitive to post-hoc simulation choices.

Authors: The abstract summarizes the primary outcome; full details appear in Sections 4 and 5. Tokenization employs discrete audio tokens (e.g., from EnCodec-style models) and hybrid screen encoding (accessibility tree or OCR text plus compact visual tokens). Tasks cover multimodal web navigation and voice-command agent benchmarks. Agent models include representative vision-language models; baselines use raw audio and screenshot transmission. Results report means with error bars across repeated emulated WAN runs. We will revise to add explicit statistical significance tests (e.g., confidence intervals or paired tests) confirming the 0.7pp margin is not sensitive to simulation choices. revision: partial
Referee: [System design] Hybrid screen representation and discrete audio tokens (as described in the abstract): No ablations or analysis are provided to show that accessibility-tree/OCR text plus visual tokens and the chosen audio tokenizers transmit all task-relevant semantics. This assumption is load-bearing for the bandwidth-reduction claims, as any dropped information (e.g., fine-grained prosody or spatial details) would invalidate the 'within 0.7pp' guarantee.

Authors: We agree that component ablations would provide stronger support. The manuscript motivates the hybrid design by the semantic needs of event-driven agents and shows end-to-end accuracy preservation under WAN conditions. We will add ablations in revision, including variants omitting the accessibility tree/OCR or visual tokens, plus targeted analysis of prosody retention in audio and spatial detail retention in screens, to verify that task-critical semantics are preserved for the evaluated agent workloads. revision: yes

Circularity Check

0 steps flagged

No significant circularity: claims rest on direct empirical simulations

full rationale

The paper reports bandwidth reductions (64x audio, 130-210x screenshots) and accuracy preservation (within 0.7pp) as outcomes of simulations under emulated WAN conditions, directly compared to raw baselines. No equations, fitted parameters, or self-citations are invoked to derive these results; the hybrid token representations and bursty delivery are presented as design choices whose efficacy is measured externally rather than defined into the metrics. The derivation chain is therefore self-contained against the simulation benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central performance claims rest on the unverified premise that semantic tokens and hybrid representations are sufficient for task accuracy; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Discrete audio tokens and hybrid screen representations preserve all task-relevant semantics for the evaluated agent models
This assumption directly supports the claim that accuracy remains within 0.7 points of raw baselines.

pith-pipeline@v0.9.0 · 5500 in / 1180 out tokens · 41285 ms · 2026-05-09T22:16:06.986422+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 24 canonical work pages · 6 internal anchors

[1]

Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle

Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. 2015. The Dataflow Model: A Practical Approach to Balancing Correctness, La- tency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing.Proceedings of the VLDB End...

2015
[2]

Roman Bachmann et al. 2025. FlexTok: Resampling Images into 1D Token Sequences of Flexible Length. InInternational Conference on Machine Learning (ICML). Variable 1–256 tokens per image, FID<2 at 8–128 tokens on ImageNet, rectified flow decoder

2025
[3]

Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin Hwang, and Nick Johnston. 2018. Variational Image Compression with a Scale Hyperprior. InInternational Conference on Learning Representations (ICLR). https://openreview.net/forum?id=rkcQFMZRb

2018
[4]

ByteDance. 2024. Doubao App: Real-Time Voice Assistant. https: //www.doubao.com/. 159M MAU; end-to-end speech-to-speech with interrupt support

2024
[5]

ByteDance. 2025. Doubao Phone Assistant. https://o.doubao.com/. OS-level mobile agent with GUI simulation, cross-app automation, and on-device memory; launched on Nubia M153

2025
[6]

Ruisheng Cao, Jixuan Chen, Zhoujun Cheng, Toh Hua, Fangyu Lei, Xiaochuan Li, Yitao Liu, Silvio Savarese, Dongchan Shin, Tianbao Xie, Caiming Xiong, Yiheng Xu, Tao Yu, Danyang Zhang, Siheng Zhao, Victor Zhong, and Shuyan Zhou. 2024. OSWorld: Benchmarking Multi- modal Agents for Open-Ended Tasks in Real Computer Environments. InAdvances in Neural Informatio...

work page doi:10.52202/079017-1650 2024
[7]

Chameleon Team. 2024. Chameleon: Mixed-Modal Early-Fusion Foun- dation Models.arXiv preprint arXiv:2405.09818(2024)

work page internal anchor Pith review arXiv 2024
[8]

Pavan Kumar Anasosalu Chen et al. 2025. FastVLM: Efficient Vision Encoding for Vision Language Models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Apple. FastViTHD hybrid encoder, 85x faster TTFT, CoreML/MLX checkpoints

2025
[9]

Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi
[10]

High fidelity neural audio compression.Transactions on Machine Learning Research(2023)

2023
[11]

Alexandre Défossez, Laurent Music, Thomas Harlé, Jade Copet- Kanishk Shao, Gabriel Synnaeve-Yossi Adi-Neil Zeghidour-Robin Moras, and Robin San Roman. 2024. Moshi: a speech-text founda- tion model for real-time dialogue.arXiv preprint arXiv:2410.00037 (2024)

work page internal anchor Pith review arXiv 2024
[12]

Grégoire Delétang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, Shane Legg, and Joel Veness. 2024. Language Modeling Is Compression. InInternational Conference on Learning Representations (ICLR)

2024
[13]

Zhihao Du et al. 2024. CosyVoice 2: Scalable Streaming Speech Syn- thesis with Large Language Models.arXiv preprint arXiv:2412.10117 (2024). Dual-level RVQ: HuBERT semantic + ECAPA-TDNN acoustic, 25 Hz token rate

work page internal anchor Pith review arXiv 2024
[14]

Patrick Esser, Robin Rombach, and Bjorn Ommer. 2021. Taming Transformers for High-Resolution Image Synthesis. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 12873–12883. https://openaccess.thecvf.com/ content/CVPR2021/html/Esser_Taming_Transformers_for_High- Resolution_Image_Synthesis_CVPR_2021_paper.html

2021
[15]

Wahby, and Keith Winstein

Sadjad Fouladi, John Emmons, Emre Orbay, Catherine Wu, Riad S. Wahby, and Keith Winstein. 2018. Salsify: Low-Latency Network Video through Tighter Integration between a Video Codec and a Transport Protocol. In15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). 267–282. https://www.usenix. org/conference/nsdi18/presentation/fouladi

2018
[16]

Piotr Grzegorzek et al. 2026. Computer Use Protocol. https://github. com/computeruseprotocol/computeruseprotocol. 59 ARIA-derived roles, 15 canonical verbs, compact text encoding with ∼97% token reduction vs JSON. 7 Meng and Li

2026
[17]

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hong- ming Zhang, Zhenzhong Lan, and Dong Yu. 2024. WebVoyager: Build- ing an End-to-End Web Agent with Large Multimodal Models. In Proceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers). Association for Compu- tational Linguistics, Bangkok...

work page doi:10.18653/v1/ 2024
[18]

Yuzhen Huang, Jinghan Li, Bangzheng Chen, Shijie Xiao, Zefan Wang, Shuai Feng, Bo Li, Xin Xie, Deqiang Ye, and Wenhan Zhang
[19]

Compression Represents Intelligence Linearly.arXiv preprint arXiv:2404.09937(2024)

work page arXiv 2024
[20]

Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. HiFi-GAN: Gen- erative Adversarial Networks for Efficient and High Fidelity Speech Synthesis.Advances in Neural Information Processing Systems33 (2020), 17022–17033

2020
[21]

Limitless AI. 2024. Limitless AI Pendant. https://www.limitless.ai/

2024
[22]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Vi- sual Instruction Tuning. InAdvances in Neural Information Processing Systems 36

2023
[23]

Shan Liu, Xiaozhong Xu, Shawmin Lei, and Kevin Jou. 2015. Overview of HEVC Extensions on Screen Content Coding.APSIPA Transactions on Signal and Information Processing4 (2015), e10. doi:10.1017/ATSIP. 2015.11

work page doi:10.1017/atsip 2015
[24]

Yutao Liu et al. 2025. DeepSeek-OCR: Contexts Optical Compression. arXiv preprint arXiv:2510.18234(2025). DeepEncoder (380M params), 97% OCR accuracy at 10x compression, 100 tokens/page

work page internal anchor Pith review arXiv 2025
[25]

Looki. 2025. Looki Multimodal AI Wearable. https://www.looki.ai/

2025
[26]

OpenAI. 2024. ChatGPT Advanced Voice Mode. https://openai.com/ index/chatgpt-can-now-see-hear-and-speak/

2024
[27]

OpenAI. 2025. ChatGPT Atlas. https://chatgpt.com/atlas/

2025
[28]

OpenClaw. 2025. OpenClaw: Persistent Client-Side Computer Use Agent. https://github.com/openclaw

2025
[29]

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudan- pur. 2015. Librispeech: An ASR corpus based on public domain audio books. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5206–5210

2015
[30]

Perplexity. 2025. Comet Browser: a Personal AI Assistant. https: //www.perplexity.ai/comet

2025
[31]

Plaud. 2024. Plaud NotePin: AI Wearable for Voice. https://www.plaud. ai/

2024
[32]

Yiming Qu, Sijie Li, Kecheng Yan, Jiaqi Wu, Kai Li, Jie Du, Shurun Xie, Xiu Luo, Jian Nie, et al. 2025. Layton: Latent Consistency Tokenizer for 1024-pixel Image Reconstruction and Generation by 256 Tokens. arXiv preprint arXiv:2503.08377(2025)

work page arXiv 2025
[33]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision.International Conference on Machine Learning (ICML)(2023)

2023
[34]

Rix, John G

Antony W. Rix, John G. Beerends, Michael P. Hollier, and Andries P. Hekstra. 2001. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs.IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)2 (2001), 749–752

2001
[35]

Rodriguez, David Vazquez, Issam Laradji, Marco Pedersoli, and Pau Rodriguez

Juan A. Rodriguez et al . 2022. OCR-VQGAN: Taming Text-within- Image Generation.arXiv preprint arXiv:2210.11248(2022)

work page arXiv 2022
[36]

Claude E. Shannon. 1959. Coding theorems for a discrete source with a fidelity criterion.IRE National Convention Record7, 4 (1959), 142–163

1959
[37]

Yulin Shao, Qi Cao, and Deniz Gündüz. 2024. A Theory of Semantic Communication.IEEE Transactions on Mobile Computing23, 12 (2024), 12211–12228. doi:10.1109/TMC.2024.3406375

work page doi:10.1109/tmc.2024.3406375 2024
[38]

Step-Audio Team. 2025. Step-Audio: Unified Understanding and Gener- ation in Intelligent Speech Interaction.arXiv preprint arXiv:2502.11946 (2025). 130B-param, dual tokenization: Paraformer (16.7 Hz) + CosyVoice (25 Hz)

work page arXiv 2025
[39]

Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand

Gary J. Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand
[40]

and Ohm, Jens-Rainer and Han, Woo-Jin and Wiegand, Thomas , urldate =

Overview of the High Efficiency Video Coding (HEVC) Standard. IEEE Transactions on Circuits and Systems for Video Technology22, 12 (2012), 1649–1668. doi:10.1109/TCSVT.2012.2221191

work page doi:10.1109/tcsvt.2012.2221191 2012
[41]

The information bottleneck method

Naftali Tishby, Fernando C. Pereira, and William Bialek. 1999. The information bottleneck method.arXiv preprint physics/0004057(1999)

work page Pith review arXiv 1999
[42]

Terriberry

Jean-Marc Valin, Koen Vos, and Timothy B. Terriberry. 2012. Definition of the Opus Audio Codec. InRFC 6716, IETF

2012
[43]

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. 2017. Neural Discrete Representation Learning. InAdvances in Neural In- formation Processing Systems 30. https://papers.nips.cc/paper/7210- neural-discrete-representation-learning

2017
[44]

Various. 2025. Token Communications: A Unified Framework for Cross-modal Context-aware Semantic Communications.arXiv preprint arXiv:2502.12096(2025)

work page arXiv 2025
[45]

Various. 2025. VLF-MSC: Vision-Language Feature-Based Multimodal Semantic Communication System.arXiv preprint arXiv:2511.10074 (2025)

work page arXiv 2025
[46]

Various. 2026. AgentOCR: Reimagining Agent History via Optical Self-Compression.arXiv preprint arXiv:2601.04786(2026). 50%+ token reduction, 95% task performance, 20x rendering speedup

work page arXiv 2026
[47]

Peng Wang et al. 2025. Qwen2.5-VL Technical Report.arXiv preprint arXiv:2502.13923(2025). 600M-param custom ViT, Naive Dynamic Resolution, Window Attention, M-RoPE

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Yue Wang et al. 2025. GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents.arXiv preprint arXiv:2506.03143(2025)

work page arXiv 2025
[49]

Jiangkai Wu, Zhiyuan Ren, Liming Liu, and Xinggong Zhang. 2025. Chat with AI: The Surprising Turn of Real-Time Video Communication from Human to AI. InProceedings of the 24th ACM Workshop on Hot Topics in Networks (HotNets). doi:10.1145/3772356.3772390

work page doi:10.1145/3772356.3772390 2025
[50]

Huiqiang Xie, Zhijin Qin, Geoffrey Ye Li, and Biing-Hwang Juang
[51]

IEEE Transactions on Signal Processing69 (2021), 2663–2675

Deep Learning Enabled Semantic Communication Systems. IEEE Transactions on Signal Processing69 (2021), 2663–2675. doi:10. 1109/TSP.2021.3071210

work page arXiv 2021
[52]

Jinze Xu et al. 2025. Qwen3-Omni Technical Report.arXiv preprint arXiv:2509.17765(2025). Thinker-Talker MoE, Audio Transformer encoder, multi-codebook speech tokens, 211ms first-packet latency

work page internal anchor Pith review arXiv 2025
[53]

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. 2024. Set-of-Mark Visual Prompting for GPT-4V. In European Conference on Computer Vision (ECCV)

2024
[54]

Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. 2022. SoundStream: An End-to-End Neural Au- dio Codec.IEEE/ACM Transactions on Audio, Speech, and Language Processing30 (2022), 495–507. doi:10.1109/TASLP.2021.3129994

work page doi:10.1109/taslp.2021.3129994 2022
[55]

Hongwei Zhang, Shuo Shao, Meixia Tao, Xiaoyan Bi, and Khaled B. Letaief. 2023. Deep Learning-Enabled Semantic Communication Sys- tems With Task-Unaware Transmitter and Dynamic Data.IEEE Journal on Selected Areas in Communications41, 1 (2023), 170–185. doi:10.1109/JSAC.2022.3221991

work page doi:10.1109/jsac.2022.3221991 2023
[56]

Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zou, and Xipeng Xie. 2024. SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models. InInternational Conference on Learning Representations (ICLR)

2024
[57]

Yadong Zheng, Jianwei Zhang, Xinya Li, and Yu Lu. 2025. OmniParser V2: Turning Any LLM into a Computer Use Agent.arXiv preprint (2025). 8

2025