pith. sign in

arxiv: 2606.09186 · v1 · pith:UKWVU77Nnew · submitted 2026-06-08 · 💻 cs.HC

DuplexOmni: Real-Time Listening, Seeing, Thinking, and Speaking for Full-Duplex Interaction

Pith reviewed 2026-06-27 15:20 UTC · model grok-4.3

classification 💻 cs.HC
keywords full-duplex interactionmultimodal streamingreal-time responseasynchronous layersomni modelsWriter-Director pipelinespeech and vision input
0
0 comments X

The pith

DuplexOmni splits an AI into an interaction layer and a thinking layer that run asynchronously in parallel to enable real-time full-duplex multimodal conversation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a method that divides model functions so one layer handles streaming audio and video input and produces immediate text and speech output while a separate pluggable layer manages deeper reasoning and tool use. These layers operate in parallel without waiting for each other, supported by a Writer-Director pipeline that generates training data for continuous interactions. A sympathetic reader would care because this structure aims to combine natural, uninterrupted human-like dialogue with complex capabilities that current unified models struggle to maintain in real time. The approach is tested on public benchmarks where it shows strong results alongside natural full-duplex behavior.

Core claim

DuplexOmni separates model capability into an interaction layer and a thinking layer that collaborate asynchronously in parallel. The interaction layer is realized as an end-to-end DuplexOmni model that ingests streaming audio and video and emits text and speech responses in real time. The thinking layer acts as a detachable module supplying complex reasoning and tool-use functions. A Writer-Director pipeline supplies the continuous-interaction training data needed to learn stable full-duplex behavior.

What carries the argument

The interaction layer (DuplexOmni model) paired with a pluggable thinking layer that collaborate asynchronously in parallel.

If this is right

  • Real-time speech and text generation can continue even while the thinking layer performs extended reasoning or calls external tools.
  • The thinking layer can be swapped or upgraded independently without retraining the entire interaction system.
  • Training data constructed via the Writer-Director pipeline enables the model to learn turn-taking and interruption handling directly from continuous streams.
  • The same architecture supports simultaneous processing of audio and video inputs while maintaining low-latency output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the asynchronous split works reliably, similar layer separation could be applied to other multimodal systems that currently force all computation through a single forward pass.
  • The approach suggests a route to add tool-use or long-horizon planning to voice assistants without forcing users to wait for each step.
  • A natural next measurement would be how often users perceive the system as interrupting or drifting during extended reasoning episodes.

Load-bearing premise

The interaction and thinking layers can coordinate asynchronously without producing unacceptable latency, loss of coherence, or coordination failures, and the Writer-Director pipeline yields enough data to train stable full-duplex responses.

What would settle it

A live test in which the system must perform a multi-step reasoning task that requires several seconds while a user continues speaking and the model must respond without noticeable pauses, repetition, or context loss.

Figures

Figures reproduced from arXiv: 2606.09186 by Jinghua Hao, Jiuchong Gao, Jun Liu, Jun Xu, Lei Shi, Lingling Zhang, Muye Huang, Renqing He, Xingyu Yu, Zhanyu Ma.

Figure 1
Figure 1. Figure 1: DuplexOmni keeps real-time interaction con [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of DuplexOmni. (a) The interaction layer conducts real-time dialogue while asynchronously [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Data pipeline for DuplexOmni. Scenario seeds and raw content are converted into temporally annotated [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

Human interaction is continuous, multimodal, and full-duplex by nature. Although recent omni models have made substantial progress in unified speech, vision, and text modeling, combining seamless real-time interaction with complex reasoning and tool use remains challenging. We present DuplexOmni, a method for real-time multimodal full-duplex interaction. DuplexOmni separates model capability into an interaction layer and a thinking layer, which collaborate asynchronously in parallel. The interaction layer is implemented by the DuplexOmni model, an end-to-end system that processes streaming audio and video inputs while generating text and speech responses in real time. The thinking layer is a pluggable module that provides complex reasoning and tool-use capabilities. To support this method, we further develop a Writer-Director pipeline for constructing continuous-interaction training data. Experiments show that DuplexOmni achieves strong performance on multiple public benchmarks and exhibits natural full-duplex interaction ability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper presents DuplexOmni, a method for real-time multimodal full-duplex interaction. It separates model capability into an interaction layer (implemented by the DuplexOmni end-to-end model that processes streaming audio/video inputs and generates text/speech responses) and a thinking layer (a pluggable module for complex reasoning and tool use). These layers collaborate asynchronously in parallel. A Writer-Director pipeline is introduced to construct continuous-interaction training data. The abstract states that experiments show strong performance on multiple public benchmarks and natural full-duplex interaction ability.

Significance. If the claims of effective asynchronous collaboration and benchmark performance hold with supporting evidence, the separation of interaction and thinking layers could enable more natural real-time multimodal AI systems that combine immediate responsiveness with deep reasoning. The Writer-Director data pipeline might address a key data scarcity issue for full-duplex training. However, the provided manuscript supplies no quantitative results, implementation details, or evaluations, so the significance cannot be assessed beyond the conceptual framing.

major comments (3)
  1. [Abstract] Abstract: The central claim that 'DuplexOmni achieves strong performance on multiple public benchmarks' is unsupported by any metrics, baselines, ablation studies, or experimental details. This directly undermines evaluation of the method's effectiveness for full-duplex interaction.
  2. [Abstract] Abstract: The assumption that the interaction layer and thinking layer 'collaborate asynchronously in parallel' without unacceptable latency, incoherence, or coordination failures is stated but receives no implementation description, latency measurements, or empirical test. This is load-bearing for the core architectural claim.
  3. [Abstract] Abstract: No details are provided on the DuplexOmni model architecture, training procedure, or how the Writer-Director pipeline generates data sufficient for stable full-duplex behavior, preventing assessment of whether the invented entities deliver the claimed capabilities.
minor comments (1)
  1. [Abstract] Abstract: The term 'DuplexOmni' is used both for the overall method and specifically for the interaction-layer model; clarifying this distinction in the title and text would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments correctly identify that the current manuscript version lacks the quantitative results, implementation specifics, and empirical measurements needed to fully support the abstract claims. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'DuplexOmni achieves strong performance on multiple public benchmarks' is unsupported by any metrics, baselines, ablation studies, or experimental details. This directly undermines evaluation of the method's effectiveness for full-duplex interaction.

    Authors: We agree that the provided manuscript contains no metrics, baselines, ablation studies, or experimental details to support the performance claim. The revised version will add a full Experiments section reporting quantitative results on the referenced public benchmarks, including baselines and ablations. revision: yes

  2. Referee: [Abstract] Abstract: The assumption that the interaction layer and thinking layer 'collaborate asynchronously in parallel' without unacceptable latency, incoherence, or coordination failures is stated but receives no implementation description, latency measurements, or empirical test. This is load-bearing for the core architectural claim.

    Authors: This observation is accurate; the manuscript provides no implementation description, latency numbers, or tests for the asynchronous collaboration. In revision we will add a dedicated subsection describing the collaboration protocol together with latency measurements and empirical tests for coherence and coordination failures. revision: yes

  3. Referee: [Abstract] Abstract: No details are provided on the DuplexOmni model architecture, training procedure, or how the Writer-Director pipeline generates data sufficient for stable full-duplex behavior, preventing assessment of whether the invented entities deliver the claimed capabilities.

    Authors: We concur that the manuscript supplies no architecture diagrams, training procedure, or data-generation details for the Writer-Director pipeline. The revision will expand the Method section with these specifics, including how the pipeline produces training data that supports stable full-duplex behavior. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation-free descriptive architecture

full rationale

The paper presents an architectural description of DuplexOmni (interaction layer + pluggable thinking layer + Writer-Director data pipeline) with no equations, no fitted parameters renamed as predictions, no self-citation chains, and no uniqueness theorems. The abstract and supplied text contain only high-level system claims and experimental statements; no derivation step reduces to its own inputs by construction. This is the expected outcome for a systems paper whose central contribution is an engineering decomposition rather than a mathematical derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The central claim rests on the functional separation of layers and the effectiveness of the new data pipeline; no free parameters, standard mathematical axioms, or independently evidenced invented entities are stated in the abstract.

invented entities (2)
  • DuplexOmni model no independent evidence
    purpose: End-to-end implementation of the interaction layer for streaming audio/video to real-time text/speech generation
    Presented as the concrete realization of the interaction layer.
  • Writer-Director pipeline no independent evidence
    purpose: Generation of continuous-interaction training data
    Developed specifically to support training of the DuplexOmni system.

pith-pipeline@v0.9.1-grok · 5716 in / 1257 out tokens · 28341 ms · 2026-06-27T15:20:36.926952+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 25 canonical work pages · 12 internal anchors

  1. [1]

    Junjie Chen, Yao Hu, Junjie Li, Kangyue Li, Kun Liu, Wenpeng Li, Xu Li, Ziyuan Li, Feiyu Shen, Xu Tang, Manzhen Wei, Yichen Wu, Fenglong Xie, Kaituo Xu, and Kun Xie. 2025a. Fireredchat: A plug- gable, full-duplex voice interaction system with cas- caded and semi-cascaded implementations.CoRR, abs/2509.06502. Wenxi Chen, Ziyang Ma, Ruiqi Yan, Yuzhe Liang, ...

  2. [2]

    Qwen2-Audio Technical Report

    Qwen2-audio technical report.CoRR, abs/2407.10759. Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour

  3. [3]

    Moshi: a speech-text foundation model for real-time dialogue

    Moshi: a speech- text foundation model for real-time dialogue.CoRR, abs/2410.00037. Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou

  4. [4]

    Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

    Enhancing chat language models by scaling high-quality instructional conver- sations.arXiv preprint arXiv:2305.14233. William Barr Held, Yanzhe Zhang, Weiyan Shi, Minzhi Li, Michael J. Ryan, and Diyi Yang

  5. [5]

    Distill- ing an end-to-end voice assistant without instruction training data. InProceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, pages 7876–7891. Association for Computational Linguistics. Baichuan Inc

  6. [6]

    Yunjie Ji, Yan Gong, Yong Deng, Yiping Peng, Qiang Niu, Baochang Ma, and Xiangang Li

    Baichuan-omni-1.5 technical re- port.CoRR, abs/2501.15368. Yunjie Ji, Yan Gong, Yong Deng, Yiping Peng, Qiang Niu, Baochang Ma, and Xiangang Li

  7. [7]

    Towards better instruction following language models for chi- nese: Investigating the impact of training data and evaluation.CoRR, abs/2304.07854. KimiTeam, Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, Zhengtao Wang, Chu Wei, Yifei Xin, Xinran Xu, Jianwei Yu, Yutao Zhang, Xinyu Zhou, Y . C...

  8. [8]

    Kimi- audio technical report.CoRR, abs/2504.18425. Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Richárd Nagyfi, Shahul ES, Sameer Suri, David Glushkov, Arnav Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and Alexander Mattick

  9. [9]

    would you rather

    Openassistant conversations – democ- ratizing large language model alignment.Preprint, arXiv:2304.07327. Guan-Ting Lin, Jiachen Lian, Tingle Li, Qirui Wang, Gopala Anumanchipalli, Alexander H. Liu, and Hung-Yi Lee

  10. [10]

    InIEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2025, Honolulu, HI, USA, December 6-10, 2025, pages 1–8

    Full-duplex-bench: A bench- mark to evaluate full-duplex spoken dialogue mod- els on turn-taking capabilities. InIEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2025, Honolulu, HI, USA, December 6-10, 2025, pages 1–8. IEEE. Yudong Lu, Yazhe Niu, Shuai Hu, and Haolin Wang

  11. [11]

    Ziyang Ma, Yakun Song, Chenpeng Du, Jian Cong, Zhuo Chen, Yuping Wang, Yuxuan Wang, and Xie Chen

    Cleans2s: Single-file framework for proactive speech-to-speech interaction.CoRR, abs/2506.01268. Ziyang Ma, Yakun Song, Chenpeng Du, Jian Cong, Zhuo Chen, Yuping Wang, Yuxuan Wang, and Xie Chen

  12. [12]

    Language model can listen while speak- ing. InThirty-Ninth AAAI Conference on Artificial Intelligence, Thirty-Seventh Conference on Innova- tive Applications of Artificial Intelligence, Fifteenth Symposium on Educational Advances in Artificial In- telligence, AAAI 2025, Philadelphia, PA, USA, Febru- ary 25 - March 4, 2025, pages 24831–24839. AAAI Press. T...

  13. [13]

    Genera- tive spoken dialogue language modeling.Preprint, arXiv:2203.16502. OpenAI

  14. [14]

    GPT-4o System Card

    Gpt-4o system card.CoRR, abs/2410.21276. Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur

  15. [15]

    In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015, pages 5206–5210

    Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015, pages 5206–5210. IEEE. 9 Nazneen Rajani, Lewis Tunstall, Edward Beeching, Nathan Lambert, Alexander M. Rush, and Thomas Wolf

  16. [16]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini 1.5: Un- locking multimodal understanding across millions of tokens of context.CoRR, abs/2403.05530. Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, and 1 others

  17. [17]

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.arXiv preprint arXiv:2206.04615. Mirac Suzgun, Nathan Scales, Nathanael Schärli, Se- bastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, and Jason Wei

  18. [18]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261. Gemini Team

  19. [19]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini: A family of highly capa- ble multimodal models.CoRR, abs/2312.11805. Qwen Team. 2025a. Qwen3-omni technical report. CoRR, abs/2509.17765. Qwen Team. 2025b. Qwen3 technical report.CoRR, abs/2505.09388. Qwen Team. 2026a. Qwen3-tts technical report.CoRR, abs/2601.15621. Qwen Team. 2026b. Qwen3.5-omni technical report. CoRR, abs/2604.15804. Tongyi Fun...

  20. [20]

    Peng Wang, Songshuo Lu, Yaohua Tang, Sijie Yan, Yuanjun Xiong, and Wei Xia

    Fun-audio- chat technical report.CoRR, abs/2512.20156. Peng Wang, Songshuo Lu, Yaohua Tang, Sijie Yan, Yuanjun Xiong, and Wei Xia

  21. [21]

    A Full-Duplex Speech Dia- logue Scheme Based on Large Language Models,

    A full-duplex speech dialogue scheme based on large language models.CoRR, abs/2405.19487. LLM-Core Xiaomi

  22. [22]

    Zhifei Xie, Ziyang Ma, Zihang Liu, Kaiyu Pang, Hongyu Li, Jialin Zhang, Yue Liao, Deheng Ye, Chunyan Miao, and Shuicheng Yan

    Mimo-audio: Audio language models are few-shot learners.CoRR, abs/2512.23808. Zhifei Xie, Ziyang Ma, Zihang Liu, Kaiyu Pang, Hongyu Li, Jialin Zhang, Yue Liao, Deheng Ye, Chunyan Miao, and Shuicheng Yan

  23. [23]

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin

    Mini- omni-reasoner: Token-level thinking-in-speaking in large speech models.CoRR, abs/2508.15827. Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin

  24. [24]

    Qwen2.5-Omni Technical Report

    Qwen2.5-omni technical report. CoRR, abs/2503.20215. Jianing Yang, Yusuke Fujita, and Yui Sudo

  25. [25]

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, and 1 others

    Du- plexcascade: Full-duplex speech-to-speech dialogue with vad-free cascaded ASR-LLM-TTS pipeline and micro-turn optimization.CoRR, abs/2603.09180. Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, and 1 others

  26. [26]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800. Wenyi Yu, Siyin Wang, Xiaoyu Yang, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Guangzhi Sun, Lu Lu, Yux- uan Wang, and Chao Zhang

  27. [27]

    Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu

    Salmonn-omni: A standalone speech LLM without codec injection for full-duplex conversation.CoRR, abs/2505.17060. Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. 2023a. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. In Findings of the Association for Computational Lin- ...

  28. [28]

    Omniflatten: An end-to-end GPT model for seamless voice conversation. InProceedings of the 63rd An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vi- enna, Austria, July 27 - August 1, 2025, pages 14570– 14580. Association for Computational Linguistics. Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie,...

  29. [29]

    Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities,

    Daily- omni: Towards audio-visual reasoning with temporal alignment across modalities.CoRR, abs/2505.17862. 10 A Full Control Token Set Table 4 lists the complete set of Director anno- tation tokens inserted during the Writer-Director pipeline. These tokens encode the temporal rela- tionship between user speech, assistant speech, and background thinking i...