arxiv: 2605.01506 · v1 · submitted 2026-05-02 · 💻 cs.CV

Recognition: unknown

OmniEncoder: See, Hear, and Feel Continuous Motion Like Humans With One Encoder

Detao Bai , Shimin Yao , Weixuan Chen , Chengen Lai , Yuanming Li , Zhiheng Ma , Xihan Wei

Authors on Pith no claims yet

Pith reviewed 2026-05-09 14:15 UTC · model grok-4.3

classification 💻 cs.CV

keywords unified encoderomni-modalvision-audiocontinuous motiontransformer backbonesign language recognitionaction analysis

0 comments

The pith

A single Transformer encoder processes video and audio together at 25 fps to capture continuous motion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Prevailing omni-modal models process video at 1-2 fps and audio at 25 fps, which forces frame-by-frame, modality-by-modality perception and weakens cross-modal interaction during encoding. Omni-Encoder replaces this with one unified backbone that co-embeds both signals symmetrically at 25 fps inside a shared latent space. Three components—the Omni-Encoder Token Template, Omni-RoPE, and Temporal Window Shifting—handle modality separation and efficiency so the model can track fine visual motion without separate encoders. Gains appear on tasks that need continuous visual understanding while audio-visual benchmarks stay competitive, showing that symmetric high-rate encoding can produce more integrated perception.

Core claim

Omni-Encoder is a unified Transformer backbone that co-embeds visual and audio signals at a symmetrical 25 fps within a shared latent space. It uses the Omni-Encoder Token Template, Omni-RoPE, and Temporal Window Shifting to reconcile modality disentanglement with computational efficiency, delivering gains on continuous visual tasks under the same token budget to the LLM decoder as the modality-specific baseline.

What carries the argument

The Omni-Encoder unified Transformer backbone that runs vision and audio at matched 25 fps rates, using token templates, a rotary embedding variant, and shifting temporal windows to separate modalities inside one network.

If this is right

Substantial improvements on sign language recognition and fine-grained sports action analysis under fixed token budgets to the decoder.
No loss of performance on established audio-visual tasks such as AVQA and speaker identification and localization.
Models perceive motion holistically rather than frame by frame and modality by modality.
Unified high-rate encoding becomes a viable route for omni-modal systems that match integrated human perception.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same symmetric-rate principle could be tested on additional modalities such as depth or IMU streams to check whether the disentanglement techniques generalize.
Real-time applications in robotics or live video analysis may benefit if the 25 fps unified encoding reduces latency compared with separate high-rate audio pipelines.
Longer video sequences could expose whether the temporal window shifting maintains coherence beyond the lengths reported in the benchmarks.

Load-bearing premise

The three proposed components can achieve both modality separation and efficiency at 25 fps without lowering audio quality or creating failure modes missed by the current benchmarks.

What would settle it

An experiment that applies the same 25 fps visual sampling to the modality-specific baseline without the new components and finds equal gains on sign-language and sports-action tasks would indicate the unified design is not required.

Figures

Figures reproduced from arXiv: 2605.01506 by Chengen Lai, Detao Bai, Shimin Yao, Weixuan Chen, Xihan Wei, Yuanming Li, Zhiheng Ma.

**Figure 1.** Figure 1: (a). Modality-specific encoders: process visual and audio information separately at mismatched frame rates (1–2 fps for video, 25 fps for audio).(b). Omni-Encoder: a single Transformer jointly encodes audio, visual base, and visual continuous tokens at 25 fps within a unified representation space.(c). With the same number of input tokens to the LLM, we compare Qwen2.5-Omni-3B with the original encoder and … view at source ↗

**Figure 2.** Figure 2: Architecture of Omni-Encoder. A 24-layer Transformer jointly encodes Audio, Visual Continuous, and Visual Base tokens from raw 25 fps video through unified self-attention, with each layer incorporating Omni-RoPE and Temporal Window Shifting. A Token Sparsifier then sparsifies Visual Base tokens to match the native input length of Qwen2.5-Omni. The encoded token sequence passes through a Token Sparsifier be… view at source ↗

**Figure 3.** Figure 3: Omni-Rope and Temporal Window Shifting . view at source ↗

read the original abstract

Recent advances in omni-modal large language models have enabled remarkable progress in joint vision-audio understanding. However, prevailing architectures rely on modality-specific encoders with a \emph{video-coarse, audio-dense} design -- sampling visual frames at 1--2 fps while processing audio waveforms at 25 fps -- resulting in systems that perceive video \emph{frame by frame, modality by modality} rather than holistically as humans do. Such a discrepancy leaves models with impoverished cross-modal interaction during encoding and an inability to capture fine-grained visual motion. To bridge this gap, we present \textbf{Omni-Encoder, a unified Transformer backbone designed to co-embed visual and audio signals at a symmetrical 25 fps} within a shared latent space. This architecture leverages three core innovations -- the Omni-Encoder Token Template, Omni-RoPE, and Temporal Window Shifting -- to effectively reconcile the dual challenges of modality disentanglement and computational efficiency. Experiments demonstrate that, compared to the modality-specific baseline Qwen2.5-Omni under the same input token budget to the LLM decoder, Omni-Encoder delivers substantial gains on visual continuous understanding tasks -- such as sign language recognition and fine-grained sports action analysis -- while maintaining competitive performance on established audio-visual benchmarks such as AVQA and Speaker Identification and Localization. These results suggest that unified omnivorous encoding offers a promising direction for building omni-modal models that more closely reflect the integrated nature of human perception.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OmniEncoder sketches a unified 25 fps vision-audio encoder with three new components but the abstract leaves the performance claims unverified and the efficiency tradeoffs unexamined.

read the letter

The core idea here is sensible: current omni-modal models sample video coarsely while keeping audio dense, which hurts fine motion understanding. OmniEncoder tries to fix that by running a single Transformer backbone on both modalities at 25 fps using an Omni-Encoder Token Template, Omni-RoPE, and Temporal Window Shifting to manage disentanglement and cost under a fixed token budget to the LLM decoder. That framing of the problem and the symmetric-rate goal is the clearest part of the work. The comparison to Qwen2.5-Omni on sign-language and sports-action tasks while holding audio-visual benchmarks steady is also a reasonable way to test whether the change helps where it should. The stress-test concern about hidden audio degradation or token reallocation is fair and not answered by anything visible; without per-modality token counts or component ablations, it is impossible to know whether the visual gains are real or just the result of giving video more capacity. The paper shows clear thinking about an architectural choice that matters for continuous perception, but the lack of quantitative detail makes the results hard to evaluate. This is worth a serious referee for groups working on omni-modal encoders who need concrete ideas for symmetric encoding, even if heavy revision on experiments would be required. I would bring it to a reading group only if the full paper supplies the missing ablations and metrics.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Omni-Encoder, a unified Transformer backbone that co-embeds visual and audio signals symmetrically at 25 fps within a shared latent space. It introduces three components—the Omni-Encoder Token Template, Omni-RoPE, and Temporal Window Shifting—to reconcile modality disentanglement with computational efficiency. The central claim is that, under an identical input token budget to the LLM decoder, this design yields substantial gains on continuous visual understanding tasks (sign language recognition, fine-grained sports action analysis) relative to the modality-specific Qwen2.5-Omni baseline while remaining competitive on established audio-visual benchmarks such as AVQA and speaker identification/localization.

Significance. If the empirical results hold after verification, the work would be significant for omni-modal LLMs by shifting from asymmetric video-coarse/audio-dense encoders toward integrated, high-temporal-resolution perception that better matches human sensory processing. The fixed-token-budget comparison and focus on fine-grained motion tasks provide a concrete test of whether unified encoding can improve cross-modal interaction without sacrificing audio performance.

major comments (3)

[Abstract] Abstract: The claim of 'substantial gains' on visual continuous understanding tasks is presented without any numerical metrics, effect sizes, tables, or statistical details. This absence is load-bearing for the central claim, as the reader cannot assess the magnitude of improvement or confirm it exceeds the baseline under the stated token constraint.
[Method] Method description: No equations, pseudocode, or architectural specifications are supplied for the Omni-Encoder Token Template, Omni-RoPE, or Temporal Window Shifting. Without these details it is impossible to verify how the components simultaneously achieve modality disentanglement and 25 fps efficiency while preserving audio feature density under a fixed LLM token budget.
[Experiments] Experiments: The manuscript provides no ablation studies isolating each of the three components, no per-modality token allocation breakdowns, and no additional audio-quality or synchronization metrics beyond the listed benchmarks. Because efficiency mechanisms necessarily reallocate capacity between modalities, the lack of these controls leaves open the possibility that reported visual gains mask hidden audio degradation or token-allocation artifacts.

minor comments (2)

[Abstract] The title and abstract use the phrase 'feel continuous motion' metaphorically; a brief clarification of the precise perceptual capabilities being modeled would improve precision.
[Abstract] A citation to the Qwen2.5-Omni baseline paper should be added to allow readers to locate the exact experimental conditions being compared.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. The comments highlight important areas where the manuscript can be strengthened for clarity and rigor. We address each major comment below and commit to a major revision that incorporates the suggested improvements.

read point-by-point responses

Referee: [Abstract] Abstract: The claim of 'substantial gains' on visual continuous understanding tasks is presented without any numerical metrics, effect sizes, tables, or statistical details. This absence is load-bearing for the central claim, as the reader cannot assess the magnitude of improvement or confirm it exceeds the baseline under the stated token constraint.

Authors: We agree that the abstract would be more informative with concrete metrics. In the revised version, we will include specific numerical results (e.g., accuracy deltas on sign language recognition and fine-grained action tasks) and confirm the fixed-token-budget comparison to Qwen2.5-Omni, allowing readers to directly evaluate the effect sizes. revision: yes
Referee: [Method] Method description: No equations, pseudocode, or architectural specifications are supplied for the Omni-Encoder Token Template, Omni-RoPE, or Temporal Window Shifting. Without these details it is impossible to verify how the components simultaneously achieve modality disentanglement and 25 fps efficiency while preserving audio feature density under a fixed LLM token budget.

Authors: We acknowledge the need for formal specifications. The revision will add equations defining each component, pseudocode for the unified encoding pipeline, and additional architectural details showing how the token template, Omni-RoPE, and window shifting maintain 25 fps symmetry, modality disentanglement, and audio density under the fixed token budget. revision: yes
Referee: [Experiments] Experiments: The manuscript provides no ablation studies isolating each of the three components, no per-modality token allocation breakdowns, and no additional audio-quality or synchronization metrics beyond the listed benchmarks. Because efficiency mechanisms necessarily reallocate capacity between modalities, the lack of these controls leaves open the possibility that reported visual gains mask hidden audio degradation or token-allocation artifacts.

Authors: We agree that these controls are essential. The revised manuscript will include ablations for the Omni-Encoder Token Template, Omni-RoPE, and Temporal Window Shifting individually; explicit per-modality token allocation tables; and supplementary audio-quality and synchronization metrics. These additions will directly address concerns about potential hidden trade-offs. revision: yes

Circularity Check

0 steps flagged

Empirical architecture proposal with no self-referential derivations or fitted predictions

full rationale

The paper presents Omni-Encoder as a new unified Transformer design using three named components (Omni-Encoder Token Template, Omni-RoPE, Temporal Window Shifting) to achieve 25 fps symmetric encoding. All performance claims are framed as experimental outcomes from comparisons against Qwen2.5-Omni under fixed LLM token budget on sign-language, sports-action, AVQA, and speaker tasks. No equations, derivations, or parameter-fitting steps appear in the provided text that would reduce any claimed gain to a self-definition, a renamed input, or a self-citation chain. The architecture is offered as an empirical engineering choice whose validity rests on benchmark numbers rather than on any closed mathematical loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

Abstract-only view limits visibility; the central claim rests on the unproven effectiveness of the three named innovations and the assumption that symmetric 25 fps encoding is feasible without modality-specific losses.

axioms (1)

ad hoc to paper The Omni-Encoder Token Template, Omni-RoPE, and Temporal Window Shifting reconcile modality disentanglement and computational efficiency at 25 fps.
Presented as the three core innovations whose success is required for the architecture to work.

invented entities (3)

Omni-Encoder Token Template no independent evidence
purpose: Co-embed visual and audio signals at symmetrical frame rate
New tokenization scheme introduced to handle dual-modality input in one backbone.
Omni-RoPE no independent evidence
purpose: Adapted rotary position embedding for omni-modal signals
Modified position encoding to support the unified high-rate input.
Temporal Window Shifting no independent evidence
purpose: Maintain computational efficiency during encoding
Mechanism to handle long sequences without quadratic cost explosion.

pith-pipeline@v0.9.0 · 5581 in / 1451 out tokens · 43694 ms · 2026-05-09T14:15:58.087177+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references

[1]

Learning transferable visual models from natural language supervision, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021

2021
[2]

Sigmoid loss for language image pre-training, 2023

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training, 2023

2023
[3]

Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features, 2025

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features, 2025

2025
[4]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational Conference on Machine Learning (ICML), pages 28492–28518. PMLR, 2023

2023
[5]

Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models, 2023

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models, 2023

2023
[6]

Qwen2-audio technical report, 2024

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen2-audio technical report, 2024

2024
[7]

Qwen 3.5 technical report

Qwen Team. Qwen 3.5 technical report. https://qwen.ai/blog?id=qwen3.5, 2025. Accessed: 2025-05-22. 7

2025
[8]

Llama: Open and efficient foundation language models, 2023

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023

2023
[9]

Qwen3.5-omni technical report, 2026

Qwen Team. Qwen3.5-omni technical report, 2026

2026
[10]

Qwen3-omni technical report, 2025

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo ...

2025
[11]

Qwen2.5-omni technical report, 2025

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report, 2025

2025
[12]

https://deepmind.google/models/gemini/, 2025

Google DeepMind. https://deepmind.google/models/gemini/, 2025

2025
[13]

Ola: Pushing the frontiers of omni-modal language model, 2025

Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Ola: Pushing the frontiers of omni-modal language model, 2025

2025
[14]

Vita-1.5: Towards gpt-4o level real-time vision and speech interaction, 2025

Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, Long Ma, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, and Ran He. Vita-1.5: Towards gpt-4o level real-time vision and speech interaction, 2025

2025
[15]

Humanomni: A large vision-speech language model for human-centric video understanding, 2025

Jiaxing Zhao, Qize Yang, Yixing Peng, Detao Bai, Shimin Yao, Boyuan Sun, Xiang Chen, Shenghao Fu, Weixuan chen, Xihan Wei, and Liefeng Bo. Humanomni: A large vision-speech language model for human-centric video understanding, 2025

2025
[16]

Humanomni-speaker: Efficient high-frequency video-audio understanding for omni-llms, 2026

Anonymous. Humanomni-speaker: Efficient high-frequency video-audio understanding for omni-llms, 2026

2026
[17]

Determinants of multisensory integration in superior colliculus neurons

M Alex Meredith, James W Nemitz, and Barry E Stein. Determinants of multisensory integration in superior colliculus neurons. i. temporal factors.Journal of Neuroscience, 7(10):3215–3229, 1987

1987
[18]

Early cross-modal interactions in auditory and visual cortex underlie a sound-induced visual illusion.Journal of Neuroscience, 27(15):4120–4131, 2007

Jyoti Mishra, Antigona Martinez, Terrence J Sejnowski, and Steven A Hillyard. Early cross-modal interactions in auditory and visual cortex underlie a sound-induced visual illusion.Journal of Neuroscience, 27(15):4120–4131, 2007

2007
[19]

Vivit: A video vision transformer

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luˇci´c, and Cordelia Schmid. Vivit: A video vision transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6836–6846, 2021

2021
[20]

Roformer: Enhanced transformer with rotary position embedding, 2023

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023

2023
[21]

Videomae: Masked autoencoders are data-efficient learners for self-supervised video learning

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

2022
[22]

V-jepa 2: Self-supervised video models enable understanding, prediction and planning, 2025

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba, Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xia...

2025
[23]

Swin transformer: Hierarchical vision transformer using shifted windows, 2021

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows, 2021

2021
[24]

Resound: Towards action recognition without representation bias

Yingwei Li, Yi Li, and Nuno Vasconcelos. Resound: Towards action recognition without representation bias. InProceedings of the European conference on computer vision (ECCV), pages 513–528, 2018

2018
[25]

something something

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzy ´nska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. The "something something" video database for learning and evaluating visual common sense, 2017

2017
[26]

Attention-based 3d-cnns for large-vocabulary sign language recognition.IEEE Transactions on Circuits and Systems for Video Technology, 29(9):2822– 2832, 2018

Jie Huang, Wengang Zhou, Houqiang Li, and Weiping Li. Attention-based 3d-cnns for large-vocabulary sign language recognition.IEEE Transactions on Circuits and Systems for Video Technology, 29(9):2822– 2832, 2018. 8

2018
[27]

Dual-view spatio-temporal feature fusion with cnn-transformer hybrid network for chinese isolated sign language recognition, 2025

Siyuan Jing, Guangxue Wang, Haoyang Zhai, Qin Tao, Jun Yang, Bing Wang, and Peng Jin. Dual-view spatio-temporal feature fusion with cnn-transformer hybrid network for chinese isolated sign language recognition, 2025

2025
[28]

Avqa: A dataset for audio-visual question answering on videos

Pinci Yang, Xin Wang, Xuguang Duan, Hong Chen, Runze Hou, Cong Jin, and Wenwu Zhu. Avqa: A dataset for audio-visual question answering on videos. InProceedings of the 30th ACM international conference on multimedia, pages 3480–3491, 2022

2022
[29]

Lip reading sentences in the wild

Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. Lip reading sentences in the wild. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, July 2017

2017
[30]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

2025
[31]

Signbert: Pre-training of hand-model-aware representation for sign language recognition

Hezhen Hu, Weichao Zhao, Wengang Zhou, Yuechen Wang, and Houqiang Li. Signbert: Pre-training of hand-model-aware representation for sign language recognition. InProceedings of the IEEE/CVF international conference on computer vision, pages 11087–11096, 2021

2021
[32]

Cat: Enhancing multimodal large language model to answer questions in dynamic audio-visual scenarios, 2024

Qilang Ye, Zitong Yu, Rui Shao, Xinyu Xie, Philip Torr, and Xiaochun Cao. Cat: Enhancing multimodal large language model to answer questions in dynamic audio-visual scenarios, 2024

2024
[33]

Internvideo2: Scaling foundation models for multimodal video understanding, 2024

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Chenting Wang, Guo Chen, Baoqi Pei, Ziang Yan, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei Huang, Yu Qiao, Yali Wang, and Limin Wang. Internvideo2: Scaling foundation models for multimodal video understanding, 2024

2024
[34]

Onevision-encoder: Codec-aligned sparsity as a foundational principle for multimodal intelligence, 2026

Feilong Tang, Xiang An, Yunyao Yan, Yin Xie, Bin Qin, Kaicheng Yang, Yifei Shen, Yuanhan Zhang, Chunyuan Li, Shikun Feng, Changrui Chen, Huajie Tan, Ming Hu, Manyuan Zhang, Bo Li, Ziyong Feng, Ziwei Liu, Zongyuan Ge, and Jiankang Deng. Onevision-encoder: Codec-aligned sparsity as a foundational principle for multimodal intelligence, 2026

2026
[35]

Improving llm video understanding with 16 frames per second, 2025

Yixuan Li, Changli Tang, Jimin Zhuang, Yudong Yang, Guangzhi Sun, Wei Li, Zejun Ma, and Chao Zhang. Improving llm video understanding with 16 frames per second, 2025

2025
[36]

Vl-jepa: Joint embedding predictive architecture for vision-language, 2026

Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Yejin Bang, Allen Bolourchi, Yann LeCun, and Pascale Fung. Vl-jepa: Joint embedding predictive architecture for vision-language, 2026

2026
[37]

Audio- visual speech recognition with a hybrid ctc/attention architecture, 2018

Stavros Petridis, Themos Stafylakis, Pingchuan Ma, Georgios Tzimiropoulos, and Maja Pantic. Audio- visual speech recognition with a hybrid ctc/attention architecture, 2018. 9

2018