pith. machine review for the scientific record. sign in

arxiv: 2605.01506 · v1 · submitted 2026-05-02 · 💻 cs.CV

Recognition: unknown

OmniEncoder: See, Hear, and Feel Continuous Motion Like Humans With One Encoder

Authors on Pith no claims yet

Pith reviewed 2026-05-09 14:15 UTC · model grok-4.3

classification 💻 cs.CV
keywords unified encoderomni-modalvision-audiocontinuous motiontransformer backbonesign language recognitionaction analysis
0
0 comments X

The pith

A single Transformer encoder processes video and audio together at 25 fps to capture continuous motion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Prevailing omni-modal models process video at 1-2 fps and audio at 25 fps, which forces frame-by-frame, modality-by-modality perception and weakens cross-modal interaction during encoding. Omni-Encoder replaces this with one unified backbone that co-embeds both signals symmetrically at 25 fps inside a shared latent space. Three components—the Omni-Encoder Token Template, Omni-RoPE, and Temporal Window Shifting—handle modality separation and efficiency so the model can track fine visual motion without separate encoders. Gains appear on tasks that need continuous visual understanding while audio-visual benchmarks stay competitive, showing that symmetric high-rate encoding can produce more integrated perception.

Core claim

Omni-Encoder is a unified Transformer backbone that co-embeds visual and audio signals at a symmetrical 25 fps within a shared latent space. It uses the Omni-Encoder Token Template, Omni-RoPE, and Temporal Window Shifting to reconcile modality disentanglement with computational efficiency, delivering gains on continuous visual tasks under the same token budget to the LLM decoder as the modality-specific baseline.

What carries the argument

The Omni-Encoder unified Transformer backbone that runs vision and audio at matched 25 fps rates, using token templates, a rotary embedding variant, and shifting temporal windows to separate modalities inside one network.

If this is right

  • Substantial improvements on sign language recognition and fine-grained sports action analysis under fixed token budgets to the decoder.
  • No loss of performance on established audio-visual tasks such as AVQA and speaker identification and localization.
  • Models perceive motion holistically rather than frame by frame and modality by modality.
  • Unified high-rate encoding becomes a viable route for omni-modal systems that match integrated human perception.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same symmetric-rate principle could be tested on additional modalities such as depth or IMU streams to check whether the disentanglement techniques generalize.
  • Real-time applications in robotics or live video analysis may benefit if the 25 fps unified encoding reduces latency compared with separate high-rate audio pipelines.
  • Longer video sequences could expose whether the temporal window shifting maintains coherence beyond the lengths reported in the benchmarks.

Load-bearing premise

The three proposed components can achieve both modality separation and efficiency at 25 fps without lowering audio quality or creating failure modes missed by the current benchmarks.

What would settle it

An experiment that applies the same 25 fps visual sampling to the modality-specific baseline without the new components and finds equal gains on sign-language and sports-action tasks would indicate the unified design is not required.

Figures

Figures reproduced from arXiv: 2605.01506 by Chengen Lai, Detao Bai, Shimin Yao, Weixuan Chen, Xihan Wei, Yuanming Li, Zhiheng Ma.

Figure 1
Figure 1. Figure 1: (a). Modality-specific encoders: process visual and audio information separately at mismatched frame rates (1–2 fps for video, 25 fps for audio).(b). Omni-Encoder: a single Transformer jointly encodes audio, visual base, and visual continuous tokens at 25 fps within a unified representation space.(c). With the same number of input tokens to the LLM, we compare Qwen2.5-Omni-3B with the original encoder and … view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of Omni-Encoder. A 24-layer Transformer jointly encodes Audio, Visual Continuous, and Visual Base tokens from raw 25 fps video through unified self-attention, with each layer incorporating Omni-RoPE and Temporal Window Shifting. A Token Sparsifier then sparsifies Visual Base tokens to match the native input length of Qwen2.5-Omni. The encoded token sequence passes through a Token Sparsifier be… view at source ↗
Figure 3
Figure 3. Figure 3: Omni-Rope and Temporal Window Shifting . view at source ↗
read the original abstract

Recent advances in omni-modal large language models have enabled remarkable progress in joint vision-audio understanding. However, prevailing architectures rely on modality-specific encoders with a \emph{video-coarse, audio-dense} design -- sampling visual frames at 1--2 fps while processing audio waveforms at 25 fps -- resulting in systems that perceive video \emph{frame by frame, modality by modality} rather than holistically as humans do. Such a discrepancy leaves models with impoverished cross-modal interaction during encoding and an inability to capture fine-grained visual motion. To bridge this gap, we present \textbf{Omni-Encoder, a unified Transformer backbone designed to co-embed visual and audio signals at a symmetrical 25 fps} within a shared latent space. This architecture leverages three core innovations -- the Omni-Encoder Token Template, Omni-RoPE, and Temporal Window Shifting -- to effectively reconcile the dual challenges of modality disentanglement and computational efficiency. Experiments demonstrate that, compared to the modality-specific baseline Qwen2.5-Omni under the same input token budget to the LLM decoder, Omni-Encoder delivers substantial gains on visual continuous understanding tasks -- such as sign language recognition and fine-grained sports action analysis -- while maintaining competitive performance on established audio-visual benchmarks such as AVQA and Speaker Identification and Localization. These results suggest that unified omnivorous encoding offers a promising direction for building omni-modal models that more closely reflect the integrated nature of human perception.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Omni-Encoder, a unified Transformer backbone that co-embeds visual and audio signals symmetrically at 25 fps within a shared latent space. It introduces three components—the Omni-Encoder Token Template, Omni-RoPE, and Temporal Window Shifting—to reconcile modality disentanglement with computational efficiency. The central claim is that, under an identical input token budget to the LLM decoder, this design yields substantial gains on continuous visual understanding tasks (sign language recognition, fine-grained sports action analysis) relative to the modality-specific Qwen2.5-Omni baseline while remaining competitive on established audio-visual benchmarks such as AVQA and speaker identification/localization.

Significance. If the empirical results hold after verification, the work would be significant for omni-modal LLMs by shifting from asymmetric video-coarse/audio-dense encoders toward integrated, high-temporal-resolution perception that better matches human sensory processing. The fixed-token-budget comparison and focus on fine-grained motion tasks provide a concrete test of whether unified encoding can improve cross-modal interaction without sacrificing audio performance.

major comments (3)
  1. [Abstract] Abstract: The claim of 'substantial gains' on visual continuous understanding tasks is presented without any numerical metrics, effect sizes, tables, or statistical details. This absence is load-bearing for the central claim, as the reader cannot assess the magnitude of improvement or confirm it exceeds the baseline under the stated token constraint.
  2. [Method] Method description: No equations, pseudocode, or architectural specifications are supplied for the Omni-Encoder Token Template, Omni-RoPE, or Temporal Window Shifting. Without these details it is impossible to verify how the components simultaneously achieve modality disentanglement and 25 fps efficiency while preserving audio feature density under a fixed LLM token budget.
  3. [Experiments] Experiments: The manuscript provides no ablation studies isolating each of the three components, no per-modality token allocation breakdowns, and no additional audio-quality or synchronization metrics beyond the listed benchmarks. Because efficiency mechanisms necessarily reallocate capacity between modalities, the lack of these controls leaves open the possibility that reported visual gains mask hidden audio degradation or token-allocation artifacts.
minor comments (2)
  1. [Abstract] The title and abstract use the phrase 'feel continuous motion' metaphorically; a brief clarification of the precise perceptual capabilities being modeled would improve precision.
  2. [Abstract] A citation to the Qwen2.5-Omni baseline paper should be added to allow readers to locate the exact experimental conditions being compared.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. The comments highlight important areas where the manuscript can be strengthened for clarity and rigor. We address each major comment below and commit to a major revision that incorporates the suggested improvements.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim of 'substantial gains' on visual continuous understanding tasks is presented without any numerical metrics, effect sizes, tables, or statistical details. This absence is load-bearing for the central claim, as the reader cannot assess the magnitude of improvement or confirm it exceeds the baseline under the stated token constraint.

    Authors: We agree that the abstract would be more informative with concrete metrics. In the revised version, we will include specific numerical results (e.g., accuracy deltas on sign language recognition and fine-grained action tasks) and confirm the fixed-token-budget comparison to Qwen2.5-Omni, allowing readers to directly evaluate the effect sizes. revision: yes

  2. Referee: [Method] Method description: No equations, pseudocode, or architectural specifications are supplied for the Omni-Encoder Token Template, Omni-RoPE, or Temporal Window Shifting. Without these details it is impossible to verify how the components simultaneously achieve modality disentanglement and 25 fps efficiency while preserving audio feature density under a fixed LLM token budget.

    Authors: We acknowledge the need for formal specifications. The revision will add equations defining each component, pseudocode for the unified encoding pipeline, and additional architectural details showing how the token template, Omni-RoPE, and window shifting maintain 25 fps symmetry, modality disentanglement, and audio density under the fixed token budget. revision: yes

  3. Referee: [Experiments] Experiments: The manuscript provides no ablation studies isolating each of the three components, no per-modality token allocation breakdowns, and no additional audio-quality or synchronization metrics beyond the listed benchmarks. Because efficiency mechanisms necessarily reallocate capacity between modalities, the lack of these controls leaves open the possibility that reported visual gains mask hidden audio degradation or token-allocation artifacts.

    Authors: We agree that these controls are essential. The revised manuscript will include ablations for the Omni-Encoder Token Template, Omni-RoPE, and Temporal Window Shifting individually; explicit per-modality token allocation tables; and supplementary audio-quality and synchronization metrics. These additions will directly address concerns about potential hidden trade-offs. revision: yes

Circularity Check

0 steps flagged

Empirical architecture proposal with no self-referential derivations or fitted predictions

full rationale

The paper presents Omni-Encoder as a new unified Transformer design using three named components (Omni-Encoder Token Template, Omni-RoPE, Temporal Window Shifting) to achieve 25 fps symmetric encoding. All performance claims are framed as experimental outcomes from comparisons against Qwen2.5-Omni under fixed LLM token budget on sign-language, sports-action, AVQA, and speaker tasks. No equations, derivations, or parameter-fitting steps appear in the provided text that would reduce any claimed gain to a self-definition, a renamed input, or a self-citation chain. The architecture is offered as an empirical engineering choice whose validity rests on benchmark numbers rather than on any closed mathematical loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

Abstract-only view limits visibility; the central claim rests on the unproven effectiveness of the three named innovations and the assumption that symmetric 25 fps encoding is feasible without modality-specific losses.

axioms (1)
  • ad hoc to paper The Omni-Encoder Token Template, Omni-RoPE, and Temporal Window Shifting reconcile modality disentanglement and computational efficiency at 25 fps.
    Presented as the three core innovations whose success is required for the architecture to work.
invented entities (3)
  • Omni-Encoder Token Template no independent evidence
    purpose: Co-embed visual and audio signals at symmetrical frame rate
    New tokenization scheme introduced to handle dual-modality input in one backbone.
  • Omni-RoPE no independent evidence
    purpose: Adapted rotary position embedding for omni-modal signals
    Modified position encoding to support the unified high-rate input.
  • Temporal Window Shifting no independent evidence
    purpose: Maintain computational efficiency during encoding
    Mechanism to handle long sequences without quadratic cost explosion.

pith-pipeline@v0.9.0 · 5581 in / 1451 out tokens · 43694 ms · 2026-05-09T14:15:58.087177+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references

  1. [1]

    Learning transferable visual models from natural language supervision, 2021

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021

  2. [2]

    Sigmoid loss for language image pre-training, 2023

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training, 2023

  3. [3]

    Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features, 2025

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features, 2025

  4. [4]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational Conference on Machine Learning (ICML), pages 28492–28518. PMLR, 2023

  5. [5]

    Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models, 2023

    Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models, 2023

  6. [6]

    Qwen2-audio technical report, 2024

    Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen2-audio technical report, 2024

  7. [7]

    Qwen 3.5 technical report

    Qwen Team. Qwen 3.5 technical report. https://qwen.ai/blog?id=qwen3.5, 2025. Accessed: 2025-05-22. 7

  8. [8]

    Llama: Open and efficient foundation language models, 2023

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023

  9. [9]

    Qwen3.5-omni technical report, 2026

    Qwen Team. Qwen3.5-omni technical report, 2026

  10. [10]

    Qwen3-omni technical report, 2025

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo ...

  11. [11]

    Qwen2.5-omni technical report, 2025

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report, 2025

  12. [12]

    https://deepmind.google/models/gemini/, 2025

    Google DeepMind. https://deepmind.google/models/gemini/, 2025

  13. [13]

    Ola: Pushing the frontiers of omni-modal language model, 2025

    Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Ola: Pushing the frontiers of omni-modal language model, 2025

  14. [14]

    Vita-1.5: Towards gpt-4o level real-time vision and speech interaction, 2025

    Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, Long Ma, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, and Ran He. Vita-1.5: Towards gpt-4o level real-time vision and speech interaction, 2025

  15. [15]

    Humanomni: A large vision-speech language model for human-centric video understanding, 2025

    Jiaxing Zhao, Qize Yang, Yixing Peng, Detao Bai, Shimin Yao, Boyuan Sun, Xiang Chen, Shenghao Fu, Weixuan chen, Xihan Wei, and Liefeng Bo. Humanomni: A large vision-speech language model for human-centric video understanding, 2025

  16. [16]

    Humanomni-speaker: Efficient high-frequency video-audio understanding for omni-llms, 2026

    Anonymous. Humanomni-speaker: Efficient high-frequency video-audio understanding for omni-llms, 2026

  17. [17]

    Determinants of multisensory integration in superior colliculus neurons

    M Alex Meredith, James W Nemitz, and Barry E Stein. Determinants of multisensory integration in superior colliculus neurons. i. temporal factors.Journal of Neuroscience, 7(10):3215–3229, 1987

  18. [18]

    Early cross-modal interactions in auditory and visual cortex underlie a sound-induced visual illusion.Journal of Neuroscience, 27(15):4120–4131, 2007

    Jyoti Mishra, Antigona Martinez, Terrence J Sejnowski, and Steven A Hillyard. Early cross-modal interactions in auditory and visual cortex underlie a sound-induced visual illusion.Journal of Neuroscience, 27(15):4120–4131, 2007

  19. [19]

    Vivit: A video vision transformer

    Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luˇci´c, and Cordelia Schmid. Vivit: A video vision transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6836–6846, 2021

  20. [20]

    Roformer: Enhanced transformer with rotary position embedding, 2023

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023

  21. [21]

    Videomae: Masked autoencoders are data-efficient learners for self-supervised video learning

    Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  22. [22]

    V-jepa 2: Self-supervised video models enable understanding, prediction and planning, 2025

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba, Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xia...

  23. [23]

    Swin transformer: Hierarchical vision transformer using shifted windows, 2021

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows, 2021

  24. [24]

    Resound: Towards action recognition without representation bias

    Yingwei Li, Yi Li, and Nuno Vasconcelos. Resound: Towards action recognition without representation bias. InProceedings of the European conference on computer vision (ECCV), pages 513–528, 2018

  25. [25]

    something something

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzy ´nska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. The "something something" video database for learning and evaluating visual common sense, 2017

  26. [26]

    Attention-based 3d-cnns for large-vocabulary sign language recognition.IEEE Transactions on Circuits and Systems for Video Technology, 29(9):2822– 2832, 2018

    Jie Huang, Wengang Zhou, Houqiang Li, and Weiping Li. Attention-based 3d-cnns for large-vocabulary sign language recognition.IEEE Transactions on Circuits and Systems for Video Technology, 29(9):2822– 2832, 2018. 8

  27. [27]

    Dual-view spatio-temporal feature fusion with cnn-transformer hybrid network for chinese isolated sign language recognition, 2025

    Siyuan Jing, Guangxue Wang, Haoyang Zhai, Qin Tao, Jun Yang, Bing Wang, and Peng Jin. Dual-view spatio-temporal feature fusion with cnn-transformer hybrid network for chinese isolated sign language recognition, 2025

  28. [28]

    Avqa: A dataset for audio-visual question answering on videos

    Pinci Yang, Xin Wang, Xuguang Duan, Hong Chen, Runze Hou, Cong Jin, and Wenwu Zhu. Avqa: A dataset for audio-visual question answering on videos. InProceedings of the 30th ACM international conference on multimedia, pages 3480–3491, 2022

  29. [29]

    Lip reading sentences in the wild

    Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. Lip reading sentences in the wild. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, July 2017

  30. [30]

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

    Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

  31. [31]

    Signbert: Pre-training of hand-model-aware representation for sign language recognition

    Hezhen Hu, Weichao Zhao, Wengang Zhou, Yuechen Wang, and Houqiang Li. Signbert: Pre-training of hand-model-aware representation for sign language recognition. InProceedings of the IEEE/CVF international conference on computer vision, pages 11087–11096, 2021

  32. [32]

    Cat: Enhancing multimodal large language model to answer questions in dynamic audio-visual scenarios, 2024

    Qilang Ye, Zitong Yu, Rui Shao, Xinyu Xie, Philip Torr, and Xiaochun Cao. Cat: Enhancing multimodal large language model to answer questions in dynamic audio-visual scenarios, 2024

  33. [33]

    Internvideo2: Scaling foundation models for multimodal video understanding, 2024

    Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Chenting Wang, Guo Chen, Baoqi Pei, Ziang Yan, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei Huang, Yu Qiao, Yali Wang, and Limin Wang. Internvideo2: Scaling foundation models for multimodal video understanding, 2024

  34. [34]

    Onevision-encoder: Codec-aligned sparsity as a foundational principle for multimodal intelligence, 2026

    Feilong Tang, Xiang An, Yunyao Yan, Yin Xie, Bin Qin, Kaicheng Yang, Yifei Shen, Yuanhan Zhang, Chunyuan Li, Shikun Feng, Changrui Chen, Huajie Tan, Ming Hu, Manyuan Zhang, Bo Li, Ziyong Feng, Ziwei Liu, Zongyuan Ge, and Jiankang Deng. Onevision-encoder: Codec-aligned sparsity as a foundational principle for multimodal intelligence, 2026

  35. [35]

    Improving llm video understanding with 16 frames per second, 2025

    Yixuan Li, Changli Tang, Jimin Zhuang, Yudong Yang, Guangzhi Sun, Wei Li, Zejun Ma, and Chao Zhang. Improving llm video understanding with 16 frames per second, 2025

  36. [36]

    Vl-jepa: Joint embedding predictive architecture for vision-language, 2026

    Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Yejin Bang, Allen Bolourchi, Yann LeCun, and Pascale Fung. Vl-jepa: Joint embedding predictive architecture for vision-language, 2026

  37. [37]

    Audio- visual speech recognition with a hybrid ctc/attention architecture, 2018

    Stavros Petridis, Themos Stafylakis, Pingchuan Ma, Georgios Tzimiropoulos, and Maja Pantic. Audio- visual speech recognition with a hybrid ctc/attention architecture, 2018. 9