arxiv: 2604.13073 · v1 · submitted 2026-03-20 · 💻 cs.CL · cs.AI· cs.MM

Recognition: 2 theorem links

· Lean Theorem

OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs

Qianqi Yan , Yichen Guo , Ching-Chen Kuo , Shan Jiang , Hang Yin , Yang Zhao , Xin Eric Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 08:03 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.MM

keywords attributionmultimodal LLMsexplainabilitygeneration-time tracingomni-modal modelscross-modal explanationstransparency

0 comments

The pith

OmniTrace converts token signals into span-level cross-modal explanations by tracing the causal decoding process in omni-modal LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OmniTrace to solve attribution in multimodal LLMs that generate responses from mixed text, image, audio, and video inputs. Existing methods fail to handle open-ended autoregressive generation, so the authors formalize attribution as tracing each output token back to its supporting input sources during decoding. The framework aggregates arbitrary token-level signals like attention or gradients into semantically coherent spans using confidence-weighted and temporally coherent rules. Evaluations on Qwen2.5-Omni and MiniCPM-o-4.5 show these generation-aware spans yield more stable explanations than naive baselines across visual, audio, and video tasks, all without retraining or extra supervision.

Core claim

OmniTrace formalizes attribution as a generation-time tracing problem over the causal decoding process and supplies a unified protocol that converts arbitrary token-level signals into coherent span-level, cross-modal explanations by tracing tokens to multimodal inputs then aggregating via confidence-weighted and temporally coherent rules.

What carries the argument

The unified generation-time tracing protocol that converts token-level signals into span-level cross-modal explanations through aggregation during decoding.

If this is right

Span-level attribution produces more stable and interpretable explanations than naive self-attribution on visual, audio, and video tasks.
Results remain robust across different underlying signals such as attention weights and gradient-based scores.
The approach applies directly to existing decoder-only omni-modal models like Qwen2.5-Omni without any model changes.
Treating attribution as structured tracing over decoding supplies a scalable route to transparency in open-ended multimodal generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tracing structure could be applied to detect when a model relies on one modality while ignoring contradictory evidence in another.
Integrating these spans into user interfaces might let people click on an explanation to see the exact input segment that supported it.
Extending the aggregation rules to include temporal alignment across video frames could improve attributions for longer sequences.

Load-bearing premise

That arbitrary token-level signals such as attention weights or gradient-based scores can be reliably converted into coherent span-level cross-modal explanations without retraining or supervision.

What would settle it

A direct comparison on a benchmark where human annotators mark supporting input spans for generated statements and the OmniTrace spans show no higher overlap or stability than simple self-attribution baselines.

Figures

Figures reproduced from arXiv: 2604.13073 by Ching-Chen Kuo, Hang Yin, Qianqi Yan, Shan Jiang, Xin Eric Wang, Yang Zhao, Yichen Guo.

**Figure 1.** Figure 1: OmniTrace performs generation-time attribution across modalities. Given interleaved multimodal inputs (text, images, audio, video), an omni-modal LLM generates output tokens autoregressively. OmniTrace traces each generated token to candidate source tokens across modalities and aggregates these signals into semantically coherent output spans with concise source explanations. The framework operates online d… view at source ↗

**Figure 2.** Figure 2: Effect of input modality segmentation and availability on attribution performance. (a) Impact of ASR segmentation quality on audio attribution. Time-F1 on audio summarization using different ASR systems. High-quality ASR segmentation (Paraformer, Scribe v2) substantially improves attribution accuracy, while raw token inputs without semantic segmentation lead to severe degradation. (b) Video input ablation.… view at source ↗

**Figure 3.** Figure 3: Positional and cross-modal attribution behavior. (a) Empirical CDF of normalized attribution positions. The curve above the diagonal indicates early-position bias. (b) Cross-modal calibration between predicted and ground-truth image mass. Deviations from the diagonal reflect regime-dependent calibration effects rather than a strong global modality bias. audio sources. To evaluate the contribution of each m… view at source ↗

**Figure 4.** Figure 4: Relationship between generation quality and attribution quality. (a) Attribution F1 for QA tasks grouped by whether the generated answer is correct or incorrect in QA tasks. (b) Attribution performance on visual summarization as a function of generation quality measured by ROUGE-L. While attribution quality generally improves with higher generation quality in some modalities (e.g., audio), the relationship… view at source ↗

read the original abstract

Modern multimodal large language models (MLLMs) generate fluent responses from interleaved text, image, audio, and video inputs. However, identifying which input sources support each generated statement remains an open challenge. Existing attribution methods are primarily designed for classification settings, fixed prediction targets, or single-modality architectures, and do not naturally extend to autoregressive, decoder-only models performing open-ended multimodal generation. We introduce OmniTrace, a lightweight and model-agnostic framework that formalizes attribution as a generation-time tracing problem over the causal decoding process. OmniTrace provides a unified protocol that converts arbitrary token-level signals such as attention weights or gradient-based scores into coherent span-level, cross-modal explanations during decoding. It traces each generated token to multimodal inputs, aggregates signals into semantically meaningful spans, and selects concise supporting sources through confidence-weighted and temporally coherent aggregation, without retraining or supervision. Evaluations on Qwen2.5-Omni and MiniCPM-o-4.5 across visual, audio, and video tasks demonstrate that generation-aware span-level attribution produces more stable and interpretable explanations than naive self-attribution and embedding-based baselines, while remaining robust across multiple underlying attribution signals. Our results suggest that treating attribution as a structured generation-time tracing problem provides a scalable foundation for transparency in omni-modal language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OmniTrace gives a clean generation-time protocol for tracing attributions across modalities in autoregressive models, but the span aggregation step rests on untested heuristics.

read the letter

The paper's main contribution is formalizing attribution as a tracing process over causal decoding in decoder-only omni-modal models. It takes token-level signals like attention or gradients and turns them into span-level cross-modal explanations using confidence-weighted and temporally coherent aggregation, all without retraining or supervision. That setup is new relative to prior work limited to classification or single-modality cases, and the evaluations on Qwen2.5-Omni and MiniCPM-o-4.5 show more stable outputs than naive self-attribution or embedding baselines across visual, audio, and video tasks. The model-agnostic framing is useful for open-ended generation, which is the practical setting most current methods ignore. The protocol itself is lightweight and directly applicable during inference, which is a real engineering plus. The soft spot is exactly where the stress-test note points: the aggregation rules for turning tokens into coherent spans are not isolated. There are no ablations that remove or vary the temporal-coherence component while holding the base signals fixed, and no quantitative checks such as human coherence ratings or alignment with known supporting evidence. Without those, it is hard to tell whether the spans are genuinely meaningful or just artifacts of the chosen policy, especially when modalities have mismatched temporal resolutions. The abstract also skips details on exact aggregation formulas and controls, which makes the stability claim harder to verify. This work is aimed at researchers and engineers who need practical transparency tools for multimodal LLMs in deployment. A reader focused on interpretability methods would find the tracing formalization worth reading even if the validation is thin. It deserves peer review because the problem is concrete and the framework is implementable, though any referee would need to see stronger evidence on the aggregation step before accepting the central claim.

Referee Report

3 major / 2 minor

Summary. The paper introduces OmniTrace, a lightweight model-agnostic framework that formalizes attribution in omni-modal LLMs as a generation-time tracing problem. It converts arbitrary token-level signals (attention weights, gradients) into span-level cross-modal explanations via confidence-weighted and temporally coherent aggregation during autoregressive decoding, without retraining or supervision. Experiments on Qwen2.5-Omni and MiniCPM-o-4.5 across visual/audio/video tasks report more stable and interpretable attributions than naive self-attribution and embedding baselines.

Significance. If the aggregation protocol reliably produces semantically coherent cross-modal spans, the work offers a practical, scalable route to post-hoc transparency for open-ended multimodal generation that reuses existing model signals rather than requiring new training or supervision.

major comments (3)

[§3.2] §3.2 (Aggregation protocol): the claim that temporally coherent aggregation converts arbitrary token signals into coherent spans rests on unvalidated heuristics; no ablation is reported that removes or perturbs the temporal-coherence rule while holding the base signals fixed, so the contribution of the rule versus the underlying signals cannot be isolated.
[§4] §4 (Experiments): stability is asserted over baselines, yet the manuscript provides neither the precise stability metric (e.g., variance across runs, human coherence ratings) nor controls for post-hoc aggregation choices, leaving the quantitative support for the central claim under-specified.
[§4.3] §4.3 (Cross-modal results): the evaluation does not include a proxy alignment metric (e.g., overlap with human-annotated supporting spans or known causal inputs) that would confirm the aggregated spans are semantically meaningful rather than artifacts of the chosen aggregation policy.

minor comments (2)

[Abstract, §3.1] The abstract and §3.1 should explicitly list the exact aggregation equations and hyper-parameters (confidence threshold, temporal window size) so readers can reproduce the protocol.
[Figure 2] Figure 2 caption should clarify the color mapping for cross-modal spans and include a legend for the confidence weighting.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and insightful comments on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where revisions are needed, we commit to incorporating them in the next version of the paper to address the concerns raised.

read point-by-point responses

Referee: [§3.2] §3.2 (Aggregation protocol): the claim that temporally coherent aggregation converts arbitrary token signals into coherent spans rests on unvalidated heuristics; no ablation is reported that removes or perturbs the temporal-coherence rule while holding the base signals fixed, so the contribution of the rule versus the underlying signals cannot be isolated.

Authors: We appreciate this observation. The temporal coherence rule is designed to respect the autoregressive generation process, ensuring that attributions accumulate consistently over time steps rather than treating each token independently. While motivated by the sequential nature of decoding, we acknowledge that an explicit ablation isolating its effect would better quantify its contribution. In the revised manuscript, we will include an ablation study that compares the full aggregation protocol against a version without the temporal coherence component, using the same base signals. revision: yes
Referee: [§4] §4 (Experiments): stability is asserted over baselines, yet the manuscript provides neither the precise stability metric (e.g., variance across runs, human coherence ratings) nor controls for post-hoc aggregation choices, leaving the quantitative support for the central claim under-specified.

Authors: Thank you for highlighting this. Stability in our experiments refers to the consistency of the attributed spans across different underlying signals (attention, gradients) and across model variants. To make this precise, we will define the stability metric explicitly—such as the variance in selected span boundaries or overlap ratios across runs—and report it quantitatively. Additionally, we will include sensitivity analysis controlling for key aggregation hyperparameters to demonstrate robustness. revision: yes
Referee: [§4.3] §4.3 (Cross-modal results): the evaluation does not include a proxy alignment metric (e.g., overlap with human-annotated supporting spans or known causal inputs) that would confirm the aggregated spans are semantically meaningful rather than artifacts of the chosen aggregation policy.

Authors: We agree that a direct alignment metric with human annotations would provide stronger evidence of semantic meaningfulness. However, creating such annotations for open-ended multimodal generation is resource-intensive and beyond the scope of the current work. Instead, our evaluation relies on indirect proxies: consistency across multiple attribution signals, improved performance in downstream tasks when using the attributions to select supporting inputs, and qualitative inspection of coherence. We will expand the discussion in §4.3 to explicitly address this limitation and suggest it as future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper defines OmniTrace as a protocol that aggregates standard token-level signals (attention weights, gradients) into span-level cross-modal attributions via confidence-weighted and temporally coherent rules. No equations or steps reduce by construction to fitted inputs, self-definitions, or self-citation chains. The framework is presented as model-agnostic and supervision-free, with claims supported by evaluations on external models rather than internal redefinitions. This is a standard case of a new aggregation method built on independent base signals.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are identified; the method relies on standard internal model signals.

pith-pipeline@v0.9.0 · 5551 in / 1023 out tokens · 41877 ms · 2026-05-15T08:03:31.683760+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArrowOfTime.lean arrow_from_z unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

aggregates signals into semantically meaningful spans... through confidence-weighted and temporally coherent aggregation
IndisputableMonolith/Foundation/Breath1024.lean neutral8 unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

run-level coherence constraints... favoring temporally contiguous source assignments

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 5 internal anchors

[1]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaugh- lin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. OpenAI GPT-5 System Card.arXiv preprint arXiv:2601.03267, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

A new era of intelligence with gemini 3

Google Gemini Team. A new era of intelligence with gemini 3. https://blog.google/ products-and-platforms/products/gemini/gemini-3, 2025. Google AI Blog, Accessed Jan 16, 2026

work page 2025
[3]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xianzhong Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Ke Chen, Xue Lian Liu, Peng Wang, Ming Yang, Dayiheng Liu, Xingzhang Ren, Bo ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025

work page internal anchor Pith review arXiv 2025
[5]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Interactiveomni: A unified omni-modal model for audio-visual multi-turn dialogue.arXiv preprint arXiv:2510.13747, 2025

Wenwen Tong, Hewei Guo, Dongchuan Ran, Jiangnan Chen, Jiefan Lu, Kaibin Wang, Keqiang Li, Xiaoxu Zhu, Jiakui Li, Kehan Li, et al. Interactiveomni: A unified omni-modal model for audio-visual multi-turn dialogue.arXiv preprint arXiv:2510.13747, 2025

work page arXiv 2025
[7]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras.arXiv preprint arXiv:2503.01743, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Omnivinci: Enhancing architecture and data for omni-modal under- standing LLM

Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Zhen Wan, Jinchuan Tian, An- Chieh Cheng, Ligeng Zhu, Yuanhang Su, Yuming Lou, Yong-Xiang Lin, Dong Yang, Sreyan Ghosh, Zhijian Liu, Yukang Chen, Ehsan Jahangiri, Ambrish Dantrey, Daguang Xu, Ehsan Hosseini-Asl, Seyed Danial Mohseni Taheri, Vidya Nariyambut Murali, Sifei Liu, Yao Lu, Oluwatobi Olabiy...

work page 2026
[9]

Quantifying attention flow in transformers

Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4190–4197, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.385. U...

work page doi:10.18653/v1/2020.acl-main.385 2020
[10]

On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.PLoS ONE, 10(7):e0130140, July 2015

Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.PLoS ONE, 10(7):e0130140, July 2015. doi: 10.1371/journal.pone.0130140

work page doi:10.1371/journal.pone.0130140 2015
[11]

Analyzing multi-head self- attention: Specialized heads do the heavy lifting, the rest can be pruned

Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self- attention: Specialized heads do the heavy lifting, the rest can be pruned. In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5797–5808, Florence, Italy, J...

work page doi:10.18653/v1/p19-1580 2019
[12]

Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers

Hila Chefer, Shir Gur, and Lior Wolf. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 397–406, 2021

work page 2021
[13]

Learning deep features for discriminative localization

Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016

work page 2016
[14]

Grad-sam: Explaining transformers via gradient self-attention maps

Oren Barkan, Edan Hauon, Avi Caciularu, Ori Katz, Itzik Malkiel, Omri Armstrong, and Noam Koenigstein. Grad-sam: Explaining transformers via gradient self-attention maps. InProceedings 14 OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs of the 30th ACM International Conference on Information & Knowledge Management, CIKM ’2...

work page doi:10.1145/3459637.3482126 2021
[15]

Grad-cam: visual explanations from deep networks via gradient-based localization

Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: visual explanations from deep networks via gradient-based localization. International journal of computer vision, 128(2):336–359, 2020

work page 2020
[16]

Attcat: explaining transformers via attentive class activation tokens

Yao Qiang, Deng Pan, Chengyin Li, Xin Li, Rhongho Jang, and Dongxiao Zhu. Attcat: explaining transformers via attentive class activation tokens. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2022. Curran Associates Inc. ISBN 9781713871088

work page 2022
[17]

Better explain transformers by illu- minating important information

Linxin Song, Yan Cui, Ao Luo, Freddy Lecue, and Irene Li. Better explain transformers by illu- minating important information. In Yvette Graham and Matthew Purver, editors,Findings of the Association for Computational Linguistics: EACL 2024, pages 2048–2062, St. Julian’s, Malta, March

work page 2024
[18]

doi: 10.18653/v1/2024.findings-eacl.138

Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-eacl.138. URL https://aclanthology.org/2024.findings-eacl.138/

work page doi:10.18653/v1/2024.findings-eacl.138 2024
[19]

Ku, Qian Liu, and Wenhu Chen

Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max W.F. Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning.Transactions on Machine Learning Research, 2024, 2024. URLhttps://openreview.net/forum?id=skLtdUVaJa

work page 2024
[20]

Mmdialog: A large-scale multi-turn dialogue dataset towards multi-modal open-domain conversation

Jiazhan Feng, Qingfeng Sun, Can Xu, Pu Zhao, Yaming Yang, Chongyang Tao, Dongyan Zhao, and Qingwei Lin. Mmdialog: A large-scale multi-turn dialogue dataset towards multi-modal open-domain conversation. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7348–7363, 2023

work page 2023
[21]

Experience and evidence are the eyes of an excellent summarizer! towards knowledge infused multi-modal clinical conversation summarization

Abhisek Tiwari, Anisha Saha, Sriparna Saha, Pushpak Bhattacharyya, and Minakshi Dhar. Experience and evidence are the eyes of an excellent summarizer! towards knowledge infused multi-modal clinical conversation summarization. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management, CIKM ’23, page 2452–2461, New York,...

work page doi:10.1145/3583780.3614870 2023
[22]

MMAU: A massive multi-task audio understanding and reasoning benchmark

S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. MMAU: A massive multi-task audio understanding and reasoning benchmark. InThe Thirteenth International Conference on Learning Representations,

work page
[23]

URLhttps://openreview.net/forum?id=TeVAZXr3yv

work page
[24]

The multimodal information based speech processing (misp) 2025 challenge: Audio-visual diarization and recognition.arXiv preprint arXiv:2505.13971, 2025

Ming Gao, Shilong Wu, Hang Chen, Jun Du, Chin-Hui Lee, Shinji Watanabe, Jingdong Chen, Sinis- calchi Sabato Marco, and Odette Scharenborg. The multimodal information based speech processing (misp) 2025 challenge: Audio-visual diarization and recognition.arXiv preprint arXiv:2505.13971, 2025

work page arXiv 2025
[25]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025. 15 OmniTrace: ...

work page 2025
[26]

Openomni: Advancing open-source omnimodal large language models with progressive multimodal alignment and real-time emotional speech synthesis

Run Luo, Ting-En Lin, Haonan Zhang, Yuchuan Wu, Xiong Liu, Yongbin Li, Longze Chen, Jiaming Li, Lei Zhang, Xiaobo Xia, Hamid Alinejad-Rokny, Fei Huang, and Min Yang. Openomni: Advancing open-source omnimodal large language models with progressive multimodal alignment and real-time emotional speech synthesis. InThe Thirty-ninth Annual Conference on Neural ...

work page 2025
[27]

Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition.arXiv preprint arXiv:2206.08317, 2022

Zhifu Gao, Shiliang Zhang, Ian McLoughlin, and Zhijie Yan. Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition.arXiv preprint arXiv:2206.08317, 2022

work page arXiv 2022
[28]

source_ids , pos , conf must have the same length

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023. 16 OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs A. Source curation A.1. Detailed implemen...

work page 2023
[29]

An image that co nt ai ns the source c o n v e r s a t i o n with image IDs shown on the images

work page
[30]

A model - g e n e r a t e d summary ( split into sentences , each s en te nc e has a se nt en ce index )

work page
[31]

t e x t _ s o u r c e

The source text chunks listed below , each with a numeric text ID . Task : For EACH g e n e r a t e d summary sentence , decide which source ev id en ce su pp or ts it . - " t e x t _ s o u r c e ": a list of TEXT IDs from the p ro vi de d text chunks that di re ct ly support that s en te nc e . - " i m a g e _ s o u r c e ": a list of IMAGE IDs visible i...

work page
[32]

One AUDIO / VIDEO file that c on ta in s the source content

work page
[33]

audio / v i d e o _ s o u r c e

A model - g e n e r a t e d summary ( split into sentences , each s en te nc e has a se nt en ce index ) . 19 OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs Task : For EACH g e n e r a t e d summary sentence , decide which source ev id en ce su pp or ts it . - " audio / v i d e o _ s o u r c e ": a list of t i m e s t a ...

work page