pith. machine review for the scientific record. sign in

arxiv: 2604.13073 · v1 · submitted 2026-03-20 · 💻 cs.CL · cs.AI· cs.MM

Recognition: 2 theorem links

· Lean Theorem

OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-15 08:03 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.MM
keywords attributionmultimodal LLMsexplainabilitygeneration-time tracingomni-modal modelscross-modal explanationstransparency
0
0 comments X

The pith

OmniTrace converts token signals into span-level cross-modal explanations by tracing the causal decoding process in omni-modal LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OmniTrace to solve attribution in multimodal LLMs that generate responses from mixed text, image, audio, and video inputs. Existing methods fail to handle open-ended autoregressive generation, so the authors formalize attribution as tracing each output token back to its supporting input sources during decoding. The framework aggregates arbitrary token-level signals like attention or gradients into semantically coherent spans using confidence-weighted and temporally coherent rules. Evaluations on Qwen2.5-Omni and MiniCPM-o-4.5 show these generation-aware spans yield more stable explanations than naive baselines across visual, audio, and video tasks, all without retraining or extra supervision.

Core claim

OmniTrace formalizes attribution as a generation-time tracing problem over the causal decoding process and supplies a unified protocol that converts arbitrary token-level signals into coherent span-level, cross-modal explanations by tracing tokens to multimodal inputs then aggregating via confidence-weighted and temporally coherent rules.

What carries the argument

The unified generation-time tracing protocol that converts token-level signals into span-level cross-modal explanations through aggregation during decoding.

If this is right

  • Span-level attribution produces more stable and interpretable explanations than naive self-attribution on visual, audio, and video tasks.
  • Results remain robust across different underlying signals such as attention weights and gradient-based scores.
  • The approach applies directly to existing decoder-only omni-modal models like Qwen2.5-Omni without any model changes.
  • Treating attribution as structured tracing over decoding supplies a scalable route to transparency in open-ended multimodal generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same tracing structure could be applied to detect when a model relies on one modality while ignoring contradictory evidence in another.
  • Integrating these spans into user interfaces might let people click on an explanation to see the exact input segment that supported it.
  • Extending the aggregation rules to include temporal alignment across video frames could improve attributions for longer sequences.

Load-bearing premise

That arbitrary token-level signals such as attention weights or gradient-based scores can be reliably converted into coherent span-level cross-modal explanations without retraining or supervision.

What would settle it

A direct comparison on a benchmark where human annotators mark supporting input spans for generated statements and the OmniTrace spans show no higher overlap or stability than simple self-attribution baselines.

Figures

Figures reproduced from arXiv: 2604.13073 by Ching-Chen Kuo, Hang Yin, Qianqi Yan, Shan Jiang, Xin Eric Wang, Yang Zhao, Yichen Guo.

Figure 1
Figure 1. Figure 1: OmniTrace performs generation-time attribution across modalities. Given interleaved multimodal inputs (text, images, audio, video), an omni-modal LLM generates output tokens autoregressively. OmniTrace traces each generated token to candidate source tokens across modalities and aggregates these signals into semantically coherent output spans with concise source explanations. The framework operates online d… view at source ↗
Figure 2
Figure 2. Figure 2: Effect of input modality segmentation and availability on attribution performance. (a) Impact of ASR segmentation quality on audio attribution. Time-F1 on audio summarization using different ASR systems. High-quality ASR segmentation (Paraformer, Scribe v2) substantially improves attribution accuracy, while raw token inputs without semantic segmentation lead to severe degradation. (b) Video input ablation.… view at source ↗
Figure 3
Figure 3. Figure 3: Positional and cross-modal attribution behavior. (a) Empirical CDF of normalized attribution positions. The curve above the diagonal indicates early-position bias. (b) Cross-modal calibration between predicted and ground-truth image mass. Deviations from the diagonal reflect regime-dependent calibration effects rather than a strong global modality bias. audio sources. To evaluate the contribution of each m… view at source ↗
Figure 4
Figure 4. Figure 4: Relationship between generation quality and attribution quality. (a) Attribution F1 for QA tasks grouped by whether the generated answer is correct or incorrect in QA tasks. (b) Attribution performance on visual summarization as a function of generation quality measured by ROUGE-L. While attribution quality generally improves with higher generation quality in some modalities (e.g., audio), the relationship… view at source ↗
read the original abstract

Modern multimodal large language models (MLLMs) generate fluent responses from interleaved text, image, audio, and video inputs. However, identifying which input sources support each generated statement remains an open challenge. Existing attribution methods are primarily designed for classification settings, fixed prediction targets, or single-modality architectures, and do not naturally extend to autoregressive, decoder-only models performing open-ended multimodal generation. We introduce OmniTrace, a lightweight and model-agnostic framework that formalizes attribution as a generation-time tracing problem over the causal decoding process. OmniTrace provides a unified protocol that converts arbitrary token-level signals such as attention weights or gradient-based scores into coherent span-level, cross-modal explanations during decoding. It traces each generated token to multimodal inputs, aggregates signals into semantically meaningful spans, and selects concise supporting sources through confidence-weighted and temporally coherent aggregation, without retraining or supervision. Evaluations on Qwen2.5-Omni and MiniCPM-o-4.5 across visual, audio, and video tasks demonstrate that generation-aware span-level attribution produces more stable and interpretable explanations than naive self-attribution and embedding-based baselines, while remaining robust across multiple underlying attribution signals. Our results suggest that treating attribution as a structured generation-time tracing problem provides a scalable foundation for transparency in omni-modal language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces OmniTrace, a lightweight model-agnostic framework that formalizes attribution in omni-modal LLMs as a generation-time tracing problem. It converts arbitrary token-level signals (attention weights, gradients) into span-level cross-modal explanations via confidence-weighted and temporally coherent aggregation during autoregressive decoding, without retraining or supervision. Experiments on Qwen2.5-Omni and MiniCPM-o-4.5 across visual/audio/video tasks report more stable and interpretable attributions than naive self-attribution and embedding baselines.

Significance. If the aggregation protocol reliably produces semantically coherent cross-modal spans, the work offers a practical, scalable route to post-hoc transparency for open-ended multimodal generation that reuses existing model signals rather than requiring new training or supervision.

major comments (3)
  1. [§3.2] §3.2 (Aggregation protocol): the claim that temporally coherent aggregation converts arbitrary token signals into coherent spans rests on unvalidated heuristics; no ablation is reported that removes or perturbs the temporal-coherence rule while holding the base signals fixed, so the contribution of the rule versus the underlying signals cannot be isolated.
  2. [§4] §4 (Experiments): stability is asserted over baselines, yet the manuscript provides neither the precise stability metric (e.g., variance across runs, human coherence ratings) nor controls for post-hoc aggregation choices, leaving the quantitative support for the central claim under-specified.
  3. [§4.3] §4.3 (Cross-modal results): the evaluation does not include a proxy alignment metric (e.g., overlap with human-annotated supporting spans or known causal inputs) that would confirm the aggregated spans are semantically meaningful rather than artifacts of the chosen aggregation policy.
minor comments (2)
  1. [Abstract, §3.1] The abstract and §3.1 should explicitly list the exact aggregation equations and hyper-parameters (confidence threshold, temporal window size) so readers can reproduce the protocol.
  2. [Figure 2] Figure 2 caption should clarify the color mapping for cross-modal spans and include a legend for the confidence weighting.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and insightful comments on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where revisions are needed, we commit to incorporating them in the next version of the paper to address the concerns raised.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Aggregation protocol): the claim that temporally coherent aggregation converts arbitrary token signals into coherent spans rests on unvalidated heuristics; no ablation is reported that removes or perturbs the temporal-coherence rule while holding the base signals fixed, so the contribution of the rule versus the underlying signals cannot be isolated.

    Authors: We appreciate this observation. The temporal coherence rule is designed to respect the autoregressive generation process, ensuring that attributions accumulate consistently over time steps rather than treating each token independently. While motivated by the sequential nature of decoding, we acknowledge that an explicit ablation isolating its effect would better quantify its contribution. In the revised manuscript, we will include an ablation study that compares the full aggregation protocol against a version without the temporal coherence component, using the same base signals. revision: yes

  2. Referee: [§4] §4 (Experiments): stability is asserted over baselines, yet the manuscript provides neither the precise stability metric (e.g., variance across runs, human coherence ratings) nor controls for post-hoc aggregation choices, leaving the quantitative support for the central claim under-specified.

    Authors: Thank you for highlighting this. Stability in our experiments refers to the consistency of the attributed spans across different underlying signals (attention, gradients) and across model variants. To make this precise, we will define the stability metric explicitly—such as the variance in selected span boundaries or overlap ratios across runs—and report it quantitatively. Additionally, we will include sensitivity analysis controlling for key aggregation hyperparameters to demonstrate robustness. revision: yes

  3. Referee: [§4.3] §4.3 (Cross-modal results): the evaluation does not include a proxy alignment metric (e.g., overlap with human-annotated supporting spans or known causal inputs) that would confirm the aggregated spans are semantically meaningful rather than artifacts of the chosen aggregation policy.

    Authors: We agree that a direct alignment metric with human annotations would provide stronger evidence of semantic meaningfulness. However, creating such annotations for open-ended multimodal generation is resource-intensive and beyond the scope of the current work. Instead, our evaluation relies on indirect proxies: consistency across multiple attribution signals, improved performance in downstream tasks when using the attributions to select supporting inputs, and qualitative inspection of coherence. We will expand the discussion in §4.3 to explicitly address this limitation and suggest it as future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper defines OmniTrace as a protocol that aggregates standard token-level signals (attention weights, gradients) into span-level cross-modal attributions via confidence-weighted and temporally coherent rules. No equations or steps reduce by construction to fitted inputs, self-definitions, or self-citation chains. The framework is presented as model-agnostic and supervision-free, with claims supported by evaluations on external models rather than internal redefinitions. This is a standard case of a new aggregation method built on independent base signals.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are identified; the method relies on standard internal model signals.

pith-pipeline@v0.9.0 · 5551 in / 1023 out tokens · 41877 ms · 2026-05-15T08:03:31.683760+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 5 internal anchors

  1. [1]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaugh- lin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. OpenAI GPT-5 System Card.arXiv preprint arXiv:2601.03267, 2026

  2. [2]

    A new era of intelligence with gemini 3

    Google Gemini Team. A new era of intelligence with gemini 3. https://blog.google/ products-and-platforms/products/gemini/gemini-3, 2025. Google AI Blog, Accessed Jan 16, 2026

  3. [3]

    Qwen3-Omni Technical Report

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xianzhong Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Ke Chen, Xue Lian Liu, Peng Wang, Ming Yang, Dayiheng Liu, Xingzhang Ren, Bo ...

  4. [4]

    MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

    Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025

  5. [5]

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215, 2025

  6. [6]

    Interactiveomni: A unified omni-modal model for audio-visual multi-turn dialogue.arXiv preprint arXiv:2510.13747, 2025

    Wenwen Tong, Hewei Guo, Dongchuan Ran, Jiangnan Chen, Jiefan Lu, Kaibin Wang, Keqiang Li, Xiaoxu Zhu, Jiakui Li, Kehan Li, et al. Interactiveomni: A unified omni-modal model for audio-visual multi-turn dialogue.arXiv preprint arXiv:2510.13747, 2025

  7. [7]

    Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras.arXiv preprint arXiv:2503.01743, 2025

  8. [8]

    Omnivinci: Enhancing architecture and data for omni-modal under- standing LLM

    Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Zhen Wan, Jinchuan Tian, An- Chieh Cheng, Ligeng Zhu, Yuanhang Su, Yuming Lou, Yong-Xiang Lin, Dong Yang, Sreyan Ghosh, Zhijian Liu, Yukang Chen, Ehsan Jahangiri, Ambrish Dantrey, Daguang Xu, Ehsan Hosseini-Asl, Seyed Danial Mohseni Taheri, Vidya Nariyambut Murali, Sifei Liu, Yao Lu, Oluwatobi Olabiy...

  9. [9]

    Quantifying attention flow in transformers

    Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4190–4197, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.385. U...

  10. [10]

    On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.PLoS ONE, 10(7):e0130140, July 2015

    Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.PLoS ONE, 10(7):e0130140, July 2015. doi: 10.1371/journal.pone.0130140

  11. [11]

    Analyzing multi-head self- attention: Specialized heads do the heavy lifting, the rest can be pruned

    Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self- attention: Specialized heads do the heavy lifting, the rest can be pruned. In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5797–5808, Florence, Italy, J...

  12. [12]

    Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers

    Hila Chefer, Shir Gur, and Lior Wolf. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 397–406, 2021

  13. [13]

    Learning deep features for discriminative localization

    Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016

  14. [14]

    Grad-sam: Explaining transformers via gradient self-attention maps

    Oren Barkan, Edan Hauon, Avi Caciularu, Ori Katz, Itzik Malkiel, Omri Armstrong, and Noam Koenigstein. Grad-sam: Explaining transformers via gradient self-attention maps. InProceedings 14 OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs of the 30th ACM International Conference on Information & Knowledge Management, CIKM ’2...

  15. [15]

    Grad-cam: visual explanations from deep networks via gradient-based localization

    Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: visual explanations from deep networks via gradient-based localization. International journal of computer vision, 128(2):336–359, 2020

  16. [16]

    Attcat: explaining transformers via attentive class activation tokens

    Yao Qiang, Deng Pan, Chengyin Li, Xin Li, Rhongho Jang, and Dongxiao Zhu. Attcat: explaining transformers via attentive class activation tokens. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2022. Curran Associates Inc. ISBN 9781713871088

  17. [17]

    Better explain transformers by illu- minating important information

    Linxin Song, Yan Cui, Ao Luo, Freddy Lecue, and Irene Li. Better explain transformers by illu- minating important information. In Yvette Graham and Matthew Purver, editors,Findings of the Association for Computational Linguistics: EACL 2024, pages 2048–2062, St. Julian’s, Malta, March

  18. [18]

    doi: 10.18653/v1/2024.findings-eacl.138

    Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-eacl.138. URL https://aclanthology.org/2024.findings-eacl.138/

  19. [19]

    Ku, Qian Liu, and Wenhu Chen

    Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max W.F. Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning.Transactions on Machine Learning Research, 2024, 2024. URLhttps://openreview.net/forum?id=skLtdUVaJa

  20. [20]

    Mmdialog: A large-scale multi-turn dialogue dataset towards multi-modal open-domain conversation

    Jiazhan Feng, Qingfeng Sun, Can Xu, Pu Zhao, Yaming Yang, Chongyang Tao, Dongyan Zhao, and Qingwei Lin. Mmdialog: A large-scale multi-turn dialogue dataset towards multi-modal open-domain conversation. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7348–7363, 2023

  21. [21]

    Experience and evidence are the eyes of an excellent summarizer! towards knowledge infused multi-modal clinical conversation summarization

    Abhisek Tiwari, Anisha Saha, Sriparna Saha, Pushpak Bhattacharyya, and Minakshi Dhar. Experience and evidence are the eyes of an excellent summarizer! towards knowledge infused multi-modal clinical conversation summarization. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management, CIKM ’23, page 2452–2461, New York,...

  22. [22]

    MMAU: A massive multi-task audio understanding and reasoning benchmark

    S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. MMAU: A massive multi-task audio understanding and reasoning benchmark. InThe Thirteenth International Conference on Learning Representations,

  23. [23]

    URLhttps://openreview.net/forum?id=TeVAZXr3yv

  24. [24]

    The multimodal information based speech processing (misp) 2025 challenge: Audio-visual diarization and recognition.arXiv preprint arXiv:2505.13971, 2025

    Ming Gao, Shilong Wu, Hang Chen, Jun Du, Chin-Hui Lee, Shinji Watanabe, Jingdong Chen, Sinis- calchi Sabato Marco, and Odette Scharenborg. The multimodal information based speech processing (misp) 2025 challenge: Audio-visual diarization and recognition.arXiv preprint arXiv:2505.13971, 2025

  25. [25]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025. 15 OmniTrace: ...

  26. [26]

    Openomni: Advancing open-source omnimodal large language models with progressive multimodal alignment and real-time emotional speech synthesis

    Run Luo, Ting-En Lin, Haonan Zhang, Yuchuan Wu, Xiong Liu, Yongbin Li, Longze Chen, Jiaming Li, Lei Zhang, Xiaobo Xia, Hamid Alinejad-Rokny, Fei Huang, and Min Yang. Openomni: Advancing open-source omnimodal large language models with progressive multimodal alignment and real-time emotional speech synthesis. InThe Thirty-ninth Annual Conference on Neural ...

  27. [27]

    Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition.arXiv preprint arXiv:2206.08317, 2022

    Zhifu Gao, Shiliang Zhang, Ian McLoughlin, and Zhijie Yan. Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition.arXiv preprint arXiv:2206.08317, 2022

  28. [28]

    source_ids , pos , conf must have the same length

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023. 16 OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs A. Source curation A.1. Detailed implemen...

  29. [29]

    An image that co nt ai ns the source c o n v e r s a t i o n with image IDs shown on the images

  30. [30]

    A model - g e n e r a t e d summary ( split into sentences , each s en te nc e has a se nt en ce index )

  31. [31]

    t e x t _ s o u r c e

    The source text chunks listed below , each with a numeric text ID . Task : For EACH g e n e r a t e d summary sentence , decide which source ev id en ce su pp or ts it . - " t e x t _ s o u r c e ": a list of TEXT IDs from the p ro vi de d text chunks that di re ct ly support that s en te nc e . - " i m a g e _ s o u r c e ": a list of IMAGE IDs visible i...

  32. [32]

    One AUDIO / VIDEO file that c on ta in s the source content

  33. [33]

    audio / v i d e o _ s o u r c e

    A model - g e n e r a t e d summary ( split into sentences , each s en te nc e has a se nt en ce index ) . 19 OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs Task : For EACH g e n e r a t e d summary sentence , decide which source ev id en ce su pp or ts it . - " audio / v i d e o _ s o u r c e ": a list of t i m e s t a ...