pith. sign in

arxiv: 2603.06683 · v2 · submitted 2026-03-04 · 💻 cs.CV

ECHO: Event-Centric Hypergraph Operations via Multi-Agent Collaboration for Multimedia Event Extraction

Pith reviewed 2026-05-15 17:19 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimedia event extractionhypergraphmulti-agent collaborationevent mention detectionargument role labelingM2E2 benchmarkstructured prediction
0
0 comments X

The pith

ECHO reframes multimedia event extraction as explicit operations on a shared hypergraph using multi-agent collaboration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ECHO to address brittleness in current LLM-based multimedia event extraction, where intermediate hypotheses stay hidden and linking stays coupled to role assignment. ECHO instead builds and refines an explicit Multimedia Event Hypergraph through auditable atomic updates performed by collaborating agents. A Link-then-Bind step separates argument attachment from semantic role assignment, leaving room to inspect and revise partial structures before final commitment. The approach is evaluated on the M2E2 benchmark and reports clear gains over prior state-of-the-art systems.

Core claim

ECHO treats multimedia event extraction as iterative refinement over an explicit Multimedia Event Hypergraph (MEHG) via multi-agent collaboration, replacing implicit linear generation with auditable atomic updates; a Link-then-Bind strategy decouples event-argument linking from role binding to avoid premature semantic commitments.

What carries the argument

The Multimedia Event Hypergraph (MEHG), an explicit shared structure that records events, arguments, and relations so agents can perform revisable atomic operations instead of generating opaque text sequences.

If this is right

  • Event mention detection improves by 7.3 F1 points over prior state-of-the-art.
  • Argument role labeling improves by 15.5 F1 points over prior state-of-the-art.
  • Intermediate event hypotheses become inspectable and correctable during inference.
  • Predictions remain schema-consistent while allowing revision before final output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same hypergraph-plus-agent pattern could be tested on other structured prediction tasks that currently rely on end-to-end text generation.
  • If coordination overhead stays low, the framework might scale to larger numbers of modalities or longer documents without proportional increases in model size.
  • Decoupling linking from binding may generalize to other domains where premature commitment causes cascading errors.

Load-bearing premise

The explicit hypergraph representation and the Link-then-Bind decoupling will reduce error propagation from early mistakes without creating new failure modes from multi-agent coordination.

What would settle it

A controlled test that injects early-stage errors into the pipeline and measures whether ECHO's reported F1 gains on event mention and argument role disappear or reverse compared with baseline methods.

Figures

Figures reproduced from arXiv: 2603.06683 by Hailong Chu, Hongbing Li, Jinsong Zhang, Lei Li, Shuo Zhang, Shutai Huang, Tinghe Yan, Xingyue Zhang, Yunlong Chu.

Figure 1
Figure 1. Figure 1: Illustrative comparison of M2E2 paradigms on the [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of ECHO. Given 𝐷 = (𝑇 , 𝐼), Stage I constructs the vertex inventory and initializes an edge-free MEHG; Stage II agents negotiate MEHG updates via auditable atomic operations; Stage III performs role binding and consolidation to produce schema-consistent event predictions. The purple hyperedge represents a concurrent event hypothesis generated via the same ecosystem. Inside the hyperedge, 𝑦 𝑒 , 𝑡𝑘 … view at source ↗
Figure 3
Figure 3. Figure 3: F1 comparison of Direct prompting, Dialogue-Mediated baseline, and ECHO on M2E2 across all settings. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation on the multimedia setting of M2E2. We [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Negotiation budget analysis for Stage II under early [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Multimedia event extraction (M2E2) aims to predict triggers, ground arguments across text and images, and then assemble them into schema-consistent event records. Recent LLM-based approaches have shown strong potential for M2E2, but their intermediate event hypotheses often remain implicit, and event-argument linking is still tightly coupled with role binding. This leaves little opportunity to inspect or revise intermediate event hypotheses and makes predictions brittle to early errors. To bridge this gap, we present ECHO, a multi-agent framework that reframes M2E2 as iterative refinement over an explicit Multimedia Event Hypergraph (MEHG). Instead of relying on implicit linear generation, ECHO performs auditable atomic updates over a shared hypergraph, making intermediate event structures explicit and revisable. Furthermore, we introduce a Link-then-Bind strategy that decouples event-argument linking from role binding, reducing premature semantic commitment during structured prediction. Extensive experiments on the M2E2 benchmark show that ECHO consistently outperforms prior state-of-the-art approaches, achieving gains of 7.3 and 15.5 F1 points on event mention and argument role, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces ECHO, a multi-agent framework for multimedia event extraction (M2E2) that reframes the task as iterative refinement over an explicit Multimedia Event Hypergraph (MEHG). It proposes a Link-then-Bind strategy to decouple event-argument linking from role binding, aiming to make intermediate event structures explicit and revisable, thereby reducing brittleness to early errors in LLM-based approaches. The central claim is that ECHO outperforms prior state-of-the-art methods on the M2E2 benchmark, with reported gains of 7.3 F1 points on event mention detection and 15.5 F1 points on argument role labeling.

Significance. If the reported performance gains hold under rigorous evaluation, this work could significantly impact the field by providing a more auditable and modular approach to structured multimedia event extraction, potentially improving robustness and interpretability in multi-modal NLP tasks.

major comments (2)
  1. [Experimental Evaluation] The abstract reports benchmark gains of 7.3 and 15.5 F1 points but supplies no experimental details, baselines, statistical tests, or error analysis. This makes it impossible to assess support for the central claim of consistent outperformance.
  2. [Link-then-Bind Strategy] The Link-then-Bind decoupling is claimed to reduce premature semantic commitment, but no ablation or analysis is provided to demonstrate that the multi-agent coordination does not introduce new failure modes, which is load-bearing for the robustness argument.
minor comments (1)
  1. [Method] The definition and operations on the Multimedia Event Hypergraph (MEHG) would benefit from a more formal mathematical specification, including explicit notation for hyperedges and update rules.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and positive assessment of the potential impact of our work. We address each major comment below and plan to revise the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Experimental Evaluation] The abstract reports benchmark gains of 7.3 and 15.5 F1 points but supplies no experimental details, baselines, statistical tests, or error analysis. This makes it impossible to assess support for the central claim of consistent outperformance.

    Authors: We appreciate this observation. The detailed experimental setup, including all baselines (e.g., prior SOTA methods on M2E2), statistical significance tests, and error analysis, are provided in Section 4 of the manuscript. To make this more accessible, we will revise the abstract to briefly summarize the evaluation methodology and key results with references to the full details in the paper. revision: yes

  2. Referee: [Link-then-Bind Strategy] The Link-then-Bind decoupling is claimed to reduce premature semantic commitment, but no ablation or analysis is provided to demonstrate that the multi-agent coordination does not introduce new failure modes, which is load-bearing for the robustness argument.

    Authors: We agree that an ablation study is necessary to validate the Link-then-Bind strategy and to show that the multi-agent setup does not introduce additional failure modes. In the revised version, we will add comprehensive ablations comparing the decoupled approach to a joint Link-and-Bind baseline, along with an analysis of coordination failures and how the iterative refinement mitigates them. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces the ECHO multi-agent framework for M2E2, reframing the task as iterative refinement over an explicit Multimedia Event Hypergraph with a Link-then-Bind strategy. All central claims rest on empirical benchmark results (7.3 and 15.5 F1 gains on M2E2) rather than any derivation, equation, or prediction that reduces by construction to fitted inputs, self-citations, or renamed ansatzes. No load-bearing step matches the enumerated circularity patterns; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unverified assumption that explicit hypergraph structures and multi-agent coordination will improve structured prediction in M2E2 without new error sources.

axioms (1)
  • domain assumption LLM-based M2E2 methods suffer from implicit hypotheses and coupled linking-role binding that cause brittle predictions
    Stated as motivation in the abstract
invented entities (1)
  • Multimedia Event Hypergraph (MEHG) no independent evidence
    purpose: Explicit shared structure for auditable iterative event refinement
    Core new representation introduced by the framework

pith-pipeline@v0.9.0 · 5521 in / 1127 out tokens · 42606 ms · 2026-05-15T17:19:58.662410+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages

  1. [1]

    Mahdi Abavisani, Liwei Wu, Shengli Hu, Joel Tetreault, and Alejandro Jaimes. 2020. Multimodal Categorization of Crisis Events in Social Media. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Seattle, Washington, USA, 14679–14689. https: //openaccess.thecvf.com/content_CVPR_2020/html/Abavisani_Multimodal_ C...

  2. [2]

    Firoj Alam, Ferda Ofli, and Muhammad Imran. 2018. CrisisMMD: Multimodal Twitter Datasets from Natural Disasters. InProceedings of the Twelfth Interna- tional AAAI Conference on Web and Social Media. AAAI Press, Stanford, California, USA, 465–473. https://ojs.aaai.org/index.php/ICWSM/article/view/14983

  3. [3]

    Jianwei Cao, Yanli Hu, Zhen Tan, and Xiang Zhao. 2025. Cross-modal Multi-task Learning for Multimedia Event Extraction. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 11454–11462. doi:10.1609/aaai.v39i11.33246

  4. [4]

    Justin Chen, Swarnadeep Saha, and Mohit Bansal. 2024. ReConcile: Round- Table Conference Improves Reasoning via Consensus among Diverse LLMs. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Bangkok, Thailand, 7066–7085. doi:10.18653/v1/2024.acl-long.381

  5. [5]

    Zilin Du, Yunxin Li, Xu Guo, Yidan Sun, and Boyang Li. 2023. Training Multime- dia Event Extraction With Generated Images and Captions. InProceedings of the 31st ACM International Conference on Multimedia. Association for Computing Machinery, Ottawa, ON, Canada, 5504–5513. doi:10.1145/3581783.3612526

  6. [6]

    Simon Gottschalk and Elena Demidova. 2018. EventKG: A Multilingual Event- Centric Temporal Knowledge Graph. InThe Semantic Web: 15th International Conference, ESWC 2018. Springer, Heraklion, Crete, Greece, 272–287. doi:10.1007/ 978-3-319-93417-4_18

  7. [7]

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber

  8. [8]

    InThe Twelfth International Conference on Learning Representations (ICLR 2024)

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Frame- work. InThe Twelfth International Conference on Learning Representations (ICLR 2024). OpenReview.net, Vienna, Austria. https://openreview.net/forum?id= VtmBAGCN7o

  9. [9]

    Jinhao Jiang, Kun Zhou, Wayne Xin Zhao, Yang Song, Chen Zhu, Hengshu Zhu, and Ji-Rong Wen. 2025. KG-Agent: An Efficient Autonomous Agent Framework for Complex Reasoning over Knowledge Graph. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, a...

  10. [10]

    doi:10.18653/v1/2025.acl-long.468

  11. [11]

    Shichao Jiao, Zonghan Wei, Xuzhen Lin, Yongsheng Yu, and Jing Jiang. 2024. Text2DB: Integrating Instruction Fine-Tuning and Database Updating for Text-to- Database Learning. InFindings of the Association for Computational Linguistics: ACL 2024. Association for Computational Linguistics, Bangkok, Thailand. doi:10. 18653/v1/2024.findings-acl.12

  12. [12]

    Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, San Diego, Califor...

  13. [13]

    Hongbing Li, Bo Xiao, Linyi Yang, Xinran Wang, and Qi Li. 2025. Multi-Grained Alignment for Visual Grounding. In2025 IEEE International Conference on Multi- media and Expo (ICME). IEEE, 1–6

  14. [14]

    Hongbing Li, Linhui Xiao, Zihan Zhao, Qi Shen, Yixiang Huang, Bo Xiao, and Zhanyu Ma. 2026. BARE: Towards Bias-Aware and Reasoning-Enhanced One- Tower Visual Grounding.IEEE Transactions on Circuits and Systems for Video Technology(2026), 1–1. doi:10.1109/TCSVT.2026.3679114

  15. [15]

    Manling Li, Ruochen Xu, Shuohang Wang, Luowei Zhou, Xudong Lin, Chen- guang Zhu, Michael Zeng, Heng Ji, and Shih-Fu Chang. 2022. CLIP-Event: Con- necting Text and Images with Event Structures. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE/CVF, 16420– 16429. doi:10.1109/CVPR52688.2022.01593

  16. [16]

    Manling Li, Alireza Zareian, Qi Zeng, Spencer Whitehead, Di Lu, Heng Ji, and Shih-Fu Chang. 2020. Cross-media Structured Common Space for Multimedia Event Extraction. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 2559–2570. doi:10.18653/v1/2020.acl-main.230

  17. [17]

    Weixin Liang, Wen Wang, Zhi Jin, Lizhou Wang, Xinyu Luo, and Percy Liang

  18. [18]

    In: Al-Onaizan, Y., Bansal, M., Chen, Y.-N

    Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Miami, Florida, 17837–17857. doi:10.18653/v1/2024.emnlp-main.992

  19. [19]

    Ying Lin, Heng Ji, Fei Huang, and Lingfei Wu. 2020. A Joint Neural Model for Information Extraction with Global Features. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 7999–8009. doi:10.1865...

  20. [20]

    Jian Liu, Yufeng Chen, and Jinan Xu. 2022. Multimedia Event Extraction From News With a Unified Contrastive Learning Framework. InProceedings of the 30th ACM International Conference on Multimedia. Association for Computing Machinery, Lisboa, Portugal, 1945–1953. doi:10.1145/3503161.3548132

  21. [21]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Bhargavi Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics 12 (2024), 157–173. doi:10.1162/tacl_a_00638

  22. [22]

    Yang Liu, Fang Liu, Licheng Jiao, Qianyue Bao, Long Sun, Shuo Li, Lingling Li, and Xu Liu. 2024. Multi-Grained Gradual Inference Model for Multimedia Event Extraction.IEEE Transactions on Circuits and Systems for Video Technology34, 10 (2024), 10507–10520. doi:10.1109/TCSVT.2024.3402242

  23. [23]

    Meng Lu, Yuzhang Xie, Zhenyu Bi, Shuxiang Cao, and Xuan Wang. 2025. CROSSAGENTIE: Cross-Type and Cross-Task Multi-Agent LLM Collaboration for Zero-Shot Information Extraction. InFindings of the Association for Computa- tional Linguistics: ACL 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Computat...

  24. [24]

    Aman Madaan, Shuyuan Yu, Aidan O’Brien, Shaurya Garg, Suresh Kumar, Yun- feng Bai, Dan Friedman, Aaron Chan, Yuandong Tian, and Annie Zhang. 2023. Self-Refine: Iterative Refinement with Self-Feedback. InAdvances in Neural In- formation Processing Systems. https://proceedings.neurips.cc/paper_files/paper/ 2023/hash/91edff07232fb1b55a505a9e9f6c0ff3-Abstract...

  25. [25]

    Eric Müller-Budack, Jonas Theiner, Sebastian Diering, Maximilian Idahl, Sherzod Hakimov, and Ralph Ewerth. 2021. Multimodal News Analytics Using Measures of Cross-Modal Entity and Context Consistency.International Journal of Multimedia Information Retrieval10, 2 (2021), 111–125. doi:10.1007/s13735-021-00207-4

  26. [26]

    Bangze Pan, Yang Li, Suge Wang, Xiaoli Li, Deyu Li, Jian Liao, and Jianxing Zheng. 2024. Document-Level Event Extraction via Information Interaction Based on Event Relation and Argument Correlation. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Nicoletta Calzola...

  27. [27]

    O'Brien and Carrie Jun Cai and Meredith Ringel Morris and Percy Liang and Michael S

    Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. Generative Agents: Interactive Simulacra of Human Behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. Association for Computing Machinery. doi:10.1145/3586183.3606763

  28. [28]

    Philipp Seeberger, Dominik Wagner, and Korbinian Riedhammer. 2024. MMUTF: Multimodal Multimedia Event Argument Extraction with Unified Template Fill- ing. InFindings of the Association for Computational Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 6539–65...

  29. [29]

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Re- inforcement Learning. InAdvances in Neural Information Processing Systems. https://proceedings.neurips.cc/paper_files/paper/2023/hash/ 1b44b878bb782e6954cd888628510e90-Abstract-Conference.html

  30. [30]

    Lin Sun, Kai Zhang, Qingyuan Li, and Renze Lou. 2024. UMIE: Unified Multi- modal Information Extraction with Instruction Tuning. InThirty-Eighth AAAI Conference on Artificial Intelligence (AAAI 2024). AAAI Press, 19062–19070. doi:10.1609/AAAI.V38I17.29873

  31. [31]

    Nikos Voskarides, Edgar Meij, Sabrina Sauer, and Maarten de Rijke. 2022. News Article Retrieval in Context for Event-Centric Narrative Creation. InProceedings of the First Workshop on Intelligent and Interactive Writing Assistants (In2Writing 2022). Association for Computational Linguistics, Dublin, Ireland, 85–91. doi:10. 18653/v1/2022.in2writing-1.10

  32. [32]

    David Wadden, Ulme Wennberg, Yi Luan, and Hannaneh Hajishirzi. 2019. Entity, Relation, and Event Extraction with Contextualized Span Representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pro- cessing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang,...

  33. [33]

    Xiaoyu Wang, Tao Sun, Gengchen Liu, Zhi Yang, Jiahui Liu, and Zimeng Xu. 2025. MGFSG-EE: A Method based on Multi-grained Fusion and Scene Graph Enhance- ment for Event Extraction(CIKM ’25). Association for Computing Machinery, New York, NY, USA, 3103–3112. doi:10.1145/3746252.3761235

  34. [34]

    Zheng Wang, Zhongyang Li, Zeren Jiang, Dandan Tu, and Wei Shi. 2024. Craft- ing Personalized Agents through Retrieval-Augmented Generation on Editable Memory Graphs. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, ...

  35. [35]

    Zhen Wang, Xu Shan, Xiangxie Zhang, and Jie Yang. 2022. N24News: A New Dataset for Multimodal News Classification. InProceedings of the Thirteenth Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 6793–6800. https://aclanthology.org/2022.lrec- 1.729/

  36. [36]

    Yilin Wen, Zifeng Wang, and Jimeng Sun. 2024. MindMap: Knowledge Graph Prompting Sparks Graph of Thoughts in Large Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thai...

  37. [37]

    White, Doug Burger, and Chi Wang

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. 2024. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. InICLR 2024 Workshop on LLM Agents (LLMAgents). https://openreview.net/forum?id=uAjxFFing2

  38. [38]

    Runxin Xu, Tianyu Liu, Lei Li, and Baobao Chang. 2021. Document-level Event Extraction via Heterogeneous Graph-based Interaction Model with a Tracker. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Pro- cessing (Volume 1: Long Papers), Chengqing Zong...

  39. [39]

    Ting Xu, Haiqin Yang, Fei Zhao, Zhen Wu, and Xinyu Dai. 2024. A Two-Agent Game for Zero-shot Relation Triplet Extraction. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 7510–7527. doi:10.18653/v1/2024.findings-acl.446

  40. [40]

    Vikas Yadav and Steven Bethard. 2018. A Survey on Recent Advances in Named Entity Recognition from Deep Learning Models. InProceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, Santa Fe, New Mexico, USA, 2145–2158. https: //aclanthology.org/C18-1182/

  41. [41]

    Zhaohui Yan, Songlin Yang, Wei Liu, and Kewei Tu. 2023. Joint Entity and Relation Extraction with Span Pruning and Hypergraph Neural Networks. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 7512–7526. doi:10.18...

  42. [42]

    Griffiths, Yuan Cao, and Karthik Narasimhan

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: deliberate problem solving with large language models. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, Artic...

  43. [43]

    Huiling You, Lilja Vrelid, and Samia Touileb. 2023. JSEEGraph: Joint Structured Event Extraction as Graph Parsing. InProceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023), Alexis Palmer and Jose Camacho-collados (Eds.). Association for Computational Linguistics, Toronto, Canada, 115–127. doi:10.18653/v1/2023.starsem-1.11

  44. [44]

    Jiaao Yu, Yijing Lin, Zhipeng Gao, Xuesong Qiu, and Lanlan Rui. 2025. Mul- timedia Event Extraction with LLM Knowledge Editing. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Chris- tos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (Eds.). Association for Computational Linguistics, Suzhou,...

  45. [45]

    Li Yuan, Yi Cai, Xudong Shen, Qing Li, Qingbao Huang, Zikun Deng, and Tao Wang. 2025. Collaborative Multi-LoRA Experts with Achievement-based Multi- Tasks Loss for Unified Multimodal Information Extraction. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI- 25, James Kwok (Ed.). International Joint Confere...

  46. [46]

    Xiang Yuan, Xinrong Chen, Haochen Li, Hang Yang, Guanyu Wang, Weiping Li, and Tong Mo. 2025. Stepwise Schema-Guided Prompting Framework with Parameter Efficient Instruction Tuning for Multimedia Event Extraction. In 2025 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1–6. doi:10.1109/icme59968.2025.11210082

  47. [47]

    Tengda Zhou, Shaoyang Men, Jingxian Liang, Baoxian Yu, Han Zhang, and Xiaomu Luo

    Shuo Zhang, Jinsong Zhang, Zhejun Zhang, and Lei Li. 2025. Multimodal Mixture of Low-Rank Experts for Sentiment Analysis and Emotion Recognition. In2025 IEEE International Conference on Multimedia and Expo (ICME). 1–6. doi:10.1109/ ICME59968.2025.11210197 ECHO: Event-Centric Hypergraph Operations via Multi-Agent Collaboration for Multimedia Event Extracti...

  48. [48]

    **JSON Format**: Output valid JSON only, matching exactly: {output_example}

  49. [49]

    * Image inputs: Trigger must be a **single-word** visual descriptor (or`""`if not applicable)

    **Trigger Constraints**: * Text inputs: Trigger must be **one token**, copied verbatim from the text. * Image inputs: Trigger must be a **single-word** visual descriptor (or`""`if not applicable). # Extraction Rules

  50. [50]

    **Evidence Discipline**: Output only events/ arguments directly supported by the provided evidence (no speculation)

  51. [51]

    **Role Fidelity**: Use the exact role names defined in the schema

  52. [52]

    # Visual Grounding & Bounding Box Discipline

    **Argument Salience**: Include only the most salient arguments for each event. # Visual Grounding & Bounding Box Discipline

  53. [53]

    **Format**: Bounding boxes are integers`[x_min, y_min, x_max, y_max]`

  54. [54]

    **Requirement**: Every entry in`image_arguments` must include a grounded bounding box

  55. [55]

    Direct Prompting (LVLM + Image + Visual Tool Outputs)

    **Instances/Groups**: Use separate boxes for distinct instances; otherwise use one tight box covering the group. Direct Prompting (LVLM + Image + Visual Tool Outputs). LVLM baselines process the raw image 𝐼 alongside the text 𝑇 . Cru- cially, to maintain a strictly controlled comparison, they are addi- tionally provided with the exact same vision tool out...

  56. [56]

    The output must exactly match: {output_example}

    **JSON Format**: Respond with valid JSON only. The output must exactly match: {output_example}

  57. [57]

    # Extraction Rules

    **Trigger Constraints**: * When text is provided: Trigger must be a **single token** copied verbatim from the text. # Extraction Rules

  58. [58]

    **Evidence Discipline**: Emit only events/arguments directly supported by text or pixels (no speculation)

  59. [59]

    # Visual Grounding Constraints

    **Role Fidelity**: Use the exact role names defined in the schema. # Visual Grounding Constraints

  60. [60]

    **Bounding Boxes**: Every entry in`image_arguments` must include a grounded integer box `[x_min, y_min, x_max, y_max]`

  61. [61]

    text_entities

    **Optional Object Name**:`object_name`may be included as an auxiliary descriptive field for readability. ECHO: Event-Centric Hypergraph Operations via Multi-Agent Collaboration for Multimedia Event Extraction ECHO.These prompts define each agent’s role, available multi- modal evidence, and the structured atomic operation format. Node Seeding # Task You ar...

  62. [62]

    * Assign unique IDs prefixed with`T`(e.g., T1, T2)

    **Text Entities** (`text_entities`): * Extract concrete entity mentions that may serve as event arguments. * Assign unique IDs prefixed with`T`(e.g., T1, T2)

  63. [63]

    * Assign unique IDs prefixed with`O`(e.g., O1, O2)

    **Visual Objects** (`image_objects`): * Extract concrete objects from the visual description that may serve as event arguments. * Assign unique IDs prefixed with`O`(e.g., O1, O2). Proposer # Task Act as an Event Hypothesis Proposer. Use the multimodal context and the current hypergraph state to propose new event hyperedges or signal convergence. # Context...

  64. [64]

    **Evidence Constraint**: Propose only events directly supported by the provided evidence

  65. [65]

    * Image-Only Mode: Use`""`or a concise visual descriptor

    **Trigger Selection**: * Text Mode: Choose a **single token** copied verbatim from the source text. * Image-Only Mode: Use`""`or a concise visual descriptor

  66. [66]

    operations

    **State/History Awareness**: Do not propose events already in`<Current hypergraph>`or rejected in `<History>`. # Operations Output a JSON object`{"operations": [...]}`containing one of:

  67. [67]

    operation

    **`propose_hyperedge`**: Introduce new event hyperedges. ```json {"operation": "propose_hyperedge", "event_type": "...", "trigger": "...", "rationale": "..."} ```

  68. [68]

    operation

    **`no_op`**: Return when no new valid event remains. ```json {"operation": "no_op", "reason": "Converged"} ``` Linker # Task Act as a Node-Hyperedge Linker. Link or unlink candidate nodes to each event hyperedge based on explicit evidence, without assigning roles. # Context Data <Source sentence> {text_input} </Source> <Current hypergraph> {hypergraph} </...

  69. [69]

    **Evidence-Based Linking**: Link nodes only when there is clear textual or visual support; avoid weak/background associations

  70. [70]

    operations

    **Cross-Modal Consistency**: Prefer links that are mutually supported or non-contradictory across modalities. # Operations Output a JSON object`{"operations": [...]}`containing:

  71. [71]

    operation

    **`link_node`**: Associate a node with an event. ```json {"operation": "link_node", "hyperedge_id": "HE1", " node_id": "T1", "rationale": "..."} ```

  72. [72]

    operation

    **`unlink_node`**: Remove an association. ```json {"operation": "unlink_node", "hyperedge_id": "HE1", " node_id": "O3", "rationale": "..."} ```

  73. [73]

    # Constraints * **History**: Do not repeat operations already recorded in`<History>`

    **`no_op`**: Return when links are sufficient. # Constraints * **History**: Do not repeat operations already recorded in`<History>`. Verifier # Task Act as a Cross-Modal Verifier. Verify coherence between text and image evidence, calibrate hyperedge confidence, and prune invalid or redundant hyperedges. # Context Data <Source sentence> {text_input} </Sour...

  74. [74]

    **Cross-Modal Consistency**: Identify direct contradictions between modalities; only prune when a contradiction or clear invalidity is observed

  75. [75]

    **Confidence Calibration**: Adjust confidence based on the strength of supporting evidence across modalities

  76. [76]

    operations

    **Pruning**: Drop hyperedges that are empty, duplicated, or unsupported/hallucinated. # Operations Output a JSON object`{"operations": [...]}`containing:

  77. [77]

    operation

    **`adjust_confidence`**: Update the confidence score. ```json {"operation": "adjust_confidence", "hyperedge_id": "HE1 ", "new_confidence": 0.85, "rationale": "..."} ```

  78. [78]

    operation

    **`drop_hyperedge`**: Remove an invalid or redundant hyperedge. ```json {"operation": "drop_hyperedge", "hyperedge_id": "HE2", "rationale": "..."} ```

  79. [79]

    # Constraints * **History**: Do not repeat operations already recorded in`<History>`

    **`no_op`**: Return if no changes are needed. # Constraints * **History**: Do not repeat operations already recorded in`<History>`. Role Bind (textual vertices).The prompt below is used for textual role binding. Visual role candidates are obtained from the vision tool conditioned on the event hypothesis and then aligned to linked visual vertices by locali...

  80. [80]

    Use`Entity`only as a fallback for ambiguous participants

    **Specificity Principle**: Always prioritize specific roles (e.g.,`Attacker`,`Target`) over the generic`Entity`role. Use`Entity`only as a fallback for ambiguous participants

Showing first 80 references.