arxiv: 2604.20136 · v1 · submitted 2026-04-22 · 💻 cs.CV · cs.AI

Recognition: unknown

IMPACT-CYCLE: A Contract-Based Multi-Agent System for Claim-Level Supervisory Correction of Long-Video Semantic Memory

Weitong Kong , Di Wen , Kunyu Peng , David Schneider , Zeyun Zhong , Alexander Jaus , Zdravko Marinov , Jiale Wei

show 5 more authors

Ruiping Liu Junwei Zheng Yufan Chen Lei Qi Rainer Stiefelhagen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:23 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords long-video understandingmulti-agent systemssemantic memoryclaim dependency graphhuman arbitrationvideo question answeringerror correctionVidOR

0 comments

The pith

A contract-based multi-agent system maintains a shared semantic memory of typed claims for long videos so that corrections remain proportional to error scope.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve the high cost of fixing errors in long-video AI systems, where opaque outputs force humans to rewatch entire footage and rebuild understanding from scratch. It does so by recasting the task as ongoing maintenance of a structured semantic memory that holds individual typed claims, tracks their dependencies in a graph, and records how each claim arose. Role-specialized agents check these claims locally for object relations, across time for consistency, and globally for overall coherence, all while operating under explicit authority contracts that restrict their scope. Humans enter only when automated checks lack enough evidence, after which the system re-verifies only the affected dependent claims. Experiments on VidOR demonstrate that this yields clearer gains in video question answering and much lower human workload than standard annotation pipelines.

Core claim

IMPACT-CYCLE reformulates long-video understanding as iterative claim-level maintenance of a shared semantic memory consisting of typed claims, a claim dependency graph, and a provenance log. Role-specialized agents operating under explicit authority contracts decompose verification into local object-relation correctness, cross-temporal consistency, and global semantic coherence, with corrections confined to structurally dependent claims. When automated evidence is insufficient the system escalates to human arbitration as the final authority, after which dependency-closure re-verification ensures that correction effort stays proportional to error scope. On the VidOR dataset the approach elev

What carries the argument

The shared semantic memory that encodes typed claims about video content, linked by a claim dependency graph and accompanied by a provenance log, which agents consult to perform targeted local, temporal, and global verifications under authority contracts.

If this is right

Human supervision effort scales directly with the size of each error rather than requiring re-examination of the entire video.
Downstream reasoning tasks such as video question answering receive measurable accuracy gains from the cleaned and versioned memory state.
Automated agents complete the majority of verification steps, invoking human arbitration only when evidence thresholds are not met.
Every correction carries an auditable provenance trail that records which agents and authorities contributed to the change.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same claim-and-dependency structure could support incremental updates when new video segments arrive without reprocessing prior content.
Explicit dependency tracking may improve debugging of video interpretation errors by revealing how one mistaken claim influences others.
Similar contract-based maintenance of interconnected claims could be tested on long-document or multi-turn dialogue tasks to check for comparable efficiency.

Load-bearing premise

The claim dependency graph must accurately capture all relevant interactions among claims so that corrections stay strictly proportional to error scope and the three verification layers cover every important cross-cutting issue.

What would settle it

A concrete case in which an error correction affects claims outside the recorded dependency closure or an error spanning multiple aspects evades detection by all three verification stages until a full manual review is performed.

Figures

Figures reproduced from arXiv: 2604.20136 by Alexander Jaus, David Schneider, Di Wen, Jiale Wei, Junwei Zheng, Kunyu Peng, Lei Qi, Rainer Stiefelhagen, Ruiping Liu, Weitong Kong, Yufan Chen, Zdravko Marinov, Zeyun Zhong.

**Figure 1.** Figure 1: Overview of IMPACT-CYCLE. Starting from keyframe-grounded scene graphs, the system decomposes graph elements into typed atomic [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: NASA-TLX workload comparison (lower is better). [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

Correcting errors in long-video understanding is disproportionately costly: existing multimodal pipelines produce opaque, end-to-end outputs that expose no intermediate state for inspection, forcing annotators to revisit raw video and reconstruct temporal logic from scratch. The core bottleneck is not generation quality alone, but the absence of a supervisory interface through which human effort can be proportional to the scope of each error. We present IMPACT-CYCLE, a supervisory multi-agent system that reformulates long-video understanding as iterative claim-level maintenance of a shared semantic memory -- a structured, versioned state encoding typed claims, a claim dependency graph, and a provenance log. Role-specialized agents operating under explicit authority contracts decompose verification into local object-relation correctness, cross-temporal consistency, and global semantic coherence, with corrections confined to structurally dependent claims. When automated evidence is insufficient, the system escalates to human arbitration as the supervisory authority with final override rights; dependency-closure re-verification then ensures correction cost remains proportional to error scope. Experiments on VidOR show substantially improved downstream reasoning (VQA: 0.71 to 0.79) and a 4.8x reduction in human arbitration cost, with workload significantly lower than manual annotation. Code will be released at https://github.com/MKong17/IMPACT_CYCLE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IMPACT-CYCLE sketches a claim-level semantic memory with dependency graphs and role agents to keep human fixes local in long-video systems, but the VidOR numbers rest on unshown graph construction and verification coverage.

read the letter

The core idea here is to stop treating long-video outputs as black boxes and instead maintain an explicit, versioned set of claims about objects, relations, and events. When an error shows up, the system uses a dependency graph to re-check only the affected claims through local, temporal, and global agent passes before escalating to a human. The reported outcome on VidOR is a VQA score moving from 0.71 to 0.79 with roughly 4.8 times less human arbitration time. That framing directly targets the cost problem the abstract describes, and the authority contracts plus provenance log give a workable way to control who can override what without cascading changes everywhere. The architecture is concrete enough that someone building a deployed video system could sketch an implementation from the description alone. The main weakness is exactly the one the stress-test flags: nothing in the abstract or the reported results shows how the dependency graph is built from raw claims or tests whether the three verification stages actually catch cross-cutting inconsistencies. Without an algorithm for graph construction, an ablation on missed errors, or even the experimental protocol behind the VidOR numbers, the gains cannot be read as evidence that corrections stay proportional to error scope. The circularity burden stays low because the benchmark is external, but the soundness gap is real and central. This paper is for groups already running long-video pipelines who need a maintenance layer rather than another end-to-end model. A reader focused on practical reliability and annotation cost would get concrete design points to adapt even if the current numbers need more backing. It deserves peer review because the problem is genuine and the proposed structure is distinct from prior memory-augmented video work; referees can require the missing graph details and controls without the submission being incoherent on its own terms.

Referee Report

2 major / 1 minor

Summary. The paper introduces IMPACT-CYCLE, a contract-based multi-agent system for iterative claim-level correction of long-video semantic memory. It structures video understanding outputs as a shared, versioned memory of typed claims linked by a dependency graph and provenance log; role-specialized agents perform local object-relation, cross-temporal, and global coherence verification under explicit authority contracts, escalating to human arbitration only when automated evidence is insufficient. Corrections are confined to the dependency closure of erroneous claims. On the VidOR benchmark the system is reported to raise VQA accuracy from 0.71 to 0.79 while reducing human arbitration cost by a factor of 4.8.

Significance. If the reported gains are reproducible and the dependency-graph mechanism is shown to keep corrections proportional to error scope, the work would offer a concrete supervisory interface that addresses a recognized bottleneck in long-video annotation pipelines. The contract-based decomposition of verification roles and the explicit escalation protocol constitute a novel organizational layer on top of existing multimodal models.

major comments (2)

Abstract: the quantitative claims (VQA 0.71→0.79 and 4.8× human-cost reduction) are stated without any description of the experimental protocol, baseline systems, number of evaluated videos or claims, statistical tests, or ablation studies on the dependency-graph construction and the three verification stages. These omissions render the central performance claims unevaluable.
Approach description: no algorithm, pseudocode, or formal definition is supplied for constructing the claim dependency graph from the set of video claims, nor is there an empirical test (ablation or error-coverage analysis) demonstrating that the local/temporal/global decomposition exhausts the space of cross-cutting inconsistencies. Because the proportionality of correction cost rests on the completeness and accuracy of this graph, the absence is load-bearing for the main thesis.

minor comments (1)

The promise to release code at the cited GitHub URL is welcome; the manuscript should include a brief reproducibility checklist (random seeds, exact model versions, hyper-parameters for the agents) even if the full repository is external.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your constructive review. We address each major comment point by point below, indicating where revisions will be made to improve clarity and completeness.

read point-by-point responses

Referee: Abstract: the quantitative claims (VQA 0.71→0.79 and 4.8× human-cost reduction) are stated without any description of the experimental protocol, baseline systems, number of evaluated videos or claims, statistical tests, or ablation studies on the dependency-graph construction and the three verification stages. These omissions render the central performance claims unevaluable.

Authors: We agree that the abstract, constrained by length, omits these details and that this limits immediate evaluability. The full experimental protocol—including the VidOR subset (videos and claims evaluated), baselines (e.g., uncorrected VQA pipelines), statistical tests, and ablations on the dependency graph and verification stages—is provided in the Experiments section. We will revise the abstract to include a concise summary of the evaluation setup, dataset scale, and key controls. revision: yes
Referee: Approach description: no algorithm, pseudocode, or formal definition is supplied for constructing the claim dependency graph from the set of video claims, nor is there an empirical test (ablation or error-coverage analysis) demonstrating that the local/temporal/global decomposition exhausts the space of cross-cutting inconsistencies. Because the proportionality of correction cost rests on the completeness and accuracy of this graph, the absence is load-bearing for the main thesis.

Authors: We acknowledge that the manuscript currently offers only a textual description of dependency-graph construction without pseudocode or a formal definition, and lacks an explicit ablation or error-coverage analysis of the three-stage decomposition. We will add a formal definition and pseudocode for graph construction (based on shared entities, temporal overlap, and semantic links) to Section 3. We will also include an ablation study and error-coverage analysis in the revised Experiments section to quantify how completely the local/temporal/global stages capture inconsistencies and support cost proportionality. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results on external benchmark

full rationale

The paper describes a multi-agent supervisory system for long-video claim maintenance, with verification decomposed into local, temporal, and global checks under authority contracts. Performance is measured directly on the external VidOR benchmark (VQA 0.71→0.79, 4.8× arbitration cost reduction). No equations, parameter fits presented as predictions, self-citations, or ansatzes appear in the provided text. The dependency graph and escalation logic are architectural choices whose correctness is tested empirically rather than derived tautologically from inputs. This is the common case of a self-contained engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

Review is abstract-only; the ledger entries below are inferred directly from the components named in the abstract. No numerical free parameters are mentioned.

axioms (2)

domain assumption Errors in long-video understanding can be decomposed into local object-relation correctness, cross-temporal consistency, and global semantic coherence without missing critical interactions.
This decomposition justifies the three specialized agent roles and the claim-level verification strategy.
domain assumption Human arbitration cost is the dominant expense and can be made proportional to error scope via dependency-closure re-verification.
This premise underpins the claimed 4.8x reduction in human effort.

invented entities (3)

Shared semantic memory no independent evidence
purpose: Versioned store of typed claims, dependency graph, and provenance log that serves as the single source of truth for verification and correction.
New data structure introduced to expose intermediate state for inspection and targeted repair.
Authority contracts no independent evidence
purpose: Explicit rules that assign verification authority and escalation rights to role-specialized agents.
Invented mechanism to structure multi-agent collaboration and human override.
Claim dependency graph no independent evidence
purpose: Graph that limits re-verification and correction to only the claims affected by an error.
Core device for keeping human workload proportional to error scope.

pith-pipeline@v0.9.0 · 5579 in / 1815 out tokens · 44838 ms · 2026-05-10T01:23:14.807119+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 6 canonical work pages · 2 internal anchors

[1]

Gpt-4 technical report,

OpenAI, “Gpt-4 technical report,” 2023

2023
[2]

Gemini: A family of highly capable multimodal models,

G. DeepMind, “Gemini: A family of highly capable multimodal models,” 2023

2023
[3]

Flamingo: a visual language model for few-shot learning,

J.-B. Alayracet al., “Flamingo: a visual language model for few-shot learning,” inNeurIPS, 2022

2022
[4]

VideoChat: Chat-Centric Video Understanding

K. Liet al., “Videochat: Chat-centric video understanding,”arXiv preprint arXiv:2305.06355, 2023

work page internal anchor Pith review arXiv 2023
[5]

Videollm: Modeling video as language for unified video understanding,

Y . Wanget al., “Videollm: Modeling video as language for unified video understanding,”arXiv preprint arXiv:2402.02410, 2024

work page arXiv 2024
[6]

Long-form video-language pre-training with multimodal temporal contrastive learning,

Y . Sunet al., “Long-form video-language pre-training with multimodal temporal contrastive learning,”arXiv preprint arXiv:2210.06031, 2022

work page arXiv 2022
[7]

Action genome: Actions as composition of spatio-temporal scene graphs,

J. Ji, R. Krishnaet al., “Action genome: Actions as composition of spatio-temporal scene graphs,” inCVPR, 2020

2020
[8]

Pvsg: Panoptic video scene graph generation,

X. Yanget al., “Pvsg: Panoptic video scene graph generation,” in CVPR, 2023

2023
[9]

Learning 4d panoptic scene graph generation from rich 2d visual scene,

S. Wu, H. Fei, J. Yang, X. Li, J. Li, H. Zhang, and T.-S. Chua, “Learning 4d panoptic scene graph generation from rich 2d visual scene,” inCVPR, 2025, pp. 24 539–24 549

2025
[10]

GQA: A new dataset for real-world visual reasoning and compositional question answering,

D. A. Hudson and C. D. Manning, “GQA: A new dataset for real-world visual reasoning and compositional question answering,” inCVPR, 2019, pp. 6700–6709

2019
[11]

AGQA: A benchmark for compositional spatio-temporal reasoning,

M. Grunde-McLaughlin, R. Krishna, and M. Agrawala, “AGQA: A benchmark for compositional spatio-temporal reasoning,” inCVPR, 2021, pp. 11 287–11 297

2021
[12]

Knowledge informed sequential scene graph verification using VQA,

D. Thauvin and S. Herbin, “Knowledge informed sequential scene graph verification using VQA,” inProceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Workshop on Scene Graphs and Graph Representation Learning (SG2RL), 2023

2023
[13]

Video annotation and tracking with active learning,

C. V ondrick and D. Ramanan, “Video annotation and tracking with active learning,” inNeurIPS, 2011. [Online]. Available: https://api.semanticscholar.org/CorpusID:5787213

2011
[14]

Human-in-the-loop video semantic segmentation auto- annotation,

N. Qiao, Y . Sun, C. Liu, L. Xia, J. Luo, K. Zhang, and C.-H. Kuo, “Human-in-the-loop video semantic segmentation auto- annotation,”2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 5870–5880, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:253305951

2023
[15]

Learning the what and how of annotation in video object segmentation,

T. Delatolas, V . S. Kalogeiton, and D. P. Papadopoulos, “Learning the what and how of annotation in video object segmentation,” 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 6936–6946, 2023. [Online]. Available: https: //api.semanticscholar.org/CorpusID:265050493

2024
[16]

Strike the balance: On-the-fly uncertainty based user interactions for long-term video object segmentation,

S. Vujasinovi’c, S. Becker, S. Bullinger, N. Scherer-Negenborn, and M. Arens, “Strike the balance: On-the-fly uncertainty based user interactions for long-term video object segmentation,” in Asian Conference on Computer Vision, 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:271600829

2024
[17]

Image retrieval using scene graphs,

J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. A. Shamma, M. S. Bernstein, and L. Fei-Fei, “Image retrieval using scene graphs,” in CVPR, 2015, pp. 3668–3678

2015
[18]

Visual genome: Connecting language and vision using crowdsourced dense image annotations,

R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalantidis, L.-J. Li, D. A. Shamma, M. S. Bernstein, and L. Fei-Fei, “Visual genome: Connecting language and vision using crowdsourced dense image annotations,”IJCV, vol. 123, pp. 32–73, 2017

2017
[19]

Scene graph generation from objects, phrases and region captions,

Y . Li, W. Ouyang, B. Zhou, K. Wang, and X. Wang, “Scene graph generation from objects, phrases and region captions,” inICCV, 2017, pp. 1270–1279

2017
[20]

Panoptic scene graph generation,

J. Yang, Y . Z. Ang, Z. Guo, K. Zhou, W. Zhang, and Z. Liu, “Panoptic scene graph generation,” inProceedings of the European Conference on Computer Vision (ECCV), 2022, pp. 178–196

2022
[21]

Universal scene graph generation,

S. Wu, H. Fei, and T.-S. Chua, “Universal scene graph generation,” in CVPR, 2025, pp. 14 158–14 168

2025
[22]

Spatial-temporal transformer for dynamic scene graph generation,

Y . Cong, W. Liao, H. Ackermann, B. Rosenhahn, and M. Y . Yang, “Spatial-temporal transformer for dynamic scene graph generation,” inICCV, 2021, pp. 16 352–16 362

2021
[23]

TRACE: Target adaptive context aggregation for video scene graph generation,

Y . Teng, L. Wang, Z. Li, and G. Wu, “TRACE: Target adaptive context aggregation for video scene graph generation,” inICCV, 2021, pp. 13 688–13 697

2021
[24]

Dynamic scene graph generation via anticipatory pre-training,

Y . Li, X. Yang, and C. Xu, “Dynamic scene graph generation via anticipatory pre-training,” inCVPR, 2022, pp. 13 864–13 873

2022
[25]

Unbiased scene graph generation in videos,

S. Nag, K. Min, S. Tripathi, and A. K. Roy-Chowdhury, “Unbiased scene graph generation in videos,” inCVPR, 2023, pp. 22 803–22 813

2023
[26]

Towards unbiased and robust spatio-temporal scene graph generation and anticipation,

R. Peddi, Saurabh, A. A. Shrivastava, P. Singla, and V . Gogate, “Towards unbiased and robust spatio-temporal scene graph generation and anticipation,” inCVPR, 2025, pp. 8648–8657

2025
[27]

OED: Towards one-stage end- to-end dynamic scene graph generation,

G. Wang, Z. Li, Q. Chen, and Y . Liu, “OED: Towards one-stage end- to-end dynamic scene graph generation,” inCVPR, 2024, pp. 27 938– 27 947

2024
[28]

From pixels to graphs: Open-vocabulary scene graph generation with vision-language models,

R. Li, S. Zhang, D. Lin, K. Chen, and X. He, “From pixels to graphs: Open-vocabulary scene graph generation with vision-language models,” inCVPR, 2024, pp. 28 076–28 086

2024
[29]

Dyngraph: Visual question answering via dynamic scene graphs,

M. Haurilet, Z. Al-Halah, and R. Stiefelhagen, “Dyngraph: Visual question answering via dynamic scene graphs,” inPattern Recognition, G. A. Fink, S. Frintrop, and X. Jiang, Eds. Cham: Springer International Publishing, 2019, pp. 428–441

2019
[30]

Dynamic multistep reasoning based on video scene graph for video question answering,

J. Mao, W. Jiang, X. Wang, Z. Feng, Y . Lyu, H. Liu, and Y . Zhu, “Dynamic multistep reasoning based on video scene graph for video question answering,” inProceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2022, pp. 3894–3904. [Online]. Available: https://aclanthology.org/2022.naacl...

2022
[31]

Language- conditioned graph networks for relational reasoning,

R. Hu, A. Rohrbach, T. Darrell, and K. Saenko, “Language- conditioned graph networks for relational reasoning,” inICCV, 2019, pp. 10 293–10 302

2019
[32]

Explainable and explicit visual reasoning over scene graphs,

J. Shi, H. Zhang, and J. Li, “Explainable and explicit visual reasoning over scene graphs,” inCVPR, 2019, pp. 8376–8384

2019
[33]

SPICE: Semantic propositional image caption evaluation,

P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE: Semantic propositional image caption evaluation,” inProceedings of the European Conference on Computer Vision (ECCV), 2016, pp. 382– 398

2016
[34]

Auto-encoding scene graphs for image captioning,

X. Yang, K. Tang, H. Zhang, and J. Cai, “Auto-encoding scene graphs for image captioning,” inCVPR, 2019, pp. 10 685–10 694

2019
[35]

Learning to generate scene graph from natural language supervision,

Y . Zhong, J. Shi, J. Yang, C. Xu, and Y . Li, “Learning to generate scene graph from natural language supervision,” inICCV, 2021, pp. 1803–1814

2021
[36]

Benchmarking large vision-language models via directed scene graph for comprehensive image captioning,

F. Lu, W. Wu, K. Zheng, S. Ma, B. Gong, J. Liu, W. Zhai, Y . Cao, Y . Shen, and Z.-J. Zha, “Benchmarking large vision-language models via directed scene graph for comprehensive image captioning,” in CVPR, 2025, pp. 19 618–19 627

2025
[37]

Fine-grained captioning of long videos through scene graph consolidation,

S. Chu, S. Seo, and B. Han, “Fine-grained captioning of long videos through scene graph consolidation,” inProceedings of the International Conference on Machine Learning (ICML), 2025. [Online]. Available: https://proceedings.mlr.press/v267/chu25b.html

2025
[38]

Egocross: Benchmarking multimodal large language models for cross-domain egocentric video question answering,

Y . Li, Y . Fu, T. Qian, Q. Xu, S. Dai, D. P. Paudel, L. Van Gool, and X. Wang, “Egocross: Benchmarking multimodal large language models for cross-domain egocentric video question answering,” in AAAI, vol. 40, no. 8, 2026, pp. 6592–6600

2026
[39]

Objectrelator: Enabling cross-view object relation understanding across ego-centric and exo-centric per- spectives,

Y . Fu, R. Wang, B. Ren, G. Sun, B. Gong, Y . Fu, D. P. Paudel, X. Huang, and L. Van Gool, “Objectrelator: Enabling cross-view object relation understanding across ego-centric and exo-centric per- spectives,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 6530–6540

2025
[40]

HopaDIFF: Holistic-partial aware fourier conditioned diffusion for referring human action segmentation in multi-person scenarios,

K. Peng, J. Huang, X. Huang, D. Wen, J. Zheng, Y . Chen, K. Yang, J. Wu, C. Hao, and R. Stiefelhagen, “HopaDIFF: Holistic-partial aware fourier conditioned diffusion for referring human action segmentation in multi-person scenarios,” 2025, neurIPS

2025
[41]

Anticipative feature fusion transformer for multi-modal action antic- ipation,

Z. Zhong, D. Schneider, M. V oit, R. Stiefelhagen, and J. Beyerer, “Anticipative feature fusion transformer for multi-modal action antic- ipation,” in2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 6057–6066

2023
[42]

Deep interactive segmentation of medical images: A systematic review and taxonomy,

Z. Marinov, P. F. J ¨ager, J. Egger, J. Kleesiek, and R. Stiefelhagen, “Deep interactive segmentation of medical images: A systematic review and taxonomy,”IEEE transactions on pattern analysis and machine intelligence, no. 12, 2024

2024
[43]

Probabilistic modeling of semantic ambiguity for scene graph generation,

G. Yang, J. Zhang, Y . Zhang, B. Wu, and Y . Yang, “Probabilistic modeling of semantic ambiguity for scene graph generation,” inCVPR, 2021

2021
[44]

Uncertainty-aware scene graph generation,

X. Li, T. Wu, G. Zheng, Y . Yu, and X. Li, “Uncertainty-aware scene graph generation,”Pattern Recognit. Lett., vol. 167, pp. 30–37,
[45]

Available: https://api.semanticscholar.org/CorpusID: 255036311

[Online]. Available: https://api.semanticscholar.org/CorpusID: 255036311
[46]

Evidential uncertainty and diversity guided active learning for scene graph generation,

S. Sun, S. Zhi, J. Heikkila, and L. Liu, “Evidential uncertainty and diversity guided active learning for scene graph generation,” inProceedings of the International Conference on Learning Representations (ICLR), 2023. [Online]. Available: https://openreview. net/forum?id=xI1ZTtVOtlz

2023
[47]

Conformal prediction and mllm aided uncertainty quantification in scene graph generation,

S. Nag, U. Ghosh, S. Bose, C.-K. Ta, J. Li, and A. K. R. Chowdhury, “Conformal prediction and mllm aided uncertainty quantification in scene graph generation,”CVPR, pp. 11 676–11 686, 2025. [Online]. Available: https://api.semanticscholar.org/CorpusID:277103791

2025
[48]

Visual explainable artificial intelligence for graph- based visual question answering and scene graph curation,

S. K ¨unzel, T. Munz-K¨orner, P. Tilli, N. Sch¨afer, S. Vidyapu, N. T. Vu, and D. Weiskopf, “Visual explainable artificial intelligence for graph- based visual question answering and scene graph curation,”Visual Computing for Industry, Biomedicine, and Art, vol. 8, 2025. [Online]. Available: https://api.semanticscholar.org/CorpusID:277609585

2025
[49]

Human-in-the-loop local corrections of 3d scene layouts via infilling,

C. Xie, A. Avetisyan, H. Howard-Jenkins, Y . Siddiqui, J. Straub, R. Newcombe, V . Balntas, and J. Engel, “Human-in-the-loop local corrections of 3d scene layouts via infilling,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 5657–5666

2025
[50]

Mica: Multi-agent industrial coordination assis- tant,

D. Wen, K. Peng, J. Zheng, Y . Chen, Y . Shi, J. Wei, R. Liu, K. Yang, and R. Stiefelhagen, “Mica: Multi-agent industrial coordination assis- tant,”arXiv preprint arXiv:2509.15237, 2025

work page arXiv 2025
[51]

Open scene understanding: Grounded situation recognition meets segment anything for helping people with visual im- pairments,

R. Liu, J. Zhang, K. Peng, J. Zheng, K. Cao, Y . Chen, K. Yang, and R. Stiefelhagen, “Open scene understanding: Grounded situation recognition meets segment anything for helping people with visual im- pairments,” in2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2023

2023
[52]

Snap, segment, deploy: A visual data and detection pipeline for wearable industrial assistants,

D. Wen, J. Zheng, R. Liu, Y . Xu, K. Peng, and R. Stiefelhagen, “Snap, segment, deploy: A visual data and detection pipeline for wearable industrial assistants,”arXiv preprint arXiv:2507.21072, 2025

work page arXiv 2025
[53]

Navigating open set scenarios for skeleton-based action recognition,

K. Peng, C. Yin, J. Zheng, R. Liu, D. Schneider, J. Zhang, K. Yang, M. S. Sarfraz, R. Stiefelhagen, and A. Roitberg, “Navigating open set scenarios for skeleton-based action recognition,”AAAI, vol. 38, no. 5, pp. 4487–4496, Mar. 2024. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/28247

2024
[54]

Pose-based contrastive learning for domain agnostic activity repre- sentations,

D. Schneider, M. Saquib Sarfraz, A. Roitberg, and R. Stiefelhagen, “Pose-based contrastive learning for domain agnostic activity repre- sentations,” inCVPRW, 2022, pp. 3432–3442

2022
[55]

Annotating objects and relations in user-generated videos,

X. Shang, D. Di, J. Xiao, Y . Cao, X. Yang, and T.-S. Chua, “Annotating objects and relations in user-generated videos,” inProceedings of the 2019 on International Conference on Multimedia Retrieval. ACM, 2019, pp. 279–287

2019
[56]

Segment Anything

A. Kirillovet al., “Segment anything,”arXiv preprint arXiv:2304.02643, 2023

work page internal anchor Pith review arXiv 2023