Recognition: unknown
IMPACT-CYCLE: A Contract-Based Multi-Agent System for Claim-Level Supervisory Correction of Long-Video Semantic Memory
Pith reviewed 2026-05-10 01:23 UTC · model grok-4.3
The pith
A contract-based multi-agent system maintains a shared semantic memory of typed claims for long videos so that corrections remain proportional to error scope.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
IMPACT-CYCLE reformulates long-video understanding as iterative claim-level maintenance of a shared semantic memory consisting of typed claims, a claim dependency graph, and a provenance log. Role-specialized agents operating under explicit authority contracts decompose verification into local object-relation correctness, cross-temporal consistency, and global semantic coherence, with corrections confined to structurally dependent claims. When automated evidence is insufficient the system escalates to human arbitration as the final authority, after which dependency-closure re-verification ensures that correction effort stays proportional to error scope. On the VidOR dataset the approach elev
What carries the argument
The shared semantic memory that encodes typed claims about video content, linked by a claim dependency graph and accompanied by a provenance log, which agents consult to perform targeted local, temporal, and global verifications under authority contracts.
If this is right
- Human supervision effort scales directly with the size of each error rather than requiring re-examination of the entire video.
- Downstream reasoning tasks such as video question answering receive measurable accuracy gains from the cleaned and versioned memory state.
- Automated agents complete the majority of verification steps, invoking human arbitration only when evidence thresholds are not met.
- Every correction carries an auditable provenance trail that records which agents and authorities contributed to the change.
Where Pith is reading between the lines
- The same claim-and-dependency structure could support incremental updates when new video segments arrive without reprocessing prior content.
- Explicit dependency tracking may improve debugging of video interpretation errors by revealing how one mistaken claim influences others.
- Similar contract-based maintenance of interconnected claims could be tested on long-document or multi-turn dialogue tasks to check for comparable efficiency.
Load-bearing premise
The claim dependency graph must accurately capture all relevant interactions among claims so that corrections stay strictly proportional to error scope and the three verification layers cover every important cross-cutting issue.
What would settle it
A concrete case in which an error correction affects claims outside the recorded dependency closure or an error spanning multiple aspects evades detection by all three verification stages until a full manual review is performed.
Figures
read the original abstract
Correcting errors in long-video understanding is disproportionately costly: existing multimodal pipelines produce opaque, end-to-end outputs that expose no intermediate state for inspection, forcing annotators to revisit raw video and reconstruct temporal logic from scratch. The core bottleneck is not generation quality alone, but the absence of a supervisory interface through which human effort can be proportional to the scope of each error. We present IMPACT-CYCLE, a supervisory multi-agent system that reformulates long-video understanding as iterative claim-level maintenance of a shared semantic memory -- a structured, versioned state encoding typed claims, a claim dependency graph, and a provenance log. Role-specialized agents operating under explicit authority contracts decompose verification into local object-relation correctness, cross-temporal consistency, and global semantic coherence, with corrections confined to structurally dependent claims. When automated evidence is insufficient, the system escalates to human arbitration as the supervisory authority with final override rights; dependency-closure re-verification then ensures correction cost remains proportional to error scope. Experiments on VidOR show substantially improved downstream reasoning (VQA: 0.71 to 0.79) and a 4.8x reduction in human arbitration cost, with workload significantly lower than manual annotation. Code will be released at https://github.com/MKong17/IMPACT_CYCLE.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces IMPACT-CYCLE, a contract-based multi-agent system for iterative claim-level correction of long-video semantic memory. It structures video understanding outputs as a shared, versioned memory of typed claims linked by a dependency graph and provenance log; role-specialized agents perform local object-relation, cross-temporal, and global coherence verification under explicit authority contracts, escalating to human arbitration only when automated evidence is insufficient. Corrections are confined to the dependency closure of erroneous claims. On the VidOR benchmark the system is reported to raise VQA accuracy from 0.71 to 0.79 while reducing human arbitration cost by a factor of 4.8.
Significance. If the reported gains are reproducible and the dependency-graph mechanism is shown to keep corrections proportional to error scope, the work would offer a concrete supervisory interface that addresses a recognized bottleneck in long-video annotation pipelines. The contract-based decomposition of verification roles and the explicit escalation protocol constitute a novel organizational layer on top of existing multimodal models.
major comments (2)
- Abstract: the quantitative claims (VQA 0.71→0.79 and 4.8× human-cost reduction) are stated without any description of the experimental protocol, baseline systems, number of evaluated videos or claims, statistical tests, or ablation studies on the dependency-graph construction and the three verification stages. These omissions render the central performance claims unevaluable.
- Approach description: no algorithm, pseudocode, or formal definition is supplied for constructing the claim dependency graph from the set of video claims, nor is there an empirical test (ablation or error-coverage analysis) demonstrating that the local/temporal/global decomposition exhausts the space of cross-cutting inconsistencies. Because the proportionality of correction cost rests on the completeness and accuracy of this graph, the absence is load-bearing for the main thesis.
minor comments (1)
- The promise to release code at the cited GitHub URL is welcome; the manuscript should include a brief reproducibility checklist (random seeds, exact model versions, hyper-parameters for the agents) even if the full repository is external.
Simulated Author's Rebuttal
Thank you for your constructive review. We address each major comment point by point below, indicating where revisions will be made to improve clarity and completeness.
read point-by-point responses
-
Referee: Abstract: the quantitative claims (VQA 0.71→0.79 and 4.8× human-cost reduction) are stated without any description of the experimental protocol, baseline systems, number of evaluated videos or claims, statistical tests, or ablation studies on the dependency-graph construction and the three verification stages. These omissions render the central performance claims unevaluable.
Authors: We agree that the abstract, constrained by length, omits these details and that this limits immediate evaluability. The full experimental protocol—including the VidOR subset (videos and claims evaluated), baselines (e.g., uncorrected VQA pipelines), statistical tests, and ablations on the dependency graph and verification stages—is provided in the Experiments section. We will revise the abstract to include a concise summary of the evaluation setup, dataset scale, and key controls. revision: yes
-
Referee: Approach description: no algorithm, pseudocode, or formal definition is supplied for constructing the claim dependency graph from the set of video claims, nor is there an empirical test (ablation or error-coverage analysis) demonstrating that the local/temporal/global decomposition exhausts the space of cross-cutting inconsistencies. Because the proportionality of correction cost rests on the completeness and accuracy of this graph, the absence is load-bearing for the main thesis.
Authors: We acknowledge that the manuscript currently offers only a textual description of dependency-graph construction without pseudocode or a formal definition, and lacks an explicit ablation or error-coverage analysis of the three-stage decomposition. We will add a formal definition and pseudocode for graph construction (based on shared entities, temporal overlap, and semantic links) to Section 3. We will also include an ablation study and error-coverage analysis in the revised Experiments section to quantify how completely the local/temporal/global stages capture inconsistencies and support cost proportionality. revision: yes
Circularity Check
No significant circularity; empirical results on external benchmark
full rationale
The paper describes a multi-agent supervisory system for long-video claim maintenance, with verification decomposed into local, temporal, and global checks under authority contracts. Performance is measured directly on the external VidOR benchmark (VQA 0.71→0.79, 4.8× arbitration cost reduction). No equations, parameter fits presented as predictions, self-citations, or ansatzes appear in the provided text. The dependency graph and escalation logic are architectural choices whose correctness is tested empirically rather than derived tautologically from inputs. This is the common case of a self-contained engineering contribution.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Errors in long-video understanding can be decomposed into local object-relation correctness, cross-temporal consistency, and global semantic coherence without missing critical interactions.
- domain assumption Human arbitration cost is the dominant expense and can be made proportional to error scope via dependency-closure re-verification.
invented entities (3)
-
Shared semantic memory
no independent evidence
-
Authority contracts
no independent evidence
-
Claim dependency graph
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Gpt-4 technical report,
OpenAI, “Gpt-4 technical report,” 2023
2023
-
[2]
Gemini: A family of highly capable multimodal models,
G. DeepMind, “Gemini: A family of highly capable multimodal models,” 2023
2023
-
[3]
Flamingo: a visual language model for few-shot learning,
J.-B. Alayracet al., “Flamingo: a visual language model for few-shot learning,” inNeurIPS, 2022
2022
-
[4]
VideoChat: Chat-Centric Video Understanding
K. Liet al., “Videochat: Chat-centric video understanding,”arXiv preprint arXiv:2305.06355, 2023
work page internal anchor Pith review arXiv 2023
-
[5]
Videollm: Modeling video as language for unified video understanding,
Y . Wanget al., “Videollm: Modeling video as language for unified video understanding,”arXiv preprint arXiv:2402.02410, 2024
-
[6]
Long-form video-language pre-training with multimodal temporal contrastive learning,
Y . Sunet al., “Long-form video-language pre-training with multimodal temporal contrastive learning,”arXiv preprint arXiv:2210.06031, 2022
-
[7]
Action genome: Actions as composition of spatio-temporal scene graphs,
J. Ji, R. Krishnaet al., “Action genome: Actions as composition of spatio-temporal scene graphs,” inCVPR, 2020
2020
-
[8]
Pvsg: Panoptic video scene graph generation,
X. Yanget al., “Pvsg: Panoptic video scene graph generation,” in CVPR, 2023
2023
-
[9]
Learning 4d panoptic scene graph generation from rich 2d visual scene,
S. Wu, H. Fei, J. Yang, X. Li, J. Li, H. Zhang, and T.-S. Chua, “Learning 4d panoptic scene graph generation from rich 2d visual scene,” inCVPR, 2025, pp. 24 539–24 549
2025
-
[10]
GQA: A new dataset for real-world visual reasoning and compositional question answering,
D. A. Hudson and C. D. Manning, “GQA: A new dataset for real-world visual reasoning and compositional question answering,” inCVPR, 2019, pp. 6700–6709
2019
-
[11]
AGQA: A benchmark for compositional spatio-temporal reasoning,
M. Grunde-McLaughlin, R. Krishna, and M. Agrawala, “AGQA: A benchmark for compositional spatio-temporal reasoning,” inCVPR, 2021, pp. 11 287–11 297
2021
-
[12]
Knowledge informed sequential scene graph verification using VQA,
D. Thauvin and S. Herbin, “Knowledge informed sequential scene graph verification using VQA,” inProceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Workshop on Scene Graphs and Graph Representation Learning (SG2RL), 2023
2023
-
[13]
Video annotation and tracking with active learning,
C. V ondrick and D. Ramanan, “Video annotation and tracking with active learning,” inNeurIPS, 2011. [Online]. Available: https://api.semanticscholar.org/CorpusID:5787213
2011
-
[14]
Human-in-the-loop video semantic segmentation auto- annotation,
N. Qiao, Y . Sun, C. Liu, L. Xia, J. Luo, K. Zhang, and C.-H. Kuo, “Human-in-the-loop video semantic segmentation auto- annotation,”2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 5870–5880, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:253305951
2023
-
[15]
Learning the what and how of annotation in video object segmentation,
T. Delatolas, V . S. Kalogeiton, and D. P. Papadopoulos, “Learning the what and how of annotation in video object segmentation,” 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 6936–6946, 2023. [Online]. Available: https: //api.semanticscholar.org/CorpusID:265050493
2024
-
[16]
Strike the balance: On-the-fly uncertainty based user interactions for long-term video object segmentation,
S. Vujasinovi’c, S. Becker, S. Bullinger, N. Scherer-Negenborn, and M. Arens, “Strike the balance: On-the-fly uncertainty based user interactions for long-term video object segmentation,” in Asian Conference on Computer Vision, 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:271600829
2024
-
[17]
Image retrieval using scene graphs,
J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. A. Shamma, M. S. Bernstein, and L. Fei-Fei, “Image retrieval using scene graphs,” in CVPR, 2015, pp. 3668–3678
2015
-
[18]
Visual genome: Connecting language and vision using crowdsourced dense image annotations,
R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalantidis, L.-J. Li, D. A. Shamma, M. S. Bernstein, and L. Fei-Fei, “Visual genome: Connecting language and vision using crowdsourced dense image annotations,”IJCV, vol. 123, pp. 32–73, 2017
2017
-
[19]
Scene graph generation from objects, phrases and region captions,
Y . Li, W. Ouyang, B. Zhou, K. Wang, and X. Wang, “Scene graph generation from objects, phrases and region captions,” inICCV, 2017, pp. 1270–1279
2017
-
[20]
Panoptic scene graph generation,
J. Yang, Y . Z. Ang, Z. Guo, K. Zhou, W. Zhang, and Z. Liu, “Panoptic scene graph generation,” inProceedings of the European Conference on Computer Vision (ECCV), 2022, pp. 178–196
2022
-
[21]
Universal scene graph generation,
S. Wu, H. Fei, and T.-S. Chua, “Universal scene graph generation,” in CVPR, 2025, pp. 14 158–14 168
2025
-
[22]
Spatial-temporal transformer for dynamic scene graph generation,
Y . Cong, W. Liao, H. Ackermann, B. Rosenhahn, and M. Y . Yang, “Spatial-temporal transformer for dynamic scene graph generation,” inICCV, 2021, pp. 16 352–16 362
2021
-
[23]
TRACE: Target adaptive context aggregation for video scene graph generation,
Y . Teng, L. Wang, Z. Li, and G. Wu, “TRACE: Target adaptive context aggregation for video scene graph generation,” inICCV, 2021, pp. 13 688–13 697
2021
-
[24]
Dynamic scene graph generation via anticipatory pre-training,
Y . Li, X. Yang, and C. Xu, “Dynamic scene graph generation via anticipatory pre-training,” inCVPR, 2022, pp. 13 864–13 873
2022
-
[25]
Unbiased scene graph generation in videos,
S. Nag, K. Min, S. Tripathi, and A. K. Roy-Chowdhury, “Unbiased scene graph generation in videos,” inCVPR, 2023, pp. 22 803–22 813
2023
-
[26]
Towards unbiased and robust spatio-temporal scene graph generation and anticipation,
R. Peddi, Saurabh, A. A. Shrivastava, P. Singla, and V . Gogate, “Towards unbiased and robust spatio-temporal scene graph generation and anticipation,” inCVPR, 2025, pp. 8648–8657
2025
-
[27]
OED: Towards one-stage end- to-end dynamic scene graph generation,
G. Wang, Z. Li, Q. Chen, and Y . Liu, “OED: Towards one-stage end- to-end dynamic scene graph generation,” inCVPR, 2024, pp. 27 938– 27 947
2024
-
[28]
From pixels to graphs: Open-vocabulary scene graph generation with vision-language models,
R. Li, S. Zhang, D. Lin, K. Chen, and X. He, “From pixels to graphs: Open-vocabulary scene graph generation with vision-language models,” inCVPR, 2024, pp. 28 076–28 086
2024
-
[29]
Dyngraph: Visual question answering via dynamic scene graphs,
M. Haurilet, Z. Al-Halah, and R. Stiefelhagen, “Dyngraph: Visual question answering via dynamic scene graphs,” inPattern Recognition, G. A. Fink, S. Frintrop, and X. Jiang, Eds. Cham: Springer International Publishing, 2019, pp. 428–441
2019
-
[30]
Dynamic multistep reasoning based on video scene graph for video question answering,
J. Mao, W. Jiang, X. Wang, Z. Feng, Y . Lyu, H. Liu, and Y . Zhu, “Dynamic multistep reasoning based on video scene graph for video question answering,” inProceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2022, pp. 3894–3904. [Online]. Available: https://aclanthology.org/2022.naacl...
2022
-
[31]
Language- conditioned graph networks for relational reasoning,
R. Hu, A. Rohrbach, T. Darrell, and K. Saenko, “Language- conditioned graph networks for relational reasoning,” inICCV, 2019, pp. 10 293–10 302
2019
-
[32]
Explainable and explicit visual reasoning over scene graphs,
J. Shi, H. Zhang, and J. Li, “Explainable and explicit visual reasoning over scene graphs,” inCVPR, 2019, pp. 8376–8384
2019
-
[33]
SPICE: Semantic propositional image caption evaluation,
P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE: Semantic propositional image caption evaluation,” inProceedings of the European Conference on Computer Vision (ECCV), 2016, pp. 382– 398
2016
-
[34]
Auto-encoding scene graphs for image captioning,
X. Yang, K. Tang, H. Zhang, and J. Cai, “Auto-encoding scene graphs for image captioning,” inCVPR, 2019, pp. 10 685–10 694
2019
-
[35]
Learning to generate scene graph from natural language supervision,
Y . Zhong, J. Shi, J. Yang, C. Xu, and Y . Li, “Learning to generate scene graph from natural language supervision,” inICCV, 2021, pp. 1803–1814
2021
-
[36]
Benchmarking large vision-language models via directed scene graph for comprehensive image captioning,
F. Lu, W. Wu, K. Zheng, S. Ma, B. Gong, J. Liu, W. Zhai, Y . Cao, Y . Shen, and Z.-J. Zha, “Benchmarking large vision-language models via directed scene graph for comprehensive image captioning,” in CVPR, 2025, pp. 19 618–19 627
2025
-
[37]
Fine-grained captioning of long videos through scene graph consolidation,
S. Chu, S. Seo, and B. Han, “Fine-grained captioning of long videos through scene graph consolidation,” inProceedings of the International Conference on Machine Learning (ICML), 2025. [Online]. Available: https://proceedings.mlr.press/v267/chu25b.html
2025
-
[38]
Egocross: Benchmarking multimodal large language models for cross-domain egocentric video question answering,
Y . Li, Y . Fu, T. Qian, Q. Xu, S. Dai, D. P. Paudel, L. Van Gool, and X. Wang, “Egocross: Benchmarking multimodal large language models for cross-domain egocentric video question answering,” in AAAI, vol. 40, no. 8, 2026, pp. 6592–6600
2026
-
[39]
Objectrelator: Enabling cross-view object relation understanding across ego-centric and exo-centric per- spectives,
Y . Fu, R. Wang, B. Ren, G. Sun, B. Gong, Y . Fu, D. P. Paudel, X. Huang, and L. Van Gool, “Objectrelator: Enabling cross-view object relation understanding across ego-centric and exo-centric per- spectives,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 6530–6540
2025
-
[40]
HopaDIFF: Holistic-partial aware fourier conditioned diffusion for referring human action segmentation in multi-person scenarios,
K. Peng, J. Huang, X. Huang, D. Wen, J. Zheng, Y . Chen, K. Yang, J. Wu, C. Hao, and R. Stiefelhagen, “HopaDIFF: Holistic-partial aware fourier conditioned diffusion for referring human action segmentation in multi-person scenarios,” 2025, neurIPS
2025
-
[41]
Anticipative feature fusion transformer for multi-modal action antic- ipation,
Z. Zhong, D. Schneider, M. V oit, R. Stiefelhagen, and J. Beyerer, “Anticipative feature fusion transformer for multi-modal action antic- ipation,” in2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 6057–6066
2023
-
[42]
Deep interactive segmentation of medical images: A systematic review and taxonomy,
Z. Marinov, P. F. J ¨ager, J. Egger, J. Kleesiek, and R. Stiefelhagen, “Deep interactive segmentation of medical images: A systematic review and taxonomy,”IEEE transactions on pattern analysis and machine intelligence, no. 12, 2024
2024
-
[43]
Probabilistic modeling of semantic ambiguity for scene graph generation,
G. Yang, J. Zhang, Y . Zhang, B. Wu, and Y . Yang, “Probabilistic modeling of semantic ambiguity for scene graph generation,” inCVPR, 2021
2021
-
[44]
Uncertainty-aware scene graph generation,
X. Li, T. Wu, G. Zheng, Y . Yu, and X. Li, “Uncertainty-aware scene graph generation,”Pattern Recognit. Lett., vol. 167, pp. 30–37,
-
[45]
Available: https://api.semanticscholar.org/CorpusID: 255036311
[Online]. Available: https://api.semanticscholar.org/CorpusID: 255036311
-
[46]
Evidential uncertainty and diversity guided active learning for scene graph generation,
S. Sun, S. Zhi, J. Heikkila, and L. Liu, “Evidential uncertainty and diversity guided active learning for scene graph generation,” inProceedings of the International Conference on Learning Representations (ICLR), 2023. [Online]. Available: https://openreview. net/forum?id=xI1ZTtVOtlz
2023
-
[47]
Conformal prediction and mllm aided uncertainty quantification in scene graph generation,
S. Nag, U. Ghosh, S. Bose, C.-K. Ta, J. Li, and A. K. R. Chowdhury, “Conformal prediction and mllm aided uncertainty quantification in scene graph generation,”CVPR, pp. 11 676–11 686, 2025. [Online]. Available: https://api.semanticscholar.org/CorpusID:277103791
2025
-
[48]
Visual explainable artificial intelligence for graph- based visual question answering and scene graph curation,
S. K ¨unzel, T. Munz-K¨orner, P. Tilli, N. Sch¨afer, S. Vidyapu, N. T. Vu, and D. Weiskopf, “Visual explainable artificial intelligence for graph- based visual question answering and scene graph curation,”Visual Computing for Industry, Biomedicine, and Art, vol. 8, 2025. [Online]. Available: https://api.semanticscholar.org/CorpusID:277609585
2025
-
[49]
Human-in-the-loop local corrections of 3d scene layouts via infilling,
C. Xie, A. Avetisyan, H. Howard-Jenkins, Y . Siddiqui, J. Straub, R. Newcombe, V . Balntas, and J. Engel, “Human-in-the-loop local corrections of 3d scene layouts via infilling,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 5657–5666
2025
-
[50]
Mica: Multi-agent industrial coordination assis- tant,
D. Wen, K. Peng, J. Zheng, Y . Chen, Y . Shi, J. Wei, R. Liu, K. Yang, and R. Stiefelhagen, “Mica: Multi-agent industrial coordination assis- tant,”arXiv preprint arXiv:2509.15237, 2025
-
[51]
Open scene understanding: Grounded situation recognition meets segment anything for helping people with visual im- pairments,
R. Liu, J. Zhang, K. Peng, J. Zheng, K. Cao, Y . Chen, K. Yang, and R. Stiefelhagen, “Open scene understanding: Grounded situation recognition meets segment anything for helping people with visual im- pairments,” in2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2023
2023
-
[52]
Snap, segment, deploy: A visual data and detection pipeline for wearable industrial assistants,
D. Wen, J. Zheng, R. Liu, Y . Xu, K. Peng, and R. Stiefelhagen, “Snap, segment, deploy: A visual data and detection pipeline for wearable industrial assistants,”arXiv preprint arXiv:2507.21072, 2025
-
[53]
Navigating open set scenarios for skeleton-based action recognition,
K. Peng, C. Yin, J. Zheng, R. Liu, D. Schneider, J. Zhang, K. Yang, M. S. Sarfraz, R. Stiefelhagen, and A. Roitberg, “Navigating open set scenarios for skeleton-based action recognition,”AAAI, vol. 38, no. 5, pp. 4487–4496, Mar. 2024. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/28247
2024
-
[54]
Pose-based contrastive learning for domain agnostic activity repre- sentations,
D. Schneider, M. Saquib Sarfraz, A. Roitberg, and R. Stiefelhagen, “Pose-based contrastive learning for domain agnostic activity repre- sentations,” inCVPRW, 2022, pp. 3432–3442
2022
-
[55]
Annotating objects and relations in user-generated videos,
X. Shang, D. Di, J. Xiao, Y . Cao, X. Yang, and T.-S. Chua, “Annotating objects and relations in user-generated videos,” inProceedings of the 2019 on International Conference on Multimedia Retrieval. ACM, 2019, pp. 279–287
2019
-
[56]
A. Kirillovet al., “Segment anything,”arXiv preprint arXiv:2304.02643, 2023
work page internal anchor Pith review arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.