ProcObject-10K: Benchmarking Object-Centric Procedural Understanding in Instructional Videos
Pith reviewed 2026-05-17 03:02 UTC · model grok-4.3
The pith
Models give plausible answers about object changes in videos but fail to point to the visual evidence for those answers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that current multimodal large language models display a clear answering-grounding gap on object-centric procedural tasks: they generate plausible answers to questions about object state transitions, preconditions, counterfactuals, mistakes, and readiness, yet achieve mean intersection-over-union below 45 percent when localizing the temporal and spatial evidence that supports those answers, revealing heavy dependence on linguistic priors instead of fine-grained visual object dynamics.
What carries the argument
The ProcObject-10K benchmark, which supplies open-ended VideoQA pairs together with spatial-temporal grounding annotations for object state changes across egocentric and exocentric views.
If this is right
- Fine-tuned models using pseudo object-level supervision and spatial-temporal constraints improve scores on the ProcObject-10K benchmark itself.
- The same fine-tuned models show better transfer performance on other grounded VideoQA datasets and embodied planning tasks.
- The benchmark jointly measures both answer correctness and evidence localization, unlike prior action-centric video benchmarks.
- Evaluation spans 137 tasks from 9 domains and includes both egocentric and exocentric video perspectives.
Where Pith is reading between the lines
- Future video models may need explicit training on localization objectives to reduce shortcut reliance on language patterns.
- Similar object-state benchmarks could be created for robotics or simulation environments to test physical reasoning.
- The observed gap suggests that simply increasing model size without grounding-focused data may not close the divide between answers and evidence.
Load-bearing premise
The 10,522 question-answer pairs and their grounding annotations correctly and without bias capture the object state transitions and temporal evidence needed for the five reasoning types.
What would settle it
Independent re-annotation of a random sample of the questions and groundings that finds frequent mismatches between the labeled video segments and the actual object changes described in the answers.
Figures
read the original abstract
Procedural activities are fundamentally driven by object state transitions, yet existing instructional video benchmarks remain action-centric and cannot evaluate whether models reason about how objects evolve toward task completion. In this work, we introduce ProcObject-10K, the first benchmark that jointly evaluates object-centric reasoning and temporal evidence grounding in instructional videos, across both egocentric and exocentric views. It comprises 10,522 open-ended VideoQA pairs grounded in 1,799 video clips, spanning 137 tasks across 9 domains and five reasoning types covering preconditions, state evolution, counterfactuals, mistakes, and readiness. Benchmarking 13 leading MLLMs reveals a substantial answering-grounding gap: models produce plausible answers while failing to localize the supporting evidence (mIoU < 45%), exposing their reliance on linguistic priors rather than fine-grained object dynamics. As a step toward closing this gap, we further provide an object-centric supervised fine-tuning baseline with pseudo object-level supervision and spatial-temporal constraints. Models fine-tuned on ProcObject-10K not only improve on the benchmark itself, but also transfer effectively to other grounded VideoQA and embodied planning tasks. The dataset, annotations, and evaluation toolkit will be publicly released to support future research on object-centric procedural understanding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ProcObject-10K, the first benchmark for joint object-centric reasoning and temporal evidence grounding in instructional videos. It comprises 10,522 open-ended VideoQA pairs from 1,799 clips across 137 tasks in 9 domains and five reasoning types (preconditions, state evolution, counterfactuals, mistakes, readiness). Benchmarking 13 MLLMs reveals a substantial answering-grounding gap with mIoU <45%, interpreted as evidence of reliance on linguistic priors over fine-grained object dynamics. An object-centric SFT baseline with pseudo-supervision and spatial-temporal constraints is shown to improve benchmark performance and transfer to other grounded VideoQA and embodied planning tasks. The dataset, annotations, and toolkit will be released publicly.
Significance. If the grounding annotations are shown to be reliable, the work has high significance: it provides the first large-scale evidence that current MLLMs fail at localizing object state transitions in procedural videos despite plausible answers, and the transfer results indicate the benchmark can drive progress toward more grounded procedural understanding. The public release of data and evaluation code is a clear strength that supports reproducibility and follow-on research.
major comments (2)
- [Dataset construction / annotation section] Dataset construction / annotation section: no inter-annotator agreement statistics (Cohen's kappa, mean IoU, or per-reasoning-type agreement) or adjudication procedure are reported for the spatial-temporal grounding labels on the 10,522 QA pairs. This directly affects the central claim, because if multiple plausible evidence regions exist or label noise is high, the reported mIoU <45% gap could reflect annotation variance rather than model reliance on linguistic priors.
- [Results / evaluation protocol] Results / evaluation protocol: the paper reports aggregate mIoU <45% but does not include a human performance baseline on the same grounding task or per-reasoning-type breakdowns with confidence intervals. Without these, it is difficult to calibrate whether the gap is diagnostic of model failure or partly an artifact of the annotation protocol.
minor comments (2)
- [Evaluation section] Clarify in §4 or the evaluation section whether the mIoU is computed with a fixed threshold or as mean IoU, and whether it is averaged over all QA pairs or per reasoning type.
- [Abstract] The abstract states 'mIoU < 45%' without specifying the exact aggregation; add a sentence in the main text that matches the abstract claim precisely.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights important aspects for strengthening the presentation of our benchmark and results. We address each major comment in detail below.
read point-by-point responses
-
Referee: [Dataset construction / annotation section] Dataset construction / annotation section: no inter-annotator agreement statistics (Cohen's kappa, mean IoU, or per-reasoning-type agreement) or adjudication procedure are reported for the spatial-temporal grounding labels on the 10,522 QA pairs. This directly affects the central claim, because if multiple plausible evidence regions exist or label noise is high, the reported mIoU <45% gap could reflect annotation variance rather than model reliance on linguistic priors.
Authors: We agree that providing inter-annotator agreement statistics is necessary to support the reliability of the annotations and the validity of our central claim. In the revised version of the manuscript, we will add a dedicated subsection in the dataset construction section detailing the annotation process, including the adjudication procedure used for the spatial-temporal grounding labels. We will also report Cohen's kappa, mean IoU between annotators, and agreement statistics broken down by reasoning type. These additions will help demonstrate that annotation variance is low and does not explain the observed performance gap. revision: yes
-
Referee: [Results / evaluation protocol] Results / evaluation protocol: the paper reports aggregate mIoU <45% but does not include a human performance baseline on the same grounding task or per-reasoning-type breakdowns with confidence intervals. Without these, it is difficult to calibrate whether the gap is diagnostic of model failure or partly an artifact of the annotation protocol.
Authors: We concur that a human baseline and more detailed breakdowns would better contextualize the results. We will incorporate a human performance baseline for the grounding task, where human annotators localize the evidence segments for a sampled set of questions, and report the corresponding mIoU. Additionally, we will provide per-reasoning-type mIoU results accompanied by confidence intervals in the updated results section. This will allow readers to better assess the significance of the model-human gap. revision: yes
Circularity Check
No significant circularity in this empirical benchmark paper
full rationale
This is an empirical benchmark paper that introduces new annotated data (10,522 VideoQA pairs across 1,799 clips) and reports model performance metrics on 13 MLLMs without any mathematical derivations, equations, or first-principles claims. The central results consist of observed performance gaps (e.g., mIoU < 45%) and transfer improvements from fine-tuning, which are directly tied to the released dataset and external model evaluations rather than reducing to fitted parameters or self-referential definitions. No load-bearing steps invoke self-citations for uniqueness theorems, ansatzes, or renamings of known results; the work is self-contained against external benchmarks and falsifiable via the public annotations and toolkit.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Benchmarking 13 leading MLLMs reveals a substantial answering-grounding gap: models produce plausible answers while failing to localize the supporting evidence (mIoU < 45%)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Is video-based education an ef- fective method in surgical education? a systematic review
Akgul Ahmet, Kus Gamze, Mustafaoglu Rustem, and Karaborklu Argut Sezen. Is video-based education an ef- fective method in surgical education? a systematic review. Journal of surgical education, 75(5):1150–1158, 2018. 1
work page 2018
-
[2]
Claude 4.1: Advanced reasoning model, 2025
Anthropic. Claude 4.1: Advanced reasoning model, 2025. Accessed: 2025-11-13. 7, 8
work page 2025
-
[3]
Video-mined task graphs for keystep recognition in instructional videos
Kumar Ashutosh, Santhosh Kumar Ramakrishnan, Tri- antafyllos Afouras, and Kristen Grauman. Video-mined task graphs for keystep recognition in instructional videos. Advances in Neural Information Processing Systems, 36: 67833–67846, 2023. 1, 2, 8
work page 2023
-
[4]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Jr-Jen Chen, Yu-Chien Liao, Hsi-Che Lin, Yu-Chu Yu, Yen- Chun Chen, and Frank Wang. Rextime: A benchmark suite for reasoning-across-time in videos.Advances in Neural In- formation Processing Systems, 37:28662–28673, 2024. 2, 3
work page 2024
-
[6]
Grounded multi- hop videoqa in long-form egocentric videos
Qirui Chen, Shangzhe Di, and Weidi Xie. Grounded multi- hop videoqa in long-form egocentric videos. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 2159–2167, 2025. 2, 3
work page 2025
-
[7]
Shine: Saliency-aware hierarchical negative ranking for compositional temporal grounding
Zixu Cheng, Yujiang Pu, Shaogang Gong, Parisa Kord- jamshidi, and Yu Kong. Shine: Saliency-aware hierarchical negative ranking for compositional temporal grounding. In European Conference on Computer Vision, pages 398–416. Springer, 2024. 2
work page 2024
-
[8]
Video question answering with procedural programs
Rohan Choudhury, Koichiro Niinuma, Kris M Kitani, and Laszlo A Jeni. Video question answering with procedural programs. InEuropean Conference on Computer Vision, pages 315–332. Springer, 2024. 2
work page 2024
-
[9]
The llama 3 herd of models.arXiv e-prints, pages arXiv–2407,
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407,
-
[10]
Eva-02: A visual representation for neon genesis.Image and Vision Computing, 149:105171,
Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xin- long Wang, and Yue Cao. Eva-02: A visual representation for neon genesis.Image and Vision Computing, 149:105171,
-
[11]
Future transformer for long-term action anticipation
Dayoung Gong, Joonseok Lee, Manjin Kim, Seong Jong Ha, and Minsu Cho. Future transformer for long-term action anticipation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3052– 3061, 2022. 2
work page 2022
-
[12]
Visual program- ming: Compositional visual reasoning without training
Tanmay Gupta and Aniruddha Kembhavi. Visual program- ming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 14953–14962, 2023. 2
work page 2023
-
[13]
Kimihiro Hasegawa, Wiradee Imrattanatrai, Zhi-Qi Cheng, Masaki Asada, Susan Holm, Yuran Wang, Ken Fukuda, and Teruko Mitamura. Promqa: Question answering dataset for multimodal procedural activity understanding.arXiv preprint arXiv:2410.22211, 2024. 2, 3, 8
-
[14]
ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly
Kimihiro Hasegawa, Wiradee Imrattanatrai, Masaki Asada, Susan Holm, Yuran Wang, Vincent Zhou, Ken Fukuda, and Teruko Mitamura. Promqa-assembly: Multimodal procedu- ral qa dataset on assembly.arXiv preprint arXiv:2509.02949,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Tgif-qa: Toward spatio-temporal reasoning in visual question answering
Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 2758–2766, 2017. 3
work page 2017
-
[16]
Multimodal subtask graph generation from instructional videos
Yunseok Jang, Sungryull Sohn, Lajanugen Logeswaran, Tiange Luo, Moontae Lee, and Honglak Lee. Multimodal subtask graph generation from instructional videos.arXiv preprint arXiv:2302.08672, 2023. 1, 2
-
[17]
Videomultiagents: A multi-agent framework for video question answering, 2025
Noriyuki Kugo, Xiang Li, Zixin Li, Ashish Gupta, Arpan- deep Khatua, Nidhish Jain, Chaitanya Patel, Yuta Kyuragi, Yasunori Ishii, Masamoto Tanabiki, Kazuki Kozuka, and Ehsan Adeli. Videomultiagents: A multi-agent framework for video question answering, 2025. 3, 6
work page 2025
-
[18]
Error recognition in pro- cedural videos using generalized task graph
Shih-Po Lee and Ehsan Elhamifar. Error recognition in pro- cedural videos using generalized task graph. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 10009–10021, 2025. 2
work page 2025
-
[19]
Error detection in egocentric procedural task videos
Shih-Po Lee, Zijia Lu, Zekun Zhang, Minh Hoai, and Ehsan Elhamifar. Error detection in egocentric procedural task videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18655– 18666, 2024. 2, 3
work page 2024
-
[20]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022. 6
work page 2022
-
[21]
Bo Liu, Pengfei Qiao, Minhan Ma, Xuange Zhang, Yi- nan Tang, Peng Xu, Kun Liu, and Tongtong Yuan. Surveillancevqa-589k: A benchmark for comprehensive surveillance video-language understanding with large mod- els.arXiv preprint arXiv:2505.12589, 2025. 7
-
[22]
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023. 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Gemini 2.5 pro: Multimodal reasoning model,
Google LLC. Gemini 2.5 pro: Multimodal reasoning model,
-
[24]
Accessed: 2025-11-13. 7, 8
work page 2025
-
[25]
Videogpt+: Integrating image and video encoders for enhanced video understanding
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Videogpt+: Integrating image and video encoders for enhanced video understanding. arxiv, 2024. 7
work page 2024
-
[26]
Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding.Advances in Neural In- formation Processing Systems, 36:46212–46244, 2023. 2, 3 9
work page 2023
-
[27]
Howto100m: Learning a text-video embedding by watching hundred million narrated video clips
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. InProceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019. 2
work page 2019
-
[28]
Medhini Narasimhan, Licheng Yu, Sean Bell, Ning Zhang, and Trevor Darrell. Learning and verification of task structure in instructional videos.arXiv preprint arXiv:2303.13519, 2023. 1, 2, 8
-
[29]
Gpt-5: Large language model, 2025
OpenAI. Gpt-5: Large language model, 2025. Accessed: 2025-11-13. 7, 8
work page 2025
-
[30]
Rohith Peddi, Shivvrat Arya, Bharath Challa, Likhitha Pal- lapothula, Akshay Vyas, Bhavya Gouripeddi, Qifan Zhang, Jikai Wang, Vasundhara Komaragiri, Eric Ragan, et al. Cap- taincook4d: A dataset for understanding errors in procedural activities.Advances in Neural Information Processing Sys- tems, 37:135626–135679, 2024. 2, 3
work page 2024
-
[31]
Action scene graphs for long- form understanding of egocentric videos
Ivan Rodin, Antonino Furnari, Kyle Min, Subarna Tripathi, and Giovanni Maria Farinella. Action scene graphs for long- form understanding of egocentric videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18622–18632, 2024. 8
work page 2024
-
[32]
As- sembly101: A large-scale multi-view video dataset for un- derstanding procedural activities
Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. As- sembly101: A large-scale multi-view video dataset for un- derstanding procedural activities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21096–21106, 2022. 1, 2
work page 2022
-
[33]
Look for the change: Learning object states and state-modifying actions from untrimmed web videos
Tom ´aˇs Souˇcek, Jean-Baptiste Alayrac, Antoine Miech, Ivan Laptev, and Josef Sivic. Look for the change: Learning object states and state-modifying actions from untrimmed web videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13956– 13966, 2022. 2, 3
work page 2022
-
[34]
Tom ´aˇs Souˇcek, Jean-Baptiste Alayrac, Antoine Miech, Ivan Laptev, and Josef Sivic. Multi-task learning of object states and state-modifying actions from web videos.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 46(7): 5114–5130, 2024. 3
work page 2024
-
[35]
Coin: A large-scale dataset for comprehensive instructional video analysis
Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. Coin: A large-scale dataset for comprehensive instructional video analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1207– 1216, 2019. 1, 2, 3
work page 2019
-
[36]
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chen- zhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025. 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Sheng- long Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Videoagent: Long-form video understanding with large language model as agent
Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung- Levy. Videoagent: Long-form video understanding with large language model as agent. InEuropean Conference on Computer Vision, pages 58–76. Springer, 2024. 2, 3, 6, 8
work page 2024
-
[39]
Trackverse: A large- scale object-centric video dataset for image-level representa- tion learning
Yibing Wei, Samuel Church, Victor Suciu, Jinhong Lin, Cheng-En Wu, and Pedro Morgado. Trackverse: A large- scale object-centric video dataset for image-level representa- tion learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11153–11163, 2025. 2, 3
work page 2025
-
[40]
Ziwei Xu, Yogesh Rawat, Yongkang Wong, Mohan S Kankanhalli, and Mubarak Shah. Don’t pour cereal into cof- fee: Differentiable temporal logic for temporal action seg- mentation.Advances in Neural Information Processing Sys- tems, 35:14890–14903, 2022. 8
work page 2022
-
[41]
Visa: Reasoning video object segmentation via large language models
Cilin Yan, Haochen Wang, Shilin Yan, Xiaolong Jiang, Yao Hu, Guoliang Kang, Weidi Xie, and Efstratios Gavves. Visa: Reasoning video object segmentation via large language models. InEuropean Conference on Computer Vision, pages 98–115. Springer, 2024. 3
work page 2024
-
[42]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Panda: To- wards generalist video anomaly detection via agentic ai en- gineer
Zhiwei Yang, Chen Gao, and Mike Zheng Shou. Panda: To- wards generalist video anomaly detection via agentic ai en- gineer. InNeurIPS, 2025. 3, 6
work page 2025
-
[44]
Activitynet-qa: A dataset for understanding complex web videos via question answering
Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yuet- ing Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. InProceedings of the AAAI Conference on Artificial Intelli- gence, pages 9127–9134, 2019. 3
work page 2019
-
[45]
Moscato: Predicting multiple object state change through ac- tions
Parnian Zameni, Yuhan Shen, and Ehsan Elhamifar. Moscato: Predicting multiple object state change through ac- tions. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 11600–11611, 2025. 3
work page 2025
-
[46]
Actionformer: Lo- calizing moments of actions with transformers
Chen-Lin Zhang, Jianxin Wu, and Yin Li. Actionformer: Lo- calizing moments of actions with transformers. InEuropean Conference on Computer Vision, pages 492–510. Springer,
-
[47]
Cross- task weakly supervised learning from instructional videos
Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, and Josef Sivic. Cross- task weakly supervised learning from instructional videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3537–3545, 2019. 2 10
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.