Single-agent vs. Multi-agents for Automated Video Analysis of On-Screen Collaborative Learning Behaviors
Pith reviewed 2026-05-13 17:28 UTC · model grok-4.3
The pith
Multi-agent systems using vision language models outperform single models for coding on-screen collaborative learning behaviors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that VLM-based multi-agent systems—one a structured three-agent workflow for scene segmentation and cursor-informed behavior detection with evidence verification, the other an autonomous-decision MAS that interleaves reasoning, segmentation, classification, and self-correction—achieve viable performance and surpass single VLMs (Claude-3.7-Sonnet, GPT-4.1, Qwen2.5-VL-72B) in detecting scenes and actions when labeling on-screen behaviors according to the ICAP framework.
What carries the argument
Multi-agent system frameworks that integrate scene segmentation, cursor-informed VLM prompting, evidence-based verification, and iterative self-correction to assign ICAP labels to on-screen collaborative behaviors.
Load-bearing premise
Vision language models prompted with cursor data and scene segmentation can reliably assign ICAP framework labels to on-screen behaviors without substantial domain-specific training or human validation.
What would settle it
Human coders independently labeling the same set of screen recordings and finding MAS agreement rates no higher than those of single VLMs or below thresholds acceptable for educational research.
Figures
read the original abstract
On-screen learning behavior provides valuable insights into how students seek, use, and create information during learning. Analyzing on-screen behavioral engagement is essential for capturing students' cognitive and collaborative processes. The recent development of Vision Language Models (VLMs) offers new opportunities to automate the labor-intensive manual coding often required for multimodal video data analysis. In this study, we compared the performance of both leading closed-source VLMs (Claude-3.7-Sonnet, GPT-4.1) and open-source VLM (Qwen2.5-VL-72B) in single- and multi-agent settings for automated coding of screen recordings in collaborative learning contexts based on the ICAP framework. In particular, we proposed and compared two multi-agent frameworks: 1) a three-agent workflow multi-agent system (MAS) that segments screen videos by scene and detects on-screen behaviors using cursor-informed VLM prompting with evidence-based verification; 2) an autonomous-decision MAS inspired by ReAct that iteratively interleaves reasoning, tool-like operations (segmentation/ classification/ validation), and observation-driven self-correction to produce interpretable on-screen behavior labels. Experimental results demonstrated that the two proposed MAS frameworks achieved viable performance, outperforming the single VLMs in scene and action detection tasks. It is worth noting that the workflow-based agent achieved best on scene detection, and the autonomous-decision MAS achieved best on action detection. This study demonstrates the effectiveness of VLM-based Multi-agent System for video analysis and contributes a scalable framework for multimodal data analytics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript compares single-agent and multi-agent Vision Language Model (VLM) systems for automated coding of on-screen collaborative learning behaviors according to the ICAP framework. It proposes two multi-agent systems—a three-agent workflow MAS using scene segmentation and cursor-informed prompting with verification, and a ReAct-inspired autonomous MAS with iterative reasoning and self-correction—and claims that both outperform single VLMs (Claude-3.7-Sonnet, GPT-4.1, Qwen2.5-VL-72B) on scene and action detection tasks.
Significance. If the performance claims hold under rigorous evaluation, the work would demonstrate a scalable VLM-based approach to multimodal video analysis in education, reducing the cost of manual ICAP coding. The specific technical choices—cursor data integration, segmentation, and self-verification loops in the MAS designs—are concrete contributions that could be adopted or extended by others working on automated behavioral analysis.
major comments (2)
- [Abstract and experimental results] Abstract and experimental results section: the central claim that 'the two proposed MAS frameworks achieved viable performance, outperforming the single VLMs in scene and action detection tasks' is presented without any quantitative metrics, sample sizes, number of videos or segments analyzed, precision/recall/F1 scores, or statistical tests. This absence prevents verification of the outperformance and viability assertions.
- [Evaluation methodology] Evaluation methodology: no human-coded ground-truth labels, inter-rater reliability statistics (e.g., Cohen’s kappa), or ablation results on the contribution of cursor-informed prompting and scene segmentation are reported. Without these, the reliability of the VLM-generated ICAP labels cannot be established, which is load-bearing for the comparison between single-agent and multi-agent performance.
minor comments (1)
- [Methods] The distinction between 'scene detection' and 'action detection' tasks would benefit from explicit operational definitions and example outputs in the methods section to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights important areas for improving the clarity and rigor of our presentation. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core contributions.
read point-by-point responses
-
Referee: [Abstract and experimental results] Abstract and experimental results section: the central claim that 'the two proposed MAS frameworks achieved viable performance, outperforming the single VLMs in scene and action detection tasks' is presented without any quantitative metrics, sample sizes, number of videos or segments analyzed, precision/recall/F1 scores, or statistical tests. This absence prevents verification of the outperformance and viability assertions.
Authors: We agree that the abstract and experimental results section would be strengthened by explicit quantitative support. In the revised manuscript we will expand both sections to report the number of videos and segments analyzed, along with precision, recall, and F1 scores for scene and action detection across all single-VLM and MAS conditions. We will also include appropriate statistical comparisons to substantiate the outperformance claims. revision: yes
-
Referee: [Evaluation methodology] Evaluation methodology: no human-coded ground-truth labels, inter-rater reliability statistics (e.g., Cohen’s kappa), or ablation results on the contribution of cursor-informed prompting and scene segmentation are reported. Without these, the reliability of the VLM-generated ICAP labels cannot be established, which is load-bearing for the comparison between single-agent and multi-agent performance.
Authors: We acknowledge that the current manuscript does not report inter-rater reliability or ablation studies. In the revision we will add a description of the human-coding process used to establish ground-truth labels, include Cohen’s kappa values for inter-rater agreement, and present ablation results that isolate the contributions of cursor-informed prompting and scene segmentation. These additions will provide a clearer basis for the reliability of the labels and the MAS versus single-VLM comparisons. revision: yes
Circularity Check
No circularity: purely empirical head-to-head comparison of VLM agent variants
full rationale
The manuscript reports an experimental comparison of single-VLM versus two multi-agent VLM frameworks (workflow MAS and ReAct-style autonomous MAS) on scene and action detection tasks for ICAP-coded screen recordings. No equations, fitted parameters, or derivation chains appear in the abstract or described methods. Performance claims rest on direct experimental outcomes rather than any reduction of outputs to prior self-generated values, self-citations that carry the central premise, or imported uniqueness theorems. The ICAP framework is an external reference, not a self-defined construct. Consequently the work contains no load-bearing circular steps of any enumerated kind.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption VLMs can interpret screen content and cursor movements to infer on-screen learning behaviors according to the ICAP framework.
- domain assumption Scene segmentation and evidence-based verification improve VLM accuracy on video labeling tasks.
Reference graph
Works this paper leans on
-
[1]
Mathematics 12(7), 1036 (2024)
Alahmadi, M.D., Alshangiti, M.: Optimizing ocr performance for programming videos: The role of image super-resolution and large language models. Mathematics 12(7), 1036 (2024)
work page 2024
-
[2]
com/news/claude-3-7-sonnet, accessed: 23 September 2025
Anthropic: Claude 3.7 Sonnet and Claude Code (2025),https://www.anthropic. com/news/claude-3-7-sonnet, accessed: 23 September 2025
work page 2025
-
[3]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025),https://arxiv.org/abs/2502.13923
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
In: International Conference on Pattern Recognition
Bosetti, M., Zhang, S., Liberatori, B., Zara, G., Ricci, E., Rota, P.: Text-enhanced zero-shot action recognition: A training-free approach. In: International Conference on Pattern Recognition. pp. 327–342. Springer (2024) 14 L. Peng, S. Feng
work page 2024
-
[5]
In: Visuospatial processing for education in health and natural sciences, pp
Castro-Alonso, J.C., Fiorella, L.: Interactive science multimedia and visuospatial processing. In: Visuospatial processing for education in health and natural sciences, pp. 145–173. Springer (2019)
work page 2019
-
[6]
Advances in neural information processing systems17 (2004)
Cesa-Bianchi, N., Gentile, C., Tironi, A., Zaniboni, L.: Incremental algorithms for hierarchical classification. Advances in neural information processing systems17 (2004)
work page 2004
-
[7]
In: European Conference on Computer Vision
Chen,X.,Lin,Y.,Zhang,Y.,Huang,W.:Autoeval-video:Anautomaticbenchmark for assessing large vision language models in open-ended video question answering. In: European Conference on Computer Vision. pp. 179–195. Springer (2024)
work page 2024
-
[8]
Journal of second language writing58, 100931 (2022)
Cheung, A.: Verbal and on-screen peer interactions of efl learners during multi- modal collaborative writing: A multiple case-study. Journal of second language writing58, 100931 (2022)
work page 2022
-
[9]
Educational psychologist49(4), 219–243 (2014)
Chi, M.T., Wylie, R.: The icap framework: Linking cognitive engagement to active learning outcomes. Educational psychologist49(4), 219–243 (2014)
work page 2014
-
[10]
International journal of computer-supported collaborative learning3(4), 447–470 (2008)
Erkens, G., Janssen, J.: Automatic coding of dialogue acts in collaboration pro- tocols. International journal of computer-supported collaborative learning3(4), 447–470 (2008)
work page 2008
-
[11]
Falco, M., Robiolo, G.: Tendencies in multi-agents systems: A systematic literature review (2020)
work page 2020
-
[12]
In: Proceedings of the 14th learning analytics and knowledge conference
Feng, S., Yan, L., Zhao, L., Maldonado, R.M., Gašević, D.: Heterogenous network analytics of small group teamwork: Using multimodal data to uncover individual behavioral engagement strategies. In: Proceedings of the 14th learning analytics and knowledge conference. pp. 587–597 (2024)
work page 2024
-
[13]
ACM Transactions on Computer-Human Interaction31(1), 1–38 (2023)
Gao, J., Choo, K.T.W., Cao, J., Lee, R.K.W., Perrault, S.: Coaicoder: Examin- ing the effectiveness of ai-assisted human-to-human collaboration in qualitative analysis. ACM Transactions on Computer-Human Interaction31(1), 1–38 (2023)
work page 2023
- [14]
-
[15]
International Journal of STEM Education10(1), 21 (2023)
Gomes, S., Costa, L., Martinho, C., Dias, J., Xexéo, G., Moura Santos, A.: Mod- eling students’ behavioral engagement through different in-class behavior styles. International Journal of STEM Education10(1), 21 (2023)
work page 2023
- [16]
-
[17]
Journal of Educational Computing Research61(5), 951–976 (2023)
Lee, H.Y., Cheng, Y.P., Wang, W.S., Lin, C.J., Huang, Y.M.: Exploring the learn- ing process and effectiveness of stem education via learning behavior analysis and the interactive-constructive-active-passive framework. Journal of Educational Computing Research61(5), 951–976 (2023)
work page 2023
-
[18]
In: LAK23: 13th International Learning Analytics and Knowledge Conference
Li, X., Yan, L., Zhao, L., Martinez-Maldonado, R., Gasevic, D.: Cvpe: A computer visionapproachforscalableandprivacy-preservingsocio-spatial,multimodallearn- ing analytics. In: LAK23: 13th International Learning Analytics and Knowledge Conference. pp. 175–185 (2023)
work page 2023
-
[19]
Longvideoagent: Multi-agent reasoning with long videos.arXiv preprint arXiv:2512.20618, 2025
Liu, R., Liu, Z., Tang, J., Ma, Y., Pi, R., Zhang, J., Chen, Q.: Longvideoagent: Multi-agent reasoning with long videos. arXiv preprint arXiv:2512.20618 (2025)
-
[20]
Oh, J., Sundar, S.S.: What happens when you click and drag: Unpacking the rela- tionship between on-screen interaction and user engagement with an anti-smoking website. Health communication (2020)
work page 2020
-
[21]
Multi-Agent Systems for Learning Video Analytics 15
OpenAI: Introducing GPT-4.1 in the API (2025),https://openai.com/index/ gpt-4-1/, accessed: 23 September 2025 Single vs. Multi-Agent Systems for Learning Video Analytics 15
work page 2025
-
[22]
Computers & Education223, 105163 (2024)
Schiller, R., Fleckenstein, J., Mertens, U., Horbach, A., Meyer, J.: Understanding the effectiveness of automated feedback: Using process data to uncover the role of behavioral engagement. Computers & Education223, 105163 (2024)
work page 2024
-
[23]
Sharma, K., Giannakos, M.: Multimodal data capabilities for learning: What can multimodal data tell us about learning? British Journal of Educational Technology 51(5), 1450–1484 (2020)
work page 2020
-
[24]
arXiv preprint arXiv:2504.13399 (2025)
Shriram, S., Perisetla, S., Keskar, A., Krishnaswamy, H., Bossen, T.E.W., Møgel- mose, A., Greer, R.: Towards a multi-agent vision-language system for zero-shot novel hazardous object detection for autonomous driving safety. arXiv preprint arXiv:2504.13399 (2025)
-
[25]
Human–Computer Interaction29(2), 109–152 (2014)
Sundar, S.S., Bellur, S., Oh, J., Xu, Q., Jia, H.: User experience of on-screen interaction techniques: An experimental investigation of clicking, sliding, zooming, hovering, dragging, and flipping. Human–Computer Interaction29(2), 109–152 (2014)
work page 2014
-
[26]
In: 2024 IEEE International Conference on Data Mining Workshops (ICDMW)
Teotia, J., Zhang, X., Mao, R., Cambria, E.: Evaluating vision language models in detecting learning engagement. In: 2024 IEEE International Conference on Data Mining Workshops (ICDMW). pp. 496–502. IEEE (2024)
work page 2024
-
[27]
Cambridge University Press (2004)
Van Rijsbergen, C.J.: The geometry of information retrieval. Cambridge University Press (2004)
work page 2004
-
[28]
Wang,H.,Xu,Z.,Cheng,Y.,Diao,S.,Zhou,Y.,Cao,Y.,Wang,Q.,Ge,W.,Huang, L.: Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models. arXiv preprint arXiv:2410.03290 (2024)
-
[29]
Frontiers of Computer Science18(6), 186345 (2024)
Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., Chen, Z., Tang, J., Chen, X., Lin, Y., et al.: A survey on large language model based autonomous agents. Frontiers of Computer Science18(6), 186345 (2024)
work page 2024
-
[30]
In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Yang, Y., Zhou, T., Li, K., Tao, D., Li, L., Shen, L., He, X., Jiang, J., Shi, Y.: Em- bodied multi-modal agent trained by an llm from a parallel textworld. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 26275–26285 (June 2024)
work page 2024
-
[31]
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceed- ings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. pp. 42–49 (1999)
work page 1999
-
[32]
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y.: React: Synergizingreasoningandactinginlanguagemodels.In:Theeleventhinternational conference on learning representations (2022)
work page 2022
-
[33]
In: 2025 8th International Conference on Artificial Intelligence and Big Data (ICAIBD)
Yu, C., Cheng, Z., Cui, H., Gao, Y., Luo, Z., Wang, Y., Zheng, H., Zhao, Y.: A survey on agent workflow–status and future. In: 2025 8th International Conference on Artificial Intelligence and Big Data (ICAIBD). pp. 770–781. IEEE (2025)
work page 2025
-
[34]
IEEE transactions on pattern analysis and machine intelligence46(8), 5625–5644 (2024)
Zhang, J., Huang, J., Jin, S., Lu, S.: Vision-language models for vision tasks: A survey. IEEE transactions on pattern analysis and machine intelligence46(8), 5625–5644 (2024)
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.