pith. sign in

arxiv: 2604.03631 · v1 · submitted 2026-04-04 · 💻 cs.AI

Single-agent vs. Multi-agents for Automated Video Analysis of On-Screen Collaborative Learning Behaviors

Pith reviewed 2026-05-13 17:28 UTC · model grok-4.3

classification 💻 cs.AI
keywords multi-agent systemsvision language modelsvideo analysiscollaborative learningICAP frameworkon-screen behaviorsautomated codingscreen recordings
0
0 comments X

The pith

Multi-agent systems using vision language models outperform single models for coding on-screen collaborative learning behaviors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that multi-agent systems built around vision language models can automate the analysis of screen recordings from collaborative learning more reliably than single VLMs alone. A reader would care because manual coding of such videos is slow and limits large-scale study of how students seek, use, and create information together. Two concrete MAS designs are tested: a three-agent workflow that segments scenes then detects behaviors with cursor-informed prompts and verification, and an autonomous ReAct-style system that interleaves reasoning, tool calls, and self-correction. Experiments with closed and open VLMs show both MAS versions deliver higher accuracy on scene and action detection tasks under the ICAP framework.

Core claim

The central claim is that VLM-based multi-agent systems—one a structured three-agent workflow for scene segmentation and cursor-informed behavior detection with evidence verification, the other an autonomous-decision MAS that interleaves reasoning, segmentation, classification, and self-correction—achieve viable performance and surpass single VLMs (Claude-3.7-Sonnet, GPT-4.1, Qwen2.5-VL-72B) in detecting scenes and actions when labeling on-screen behaviors according to the ICAP framework.

What carries the argument

Multi-agent system frameworks that integrate scene segmentation, cursor-informed VLM prompting, evidence-based verification, and iterative self-correction to assign ICAP labels to on-screen collaborative behaviors.

Load-bearing premise

Vision language models prompted with cursor data and scene segmentation can reliably assign ICAP framework labels to on-screen behaviors without substantial domain-specific training or human validation.

What would settle it

Human coders independently labeling the same set of screen recordings and finding MAS agreement rates no higher than those of single VLMs or below thresholds acceptable for educational research.

Figures

Figures reproduced from arXiv: 2604.03631 by Likai Peng, Shihui Feng.

Figure 1
Figure 1. Figure 1: Screenshot Data Examples 3.3 Data Collection Screen Data Collection Screen recordings were captured via Zoom to doc￾ument students’ on-screen behaviors during collaborative tasks. The recordings captured typing, clicking, scrolling, page navigation, and contextual information about platforms (e.g., GAI interfaces, web search engines, group documents). Coding Framework We operationalized the ICAP framework[… view at source ↗
Figure 2
Figure 2. Figure 2: The System Architecture of Multi-agent for Video Analysis of On-screen [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The System Architecture of ReAct-style Multi-agent System for Video [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: RQ1: Few-shot single-VLM performance on scene vs. action detection. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
read the original abstract

On-screen learning behavior provides valuable insights into how students seek, use, and create information during learning. Analyzing on-screen behavioral engagement is essential for capturing students' cognitive and collaborative processes. The recent development of Vision Language Models (VLMs) offers new opportunities to automate the labor-intensive manual coding often required for multimodal video data analysis. In this study, we compared the performance of both leading closed-source VLMs (Claude-3.7-Sonnet, GPT-4.1) and open-source VLM (Qwen2.5-VL-72B) in single- and multi-agent settings for automated coding of screen recordings in collaborative learning contexts based on the ICAP framework. In particular, we proposed and compared two multi-agent frameworks: 1) a three-agent workflow multi-agent system (MAS) that segments screen videos by scene and detects on-screen behaviors using cursor-informed VLM prompting with evidence-based verification; 2) an autonomous-decision MAS inspired by ReAct that iteratively interleaves reasoning, tool-like operations (segmentation/ classification/ validation), and observation-driven self-correction to produce interpretable on-screen behavior labels. Experimental results demonstrated that the two proposed MAS frameworks achieved viable performance, outperforming the single VLMs in scene and action detection tasks. It is worth noting that the workflow-based agent achieved best on scene detection, and the autonomous-decision MAS achieved best on action detection. This study demonstrates the effectiveness of VLM-based Multi-agent System for video analysis and contributes a scalable framework for multimodal data analytics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript compares single-agent and multi-agent Vision Language Model (VLM) systems for automated coding of on-screen collaborative learning behaviors according to the ICAP framework. It proposes two multi-agent systems—a three-agent workflow MAS using scene segmentation and cursor-informed prompting with verification, and a ReAct-inspired autonomous MAS with iterative reasoning and self-correction—and claims that both outperform single VLMs (Claude-3.7-Sonnet, GPT-4.1, Qwen2.5-VL-72B) on scene and action detection tasks.

Significance. If the performance claims hold under rigorous evaluation, the work would demonstrate a scalable VLM-based approach to multimodal video analysis in education, reducing the cost of manual ICAP coding. The specific technical choices—cursor data integration, segmentation, and self-verification loops in the MAS designs—are concrete contributions that could be adopted or extended by others working on automated behavioral analysis.

major comments (2)
  1. [Abstract and experimental results] Abstract and experimental results section: the central claim that 'the two proposed MAS frameworks achieved viable performance, outperforming the single VLMs in scene and action detection tasks' is presented without any quantitative metrics, sample sizes, number of videos or segments analyzed, precision/recall/F1 scores, or statistical tests. This absence prevents verification of the outperformance and viability assertions.
  2. [Evaluation methodology] Evaluation methodology: no human-coded ground-truth labels, inter-rater reliability statistics (e.g., Cohen’s kappa), or ablation results on the contribution of cursor-informed prompting and scene segmentation are reported. Without these, the reliability of the VLM-generated ICAP labels cannot be established, which is load-bearing for the comparison between single-agent and multi-agent performance.
minor comments (1)
  1. [Methods] The distinction between 'scene detection' and 'action detection' tasks would benefit from explicit operational definitions and example outputs in the methods section to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important areas for improving the clarity and rigor of our presentation. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses
  1. Referee: [Abstract and experimental results] Abstract and experimental results section: the central claim that 'the two proposed MAS frameworks achieved viable performance, outperforming the single VLMs in scene and action detection tasks' is presented without any quantitative metrics, sample sizes, number of videos or segments analyzed, precision/recall/F1 scores, or statistical tests. This absence prevents verification of the outperformance and viability assertions.

    Authors: We agree that the abstract and experimental results section would be strengthened by explicit quantitative support. In the revised manuscript we will expand both sections to report the number of videos and segments analyzed, along with precision, recall, and F1 scores for scene and action detection across all single-VLM and MAS conditions. We will also include appropriate statistical comparisons to substantiate the outperformance claims. revision: yes

  2. Referee: [Evaluation methodology] Evaluation methodology: no human-coded ground-truth labels, inter-rater reliability statistics (e.g., Cohen’s kappa), or ablation results on the contribution of cursor-informed prompting and scene segmentation are reported. Without these, the reliability of the VLM-generated ICAP labels cannot be established, which is load-bearing for the comparison between single-agent and multi-agent performance.

    Authors: We acknowledge that the current manuscript does not report inter-rater reliability or ablation studies. In the revision we will add a description of the human-coding process used to establish ground-truth labels, include Cohen’s kappa values for inter-rater agreement, and present ablation results that isolate the contributions of cursor-informed prompting and scene segmentation. These additions will provide a clearer basis for the reliability of the labels and the MAS versus single-VLM comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical head-to-head comparison of VLM agent variants

full rationale

The manuscript reports an experimental comparison of single-VLM versus two multi-agent VLM frameworks (workflow MAS and ReAct-style autonomous MAS) on scene and action detection tasks for ICAP-coded screen recordings. No equations, fitted parameters, or derivation chains appear in the abstract or described methods. Performance claims rest on direct experimental outcomes rather than any reduction of outputs to prior self-generated values, self-citations that carry the central premise, or imported uniqueness theorems. The ICAP framework is an external reference, not a self-defined construct. Consequently the work contains no load-bearing circular steps of any enumerated kind.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions about VLM visual reasoning capabilities and the applicability of the ICAP framework to screen data, with no free parameters or new entities introduced in the abstract.

axioms (2)
  • domain assumption VLMs can interpret screen content and cursor movements to infer on-screen learning behaviors according to the ICAP framework.
    This underpins the prompting strategy in both multi-agent frameworks.
  • domain assumption Scene segmentation and evidence-based verification improve VLM accuracy on video labeling tasks.
    Invoked in the design of the three-agent workflow MAS.

pith-pipeline@v0.9.0 · 5580 in / 1341 out tokens · 42737 ms · 2026-05-13T17:28:00.976265+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 1 internal anchor

  1. [1]

    Mathematics 12(7), 1036 (2024)

    Alahmadi, M.D., Alshangiti, M.: Optimizing ocr performance for programming videos: The role of image super-resolution and large language models. Mathematics 12(7), 1036 (2024)

  2. [2]

    com/news/claude-3-7-sonnet, accessed: 23 September 2025

    Anthropic: Claude 3.7 Sonnet and Claude Code (2025),https://www.anthropic. com/news/claude-3-7-sonnet, accessed: 23 September 2025

  3. [3]

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025),https://arxiv.org/abs/2502.13923

  4. [4]

    In: International Conference on Pattern Recognition

    Bosetti, M., Zhang, S., Liberatori, B., Zara, G., Ricci, E., Rota, P.: Text-enhanced zero-shot action recognition: A training-free approach. In: International Conference on Pattern Recognition. pp. 327–342. Springer (2024) 14 L. Peng, S. Feng

  5. [5]

    In: Visuospatial processing for education in health and natural sciences, pp

    Castro-Alonso, J.C., Fiorella, L.: Interactive science multimedia and visuospatial processing. In: Visuospatial processing for education in health and natural sciences, pp. 145–173. Springer (2019)

  6. [6]

    Advances in neural information processing systems17 (2004)

    Cesa-Bianchi, N., Gentile, C., Tironi, A., Zaniboni, L.: Incremental algorithms for hierarchical classification. Advances in neural information processing systems17 (2004)

  7. [7]

    In: European Conference on Computer Vision

    Chen,X.,Lin,Y.,Zhang,Y.,Huang,W.:Autoeval-video:Anautomaticbenchmark for assessing large vision language models in open-ended video question answering. In: European Conference on Computer Vision. pp. 179–195. Springer (2024)

  8. [8]

    Journal of second language writing58, 100931 (2022)

    Cheung, A.: Verbal and on-screen peer interactions of efl learners during multi- modal collaborative writing: A multiple case-study. Journal of second language writing58, 100931 (2022)

  9. [9]

    Educational psychologist49(4), 219–243 (2014)

    Chi, M.T., Wylie, R.: The icap framework: Linking cognitive engagement to active learning outcomes. Educational psychologist49(4), 219–243 (2014)

  10. [10]

    International journal of computer-supported collaborative learning3(4), 447–470 (2008)

    Erkens, G., Janssen, J.: Automatic coding of dialogue acts in collaboration pro- tocols. International journal of computer-supported collaborative learning3(4), 447–470 (2008)

  11. [11]

    Falco, M., Robiolo, G.: Tendencies in multi-agents systems: A systematic literature review (2020)

  12. [12]

    In: Proceedings of the 14th learning analytics and knowledge conference

    Feng, S., Yan, L., Zhao, L., Maldonado, R.M., Gašević, D.: Heterogenous network analytics of small group teamwork: Using multimodal data to uncover individual behavioral engagement strategies. In: Proceedings of the 14th learning analytics and knowledge conference. pp. 587–597 (2024)

  13. [13]

    ACM Transactions on Computer-Human Interaction31(1), 1–38 (2023)

    Gao, J., Choo, K.T.W., Cao, J., Lee, R.K.W., Perrault, S.: Coaicoder: Examin- ing the effectiveness of ai-assisted human-to-human collaboration in qualitative analysis. ACM Transactions on Computer-Human Interaction31(1), 1–38 (2023)

  14. [14]

    Gao, Z., Zhang, B., Li, P., Ma, X., Yuan, T., Fan, Y., Wu, Y., Jia, Y., Zhu, S.C., Li, Q.: Multi-modal agent tuning: Building a vlm-driven agent for efficient tool usage (2025),https://arxiv.org/abs/2412.15606

  15. [15]

    International Journal of STEM Education10(1), 21 (2023)

    Gomes, S., Costa, L., Martinho, C., Dias, J., Xexéo, G., Moura Santos, A.: Mod- eling students’ behavioral engagement through different in-class behavior styles. International Journal of STEM Education10(1), 21 (2023)

  16. [16]

    Jiang, Y.H., Li, R., Zhou, Y., Qi, C., Hu, H., Wei, Y., Jiang, B., Wu, Y.: Ai agent for education: von neumann multi-agent system framework (2024),https: //arxiv.org/abs/2501.00083

  17. [17]

    Journal of Educational Computing Research61(5), 951–976 (2023)

    Lee, H.Y., Cheng, Y.P., Wang, W.S., Lin, C.J., Huang, Y.M.: Exploring the learn- ing process and effectiveness of stem education via learning behavior analysis and the interactive-constructive-active-passive framework. Journal of Educational Computing Research61(5), 951–976 (2023)

  18. [18]

    In: LAK23: 13th International Learning Analytics and Knowledge Conference

    Li, X., Yan, L., Zhao, L., Martinez-Maldonado, R., Gasevic, D.: Cvpe: A computer visionapproachforscalableandprivacy-preservingsocio-spatial,multimodallearn- ing analytics. In: LAK23: 13th International Learning Analytics and Knowledge Conference. pp. 175–185 (2023)

  19. [19]

    Longvideoagent: Multi-agent reasoning with long videos.arXiv preprint arXiv:2512.20618, 2025

    Liu, R., Liu, Z., Tang, J., Ma, Y., Pi, R., Zhang, J., Chen, Q.: Longvideoagent: Multi-agent reasoning with long videos. arXiv preprint arXiv:2512.20618 (2025)

  20. [20]

    Health communication (2020)

    Oh, J., Sundar, S.S.: What happens when you click and drag: Unpacking the rela- tionship between on-screen interaction and user engagement with an anti-smoking website. Health communication (2020)

  21. [21]

    Multi-Agent Systems for Learning Video Analytics 15

    OpenAI: Introducing GPT-4.1 in the API (2025),https://openai.com/index/ gpt-4-1/, accessed: 23 September 2025 Single vs. Multi-Agent Systems for Learning Video Analytics 15

  22. [22]

    Computers & Education223, 105163 (2024)

    Schiller, R., Fleckenstein, J., Mertens, U., Horbach, A., Meyer, J.: Understanding the effectiveness of automated feedback: Using process data to uncover the role of behavioral engagement. Computers & Education223, 105163 (2024)

  23. [23]

    Sharma, K., Giannakos, M.: Multimodal data capabilities for learning: What can multimodal data tell us about learning? British Journal of Educational Technology 51(5), 1450–1484 (2020)

  24. [24]

    arXiv preprint arXiv:2504.13399 (2025)

    Shriram, S., Perisetla, S., Keskar, A., Krishnaswamy, H., Bossen, T.E.W., Møgel- mose, A., Greer, R.: Towards a multi-agent vision-language system for zero-shot novel hazardous object detection for autonomous driving safety. arXiv preprint arXiv:2504.13399 (2025)

  25. [25]

    Human–Computer Interaction29(2), 109–152 (2014)

    Sundar, S.S., Bellur, S., Oh, J., Xu, Q., Jia, H.: User experience of on-screen interaction techniques: An experimental investigation of clicking, sliding, zooming, hovering, dragging, and flipping. Human–Computer Interaction29(2), 109–152 (2014)

  26. [26]

    In: 2024 IEEE International Conference on Data Mining Workshops (ICDMW)

    Teotia, J., Zhang, X., Mao, R., Cambria, E.: Evaluating vision language models in detecting learning engagement. In: 2024 IEEE International Conference on Data Mining Workshops (ICDMW). pp. 496–502. IEEE (2024)

  27. [27]

    Cambridge University Press (2004)

    Van Rijsbergen, C.J.: The geometry of information retrieval. Cambridge University Press (2004)

  28. [28]

    Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models.arXiv preprint arXiv:2410.03290, 2024

    Wang,H.,Xu,Z.,Cheng,Y.,Diao,S.,Zhou,Y.,Cao,Y.,Wang,Q.,Ge,W.,Huang, L.: Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models. arXiv preprint arXiv:2410.03290 (2024)

  29. [29]

    Frontiers of Computer Science18(6), 186345 (2024)

    Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., Chen, Z., Tang, J., Chen, X., Lin, Y., et al.: A survey on large language model based autonomous agents. Frontiers of Computer Science18(6), 186345 (2024)

  30. [30]

    In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Yang, Y., Zhou, T., Li, K., Tao, D., Li, L., Shen, L., He, X., Jiang, J., Shi, Y.: Em- bodied multi-modal agent trained by an llm from a parallel textworld. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 26275–26285 (June 2024)

  31. [31]

    In: Proceed- ings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval

    Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceed- ings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. pp. 42–49 (1999)

  32. [32]

    Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y.: React: Synergizingreasoningandactinginlanguagemodels.In:Theeleventhinternational conference on learning representations (2022)

  33. [33]

    In: 2025 8th International Conference on Artificial Intelligence and Big Data (ICAIBD)

    Yu, C., Cheng, Z., Cui, H., Gao, Y., Luo, Z., Wang, Y., Zheng, H., Zhao, Y.: A survey on agent workflow–status and future. In: 2025 8th International Conference on Artificial Intelligence and Big Data (ICAIBD). pp. 770–781. IEEE (2025)

  34. [34]

    IEEE transactions on pattern analysis and machine intelligence46(8), 5625–5644 (2024)

    Zhang, J., Huang, J., Jin, S., Lu, S.: Vision-language models for vision tasks: A survey. IEEE transactions on pattern analysis and machine intelligence46(8), 5625–5644 (2024)