Single-agent vs. Multi-agents for Automated Video Analysis of On-Screen Collaborative Learning Behaviors

Likai Peng; Shihui Feng

arxiv: 2604.03631 · v1 · submitted 2026-04-04 · 💻 cs.AI

Single-agent vs. Multi-agents for Automated Video Analysis of On-Screen Collaborative Learning Behaviors

Likai Peng , Shihui Feng This is my paper

Pith reviewed 2026-05-13 17:28 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-agent systemsvision language modelsvideo analysiscollaborative learningICAP frameworkon-screen behaviorsautomated codingscreen recordings

0 comments

The pith

Multi-agent systems using vision language models outperform single models for coding on-screen collaborative learning behaviors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that multi-agent systems built around vision language models can automate the analysis of screen recordings from collaborative learning more reliably than single VLMs alone. A reader would care because manual coding of such videos is slow and limits large-scale study of how students seek, use, and create information together. Two concrete MAS designs are tested: a three-agent workflow that segments scenes then detects behaviors with cursor-informed prompts and verification, and an autonomous ReAct-style system that interleaves reasoning, tool calls, and self-correction. Experiments with closed and open VLMs show both MAS versions deliver higher accuracy on scene and action detection tasks under the ICAP framework.

Core claim

The central claim is that VLM-based multi-agent systems—one a structured three-agent workflow for scene segmentation and cursor-informed behavior detection with evidence verification, the other an autonomous-decision MAS that interleaves reasoning, segmentation, classification, and self-correction—achieve viable performance and surpass single VLMs (Claude-3.7-Sonnet, GPT-4.1, Qwen2.5-VL-72B) in detecting scenes and actions when labeling on-screen behaviors according to the ICAP framework.

What carries the argument

Multi-agent system frameworks that integrate scene segmentation, cursor-informed VLM prompting, evidence-based verification, and iterative self-correction to assign ICAP labels to on-screen collaborative behaviors.

Load-bearing premise

Vision language models prompted with cursor data and scene segmentation can reliably assign ICAP framework labels to on-screen behaviors without substantial domain-specific training or human validation.

What would settle it

Human coders independently labeling the same set of screen recordings and finding MAS agreement rates no higher than those of single VLMs or below thresholds acceptable for educational research.

Figures

Figures reproduced from arXiv: 2604.03631 by Likai Peng, Shihui Feng.

**Figure 1.** Figure 1: Screenshot Data Examples 3.3 Data Collection Screen Data Collection Screen recordings were captured via Zoom to document students’ on-screen behaviors during collaborative tasks. The recordings captured typing, clicking, scrolling, page navigation, and contextual information about platforms (e.g., GAI interfaces, web search engines, group documents). Coding Framework We operationalized the ICAP framework[… view at source ↗

**Figure 2.** Figure 2: The System Architecture of Multi-agent for Video Analysis of On-screen [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: The System Architecture of ReAct-style Multi-agent System for Video [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: RQ1: Few-shot single-VLM performance on scene vs. action detection. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

read the original abstract

On-screen learning behavior provides valuable insights into how students seek, use, and create information during learning. Analyzing on-screen behavioral engagement is essential for capturing students' cognitive and collaborative processes. The recent development of Vision Language Models (VLMs) offers new opportunities to automate the labor-intensive manual coding often required for multimodal video data analysis. In this study, we compared the performance of both leading closed-source VLMs (Claude-3.7-Sonnet, GPT-4.1) and open-source VLM (Qwen2.5-VL-72B) in single- and multi-agent settings for automated coding of screen recordings in collaborative learning contexts based on the ICAP framework. In particular, we proposed and compared two multi-agent frameworks: 1) a three-agent workflow multi-agent system (MAS) that segments screen videos by scene and detects on-screen behaviors using cursor-informed VLM prompting with evidence-based verification; 2) an autonomous-decision MAS inspired by ReAct that iteratively interleaves reasoning, tool-like operations (segmentation/ classification/ validation), and observation-driven self-correction to produce interpretable on-screen behavior labels. Experimental results demonstrated that the two proposed MAS frameworks achieved viable performance, outperforming the single VLMs in scene and action detection tasks. It is worth noting that the workflow-based agent achieved best on scene detection, and the autonomous-decision MAS achieved best on action detection. This study demonstrates the effectiveness of VLM-based Multi-agent System for video analysis and contributes a scalable framework for multimodal data analytics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds two practical multi-agent setups for VLM video coding in collaborative learning but states outperformance without any numbers or human validation.

read the letter

The main point is that they took existing agent patterns and applied them to automate ICAP coding of on-screen behaviors in collaborative learning videos. They describe a three-agent workflow that segments scenes, uses cursor-informed prompts for detection, and adds verification, plus a ReAct-style autonomous version that loops through reasoning, tool calls, and self-correction. Both are tested against single VLMs like Claude-3.7, GPT-4.1, and Qwen2.5-VL-72B. That specific tailoring to cursor data and educational video is the concrete new piece, and it is a reasonable extension for this task. The framing around on-screen engagement and the ICAP framework is clear and practical. The designs themselves look workable for anyone trying to scale video analysis without full manual coding. The soft spot is the evaluation. The abstract claims the multi-agent versions achieved viable performance and outperformed the single models on scene and action detection, yet it gives no dataset size, no accuracy or F1 numbers, no statistical tests, and no comparison to human-coded ground truth. Without inter-rater metrics or an ablation showing that the cursor and segmentation steps actually improve label quality, the outperformance claim cannot be checked against hallucination or prompt sensitivity. That gap is central rather than minor. This is for people working in educational technology or learning analytics who need ideas for automated coding pipelines. A reader already building VLM agents would pick up the workflow details and the open-source model comparison. I would bring the agent architectures to a reading group for discussion but would want the full results table first. It deserves peer review if the complete paper supplies the missing metrics and validation steps, because the application area is narrow but the agent designs are reproducible enough to test.

Referee Report

2 major / 1 minor

Summary. The manuscript compares single-agent and multi-agent Vision Language Model (VLM) systems for automated coding of on-screen collaborative learning behaviors according to the ICAP framework. It proposes two multi-agent systems—a three-agent workflow MAS using scene segmentation and cursor-informed prompting with verification, and a ReAct-inspired autonomous MAS with iterative reasoning and self-correction—and claims that both outperform single VLMs (Claude-3.7-Sonnet, GPT-4.1, Qwen2.5-VL-72B) on scene and action detection tasks.

Significance. If the performance claims hold under rigorous evaluation, the work would demonstrate a scalable VLM-based approach to multimodal video analysis in education, reducing the cost of manual ICAP coding. The specific technical choices—cursor data integration, segmentation, and self-verification loops in the MAS designs—are concrete contributions that could be adopted or extended by others working on automated behavioral analysis.

major comments (2)

[Abstract and experimental results] Abstract and experimental results section: the central claim that 'the two proposed MAS frameworks achieved viable performance, outperforming the single VLMs in scene and action detection tasks' is presented without any quantitative metrics, sample sizes, number of videos or segments analyzed, precision/recall/F1 scores, or statistical tests. This absence prevents verification of the outperformance and viability assertions.
[Evaluation methodology] Evaluation methodology: no human-coded ground-truth labels, inter-rater reliability statistics (e.g., Cohen’s kappa), or ablation results on the contribution of cursor-informed prompting and scene segmentation are reported. Without these, the reliability of the VLM-generated ICAP labels cannot be established, which is load-bearing for the comparison between single-agent and multi-agent performance.

minor comments (1)

[Methods] The distinction between 'scene detection' and 'action detection' tasks would benefit from explicit operational definitions and example outputs in the methods section to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important areas for improving the clarity and rigor of our presentation. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses

Referee: [Abstract and experimental results] Abstract and experimental results section: the central claim that 'the two proposed MAS frameworks achieved viable performance, outperforming the single VLMs in scene and action detection tasks' is presented without any quantitative metrics, sample sizes, number of videos or segments analyzed, precision/recall/F1 scores, or statistical tests. This absence prevents verification of the outperformance and viability assertions.

Authors: We agree that the abstract and experimental results section would be strengthened by explicit quantitative support. In the revised manuscript we will expand both sections to report the number of videos and segments analyzed, along with precision, recall, and F1 scores for scene and action detection across all single-VLM and MAS conditions. We will also include appropriate statistical comparisons to substantiate the outperformance claims. revision: yes
Referee: [Evaluation methodology] Evaluation methodology: no human-coded ground-truth labels, inter-rater reliability statistics (e.g., Cohen’s kappa), or ablation results on the contribution of cursor-informed prompting and scene segmentation are reported. Without these, the reliability of the VLM-generated ICAP labels cannot be established, which is load-bearing for the comparison between single-agent and multi-agent performance.

Authors: We acknowledge that the current manuscript does not report inter-rater reliability or ablation studies. In the revision we will add a description of the human-coding process used to establish ground-truth labels, include Cohen’s kappa values for inter-rater agreement, and present ablation results that isolate the contributions of cursor-informed prompting and scene segmentation. These additions will provide a clearer basis for the reliability of the labels and the MAS versus single-VLM comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical head-to-head comparison of VLM agent variants

full rationale

The manuscript reports an experimental comparison of single-VLM versus two multi-agent VLM frameworks (workflow MAS and ReAct-style autonomous MAS) on scene and action detection tasks for ICAP-coded screen recordings. No equations, fitted parameters, or derivation chains appear in the abstract or described methods. Performance claims rest on direct experimental outcomes rather than any reduction of outputs to prior self-generated values, self-citations that carry the central premise, or imported uniqueness theorems. The ICAP framework is an external reference, not a self-defined construct. Consequently the work contains no load-bearing circular steps of any enumerated kind.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions about VLM visual reasoning capabilities and the applicability of the ICAP framework to screen data, with no free parameters or new entities introduced in the abstract.

axioms (2)

domain assumption VLMs can interpret screen content and cursor movements to infer on-screen learning behaviors according to the ICAP framework.
This underpins the prompting strategy in both multi-agent frameworks.
domain assumption Scene segmentation and evidence-based verification improve VLM accuracy on video labeling tasks.
Invoked in the design of the three-agent workflow MAS.

pith-pipeline@v0.9.0 · 5580 in / 1341 out tokens · 42737 ms · 2026-05-13T17:28:00.976265+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 1 internal anchor

[1]

Mathematics 12(7), 1036 (2024)

Alahmadi, M.D., Alshangiti, M.: Optimizing ocr performance for programming videos: The role of image super-resolution and large language models. Mathematics 12(7), 1036 (2024)

work page 2024
[2]

com/news/claude-3-7-sonnet, accessed: 23 September 2025

Anthropic: Claude 3.7 Sonnet and Claude Code (2025),https://www.anthropic. com/news/claude-3-7-sonnet, accessed: 23 September 2025

work page 2025
[3]

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025),https://arxiv.org/abs/2502.13923

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

In: International Conference on Pattern Recognition

Bosetti, M., Zhang, S., Liberatori, B., Zara, G., Ricci, E., Rota, P.: Text-enhanced zero-shot action recognition: A training-free approach. In: International Conference on Pattern Recognition. pp. 327–342. Springer (2024) 14 L. Peng, S. Feng

work page 2024
[5]

In: Visuospatial processing for education in health and natural sciences, pp

Castro-Alonso, J.C., Fiorella, L.: Interactive science multimedia and visuospatial processing. In: Visuospatial processing for education in health and natural sciences, pp. 145–173. Springer (2019)

work page 2019
[6]

Advances in neural information processing systems17 (2004)

Cesa-Bianchi, N., Gentile, C., Tironi, A., Zaniboni, L.: Incremental algorithms for hierarchical classification. Advances in neural information processing systems17 (2004)

work page 2004
[7]

In: European Conference on Computer Vision

Chen,X.,Lin,Y.,Zhang,Y.,Huang,W.:Autoeval-video:Anautomaticbenchmark for assessing large vision language models in open-ended video question answering. In: European Conference on Computer Vision. pp. 179–195. Springer (2024)

work page 2024
[8]

Journal of second language writing58, 100931 (2022)

Cheung, A.: Verbal and on-screen peer interactions of efl learners during multi- modal collaborative writing: A multiple case-study. Journal of second language writing58, 100931 (2022)

work page 2022
[9]

Educational psychologist49(4), 219–243 (2014)

Chi, M.T., Wylie, R.: The icap framework: Linking cognitive engagement to active learning outcomes. Educational psychologist49(4), 219–243 (2014)

work page 2014
[10]

International journal of computer-supported collaborative learning3(4), 447–470 (2008)

Erkens, G., Janssen, J.: Automatic coding of dialogue acts in collaboration pro- tocols. International journal of computer-supported collaborative learning3(4), 447–470 (2008)

work page 2008
[11]

Falco, M., Robiolo, G.: Tendencies in multi-agents systems: A systematic literature review (2020)

work page 2020
[12]

In: Proceedings of the 14th learning analytics and knowledge conference

Feng, S., Yan, L., Zhao, L., Maldonado, R.M., Gašević, D.: Heterogenous network analytics of small group teamwork: Using multimodal data to uncover individual behavioral engagement strategies. In: Proceedings of the 14th learning analytics and knowledge conference. pp. 587–597 (2024)

work page 2024
[13]

ACM Transactions on Computer-Human Interaction31(1), 1–38 (2023)

Gao, J., Choo, K.T.W., Cao, J., Lee, R.K.W., Perrault, S.: Coaicoder: Examin- ing the effectiveness of ai-assisted human-to-human collaboration in qualitative analysis. ACM Transactions on Computer-Human Interaction31(1), 1–38 (2023)

work page 2023
[14]

Gao, Z., Zhang, B., Li, P., Ma, X., Yuan, T., Fan, Y., Wu, Y., Jia, Y., Zhu, S.C., Li, Q.: Multi-modal agent tuning: Building a vlm-driven agent for efficient tool usage (2025),https://arxiv.org/abs/2412.15606

work page arXiv 2025
[15]

International Journal of STEM Education10(1), 21 (2023)

Gomes, S., Costa, L., Martinho, C., Dias, J., Xexéo, G., Moura Santos, A.: Mod- eling students’ behavioral engagement through different in-class behavior styles. International Journal of STEM Education10(1), 21 (2023)

work page 2023
[16]

Jiang, Y.H., Li, R., Zhou, Y., Qi, C., Hu, H., Wei, Y., Jiang, B., Wu, Y.: Ai agent for education: von neumann multi-agent system framework (2024),https: //arxiv.org/abs/2501.00083

work page arXiv 2024
[17]

Journal of Educational Computing Research61(5), 951–976 (2023)

Lee, H.Y., Cheng, Y.P., Wang, W.S., Lin, C.J., Huang, Y.M.: Exploring the learn- ing process and effectiveness of stem education via learning behavior analysis and the interactive-constructive-active-passive framework. Journal of Educational Computing Research61(5), 951–976 (2023)

work page 2023
[18]

In: LAK23: 13th International Learning Analytics and Knowledge Conference

Li, X., Yan, L., Zhao, L., Martinez-Maldonado, R., Gasevic, D.: Cvpe: A computer visionapproachforscalableandprivacy-preservingsocio-spatial,multimodallearn- ing analytics. In: LAK23: 13th International Learning Analytics and Knowledge Conference. pp. 175–185 (2023)

work page 2023
[19]

Longvideoagent: Multi-agent reasoning with long videos.arXiv preprint arXiv:2512.20618, 2025

Liu, R., Liu, Z., Tang, J., Ma, Y., Pi, R., Zhang, J., Chen, Q.: Longvideoagent: Multi-agent reasoning with long videos. arXiv preprint arXiv:2512.20618 (2025)

work page arXiv 2025
[20]

Health communication (2020)

Oh, J., Sundar, S.S.: What happens when you click and drag: Unpacking the rela- tionship between on-screen interaction and user engagement with an anti-smoking website. Health communication (2020)

work page 2020
[21]

Multi-Agent Systems for Learning Video Analytics 15

OpenAI: Introducing GPT-4.1 in the API (2025),https://openai.com/index/ gpt-4-1/, accessed: 23 September 2025 Single vs. Multi-Agent Systems for Learning Video Analytics 15

work page 2025
[22]

Computers & Education223, 105163 (2024)

Schiller, R., Fleckenstein, J., Mertens, U., Horbach, A., Meyer, J.: Understanding the effectiveness of automated feedback: Using process data to uncover the role of behavioral engagement. Computers & Education223, 105163 (2024)

work page 2024
[23]

Sharma, K., Giannakos, M.: Multimodal data capabilities for learning: What can multimodal data tell us about learning? British Journal of Educational Technology 51(5), 1450–1484 (2020)

work page 2020
[24]

arXiv preprint arXiv:2504.13399 (2025)

Shriram, S., Perisetla, S., Keskar, A., Krishnaswamy, H., Bossen, T.E.W., Møgel- mose, A., Greer, R.: Towards a multi-agent vision-language system for zero-shot novel hazardous object detection for autonomous driving safety. arXiv preprint arXiv:2504.13399 (2025)

work page arXiv 2025
[25]

Human–Computer Interaction29(2), 109–152 (2014)

Sundar, S.S., Bellur, S., Oh, J., Xu, Q., Jia, H.: User experience of on-screen interaction techniques: An experimental investigation of clicking, sliding, zooming, hovering, dragging, and flipping. Human–Computer Interaction29(2), 109–152 (2014)

work page 2014
[26]

In: 2024 IEEE International Conference on Data Mining Workshops (ICDMW)

Teotia, J., Zhang, X., Mao, R., Cambria, E.: Evaluating vision language models in detecting learning engagement. In: 2024 IEEE International Conference on Data Mining Workshops (ICDMW). pp. 496–502. IEEE (2024)

work page 2024
[27]

Cambridge University Press (2004)

Van Rijsbergen, C.J.: The geometry of information retrieval. Cambridge University Press (2004)

work page 2004
[28]

Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models.arXiv preprint arXiv:2410.03290, 2024

Wang,H.,Xu,Z.,Cheng,Y.,Diao,S.,Zhou,Y.,Cao,Y.,Wang,Q.,Ge,W.,Huang, L.: Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models. arXiv preprint arXiv:2410.03290 (2024)

work page arXiv 2024
[29]

Frontiers of Computer Science18(6), 186345 (2024)

Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., Chen, Z., Tang, J., Chen, X., Lin, Y., et al.: A survey on large language model based autonomous agents. Frontiers of Computer Science18(6), 186345 (2024)

work page 2024
[30]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Yang, Y., Zhou, T., Li, K., Tao, D., Li, L., Shen, L., He, X., Jiang, J., Shi, Y.: Em- bodied multi-modal agent trained by an llm from a parallel textworld. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 26275–26285 (June 2024)

work page 2024
[31]

In: Proceed- ings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval

Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceed- ings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. pp. 42–49 (1999)

work page 1999
[32]

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y.: React: Synergizingreasoningandactinginlanguagemodels.In:Theeleventhinternational conference on learning representations (2022)

work page 2022
[33]

In: 2025 8th International Conference on Artificial Intelligence and Big Data (ICAIBD)

Yu, C., Cheng, Z., Cui, H., Gao, Y., Luo, Z., Wang, Y., Zheng, H., Zhao, Y.: A survey on agent workflow–status and future. In: 2025 8th International Conference on Artificial Intelligence and Big Data (ICAIBD). pp. 770–781. IEEE (2025)

work page 2025
[34]

IEEE transactions on pattern analysis and machine intelligence46(8), 5625–5644 (2024)

Zhang, J., Huang, J., Jin, S., Lu, S.: Vision-language models for vision tasks: A survey. IEEE transactions on pattern analysis and machine intelligence46(8), 5625–5644 (2024)

work page 2024

[1] [1]

Mathematics 12(7), 1036 (2024)

Alahmadi, M.D., Alshangiti, M.: Optimizing ocr performance for programming videos: The role of image super-resolution and large language models. Mathematics 12(7), 1036 (2024)

work page 2024

[2] [2]

com/news/claude-3-7-sonnet, accessed: 23 September 2025

Anthropic: Claude 3.7 Sonnet and Claude Code (2025),https://www.anthropic. com/news/claude-3-7-sonnet, accessed: 23 September 2025

work page 2025

[3] [3]

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025),https://arxiv.org/abs/2502.13923

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

In: International Conference on Pattern Recognition

Bosetti, M., Zhang, S., Liberatori, B., Zara, G., Ricci, E., Rota, P.: Text-enhanced zero-shot action recognition: A training-free approach. In: International Conference on Pattern Recognition. pp. 327–342. Springer (2024) 14 L. Peng, S. Feng

work page 2024

[5] [5]

In: Visuospatial processing for education in health and natural sciences, pp

Castro-Alonso, J.C., Fiorella, L.: Interactive science multimedia and visuospatial processing. In: Visuospatial processing for education in health and natural sciences, pp. 145–173. Springer (2019)

work page 2019

[6] [6]

Advances in neural information processing systems17 (2004)

Cesa-Bianchi, N., Gentile, C., Tironi, A., Zaniboni, L.: Incremental algorithms for hierarchical classification. Advances in neural information processing systems17 (2004)

work page 2004

[7] [7]

In: European Conference on Computer Vision

Chen,X.,Lin,Y.,Zhang,Y.,Huang,W.:Autoeval-video:Anautomaticbenchmark for assessing large vision language models in open-ended video question answering. In: European Conference on Computer Vision. pp. 179–195. Springer (2024)

work page 2024

[8] [8]

Journal of second language writing58, 100931 (2022)

Cheung, A.: Verbal and on-screen peer interactions of efl learners during multi- modal collaborative writing: A multiple case-study. Journal of second language writing58, 100931 (2022)

work page 2022

[9] [9]

Educational psychologist49(4), 219–243 (2014)

Chi, M.T., Wylie, R.: The icap framework: Linking cognitive engagement to active learning outcomes. Educational psychologist49(4), 219–243 (2014)

work page 2014

[10] [10]

International journal of computer-supported collaborative learning3(4), 447–470 (2008)

Erkens, G., Janssen, J.: Automatic coding of dialogue acts in collaboration pro- tocols. International journal of computer-supported collaborative learning3(4), 447–470 (2008)

work page 2008

[11] [11]

Falco, M., Robiolo, G.: Tendencies in multi-agents systems: A systematic literature review (2020)

work page 2020

[12] [12]

In: Proceedings of the 14th learning analytics and knowledge conference

Feng, S., Yan, L., Zhao, L., Maldonado, R.M., Gašević, D.: Heterogenous network analytics of small group teamwork: Using multimodal data to uncover individual behavioral engagement strategies. In: Proceedings of the 14th learning analytics and knowledge conference. pp. 587–597 (2024)

work page 2024

[13] [13]

ACM Transactions on Computer-Human Interaction31(1), 1–38 (2023)

Gao, J., Choo, K.T.W., Cao, J., Lee, R.K.W., Perrault, S.: Coaicoder: Examin- ing the effectiveness of ai-assisted human-to-human collaboration in qualitative analysis. ACM Transactions on Computer-Human Interaction31(1), 1–38 (2023)

work page 2023

[14] [14]

Gao, Z., Zhang, B., Li, P., Ma, X., Yuan, T., Fan, Y., Wu, Y., Jia, Y., Zhu, S.C., Li, Q.: Multi-modal agent tuning: Building a vlm-driven agent for efficient tool usage (2025),https://arxiv.org/abs/2412.15606

work page arXiv 2025

[15] [15]

International Journal of STEM Education10(1), 21 (2023)

Gomes, S., Costa, L., Martinho, C., Dias, J., Xexéo, G., Moura Santos, A.: Mod- eling students’ behavioral engagement through different in-class behavior styles. International Journal of STEM Education10(1), 21 (2023)

work page 2023

[16] [16]

Jiang, Y.H., Li, R., Zhou, Y., Qi, C., Hu, H., Wei, Y., Jiang, B., Wu, Y.: Ai agent for education: von neumann multi-agent system framework (2024),https: //arxiv.org/abs/2501.00083

work page arXiv 2024

[17] [17]

Journal of Educational Computing Research61(5), 951–976 (2023)

Lee, H.Y., Cheng, Y.P., Wang, W.S., Lin, C.J., Huang, Y.M.: Exploring the learn- ing process and effectiveness of stem education via learning behavior analysis and the interactive-constructive-active-passive framework. Journal of Educational Computing Research61(5), 951–976 (2023)

work page 2023

[18] [18]

In: LAK23: 13th International Learning Analytics and Knowledge Conference

Li, X., Yan, L., Zhao, L., Martinez-Maldonado, R., Gasevic, D.: Cvpe: A computer visionapproachforscalableandprivacy-preservingsocio-spatial,multimodallearn- ing analytics. In: LAK23: 13th International Learning Analytics and Knowledge Conference. pp. 175–185 (2023)

work page 2023

[19] [19]

Longvideoagent: Multi-agent reasoning with long videos.arXiv preprint arXiv:2512.20618, 2025

Liu, R., Liu, Z., Tang, J., Ma, Y., Pi, R., Zhang, J., Chen, Q.: Longvideoagent: Multi-agent reasoning with long videos. arXiv preprint arXiv:2512.20618 (2025)

work page arXiv 2025

[20] [20]

Health communication (2020)

Oh, J., Sundar, S.S.: What happens when you click and drag: Unpacking the rela- tionship between on-screen interaction and user engagement with an anti-smoking website. Health communication (2020)

work page 2020

[21] [21]

Multi-Agent Systems for Learning Video Analytics 15

OpenAI: Introducing GPT-4.1 in the API (2025),https://openai.com/index/ gpt-4-1/, accessed: 23 September 2025 Single vs. Multi-Agent Systems for Learning Video Analytics 15

work page 2025

[22] [22]

Computers & Education223, 105163 (2024)

Schiller, R., Fleckenstein, J., Mertens, U., Horbach, A., Meyer, J.: Understanding the effectiveness of automated feedback: Using process data to uncover the role of behavioral engagement. Computers & Education223, 105163 (2024)

work page 2024

[23] [23]

Sharma, K., Giannakos, M.: Multimodal data capabilities for learning: What can multimodal data tell us about learning? British Journal of Educational Technology 51(5), 1450–1484 (2020)

work page 2020

[24] [24]

arXiv preprint arXiv:2504.13399 (2025)

Shriram, S., Perisetla, S., Keskar, A., Krishnaswamy, H., Bossen, T.E.W., Møgel- mose, A., Greer, R.: Towards a multi-agent vision-language system for zero-shot novel hazardous object detection for autonomous driving safety. arXiv preprint arXiv:2504.13399 (2025)

work page arXiv 2025

[25] [25]

Human–Computer Interaction29(2), 109–152 (2014)

Sundar, S.S., Bellur, S., Oh, J., Xu, Q., Jia, H.: User experience of on-screen interaction techniques: An experimental investigation of clicking, sliding, zooming, hovering, dragging, and flipping. Human–Computer Interaction29(2), 109–152 (2014)

work page 2014

[26] [26]

In: 2024 IEEE International Conference on Data Mining Workshops (ICDMW)

Teotia, J., Zhang, X., Mao, R., Cambria, E.: Evaluating vision language models in detecting learning engagement. In: 2024 IEEE International Conference on Data Mining Workshops (ICDMW). pp. 496–502. IEEE (2024)

work page 2024

[27] [27]

Cambridge University Press (2004)

Van Rijsbergen, C.J.: The geometry of information retrieval. Cambridge University Press (2004)

work page 2004

[28] [28]

Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models.arXiv preprint arXiv:2410.03290, 2024

Wang,H.,Xu,Z.,Cheng,Y.,Diao,S.,Zhou,Y.,Cao,Y.,Wang,Q.,Ge,W.,Huang, L.: Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models. arXiv preprint arXiv:2410.03290 (2024)

work page arXiv 2024

[29] [29]

Frontiers of Computer Science18(6), 186345 (2024)

Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., Chen, Z., Tang, J., Chen, X., Lin, Y., et al.: A survey on large language model based autonomous agents. Frontiers of Computer Science18(6), 186345 (2024)

work page 2024

[30] [30]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Yang, Y., Zhou, T., Li, K., Tao, D., Li, L., Shen, L., He, X., Jiang, J., Shi, Y.: Em- bodied multi-modal agent trained by an llm from a parallel textworld. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 26275–26285 (June 2024)

work page 2024

[31] [31]

In: Proceed- ings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval

Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceed- ings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. pp. 42–49 (1999)

work page 1999

[32] [32]

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y.: React: Synergizingreasoningandactinginlanguagemodels.In:Theeleventhinternational conference on learning representations (2022)

work page 2022

[33] [33]

In: 2025 8th International Conference on Artificial Intelligence and Big Data (ICAIBD)

Yu, C., Cheng, Z., Cui, H., Gao, Y., Luo, Z., Wang, Y., Zheng, H., Zhao, Y.: A survey on agent workflow–status and future. In: 2025 8th International Conference on Artificial Intelligence and Big Data (ICAIBD). pp. 770–781. IEEE (2025)

work page 2025

[34] [34]

IEEE transactions on pattern analysis and machine intelligence46(8), 5625–5644 (2024)

Zhang, J., Huang, J., Jin, S., Lu, S.: Vision-language models for vision tasks: A survey. IEEE transactions on pattern analysis and machine intelligence46(8), 5625–5644 (2024)

work page 2024