pith. sign in

arxiv: 2606.21861 · v1 · pith:HSDV7KFDnew · submitted 2026-06-20 · 💻 cs.CV

Zero-Shot Vision-Language Models for Classroom Engagement Recognition: A Benchmark Study of Prompt Sensitivity and Cross-Dataset Generalization

Pith reviewed 2026-06-26 12:29 UTC · model grok-4.3

classification 💻 cs.CV
keywords zero-shot vision-language modelsclassroom engagement recognitionprompt sensitivityclass collapseDAiSEE datasetbenchmark evaluationcross-dataset generalization
0
0 comments X

The pith

Zero-shot vision-language models show near-random accuracy, class collapse, and large prompt sensitivity on individual student engagement recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs a controlled benchmark of five VLMs on two classroom datasets, applying three prompt styles to test zero-shot performance at recognizing engagement levels. On individual-student video clips the models stay near random with Cohen's kappa never above 0.10, assign 85-100 percent of outputs to one class, and swing in accuracy by as much as 32 points when only the prompt wording changes. Scene-level images prove easier, reaching kappa around 0.60 with behaviorally specific prompts, yet even there GPT-4o blocks 98 percent of detailed requests through safety filters. These patterns indicate that current zero-shot VLMs cannot serve as drop-in tools for per-student learning analytics.

Core claim

Zero-shot VLMs exhibit three consistent failure modes on individual-student engagement: near-random performance (kappa ≤ 0.10 on DAiSEE), severe class collapse (85-100 percent of predictions to a single level), and extreme prompt sensitivity (accuracy changes up to 32 percentage points on identical images). Scene-level classification on the SCB dataset is more tractable, with CLIP and GPT-4o reaching kappa approximately 0.60 under rubric-anchored prompts.

What carries the argument

The benchmark that applies three prompt variants (minimal, rubric-anchored, chain-of-thought) to five VLMs across the DAiSEE individual-video clips and SCB scene-level images.

Load-bearing premise

The sampled test clips and the SCB labels form representative ground truth for real classroom engagement, and the three prompt styles adequately represent typical zero-shot use.

What would settle it

A fresh set of classroom videos with independently verified engagement labels on which any of the tested VLMs reaches kappa above 0.30 without collapsing to one class or showing large prompt-dependent swings.

Figures

Figures reproduced from arXiv: 2606.21861 by Aman Goyal, Kemmannu Vineet Venkatesh Rao, Kshama Nitin Shah.

Figure 1
Figure 1. Figure 1: Best zero-shot accuracy and Cohen’s κ per model on DAiSEE (individual student, video frames) versus SCB (classroom scene images). Scene-level classification is substantially more tractable, with CLIP and GPT-4o achieving κ≈0.60 on SCB while remaining near-random on DAiSEE. • Documentation of GPT-4o’s safety-guardrail refusals blocking chain-of-thought engagement analysis for indi￾vidual student faces, with… view at source ↗
Figure 4
Figure 4. Figure 4: Normalised confusion matrices (rows = true label, cols [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Fraction of the 300 DAiSEE samples predicted at each [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: t-SNE projection of CLIP ViT-B/32 image embeddings [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Text-image cosine similarity distributions across all [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
read the original abstract

Automated classroom engagement recognition holds substantial promise for scalable learning analytics, yet the suitability of modern Vision-Language Models (VLMs) for this task under zero-shot conditions remains largely unexplored. We present a systematic benchmark that evaluates five widely-used VLMs: CLIP, BLIP-VQA, GPT-4o, LLaVA-1.5-7B, and Qwen2.5VL-7B-Instruct across two complementary educational datasets: DAiSEE, an individual-student video dataset (300 sampled test clips), and the Student Classroom Behaviour dataset (SCB, 1,168 scene-level images). Each model is probed with three prompt variants spanning minimal, rubric-anchored, and chain-of-thought designs. Our experiments reveal three primary failure modes of zero-shot VLMs for engagement recognition: (1) near-random performance on individual students, with Cohen's kappa never exceeding 0.10 on DAiSEE; (2) severe class collapse, where models assign 85-100% of predictions to a single engagement level regardless of visual content; and (3) extreme prompt sensitivity, with accuracy swings of up to 32 percentage points on identical images depending solely on prompt phrasing. Remarkably, scene-level classification on SCB is substantially more tractable: CLIP and GPT-4o achieve kappa approximately 0.60 when prompted with behaviorally-grounded rubrics. We also document a practical barrier for deployment: GPT-4o's safety filters reject 98% of chain-of-thought requests involving individual student faces. Our findings provide a calibrated baseline and surface critical design considerations for the use of VLMs in educational observation systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper benchmarks five zero-shot VLMs (CLIP, BLIP-VQA, GPT-4o, LLaVA-1.5-7B, Qwen2.5VL-7B-Instruct) for classroom engagement recognition on DAiSEE (300 sampled individual-student video clips) and SCB (1,168 scene-level images). Using three prompt variants (minimal, rubric-anchored, chain-of-thought), it reports three failure modes: Cohen's kappa never exceeding 0.10 on DAiSEE, 85-100% class collapse to one level, and accuracy swings up to 32 points from prompt changes alone. Scene-level SCB yields better results (kappa ~0.60 for CLIP/GPT-4o with rubrics), while GPT-4o safety filters reject 98% of CoT requests on faces.

Significance. If the empirical results hold after addressing data-quality concerns, the work supplies a useful calibrated baseline showing that current zero-shot VLMs are unsuitable for reliable individual-student engagement detection without adaptation, while underscoring prompt sensitivity and deployment barriers such as safety filters. It contributes concrete metrics and failure-mode documentation to the educational-AI literature.

major comments (3)
  1. [Abstract] Abstract (DAiSEE test set): the central claim of near-random performance (kappa ≤ 0.10) and class collapse rests on 300 sampled clips, yet the manuscript supplies no sampling protocol, stratification by engagement level, or representativeness checks; without these, the reported failure modes cannot be distinguished from selection bias.
  2. [Abstract] Abstract (ground-truth labels): no inter-rater reliability statistics, label-validation studies, or agreement metrics are provided for DAiSEE or SCB engagement annotations; given that engagement is a low-agreement construct, this directly affects defensibility of all kappa and accuracy figures as measures of model behavior rather than label noise.
  3. [Abstract] Abstract (prompt variants): the three variants are asserted to cover typical zero-shot usage, but no justification, ablation against alternative rubric or CoT phrasings, or coverage argument is supplied; this weakens the 32-point sensitivity claim as a general property of zero-shot practice.
minor comments (2)
  1. [Abstract] Abstract: specify whether Cohen's kappa is weighted or unweighted and how the four engagement levels are treated in the multi-class setting.
  2. [Abstract] The safety-filter rejection rate (98%) is reported only for GPT-4o; clarify whether equivalent filtering behavior was observed or tested for the open-source models.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and indicate the changes planned for the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract (DAiSEE test set): the central claim of near-random performance (kappa ≤ 0.10) and class collapse rests on 300 sampled clips, yet the manuscript supplies no sampling protocol, stratification by engagement level, or representativeness checks; without these, the reported failure modes cannot be distinguished from selection bias.

    Authors: We agree that the absence of an explicit sampling protocol leaves the results open to concerns about selection bias. In the revision we will add a dedicated paragraph describing the sampling procedure: the 300 clips were drawn randomly from the DAiSEE test partition using a fixed seed (42) chosen for reproducibility, with the sample size selected to balance computational cost and statistical power while preserving the original class proportions as closely as possible. We will also report the resulting class distribution and compare it to the full test set. revision: yes

  2. Referee: [Abstract] Abstract (ground-truth labels): no inter-rater reliability statistics, label-validation studies, or agreement metrics are provided for DAiSEE or SCB engagement annotations; given that engagement is a low-agreement construct, this directly affects defensibility of all kappa and accuracy figures as measures of model behavior rather than label noise.

    Authors: We acknowledge the importance of this point. DAiSEE labels originate from the dataset release and SCB labels were produced by two trained annotators following the rubric described in the paper. We will expand the manuscript to (i) cite any published validation or reliability information available for DAiSEE, (ii) report the inter-annotator agreement obtained during SCB labeling, and (iii) add an explicit limitations paragraph discussing how label noise may affect the interpretation of kappa and accuracy. Because the original annotations are fixed, we cannot retroactively compute new IRR on the full sets, but the added discussion will clarify the scope of our claims. revision: partial

  3. Referee: [Abstract] Abstract (prompt variants): the three variants are asserted to cover typical zero-shot usage, but no justification, ablation against alternative rubric or CoT phrasings, or coverage argument is supplied; this weakens the 32-point sensitivity claim as a general property of zero-shot practice.

    Authors: The three variants were chosen to span the most common practical approaches observed in the zero-shot VLM literature (minimal instruction, domain-specific rubric, and explicit reasoning steps). We will revise the prompt section to provide explicit justification with supporting citations, state the design rationale for each variant, and note that the observed sensitivity is intended to illustrate the phenomenon rather than to exhaustively map the entire prompt space. While a comprehensive ablation of every possible phrasing is outside the scope of the study, the added text will better contextualize the 32-point swings as evidence of prompt sensitivity in this domain. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with external datasets and metrics

full rationale

This paper reports experimental results from probing five VLMs on two external datasets (DAiSEE with 300 sampled clips and SCB with 1,168 images) using three prompt variants and standard metrics (Cohen's kappa, accuracy). No derivations, equations, fitted parameters, or predictions are present; all claims are direct measurements against provided ground-truth labels. No self-citations are load-bearing, and the work contains no mathematical or definitional steps that could reduce to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study relies entirely on existing pre-trained VLMs, publicly referenced datasets, and standard classification metrics; no new free parameters, axioms beyond domain-standard evaluation practices, or invented entities are introduced.

axioms (1)
  • domain assumption Cohen's kappa and accuracy are appropriate primary metrics for multi-class engagement level prediction.
    Invoked implicitly when reporting all quantitative results without alternative metrics or justification for label imbalance handling.

pith-pipeline@v0.9.1-grok · 5852 in / 1233 out tokens · 19906 ms · 2026-06-26T12:29:15.815583+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 3 canonical work pages · 2 internal anchors

  1. [1]

    Ali Abedi and Shehroz S. Khan. Improving state-of-the-art in detecting student engagement with resnet and TCN hybrid network. InProceedings of the 18th Conference on Robots and Vision (CRV), pages 151–157, 2021

  2. [2]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Junyang Wang, Wenbin Ge, Zesen Song, et al. Qwen2.5-VL technical report. InarXiv preprint arXiv:2502.13923, 2025. 7

  3. [3]

    Fredricks, Phyllis C

    Jennifer A. Fredricks, Phyllis C. Blumenfeld, and Alison H. Paris. School engagement: Potential of the concept, state of the evidence.Review of Educational Research, 74(1):59–109, 2004

  4. [4]

    Balasubramanian

    Abhay Gupta, Arjun D’Cunha, Kamal Awasthi, and Vi- neeth N. Balasubramanian. DAiSEE: Towards user en- gagement recognition in the wild. InarXiv preprint arXiv:1609.01885, 2016

  5. [5]

    Fine-grained engagement recognition in online learn- ing environment

    Tengfei Huang, Yu Mei, Haijun Zhang, Shuai Liu, and Hong Yang. Fine-grained engagement recognition in online learn- ing environment. InProceedings of the IEEE International Conference on Electronics Information and Emergency Com- munication (ICEIEC), pages 338–341, 2019

  6. [6]

    Prediction and localisation of student engage- ment in the wild

    Amanjot Kaur, Arslan Mustafa, Lavisha Mehta, and Abhi- nav Dhall. Prediction and localisation of student engage- ment in the wild. InProceedings of the IEEE International Conference on Digital Image Computing: Techniques and Applications (DICTA), pages 637–644, 2018

  7. [7]

    Richard Landis and Gary G

    J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data.Biometrics, 33(1): 159–174, 1977

  8. [8]

    BLIP: Bootstrapping language-image pre-training for unified vision- language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for unified vision- language understanding and generation. InProceedings of the International Conference on Machine Learning (ICML), pages 12888–12900, 2022

  9. [9]

    Deep facial spa- tiotemporal network for engagement prediction in online learning

    Jiaming Liao, Yan Liang, and Jiahui Pan. Deep facial spa- tiotemporal network for engagement prediction in online learning. InApplied Intelligence, pages 6609–6621, 2021

  10. [10]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  11. [11]

    Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity

    Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 8086–8098, 2022

  12. [12]

    Visual classification via description from large language models

    Sachit Menon and Carl V ondrick. Visual classification via description from large language models. InProceedings of the International Conference on Learning Representations (ICLR), 2023

  13. [13]

    GPT-4o system card

    OpenAI. GPT-4o system card. Technical report, OpenAI, 2024

  14. [14]

    Pianta, Karen M

    Robert C. Pianta, Karen M. La Paro, and Bridget K. Hamre. Classroom Assessment Scoring System (CLASS) Manual. Paul H. Brookes Publishing, 2008

  15. [15]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the International Conference on Machine Learning (ICML), pages 8748–8763, 2021

  16. [16]

    Evaluating vision language models in detecting learning en- gagement

    Jayant Teotia, Xulang Zhang, Rui Mao, and Erik Cambria. Evaluating vision language models in detecting learning en- gagement. InProc. IEEE International Conference on Data Mining Workshops (ICDMW), pages 496–502, Abu Dhabi, UAE, 2024

  17. [17]

    Predicting stu- dent engagement in classrooms using facial behavioral cues

    Clinton Thomas and Dinesh Babu Jayagopi. Predicting stu- dent engagement in classrooms using facial behavioral cues. InProceedings of the 1st ACM SIGCHI International Work- shop on Multimodal Interaction for Education (MIE), pages 33–40, 2017

  18. [18]

    Student classroom behaviour real-time detection based on YOLOv5 and identity recognition

    Junjie Wang, Sisi Jiang, Kedi Wang, and Mingjie Li. Student classroom behaviour real-time detection based on YOLOv5 and identity recognition. InSystems and Soft Computing, page 200058. Elsevier, 2023

  19. [19]

    Le, and Denny Zhou

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large lan- guage models. InAdvances in Neural Information Processing Systems (NeurIPS), pages 24824–24837, 2022

  20. [20]

    Movellan

    Jacob Whitehill, Zewelanji Serpell, Yi-Ching Lin, Aysha Fos- ter, and Javier R. Movellan. The faces of engagement: Auto- matic recognition of student engagementfrom facial expres- sions. InIEEE Transactions on Affective Computing, pages 86–98, 2014

  21. [21]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chau- mond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R ´emi Louf, Morgan Funtowicz, Joe Brew, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Trans- formers: State-of-the-art natural language processing. In Proceedings of the 2020 ...

  22. [22]

    BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

    Sheng Zhang, Yanbo Xu, Naoto Usuyama, Jaspreet Bagga, Robert Mason, Trista Shi, Tristan Naumann, Cliff Bernard, Robin Gallo, Brian Piening, Frederick A. Matsen IV , Masoud Rouhizadeh, Wei Wei, Stephen Snider, and Hoifung Poon. Large-scale domain-specific pretraining for biomedical vision- language processing. InarXiv preprint arXiv:2303.00915, 2023

  23. [23]

    Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh

    Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. InProceedings of the International Con- ference on Machine Learning (ICML), pages 12697–12706, 2021. 8 Zero-Shot Vision-Language Models for Classroom Engagement Recognition: A Benchmark Study of Prompt Sensitivity and...

  24. [24]

    Full Prompt Texts This supplementary provides the exact prompt text used for each model and dataset in our zero-shot benchmark, enabling full reproducibility. 7.1. SCB Dataset | GPT-4o, LLaV A-1.5, and Qwen2.5-VL-7B These prompts were applied to the scene-level SCB images for GPT-4o, LLaV A-1.5, and Qwen2.5-VL-7B. Each image was submitted directly; the mo...