BAH Dataset for Ambivalence/Hesitancy Recognition in Videos for Digital Behavioural Change
Pith reviewed 2026-05-19 13:04 UTC · model grok-4.3
The pith
This paper introduces the BAH dataset of 1,427 annotated videos to train machine learning models that detect ambivalence and hesitancy during digital health interventions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the Behavioural Ambivalence/Hesitancy (BAH) dataset, comprising 1,427 videos totaling 10.60 hours from 300 participants responding to predefined elicitation questions, supplies the multimodal video material and expert annotations required to develop machine learning models for A/H recognition in digital behaviour change settings. The dataset supplies binary presence/absence labels, cue annotations at frame and video level, transcripts, cropped faces, and metadata. Reported baseline results on frame- and video-level tasks show modest performance and thereby indicate that existing approaches must be adapted for this subtle, cross-modal phenomenon.
What carries the argument
The BAH dataset itself, which records participants answering questions designed to provoke ambivalence or hesitancy and supplies expert-provided timestamps and labels for A/H presence and cues across video, audio, and text modalities.
If this is right
- Researchers can now train and benchmark multimodal models specifically for ambivalence and hesitancy recognition using a public resource.
- Digital health platforms could incorporate real-time A/H detection to adjust messaging or support when users show hesitation.
- Standardized evaluation of spatio-temporal and cross-modal architectures becomes possible for this class of subtle emotional states.
- The binary A/H label scheme offers a practical starting point for deployment even though ambivalence and hesitancy are closely related.
Where Pith is reading between the lines
- Integration of BAH-trained models into smartphone apps might increase completion rates for behavior change programs by catching hesitation early.
- The dataset's Canadian participant base leaves open the question of whether similar elicitation and annotation protocols would work across other cultural groups.
- Future work could test whether adding physiological signals from wearables improves recognition accuracy beyond the current video-plus-transcript setup.
Load-bearing premise
The predefined questions produce genuine ambivalence or hesitancy that experts can annotate consistently and that the resulting labels will transfer to real digital interventions without major domain shift.
What would settle it
Demonstrating low inter-annotator agreement among experts on the A/H labels, or showing that models trained on BAH perform no better than chance when tested on videos collected from actual deployed digital behaviour change programs.
Figures
read the original abstract
Ambivalence and hesitancy (A/H), closely related constructs, are the primary reasons why individuals delay, avoid, or abandon health behaviour changes. They are subtle and conflicting emotions that sets a person in a state between positive and negative orientations, or between acceptance and refusal to do something. They manifest as a discord in affect between multiple modalities or within a modality, such as facial and vocal expressions, and body language. Although experts can be trained to recognize A/H as done for in-person interactions, integrating them into digital health interventions is costly and less effective. Automatic A/H recognition is therefore critical for the personalization and cost-effectiveness of digital behaviour change interventions. However, no datasets currently exist for the design of machine learning models to recognize A/H. This paper introduces the Behavioural Ambivalence/Hesitancy (BAH) dataset collected for multimodal recognition of A/H in videos. It contains 1,427 videos with a total duration of 10.60 hours, captured from 300 participants across Canada, answering predefined questions to elicit A/H. It is intended to mirror real-world digital behaviour change interventions delivered online. BAH is annotated by three experts to provide timestamps that indicate where A/H occurs, and frame- and video-level annotations with A/H cues. Video transcripts, cropped and aligned faces, and participant metadata are also provided. Since A and H manifest similarly in practice, we provide a binary annotation indicating the presence or absence of A/H. Additionally, this paper includes benchmarking results using baseline models on BAH for frame- and video-level recognition, and different learning setups. The limited performance highlights the need for adapted multimodal and spatio-temporal models for A/H recognition. The data and code are publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Behavioural Ambivalence/Hesitancy (BAH) dataset containing 1,427 videos (10.60 hours total) from 300 participants across Canada who responded to a fixed set of predefined questions intended to elicit ambivalence/hesitancy (A/H). The dataset supplies multimodal video data, three-expert annotations with timestamps and frame-/video-level binary A/H labels plus cues, transcripts, cropped faces, and participant metadata. Baseline benchmarking results for frame- and video-level recognition are reported under several learning setups, with limited performance noted, and the data and code are released publicly to support ML model development for digital behaviour-change interventions.
Significance. If the expert labels are shown to be reliable, the BAH dataset would constitute the first public resource for multimodal A/H recognition and could enable progress on personalised digital health interventions. The public release of data and code, together with the honest reporting of modest baseline performance, are clear strengths that support reproducibility and further model development in affective computing and computer vision.
major comments (2)
- [Dataset Collection and Annotation] Dataset Collection and Annotation: No inter-rater reliability statistics (e.g., Fleiss' kappa or percentage agreement) are reported for the three experts' binary A/H annotations and timestamps. Because the dataset's utility for training ML models rests directly on label quality and consistency, this omission is load-bearing and must be addressed before the resource can be confidently used for model development.
- [Data Collection Procedure] Data Collection Procedure: The manuscript provides no pilot validation, operational definition of A/H cues, or evidence that the fixed set of questions reliably induces genuine ambivalence/hesitancy states rather than other affective responses or demand characteristics. This directly affects the claim that the dataset mirrors real-world digital interventions and transfers without substantial domain shift.
minor comments (2)
- [Benchmarking] Benchmarking section: The architectures, hyperparameters, and training protocols of the baseline models should be described in greater detail (including exact loss functions and data splits) to allow independent reproduction of the reported frame- and video-level results.
- [Figures and Supplementary Material] Figure captions and supplementary material: Video examples and annotation visualisations would benefit from more explicit descriptions of the depicted A/H cues so that readers can interpret them without direct access to the released data.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and indicate the revisions we will make to the next version of the paper.
read point-by-point responses
-
Referee: [Dataset Collection and Annotation] Dataset Collection and Annotation: No inter-rater reliability statistics (e.g., Fleiss' kappa or percentage agreement) are reported for the three experts' binary A/H annotations and timestamps. Because the dataset's utility for training ML models rests directly on label quality and consistency, this omission is load-bearing and must be addressed before the resource can be confidently used for model development.
Authors: We agree that inter-rater reliability statistics are essential to demonstrate label quality and consistency. We have computed Fleiss' kappa and percentage agreement for the binary A/H annotations as well as for the timestamped segments. These metrics will be reported in a new subsection of the annotation protocol in the revised manuscript, confirming substantial agreement among the three experts. revision: yes
-
Referee: [Data Collection Procedure] Data Collection Procedure: The manuscript provides no pilot validation, operational definition of A/H cues, or evidence that the fixed set of questions reliably induces genuine ambivalence/hesitancy states rather than other affective responses or demand characteristics. This directly affects the claim that the dataset mirrors real-world digital interventions and transfers without substantial domain shift.
Authors: We acknowledge that additional procedural details would strengthen the manuscript. In the revision we will expand the data collection section to include the operational definitions of A/H cues supplied to the annotators and the rationale for the question set, which was drawn from established behavioral-change instruments designed to surface mixed feelings. A formal pilot validation study with quantitative induction metrics was not performed prior to the main collection; we will therefore note this limitation and discuss potential domain-shift considerations when using the dataset for real-world interventions. revision: partial
Circularity Check
No circularity: dataset collection and benchmarking paper with no derivations or predictions
full rationale
The paper is a data collection and benchmarking effort that introduces the BAH dataset from 300 participants answering predefined questions, with expert annotations for A/H presence, timestamps, and cues, plus baseline model results. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described content. The central claim (no prior datasets exist for A/H recognition, and BAH fills the gap) is a factual statement about data availability rather than a reduction to its own inputs. Annotations and benchmarks are presented as empirical outputs without circular self-definition or uniqueness theorems imported from prior author work.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
This paper introduces the Behavioural Ambivalence/Hesitancy (BAH) dataset collected for multimodal recognition of A/H in videos... annotated by three experts... binary annotation indicating the presence or absence of A/H.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat.induction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Baseline results... using ResNet101... multimodal fusion (LFAN, CAN, MT, JMT)... temporal modelling with TCN
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
H. Arabian, T. Abdulbaki Alshirbaji, R. Schmid, V . Wagner- Hartl, J. Chase, and K. Moeller. Harnessing wearable devices for emotional intelligence: Therapeutic applications in digital health.Sensors, 23(19):8092, 2023
work page 2023
-
[2]
C. J. Armitage and M. Conner. Attitudinal ambivalence: A test of three key hypotheses.Personality and Social Psychology Bulletin, 26(11):1421–1432, 2000
work page 2000
- [3]
-
[4]
S. Bai, J. Kolter, and V . Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling.CoRR, abs/1803.01271, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[5]
S. Belharbi, M. Pedersoli, A. Koerich, S. Bacon, and E. Granger. Spatial action unit cues for interpretable deep facial expression recognition. InAI and Digital Health Sym- posium, 2024
work page 2024
-
[6]
S. Belharbi, M. Pedersoli, A. L. Koerich, S. Bacon, and E. Granger. Guided interpretable facial expression recognition via spatial action unit cues. InInternational Conference on Automatic Face and Gesture Recognition (FG), 2024
work page 2024
- [7]
-
[8]
J. Bonnard, A. Dapogny, F. Dhombres, and K. Bailly. Privi- leged attribution constrained deep networks for facial expres- sion recognition. InICPR, 2022
work page 2022
-
[9]
E. Bradley, L. Curry, and K. Devers. Qualitative data analysis for health services research: developing taxonomy, themes, and theory.Health services research, 42(4):1758–1772, 2007
work page 2007
- [10]
-
[11]
H. Chaptoukaev, V . Strizhkova, M. Panariello, B. Dalpaos, A. Reka, V . Manera, S. Thummler, E. ISMAILOV A, N. Evans, F. Bremond, M. Todisco, M. A. Zuluaga, and L. M. Ferrari. StressID: a multimodal dataset for stress identification. In NeurIPS, 2023
work page 2023
-
[12]
J. Choe, S. Oh, S. Chun, S. Lee, Z. Akata, and H. Shim. Eval- uation for weakly supervised object localization: Protocol, metrics, and datasets.TPAMI, pages 1–1, 2022
work page 2022
- [13]
- [14]
-
[15]
M. Conner and P. Sparks. Ambivalence and attitudes.Euro- pean review of social psychology, 12(1):37–70, 2002
work page 2002
-
[16]
K. Davidson and U. Scholz. Understanding and predicting health behaviour change: a contemporary view through the lenses of meta-reviews.Health psychology review, 14(1):1–5, 2020
work page 2020
-
[17]
M. De-la Torre, E. Granger, P. V . Radtke, R. Sabourin, and D. Gorodnichy. Partially-supervised learning from facial tra- jectories for face recognition in video surveillance.Informa- tion fusion, 24:31–53, 2015
work page 2015
-
[18]
J. Deng, J. Guo, Y . Zhou, J. Yu, I. Kotsia, and S. Zafeiriou. Retinaface: Single-stage dense face localisation in the wild. CoRR, abs/1905.00641, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1905
- [19]
-
[20]
M. Dhuheir, A. Albaseer, E. Baccour, A. Erbad, M. Abdallah, and M. Hamdi. Emotion recognition for healthcare surveil- lance systems using neural networks: A survey. In2021 In- ternational Wireless Communications and Mobile Computing (IWCMC), pages 681–687, 2021
work page 2021
- [21]
-
[22]
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021
work page 2021
-
[23]
Y . Fan, J. Lam, and V . Li. Facial action unit intensity esti- mation via semantic correspondence learning with dynamic graph convolution. InAAAI, 2020
work page 2020
-
[24]
J. Fleiss. Measuring nominal scale agreement among many raters.Psychological bulletin, 76(5):378, 1971
work page 1971
-
[25]
R. Guo, H. Guo, L. Wang, M. Chen, D. Yang, and B. Li. Development and application of emotion recognition tech- nology—a systematic literature review.BMC psychology, 12(1):95, 2024
work page 2024
-
[26]
X. Guo, B. Zhu, L. Polanía, C. Boncelet, and K. Barner. Group-level emotion recognition using hybrid deep models based on faces, scenes, skeletons and visual attentions. In Proceedings of the 20th ACM international conference on multimodal interaction, pages 635–639, 2018
work page 2018
-
[27]
J. Hall, J. Harrigan, and R. Rosenthal. Nonverbal behav- ior in clinician—patient interaction.Applied and preventive psychology, 4(1):21–37, 1995
work page 1995
-
[28]
T. Hallmen, R.-N. Kampa, F. Deuser, N. Oswald, and E. An- dré. Semantic matters: Multimodal features for affective analysis. InABAW workshop at CVPR, 2025
work page 2025
-
[29]
J. Han, L. Xie, J. Liu, and X. Li. Personalized broad learning system for facial expression.Multimedia Tools and Applica- tions, 2020
work page 2020
-
[30]
D. Hayashi, S. Carvalho, P. Ribeiro, R. Rodrigues, T. São- João, K. Lavoie, S. Bacon, and M. E. Cornelio. Methods to assess ambivalence towards food and diet: a scoping review. Journal of Human Nutrition and Dietetics, 36(5):2010–2025, 2023
work page 2010
-
[31]
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InCVPR, 2016
work page 2016
-
[32]
Z. He, Z. Li, F. Yang, L. Wang, J. Li, C. Zhou, and J. Pan. Advances in multimodal emotion recognition based on brain– computer interfaces.Brain sciences, 10(10):687, 2020
work page 2020
-
[33]
M. Heisel and M. Mongrain. Facial expressions and ambiva- lence: Looking for conflict in all the right faces.Journal of Nonverbal Behavior, 28:35–52, 2004
work page 2004
-
[34]
S. Hershey, S. Chaudhuri, D. Ellis, J. Gemmeke, A. Jansen, R. Moore, M. Plakal, D. Platt, R. Saurous, B. Seybold, M. Slaney, R. Weiss, and K. Wilson. Cnn architectures for large-scale audio classification. InICASSP, 2017
work page 2017
- [35]
-
[36]
S. Hornstein, K. Zantvoort, U. Lueken, B. Funk, and K. Hilbert. Personalization strategies in digital mental health 43 González et al. [ICLR 2026] interventions: a systematic review and conceptual frame- work for depressive symptoms.Frontiers in digital health, 5:1170002, 2023
work page 2026
-
[37]
G. M. Jacob and B. Stenger. Facial action unit detection with transformers. InCVPR, 2021
work page 2021
- [38]
-
[39]
H. Jin. A comparative analysis of single and multi-modality- based emotion recognition for disease prevention. InInterna- tional Conference on Artificial Intelligence and Communica- tion (ICAIC), volume 185, page 323, 2024
work page 2024
-
[40]
W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vi- jayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, and M. Suleyman. The kinetics human action video dataset.CoRR, abs/1705.06950, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [41]
-
[42]
D. Kollias. Multi-label compound expression recognition: C-expr database & network. InCVPR, 2023
work page 2023
-
[43]
D. Kollias, P. Tzirakis, A. Cowen, S. Zafeiriou, I. Kotsia, E. Granger, M. Pedersoli, S. Bacon, A. Baird, C. Gagne, C. Shao, G. Hu, S. Belharbi, and M. H. Aslam. Advancements in affective and behavior analysis: The 8th abaw workshop and competition. InComputer Vision and Pattern Recognition Conference (CVPR) workshop, 2025
work page 2025
-
[44]
D. Kollias and S. Zafeiriou. Expression, affect, action unit recognition: Aff-wild2, multi-task learning and arcface. CoRR, 2019
work page 2019
-
[45]
J. Kossaifi, R. Walecki, Y . Panagakis, J. Shen, M. Schmitt, F. Ringeval, J. Han, V . Pandit, A. Toisoul, B. Schuller, et al. Sewa db: A rich database for audio-visual emotion and senti- ment research in the wild.TPAMI, 43(3):1022–1040, 2019
work page 2019
- [46]
- [47]
-
[48]
I. Lee, E. Lee, and S. Yoo. Latent-ofer: Detect, mask, and reconstruct with latent vectors for occluded facial expression recognition. InICCV, 2023
work page 2023
- [49]
-
[50]
S. Li, W. Deng, and J. Du. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. InCVPR, 2017
work page 2017
- [51]
-
[52]
B. Lin, Y . Ye, B. Zhu, J. Cui, M. Ning, P. Jin, and L. Yuan. Video-llava: Learning united visual representation by align- ment before projection, 2024
work page 2024
-
[53]
C. Liu, X. Zhang, X. Liu, T. Zhang, L. Meng, Y . Liu, Y . Deng, and W. Jiang. Facial expression recognition based on multi- modal features for videos in the wild. InCVPR, 2023
work page 2023
-
[54]
D. Liu, H. Zhang, and P. Zhou. Video-based facial expression recognition using graph convolutional networks. InICPR, 2021
work page 2021
-
[55]
H. Liu, R. An, Z. Zhang, B. Ma, W. Zhang, Y . Song, Y . Hu, W. Chen, and Y . Ding. Norface: Improving facial expression analysis by identity normalization.ECCV, 2024
work page 2024
-
[56]
X. Liu, L. Jin, X. Han, J. Lu, J. You, and L. Kong. Identity- aware facial expression recognition in compressed video. In ICPR, 2021
work page 2021
-
[57]
Y . Liu, W. Wang, C. Feng, H. Zhang, Z. Chen, and Y . Zhan. Expression snippet transformer for robust video-based facial expression recognition.Pattern Recognition, 138:109368, 2023
work page 2023
-
[58]
Y . Liu, Y . Zhang, and Y . Wang. Application of deep learning- based image processing in emotion recognition and psycho- logical therapy.Traitement du Signal, 41(6):2923, 2024
work page 2024
-
[59]
H. Lokhande, C. Garware, T. Kudale, and R. Kumar. Personal- ized well-being interventions (pwis): A new frontier in mental health. InAffective Computing for Social Good: Enhancing Well-being, Empathy, and Equity, pages 183–200. 2024
work page 2024
-
[60]
I. Loshchilov and F. Hutter. SGDR: stochastic gradient de- scent with warm restarts. InICLR, 2017
work page 2017
-
[61]
C. Luo, S. Song, W. Xie, L. Shen, and H. Gunes. Learning multi-dimensional edge feature-based AU relation graph for facial action unit recognition. InIJCAI, 2022
work page 2022
- [62]
- [63]
-
[64]
J. Manuel and T. Moyers. The role of ambivalence in behavior change.Addiction, 111(11):1910–1912, Nov. 2016
work page 1910
-
[65]
J. Mao, R. Xu, X. Yin, Y . Chang, B. Nie, A. Huang, and Y . Wang. Poster++: A simpler and stronger facial expression recognition network.Pattern Recognition, page 110951, 2024
work page 2024
-
[66]
H. McDonald, A. Garg, and R. Haynes. Interventions to en- hance patient adherence to medication prescriptions: scientific review.Jama, 288(22):2868–2879, 2002
work page 2002
-
[67]
S. Michie, M. Richardson, M. Johnston, C. Abraham, J. Fran- cis, W. Hardeman, M. Eccles, J. Cane, and C. Wood. The behavior change technique taxonomy (v1) of 93 hierarchically clustered techniques: building an international consensus for the reporting of behavior change interventions.Annals of behavioral medicine, 46(1):81–95, 2013
work page 2013
- [68]
-
[69]
W. Miller and G. Rose. Motivational interviewing and deci- sional balance: contrasting responses to client ambivalence. Behavioural and cognitive psychotherapy, 43(2):129–141, 2015
work page 2015
-
[70]
J. Miranda Calero, L. Gutiérrez-Martín, E. Rituerto-González, E. Romero-Perales, J. Lanza-Gutiérrez, C. Peláez-Moreno, and C. López-Ongil. Wemac: Women and emotion multi-modal affective computing dataset.Scientific data, 11(1):1182, 2024
work page 2024
-
[71]
A. Mollahosseini, B. Hassani, and M. H. Mahoor. Affectnet: A database for facial expression, valence, and arousal com- puting in the wild.IEEE Trans. Affect. Comput., 10(1):18–31, 2019
work page 2019
-
[72]
S. Murtaza, S. Belharbi, M. Pedersoli, and E. Granger. A realistic protocol for evaluation of weakly supervised object localization. InWACV, 2025
work page 2025
-
[73]
J. Nasimzada, J. Kleesiek, K. Herrmann, A. Roitberg, and C. Seibold. Towards synthetic data generation for improved pain recognition in videos under patient constraints.CoRR, abs/2409.16382, 2024. 44 González et al. [ICLR 2026]
-
[74]
A. O’Donnell, M. Addison, L. Spencer, H. Zurhold, M. Rosenkranz, R. McGovern, E. Gilvarry, M.-S. Martens, U. Verthein, and E. Kaner. Which individual, social and envi- ronmental influences shape key phases in the amphetamine type stimulant use trajectory? a systematic narrative review and thematic synthesis of the qualitative literature.Addiction, 114(1):...
work page 2019
- [75]
-
[76]
M. Pantic and L. Rothkrantz. Toward an affect-sensitive multimodal human-computer interaction.Proceedings of the IEEE, 91(9):1370–1390, 2003
work page 2003
-
[77]
L. Pepa, L. Spalazzi, M. Capecci, and M. G. Ceravolo. Auto- matic emotion recognition in clinical scenario: a systematic review of methods.IEEE Transactions on Affective Comput- ing, 14(2):1675–1695, 2021
work page 2021
- [78]
-
[79]
R. Praveen and J. Alam. Recursive joint cross-modal attention for multimodal fusion in dimensional emotion recognition. In ABAW workshop at CVPR, 2024
work page 2024
-
[80]
R. G. Praveen and J. Alam. Inconsistency-aware cross- attention for audio-visual fusion in dimensional emotion recognition.CoRR, abs/405.12853, 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.