Multimodal Ambivalence/Hesitancy Recognition in Videos for Personalized Digital Health Interventions
Pith reviewed 2026-05-10 15:56 UTC · model grok-4.3
The pith
Standard deep learning models show limited success recognizing ambivalence and hesitancy in videos, indicating that better methods for handling multimodal conflicts are needed.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Applying standard deep learning pipelines to multimodal video for ambivalence and hesitancy recognition produces only limited accuracy, demonstrating that existing architectures are insufficient to exploit affective inconsistencies within and across modalities and that specialized spatio-temporal and multimodal fusion techniques will be required before such recognition can support personalized digital health interventions.
What carries the argument
Multimodal video analysis pipelines that combine facial, vocal, linguistic, and body cues to detect affective inconsistency, evaluated in supervised, domain-adaptation, and large-language-model zero-shot regimes on the BAH dataset.
If this is right
- Improved fusion methods would allow digital health systems to detect when a user is wavering between acceptance and refusal of a recommended behavior.
- Such detection would enable real-time personalization of interventions, for example by adjusting message framing or timing when ambivalence is flagged.
- Domain adaptation and zero-shot LLM routes both inherit the same fusion shortcomings, so gains would require changes to the underlying video representation rather than only the training regime.
- Accurate A/H recognition could reduce the cost and improve the scalability of behavior-change support in settings where in-person experts are unavailable.
Where Pith is reading between the lines
- Similar fusion limitations may appear in other video tasks that rely on detecting internal contradictions, such as multimodal deception detection or conflicting sentiment in conversation.
- If better models are built, they could be tested for transfer to related affective states like uncertainty or mixed emotions in clinical or educational video data.
- The finding suggests that progress on this task may depend more on new architectural primitives for inconsistency modeling than on simply scaling data or model size.
Load-bearing premise
That off-the-shelf deep learning video models can detect subtle emotional conflicts across and within modalities without new architectural adaptations for spatio-temporal fusion.
What would settle it
A new model that adds explicit spatio-temporal and cross-modal fusion layers and then achieves substantially higher accuracy on the same BAH test videos would show that the current limited performance is not an inherent limit of the task.
Figures
read the original abstract
Using behavioural science, health interventions focus on behaviour change by providing a framework to help patients acquire and maintain healthy habits that improve medical outcomes. In-person interventions are costly and difficult to scale, especially in resource-limited regions. Digital health interventions offer a cost-effective approach, potentially supporting independent living and self-management. Automating such interventions, especially through machine learning, has gained considerable attention recently. Ambivalence and hesitancy (A/H) play a primary role for individuals to delay, avoid, or abandon health interventions. A/H are subtle and conflicting emotions that place a person in a state between positive and negative evaluations of a behaviour, or between acceptance and refusal to engage in it. They manifest as affective inconsistency across modalities or within a modality, such as language, facial, vocal expressions, and body language. While experts can be trained to recognize A/H, integrating them into digital health interventions is costly and less effective. Automatic A/H recognition is therefore critical for the personalization and cost-effectiveness of digital health interventions. Here, we explore the application of deep learning models for A/H recognition in videos, a multi-modal task by nature. In particular, this paper covers three learning setups: supervised learning, unsupervised domain adaptation for personalization, and zero-shot inference via large language models (LLMs). Our experiments are conducted on the unique and recently published BAH video dataset for A/H recognition. Our results show limited performance, suggesting that more adapted multi-modal models are required for accurate A/H recognition. Better methods for modeling spatio-temporal and multimodal fusion are necessary to leverage conflicts within/across modalities.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript explores the application of deep learning to multimodal ambivalence/hesitancy (A/H) recognition in videos from the BAH dataset. It evaluates three setups—supervised learning, unsupervised domain adaptation for personalization, and LLM zero-shot inference—and reports limited performance, concluding that more adapted models are needed for spatio-temporal and cross-modal fusion to capture affective inconsistencies.
Significance. If the reported limited performance is substantiated with quantitative evidence, the work usefully identifies open challenges in affective computing for digital health interventions, particularly the difficulty of modeling subtle conflicts within and across modalities. The inclusion of multiple learning paradigms (supervised, domain adaptation, zero-shot) is a positive aspect that broadens the empirical scope.
major comments (2)
- [Abstract] Abstract: The central claim that 'our results show limited performance' is load-bearing for the recommendation of better spatio-temporal and multimodal fusion methods, yet the abstract supplies no accuracy, F1, or other quantitative metrics, no baseline comparisons, no dataset statistics (e.g., number of videos, class balance), and no error bars. Without these, the claim that standard architectures are insufficient cannot be evaluated.
- [Experiments] Results/Experiments section (inferred from the three learning setups described): The manuscript states that standard deep learning architectures yield limited performance on A/H recognition but provides no details on the specific video models used (e.g., which spatio-temporal backbones, fusion strategies, or loss functions), making it impossible to assess whether the 'limited performance' stems from architectural limitations or from implementation choices.
minor comments (2)
- [Abstract] The abstract and introduction use 'A/H' without an initial definition on first use; expand the acronym at first mention for clarity.
- [Introduction] The manuscript refers to the 'unique and recently published BAH video dataset' but does not cite its source or provide a reference; add the appropriate citation.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which helps strengthen the clarity and substantiation of our claims about the challenges in multimodal A/H recognition. We address each major comment below and will incorporate revisions to improve the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 'our results show limited performance' is load-bearing for the recommendation of better spatio-temporal and multimodal fusion methods, yet the abstract supplies no accuracy, F1, or other quantitative metrics, no baseline comparisons, no dataset statistics (e.g., number of videos, class balance), and no error bars. Without these, the claim that standard architectures are insufficient cannot be evaluated.
Authors: We agree that the abstract should be more self-contained to support the central claim. In the revised version, we will expand the abstract to report key quantitative results (accuracy, F1, and other relevant metrics for the supervised, domain adaptation, and zero-shot setups), include BAH dataset statistics (number of videos, class balance), reference baseline comparisons, and note error bars or variability from our runs. This will allow readers to directly evaluate the evidence for needing improved spatio-temporal and cross-modal fusion methods. revision: yes
-
Referee: [Experiments] Results/Experiments section (inferred from the three learning setups described): The manuscript states that standard deep learning architectures yield limited performance on A/H recognition but provides no details on the specific video models used (e.g., which spatio-temporal backbones, fusion strategies, or loss functions), making it impossible to assess whether the 'limited performance' stems from architectural limitations or from implementation choices.
Authors: We acknowledge the need for greater specificity in the experimental description. We will revise the Experiments section to explicitly detail the spatio-temporal backbones (e.g., the video encoders used for visual features), cross-modal fusion strategies (e.g., late fusion, attention mechanisms, or other approaches), and loss functions applied in each of the three learning setups. These additions will enable assessment of whether the observed limited performance arises from the task's inherent difficulties (affective inconsistencies across modalities) or from the particular implementation choices, thereby strengthening the motivation for more adapted models. revision: yes
Circularity Check
No significant circularity
full rationale
This is a purely empirical application paper that applies off-the-shelf supervised video models, domain-adaptation techniques, and LLM zero-shot inference to the BAH dataset and reports the resulting performance numbers. No derivations, equations, parameter fittings, or self-referential definitions appear in the work; the modest conclusion that current architectures yield limited performance and that better spatio-temporal fusion is needed follows directly from the tabulated experimental outcomes without any reduction to the paper's own inputs or prior self-citations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Deep learning models can learn representations of subtle emotional states from multimodal video data when trained on sufficient examples.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We explore the application of deep learning models for A/H recognition in videos... supervised learning, unsupervised domain adaptation... zero-shot inference via large language models... multimodal fusion... spatio-temporal modeling
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Better methods for modeling spatio-temporal and multimodal fusion are necessary to leverage conflicts within/across modalities
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
C. J. Armitage and M. Conner. Attitudinal ambivalence: A test of three key hypotheses.Personality and Social Psychology Bulletin, 26(11):1421–1432, 2000
work page 2000
- [2]
-
[3]
S. Bai, J. Kolter, and V . Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling.CoRR, abs/1803.01271, 2018
work page internal anchor Pith review arXiv 2018
-
[4]
S. Belharbi, M. Pedersoli, A. L. Koerich, S. Bacon, and E. Granger. Guided inter- pretable facial expression recognition via spatial action unit cues. InInternational Conference on Automatic Face and Gesture Recognition (FG), 2024
work page 2024
-
[5]
S. Belharbi, M. Pedersoli, A. L. Koerich, S. Bacon, and E. Granger. Spatial action unit cues for interpretable deep facial expression recognition. InAI and Digital Health Symposium, 2024
work page 2024
- [6]
-
[7]
C. Buttorff, T. Ruder, and M. Bauman.Multiple chronic conditions in the United States, volume 10. 2017
work page 2017
-
[8]
H. Chaptoukaev, V . Strizhkova, M. Panariello, B. Dalpaos, A. Reka, V . Manera, S. Thummler, E. ISMAILOV A, N. Evans, F. Bremond, M. Todisco, M. A. Zulu- aga, and L. M. Ferrari. StressID: a multimodal dataset for stress identification. InNeurIPS, 2023
work page 2023
- [9]
-
[10]
M. Conner and P. Sparks. Ambivalence and attitudes.European review of social psychology, 12(1):37–70, 2002
work page 2002
-
[11]
D. Kollias and P. Tzirakis and A. Cowen and S. Zafeiriou and I. Kotsia and E. Granger and M. Pedersoli and S. Bacon and A. Baird and C. Gagne and C. Shao and G. Hu and S. Belharbi and M. H. Aslam. Advancements in affective and behavior analysis: The 8th abaw workshop and competition. InCVPR workshop, 2025
work page 2025
-
[12]
K. Davidson and U. Scholz. Understanding and predicting health behaviour change: a contemporary view through the lenses of meta-reviews.Health psy- chology review, 14(1):1–5, 2020
work page 2020
-
[13]
B. R. Delazeri, A. G. Hochuli, J. P. Barddal, A. L. Koerich, and A. de S. Britto Jr. Representation ensemble learning applied to facial expression recognition.Neu- ral Computing and Applications, 37(1):417–438, 2025
work page 2025
-
[14]
J. Deng, J. Guo, Y . Zhou, J. Yu, I. Kotsia, and S. Zafeiriou. Retinaface: Single- stage dense face localisation in the wild.CoRR, abs/1905.00641, 2019
work page Pith review arXiv 1905
- [15]
-
[16]
Diabetes Prevention Program (DPP) Research Group. The diabetes prevention program (DPP): description of lifestyle intervention.Diabetes Care, 25(12):2165– 2171, Dec. 2002
work page 2002
-
[17]
L. Dong, X. Wang, S. Setlur, V . Govindaraju, and I. Nwogu. Ig3d: Integrating 3d face representations in facial expression inference. InECCV, 2024
work page 2024
-
[18]
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021
work page 2021
-
[19]
Y . Fan, J. Lam, and V . Li. Facial action unit intensity estimation via semantic correspondence learning with dynamic graph convolution. InAAAI, 2020
work page 2020
-
[20]
N. M. Foteinopoulou and I. Patras. Emoclip: A vision-language method for zero-shot video facial expression recognition. InInternational conference on automatic face and gesture recognition (FG), 2024
work page 2024
-
[21]
M. González-González, J. Almeida, L. Ortiz, S. Belharbi, K. Lavoie, E. Granger, and S. Bacon. Identifying multimodal cues of ambivalence and hesitancy for digital health behaviour change interventions. InAnnals of Behavioral Medicine, 2026
work page 2026
-
[22]
M. González-González, S. Belharbi, M. O. Zeeshan, M. Sharafi, M. H. Aslam, M. Pedersoli, A. L. Koerich, S. L. Bacon, and E. Granger. BAH dataset for ambivalence/hesitancy recognition in videos for digital behavioural change. In ICLR, 2026
work page 2026
-
[23]
X. Guo, B. Zhu, L. Polanía, C. Boncelet, and K. Barner. Group-level emotion recognition using hybrid deep models based on faces, scenes, skeletons and visual attentions. InACM international conference on multimodal interaction, pages 635–639, 2018
work page 2018
-
[24]
K. Hacker. The burden of chronic disease.Mayo Clinic Proceedings: Innovations, Quality & Outcomes, 8(1):112–119, 2024
work page 2024
-
[25]
J. Hall, J. Harrigan, and R. Rosenthal. Nonverbal behavior in clinician—patient interaction.Applied and preventive psychology, 4(1):21–37, 1995
work page 1995
-
[26]
J. Han, L. Xie, J. Liu, and X. Li. Personalized broad learning system for facial expression.Multimedia Tools and Applications, 2020
work page 2020
- [27]
-
[28]
D. Hayashi, S. Carvalho, P. Ribeiro, R. Rodrigues, T. São-João, K. Lavoie, S. Bacon, and M. E. Cornelio. Methods to assess ambivalence towards food and diet: a scoping review.Journal of Human Nutrition and Dietetics, 36(5):2010– 2025, 2023
work page 2010
-
[29]
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InCVPR, 2016
work page 2016
-
[30]
Z. He, Z. Li, F. Yang, L. Wang, J. Li, C. Zhou, and J. Pan. Advances in multimodal emotion recognition based on brain–computer interfaces.Brain sciences, 10(10):687, 2020
work page 2020
-
[31]
M. Heisel and M. Mongrain. Facial expressions and ambivalence: Looking for conflict in all the right faces.Journal of Nonverbal Behavior, 28:35–52, 2004
work page 2004
-
[32]
S. Hershey, S. Chaudhuri, D. Ellis, J. Gemmeke, A. Jansen, R. Moore, M. Plakal, D. Platt, R. Saurous, B. Seybold, M. Slaney, R. Weiss, and K. Wilson. Cnn architectures for large-scale audio classification. InICASSP, 2017
work page 2017
- [33]
- [34]
-
[35]
S. Hornstein, K. Zantvoort, U. Lueken, B. Funk, and K. Hilbert. Personalization strategies in digital mental health interventions: a systematic review and concep- tual framework for depressive symptoms.Frontiers in digital health, 5:1170002, 2023
work page 2023
-
[36]
T.-C. C. Hsu, P. Whelan, J. Gandrup, C. J. Armitage, L. Cordingley, and J. Mc- Beth. Personalized interventions for behaviour change: A scoping review of just-in-time adaptive interventions.British Journal of Health Psychology, 30(1):e12766, 2025
work page 2025
-
[37]
J. L. Kaar, C. M. Luberto, K. A. Campbell, and J. C. Huffman. Sleep, health behaviors, and behavioral interventions: Reducing the risk of cardiovascular disease in adults.World Journal of Cardiology, 9(5):396, 2017
work page 2017
-
[38]
A. Karmanov, D. Guan, S. Lu, A. El Saddik, and E. Xing. Efficient test-time adaptation of vision-language models. InCVPR, 2024. 9 González et al. [Under Review 2026]
work page 2024
-
[39]
W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, and M. Suleyman. The kinetics human action video dataset.CoRR, abs/1705.06950, 2017
work page internal anchor Pith review arXiv 2017
- [40]
-
[41]
D. Kollias. Multi-label compound expression recognition: C-expr database & network. InCVPR, 2023
work page 2023
-
[42]
D. Kollias and S. Zafeiriou. Expression, affect, action unit recognition: Aff-wild2, multi-task learning and arcface.CoRR, 2019
work page 2019
- [43]
- [44]
- [45]
-
[46]
S. Li, W. Deng, and J. Du. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. InCVPR, 2017
work page 2017
-
[47]
Z. Lian, H. Chen, L. Chen, H. Sun, L. Sun, Y . Ren, Z. Cheng, B. Liu, R. Liu, X. Peng, J. Yi, and J. Tao. Affectgpt: A new dataset, model, and benchmark for emotion understanding with multimodal large language models. InICML, 2025
work page 2025
- [48]
- [49]
-
[50]
B. Liberatori, A. Conti, P. Rota, Y . Wang, and E. Ricci. Test-time zero-shot temporal action localization. InCVPR, 2024
work page 2024
-
[51]
B. Lin, Y . Ye, B. Zhu, J. Cui, M. Ning, P. Jin, and L. Yuan. Video-llava: Learning united visual representation by alignment before projection, 2024
work page 2024
-
[52]
A. Lisowska, S. Wilk, and M. Peleg. Personalising digital health behaviour change interventions using machine learning and domain knowledge.CoRR, abs/2304.03392, 2023
-
[53]
C. Liu, X. Zhang, X. Liu, T. Zhang, L. Meng, Y . Liu, Y . Deng, and W. Jiang. Facial expression recognition based on multi-modal features for videos in the wild. InCVPR, 2023
work page 2023
-
[54]
H. Liu, R. An, Z. Zhang, B. Ma, W. Zhang, Y . Song, Y . Hu, W. Chen, and Y . Ding. Norface: Improving facial expression analysis by identity normalization.ECCV, 2024
work page 2024
-
[55]
X. Liu, L. Jin, X. Han, J. Lu, J. You, and L. Kong. Identity-aware facial expression recognition in compressed video. InICPR, 2021
work page 2021
-
[56]
Y . Liu, W. Wang, C. Feng, H. Zhang, Z. Chen, and Y . Zhan. Expression snip- pet transformer for robust video-based facial expression recognition.Pattern Recognition, 138:109368, 2023
work page 2023
-
[57]
H. Lokhande, C. Garware, T. Kudale, and R. Kumar. Personalized well-being interventions (pwis): A new frontier in mental health. InAffective Computing for Social Good: Enhancing Well-being, Empathy, and Equity, pages 183–200. 2024
work page 2024
-
[58]
C. Luo, S. Song, W. Xie, L. Shen, and H. Gunes. Learning multi-dimensional edge feature-based AU relation graph for facial action unit recognition. InIJCAI, 2022
work page 2022
- [59]
-
[60]
S. Mantena, A. Johnson, M. Oppezzo, N. Schütz, A. Tolas, R. Doijad, C. M. Mattson, A. Lawrie, M. Ramirez-Posada, P. Schmiedmayer, et al. Fine-tuning llms in behavioral psychology for scalable health coaching.NPJ Cardiovascular Health, 2(1):48, 2025
work page 2025
-
[61]
J. Manuel and T. Moyers. The role of ambivalence in behavior change.Addiction, 111(11):1910–1912, Nov. 2016
work page 1910
-
[62]
J. Mao, R. Xu, X. Yin, Y . Chang, B. Nie, A. Huang, and Y . Wang. Poster++: A simpler and stronger facial expression recognition network.Pattern Recognition, page 110951, 2024
work page 2024
-
[63]
M. Mather and P. Scommegna. Up to half of u.s. premature deaths are preventable; behavioral factors key, 2015
work page 2015
-
[64]
J. A. Matthews, S. Matthews, M. D. Faries, and R. Q. Wolever. Supporting sustainable health behavior change: the whole is greater than the sum of its parts. Mayo Clinic Proceedings: Innovations, Quality & Outcomes, 8(3):263–275, 2024
work page 2024
-
[65]
C. McCord, F. Ullrich, K. A. S. Merchant, D. Bhagianadh, K. D. Carter, E. Nelson, J. P. Marcin, K. B. Law, J. Neufeld, A. Giovanetti, and M. M. Ward. Comparison of in-person vs. telebehavioral health outcomes from rural populations across america.BMC Psychiatry, 22(1):778, Dec. 2022
work page 2022
-
[66]
S. Michie, M. Richardson, M. Johnston, C. Abraham, J. Francis, W. Hardeman, M. Eccles, J. Cane, and C. Wood. The behavior change technique taxonomy (v1) of 93 hierarchically clustered techniques: building an international consensus for the reporting of behavior change interventions.Annals of behavioral medicine, 46(1):81–95, 2013
work page 2013
- [67]
-
[68]
K. R. Middleton, S. D. Anton, and M. G. Perri. Long-term adherence to health behavior change.American journal of lifestyle medicine, 7(6):395–404, 2013
work page 2013
-
[69]
D. D. Miller. Can ai help with the hardest thing: pro health behavior change.npj Cardiovascular Health, 3(1):3, 2026
work page 2026
-
[70]
W. Miller and G. Rose. Motivational interviewing and decisional balance: con- trasting responses to client ambivalence.Behavioural and cognitive psychother- apy, 43(2):129–141, 2015
work page 2015
-
[71]
A. Mollahosseini, B. Hassani, and M. H. Mahoor. Affectnet: A database for facial expression, valence, and arousal computing in the wild.TAFFC, 10(1):18–31, 2019
work page 2019
-
[72]
B. C. Mulder, H. Algra, E. Cruijsen, J. M. Geleijnse, R. M. Winkels, and W. Kroeze. Beyond motivation: Creating supportive healthcare environments for engaging in therapeutic patient education according to healthcare providers. PEC innovation, 6:100405, 2025
work page 2025
-
[73]
S. Murtaza, S. Belharbi, M. Pedersoli, and E. Granger. A realistic protocol for evaluation of weakly supervised object localization. InWACV, 2025
work page 2025
-
[74]
J. Nasimzada, J. Kleesiek, K. Herrmann, A. Roitberg, and C. Seibold. Towards synthetic data generation for improved pain recognition in videos under patient constraints.CoRR, abs/2409.16382, 2024
-
[75]
L. Nielsen, M. Riddle, W. M. King, J.W. Aklin, W. Chen, D. Clark, E. Collier, S. Czajkowski, L. Esposito, R. Ferrer, et al. The NIH science of behavior change program: Transforming the science through a focus on mechanisms of change. Behaviour Research and Therapy, 101:3–11, July 2017
work page 2017
-
[76]
A. O’Donnell, M. Addison, L. Spencer, H. Zurhold, M. Rosenkranz, R. McGov- ern, E. Gilvarry, M.-S. Martens, U. Verthein, and E. Kaner. Which individual, social and environmental influences shape key phases in the amphetamine type stimulant use trajectory? a systematic narrative review and thematic synthesis of the qualitative literature.Addiction, 114(1):...
work page 2019
-
[77]
V . Olie, C. Grave, G. Helft, V . Nguyen-Thanh, R. Andler, G. Quatremere, A. Pas- quereau, E. Lahaie, G. Lailler, C. Verdot, et al. Epidemiology of cardiovascular risk factors: Behavioural risk factors.Archives of Cardiovascular Diseases, 117(12):770–784, 2024
work page 2024
- [78]
-
[79]
J. A. Parkinson. Promoting behavioral change to improve health outcomes, 2025
work page 2025
-
[80]
R. G. Praveen, P. Cardinal, and E. Granger. Audio–visual fusion for emotion recognition in the valence–arousal space using joint cross-attention.IEEE Trans- actions on Biometrics, Behavior, and Identity Science, 5(3):360–373, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.