pith. sign in

arxiv: 2508.20765 · v2 · submitted 2025-08-28 · 💻 cs.CV · cs.AI

Looking Beyond the Obvious: A Survey on Abstract Concept Recognition for Video Understanding

Pith reviewed 2026-05-18 20:23 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords abstract concept recognitionvideo understandingfoundation modelssurveymultimodal modelscomputer vision
0
0 comments X

The pith

Foundation models offer an ideal opportunity to recognize abstract concepts like justice and freedom in videos by building on past research.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys tasks and datasets developed over time for recognizing abstract concepts in video content. It claims that researchers have repeatedly attempted these problems using the best available tools at each stage. Recent progress in foundation models creates a fresh chance to make real headway on this challenge. Drawing from decades of prior experience can prevent repeating old mistakes when applying multimodal models. Success here would let video understanding systems reason at the high-level semantic layers that match human values and context.

Core claim

Abstract concept recognition forms a crucial open challenge in video understanding, where reasoning on multiple semantic levels based on contextual information is key. The authors argue that the recent advances in foundation models make for an ideal setting to address abstract concept understanding in videos. Automated understanding of high-level abstract concepts is imperative as it enables models to be more aligned with human reasoning and values. The survey examines different tasks and datasets used to understand abstract concepts in video content and advocates that drawing on decades of community experience will help shed light on this important open grand challenge and avoid re-invented

What carries the argument

Survey of tasks and datasets for abstract concept recognition in videos, which shows periodic research efforts using tools available at the time and now positions foundation models as the next step.

If this is right

  • Video understanding systems would move beyond detecting visible objects and actions to reasoning about high-level ideas.
  • Models would align more closely with human reasoning and values through contextual multi-level analysis.
  • Prior research on abstract tasks can be reused rather than restarted when applying new multimodal foundation models.
  • The field gains a clearer path to solving an open grand challenge without repeating past cycles of effort.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Video search and recommendation tools could retrieve content by themes such as togetherness or freedom instead of surface features alone.
  • This line of work might extend to ethical AI systems that interpret social implications in visual media.
  • Direct comparisons of foundation models against historical dataset results could test whether they truly close the gap in contextual reasoning.

Load-bearing premise

The tasks and datasets covered in the survey adequately represent the problem of abstract concept recognition and foundation models can overcome prior limits without facing new fundamental barriers.

What would settle it

An experiment in which foundation models show no improvement over earlier specialized methods when tested on the surveyed abstract concept video datasets would undermine the central claim.

read the original abstract

The automatic understanding of video content is advancing rapidly. Empowered by deeper neural networks and large datasets, machines are increasingly capable of understanding what is concretely visible in video frames, whether it be objects, actions, events, or scenes. In comparison, humans retain a unique ability to also look beyond concrete entities and recognize abstract concepts like justice, freedom, and togetherness. Abstract concept recognition forms a crucial open challenge in video understanding, where reasoning on multiple semantic levels based on contextual information is key. In this paper, we argue that the recent advances in foundation models make for an ideal setting to address abstract concept understanding in videos. Automated understanding of high-level abstract concepts is imperative as it enables models to be more aligned with human reasoning and values. In this survey, we study different tasks and datasets used to understand abstract concepts in video content. We observe that, periodically and over a long period, researchers have attempted to solve these tasks, making the best use of the tools available at their disposal. We advocate that drawing on decades of community experience will help us shed light on this important open grand challenge and avoid ``re-inventing the wheel'' as we start revisiting it in the era of multi-modal foundation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper is a survey on abstract concept recognition in video understanding. It reviews tasks and datasets for recognizing high-level abstract concepts (e.g., justice, freedom, togetherness) that require contextual and multi-level reasoning beyond concrete visual entities. The central claim is that recent foundation models create an ideal setting for this challenge and that the community should draw on decades of prior research attempts to avoid reinventing the wheel.

Significance. If the survey provides a representative overview of historical tasks/datasets and identifies transferable lessons, it could usefully orient future work on aligning video models with human-like abstract reasoning. The directional argument linking foundation models to this open problem is timely and could help consolidate community efforts.

major comments (2)
  1. [Abstract] Abstract and introduction: the claim that surveyed tasks/datasets are sufficiently representative of abstract concept recognition (and that foundation models will succeed where prior approaches fell short) is presented without explicit inclusion criteria, coverage statistics, or discussion of potential gaps; this is load-bearing for the survey's utility as a foundation for new work.
  2. [Tasks and datasets review] The advocacy for drawing on 'decades of community experience' would be strengthened by concrete mappings in the tasks/datasets review section showing which specific limitations of earlier methods (e.g., lack of context modeling) are directly addressable by current foundation-model capabilities.
minor comments (2)
  1. Clarify the taxonomy or categorization scheme used to group the surveyed tasks and datasets for easier navigation.
  2. Ensure all cited prior works include publication years and venues in the reference list for historical context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and the recommendation for minor revision. We address the major comments below and will update the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Abstract] Abstract and introduction: the claim that surveyed tasks/datasets are sufficiently representative of abstract concept recognition (and that foundation models will succeed where prior approaches fell short) is presented without explicit inclusion criteria, coverage statistics, or discussion of potential gaps; this is load-bearing for the survey's utility as a foundation for new work.

    Authors: We acknowledge that explicit documentation of the survey methodology is important for establishing the representativeness of the reviewed tasks and datasets. In the revised version, we will introduce a 'Survey Scope and Methodology' subsection in the Introduction. This will specify the inclusion criteria (focusing on tasks that involve reasoning about abstract concepts requiring contextual, multi-level understanding beyond concrete visual elements), the search and selection process, quantitative coverage (e.g., total papers, tasks, and datasets surveyed), and a balanced discussion of limitations and gaps, such as potential biases toward certain video domains or under-explored abstract concepts. This addition will provide a solid foundation for the claims and guide future research. revision: yes

  2. Referee: [Tasks and datasets review] The advocacy for drawing on 'decades of community experience' would be strengthened by concrete mappings in the tasks/datasets review section showing which specific limitations of earlier methods (e.g., lack of context modeling) are directly addressable by current foundation-model capabilities.

    Authors: We agree that providing concrete linkages will make the argument more compelling. We will enhance the review section by adding targeted examples and a summary mapping. Specifically, we will highlight cases where earlier methods (e.g., in video emotion recognition or social scene understanding) were limited by insufficient modeling of long-term context or multimodal integration, and contrast this with foundation models' strengths in self-attention, large-scale pretraining, and zero-shot generalization. A new table or bullet-point summaries will map 4-5 representative limitations to corresponding FM capabilities, while maintaining a cautious tone that these are promising directions rather than guaranteed successes. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

This is a survey paper with no derivations, equations, fitted parameters, or new technical predictions. Its central claim is directional—that foundation models create an ideal setting for abstract concept recognition in video and that prior community experience should inform the effort—without reducing any result to quantities defined by its own choices or self-citations. All referenced tasks, datasets, and prior approaches are drawn from external literature rather than constructed internally, making the paper self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The survey rests on the domain assumption that abstract concepts are recognizable from video context and that historical attempts provide reusable insights; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Reasoning on multiple semantic levels based on contextual information is key to abstract concept recognition in video.
    Invoked in the abstract when stating that abstract concept recognition forms a crucial open challenge where such reasoning is key.

pith-pipeline@v0.9.0 · 5752 in / 1134 out tokens · 37565 ms · 2026-05-18T20:23:31.947987+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

233 extracted references · 233 canonical work pages · 10 internal anchors

  1. [1]

    https://www.red-dot.org/de/project/ freedom-of-speech-20384-20384

    Dot, R.: Red Dot Design Award: FREE- DOM OF SPEECH — red-dot.org. https://www.red-dot.org/de/project/ freedom-of-speech-20384-20384. [Accessed 17-02-2025]. https://doi.org/10.5553/ab/ 0165-13312018098005031

  2. [2]

    [Place of Publication Not Identified: Publisher Not Identified, to 1943] (1942)

    War Production Co-Ordinating Committee: We Can Do It! Rosie the Riveter. [Place of Publication Not Identified: Publisher Not Identified, to 1943] (1942). https://doi.org/ 10.3735/9781961844179.book-part-147

  3. [3]

    : Deep learning for generic object detection: A survey

    Liu, L., Ouyang, W., Wang, X., Fieguth, P.W., Chen, J., et al. : Deep learning for generic object detection: A survey. International Journal of Computer Vision 128, 261–318 (2018) https://doi.org/10. 1007/S11263-019-01247-4

  4. [4]

    : Deep learning for scene classification: A survey

    Zeng, D., Liao, M., Tavakolian, M., Guo, Y., Zhou, B., et al. : Deep learning for scene classification: A survey. arXiv preprint arXiv:2101.10531 (2021) https://doi.org/10. 48550/arXiv.2101.10531

  5. [5]

    Interna- tional Journal of Computer Vision 130, 1366–1401 (2018) https://doi.org/10.1007/ S11263-022-01594-9

    Kong, Y., Fu, Y.R.: Human action recog- nition and prediction: A survey. Interna- tional Journal of Computer Vision 130, 1366–1401 (2018) https://doi.org/10.1007/ S11263-022-01594-9

  6. [6]

    Human Perception of Visual Information, 85 (2022) https://doi

    Zhao, S., Huang, Q., Tang, Y., Yao, X., Yang, J., et al.: Computational emotion analysis from images: Recent advances and future directions. Human Perception of Visual Information, 85 (2022) https://doi. org/10.1007/978-3-030-81465-6 4

  7. [7]

    In: International Conference on Image Analysis and Processing (2019)

    Stefanini, M., Cornia, M., Baraldi, L., Corsini, M., Cucchiara, R.: Artpedia: A new visual-semantic dataset with visual and con- textual sentences in the artistic domain. In: International Conference on Image Analysis and Processing (2019). https://doi.org/10. 1007/978-3-030-30645-8 66

  8. [8]

    Seeing the Intangible: Survey of Image Classification into High-Level and Abstract Categories

    Pandiani, D.S.M., Presutti, V.: Seeing the intangible: Survey of image classification into high-level and abstract categories. ArXiv preprint abs/2308.10562 (2023) https://doi.org/10.48550/arXiv.2308.10562

  9. [9]

    Interna- tional Journal of Computer Vision 60, 91– 110 (2004) https://doi.org/10.1023/b:visi

    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Interna- tional Journal of Computer Vision 60, 91– 110 (2004) https://doi.org/10.1023/b:visi. 0000029664.99615.94

  10. [10]

    2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) 1, 886–8931 (2005) https://doi

    Dalal, N., Triggs, B.: Histograms of ori- ented gradients for human detection. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) 1, 886–8931 (2005) https://doi. org/10.1109/cvpr.2005.177 26

  11. [11]

    In: AAAI Conference on Artificial Intelligence (2014)

    Jiang, Y.-G., Xu, B., Xue, X.: Predicting emotions in user-generated videos. In: AAAI Conference on Artificial Intelligence (2014). https://doi.org/10.1609/aaai.v28i1.8724

  12. [12]

    Proceedings of the 22nd ACM international conference on Multimedia (2014) https://doi.org/10.1145/ 2647868.2654927

    Lu, X., Lin, Z.L., Jin, H., Yang, J., Wang, J.Z.: Rapid: Rating pictorial aes- thetics using deep learning. Proceedings of the 22nd ACM international conference on Multimedia (2014) https://doi.org/10.1145/ 2647868.2654927

  13. [13]

    : Quality assessment in the era of large models: A survey

    Zhang, Z., Zhou, Y., Li, C., Zhao, B., Liu, X., et al. : Quality assessment in the era of large models: A survey. ACM Transac- tions on Multimedia Computing, Communi- cations and Applications (2024) https://doi. org/10.1145/3722559

  14. [14]

    : Emotion-llama: Mul- timodal emotion recognition and reason- ing with instruction tuning

    Cheng, Z., Cheng, Z.-Q., He, J.-Y., Wang, K., Lin, Y., et al. : Emotion-llama: Mul- timodal emotion recognition and reason- ing with instruction tuning. Advances in Neural Information Processing Systems 37, 110805–110853 (2024)

  15. [15]

    doi: 10.1109/TPAMI.2024

    Zhang, J., Huang, J., Jin, S., Lu, S.: Vision- language models for vision tasks: A sur- vey. IEEE Transactions on Pattern Analy- sis and Machine Intelligence 46, 5625–5644 (2023) https://doi.org/10.1109/tpami.2024. 3369699

  16. [16]

    International Journal of Digital Humanities 5, 451–490 (2023) https://doi.org/10.1007/ s42803-023-00077-8

    Pandiani, D.S.M., Lazzari, N., Erp, M., Pre- sutti, V.: Hypericons for interpretability: decoding abstract concepts in visual data. International Journal of Digital Humanities 5, 451–490 (2023) https://doi.org/10.1007/ s42803-023-00077-8

  17. [17]

    In: Workshop on Cognitive Aspects of the Lexicon (2024)

    Cerini, L., Bondielli, A., Lenci, A.: Repre- senting abstract concepts with images: An investigation with large language models. In: Workshop on Cognitive Aspects of the Lexicon (2024)

  18. [18]

    IEEE Trans

    Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.C.: Content-based image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell.22, 1349–1380 (2000) https://doi.org/10.1109/ 34.895972

  19. [19]

    In: 28th International Joint Con- ference on Artificial Intelligence, IJCAI 2019, pp

    Aditya, S., Yang, Y., Baral, C.: Integrating knowledge and reasoning in image under- standing. In: 28th International Joint Con- ference on Artificial Intelligence, IJCAI 2019, pp. 6252–6259 (2019). https://doi. org/10.24963/ijcai.2019/873 . International Joint Conferences on Artificial Intelligence

  20. [20]

    IEEE Transactions on Multime- dia 13, 303–319 (2011) https://doi.org/10

    Fu, Z., Lu, G., Ting, K.M., Zhang, D.: A sur- vey of audio-based music classification and annotation. IEEE Transactions on Multime- dia 13, 303–319 (2011) https://doi.org/10. 1109/tmm.2010.2098858

  21. [21]

    Philanthropy, B.: Empowering Girls in India (2024)

  22. [22]

    In: International Conference on Machine Learning (2023)

    Li, J., Li, D., Savarese, S., Hoi, S.C.H.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International Conference on Machine Learning (2023)

  23. [23]

    In: Computer Vision– ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part I 11, pp

    Torresani, L., Szummer, M., Fitzgibbon, A.: Efficient object category recognition using classemes. In: Computer Vision– ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part I 11, pp. 776–789 (2010). https://doi.org/10. 1007/978-3-642-15549-9 56 . Springer

  24. [24]

    Proceedings of the 21st ACM international conference on Multimedia (2013) https://doi.org/10.1145/ 2502081.2502268

    Borth, D., Chen, T., Ji, R., Chang, S.-F.: Sentibank: large-scale ontology and clas- sifiers for detecting sentiment and emo- tions in visual content. Proceedings of the 21st ACM international conference on Multimedia (2013) https://doi.org/10.1145/ 2502081.2502268

  25. [25]

    In: Bengio, Y., LeCun, Y

    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015)

  26. [27]

    Neural computation 9(8), 1735–1780 (1997) https://doi.org/10.1162/ neco.1997.9.8.1735

    Hochreiter, S., Schmidhuber, J.: Long short- term memory. Neural computation 9(8), 1735–1780 (1997) https://doi.org/10.1162/ neco.1997.9.8.1735

  27. [28]

    In: European Conference on Computer Vision (2023)

    Li, Y., Wang, C., Jia, J.: Llama-vid: An image is worth 2 tokens in large lan- guage models. In: European Conference on Computer Vision (2023). https://doi.org/ 10.1007/978-3-031-72952-2 19

  28. [29]

    Masked feature prediction for self-supervised visual pre-training

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., et al.: Learning trans- ferable visual models from natural language supervision. In: International Conference on Machine Learning (2021). https://doi.org/ 10.1109/cvpr52688.2022.00101

  29. [30]

    The Llama 3 Herd of Models

    Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., et al.: The llama 3 herd of models. ArXiv abs/2407.21783 (2024) https://doi.org/10.1016/s0749-0720(15) 31012-4

  30. [31]

    abstract semantics: From mental representations to functional brain mapping

    Mkrtychian, N.A., Blagovechtchenski, E.D., Kurmakaeva, D., Gnedykh, D.S., Kostro- mina, S.N., et al.: Concrete vs. abstract semantics: From mental representations to functional brain mapping. Frontiers in Human Neuroscience 13 (2019) https://doi. org/10.3389/fnhum.2019.00267

  31. [32]

    Journal of Cognition 6 (2023) https://doi.org/10

    Banks, B., Borghi, A.M., Fargier, R., Fini, C., Jonauskait˙ e, D., et al.: Consensus paper: Current perspectives on abstract concepts and future research directions. Journal of Cognition 6 (2023) https://doi.org/10. 5334/joc.238

  32. [33]

    https://doi.org/10.61508/refl.v25i2

    Bateman, J.A.: Text and Image: A Criti- cal Introduction to the Visual/Verbal Divide (2014). https://doi.org/10.61508/refl.v25i2. 166287

  33. [34]

    Behavior Research Methods 46, 904–911 (2014) https://doi.org/10.3758/ s13428-013-0403-5

    Brysbaert, M., Warriner, A.B., Kuper- man, V.: Concreteness ratings for 40 thousand generally known english word lemmas. Behavior Research Methods 46, 904–911 (2014) https://doi.org/10.3758/ s13428-013-0403-5

  34. [35]

    Cognitive science 29(5), 719–736 (2005) https://doi.org/10.1207/ s15516709cog0000 33

    Katja Wiemer-Hastings, K., Xu, X.: Con- tent differences for abstract and con- crete concepts. Cognitive science 29(5), 719–736 (2005) https://doi.org/10.1207/ s15516709cog0000 33

  35. [36]

    Philosophi- cal Transactions of the Royal Society B 378 (2022) https://doi.org/10.1098/rstb.2021

    Langland-Hassan, P., Davis, C.P.: A context-sensitive and non-linguistic approach to abstract concepts. Philosophi- cal Transactions of the Royal Society B 378 (2022) https://doi.org/10.1098/rstb.2021. 0355

  36. [37]

    Future Humanities (2024) https://doi.org/ 10.22541/au.171181017.78084528/v1

    Pandiani, D.S.M.: The wicked problem of naming the intangible: Abstract concepts, binary thinking, and computer vision labels. Future Humanities (2024) https://doi.org/ 10.22541/au.171181017.78084528/v1

  37. [38]

    In: Electronic Imaging (2006)

    Hare, J.S., Lewis, P.H., Enser, P.G.B., San- dom, C.J.: Mind the gap: another look at the problem of the semantic gap in image retrieval. In: Electronic Imaging (2006). https://doi.org/10.1117/12.647755

  38. [40]

    The communi- cation of ideas 37(1), 136–139 (1960)

    Lasswell, H.D.: The structure and function of communication in society. The communi- cation of ideas 37(1), 136–139 (1960)

  39. [41]

    : Teaching human behavior improves content understanding abilities of vlms

    Singh, S.K., Harini, S., Singla, Y.K., Chen, C., Shah, R.R., et al. : Teaching human behavior improves content understanding abilities of vlms. In: The Thirteenth Inter- national Conference on Learning Represen- tations (2024)

  40. [42]

    Graham, F.Q

    Kinney, R.M., Anastasiades, C., Authur, R., Beltagy, I., Bragg, J., et al.: The semantic scholar open data platform. ArXiv preprint abs/2301.10140 (2023) https://doi.org/ 10.48550/ARXIV.2301.10140 28

  41. [43]

    https://portal.core

    CORE – Conference Ranking Portal: CORE Conference Rankings. https://portal.core. edu.au/conf-ranks/. Accessed: 2025-06-30 (2025)

  42. [46]

    In: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp

    Muennighoff, N., Tazi, N., Magne, L., Reimers, N.: Mteb: Massive text embed- ding benchmark. In: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 2014–2037 (2023). https://doi.org/10. 18653/v1/2023.eacl-main.148

  43. [47]

    Communication Theory 9, 119–161 (1999) https://doi.org/10.1111/j.1468-2885

    Craig, R.T.: Communication theory as a field. Communication Theory 9, 119–161 (1999) https://doi.org/10.1111/j.1468-2885. 1999.tb00355.x

  44. [48]

    Circle loss: A unified perspective of pair similarity optimization

    Epstein, D., Chen, B., Vondrick, C.: Oops! predicting unintentional action in video. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 916–926 (2019) https://doi.org/10.1109/ cvpr42600.2020.00100

  45. [50]

    : Funqa: Towards sur- prising video comprehension

    Xie, B., Zhang, S., Zhou, Z., Li, B., Zhang, Y., et al. : Funqa: Towards sur- prising video comprehension. In: Euro- pean Conference on Computer Vision, pp. 39–57 (2024). https://doi.org/10.1007/ 978-3-031-73232-4 3 . Springer

  46. [51]

    and Joo, J

    Xu, X., Lu, Y., Lu, Z., Xiang, T.: Vid2int: Detecting implicit intention from long dia- log videos. 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), 3298–3307 (2021) https://doi.org/10.1109/ wacv48630.2021.00334

  47. [52]

    0: A large-scale bench- mark dataset for multimodal intent recogni- tion and out-of-scope detection in conversa- tions

    Zhang, H., Wang, X., Xu, H., Zhou, Q., Gao, K., et al.: Mintrec2. 0: A large-scale bench- mark dataset for multimodal intent recogni- tion and out-of-scope detection in conversa- tions. In: The Twelfth International Confer- ence on Learning Representations (2024)

  48. [54]

    In: CVPR

    Jia, M., Wu, Z., Reiter, A., Cardie, C., Belongie, S.J., et al.: Intentonomy: a dataset and study towards human intent under- standing. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 12981–12991 (2020) https://doi. org/10.1109/cvpr46437.2021.01279

  49. [55]

    : The konstanz natu- ral video database (konvid-1k)

    Hosu, V., Hahn, F., Jenadeleh, M., Lin, H., Men, H., et al. : The konstanz natu- ral video database (konvid-1k). In: 2017 Ninth International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–6 (2017). https://doi.org/10.1109/qomex. 2017.7965673 . IEEE

  50. [56]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Ying, Z., Mandal, M., Ghadiyaram, D., Bovik, A.: Patch-vq:’patching up’the video quality problem. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14019–14029 (2021). https://doi.org/10.1109/cvpr46437. 2021.01380

  51. [57]

    In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp

    Wu, H., Zhang, E., Liao, L., Chen, C., Hou, J., et al.: Exploring video quality assessment on user generated contents from aesthetic 29 and technical perspectives. In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision, pp. 20144–20154 (2023). https://doi.org/10.1109/iccv51070. 2023.01843

  52. [58]

    2012 IEEE Conference on Computer Vision and Pattern Recognition, 2408–2415 (2012) https://doi.org/10.1109/ cvpr.2012.6247954

    Murray, N., Marchesotti, L., Perronnin, F.: Ava: A large-scale database for aesthetic visual analysis. 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2408–2415 (2012) https://doi.org/10.1109/ cvpr.2012.6247954

  53. [59]

    IEEE Transactions on Image Processing 25(1), 372–387 (2015) https://doi.org/10.1109/tip.2015.2500021

    Ghadiyaram, D., Bovik, A.C.: Massive online crowdsourced study of subjective and objective picture quality. IEEE Transactions on Image Processing 25(1), 372–387 (2015) https://doi.org/10.1109/tip.2015.2500021

  54. [60]

    Chudnovsky and S

    Sun, W., Zhou, F., Liao, Q.: Mdid: A mul- tiply distorted image database for image quality assessment. Pattern Recognition 61, 153–168 (2017) https://doi.org/10.1016/j. patcog.2016.07.033

  55. [61]

    arXiv preprint arXiv:1803.08489 (2018) https://doi.org/10

    Lin, H., Hosu, V., Saupe, D.: Koniq- 10k: Towards an ecologically valid and large-scale iqa database. arXiv preprint arXiv:1803.08489 (2018) https://doi.org/10. 1109/tip.2020.2967829

  56. [62]

    Momentum contrast for unsupervised visual representation learning

    Fang, Y., Zhu, H., Zeng, Y., Ma, K., Wang, Z.: Perceptual quality assessment of smart- phone photography. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3677–3686 (2020). https://doi.org/10.1109/cvpr42600. 2020.00373

  57. [63]

    Oona Rainio, Jarmo Teuho, and Riku Klén

    HariniS, I., Singh, S., Singla, Y.K., Bhat- tacharyya, A., Baths, V., et al.: Long-term ad memorability: Understanding & generat- ing memorable ads. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 5707–5718 (2023) https: //doi.org/10.1109/wacv61041.2025.00557

  58. [64]

    : Automatic under- standing of image and video advertisements

    Hussain, Z., Zhang, M., Zhang, X., Ye, K., Thomas, C., et al. : Automatic under- standing of image and video advertisements. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1705–1715 (2017). https://doi.org/10. 1109/cvpr.2017.123

  59. [65]

    YouTube-8M: A Large-Scale Video Classification Benchmark

    Abu-El-Haija, S., Kothari, N., Lee, J., Nat- sev, P., Toderici, G., et al.: Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016) https://doi.org/10.48550/arXiv.1609.08675

  60. [66]

    In: CVPR

    Wu, C., Kr¨ ahenb¨ uhl, P.: Towards long-form video understanding. 2021 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), 1884–1894 (2021) https://doi.org/10.1109/cvpr46437.2021. 00192

  61. [67]

    Proceedings of the 29th ACM International Conference on Multimedia (2021) https://doi.org/10.1145/ 3474085.3479222

    Wang, Z., Wu, L., Li, Z., Xiong, J., Lu, Q.: Overview of tencent multi-modal ads video understanding. Proceedings of the 29th ACM International Conference on Multimedia (2021) https://doi.org/10.1145/ 3474085.3479222

  62. [68]

    : Mm-au:towards multimodal understanding of advertisement videos

    Bose, D., Hebbar, R., Feng, T., Somande- palli, K., Xu, A., et al. : Mm-au:towards multimodal understanding of advertisement videos. Proceedings of the 31st ACM Inter- national Conference on Multimedia (2023) https://doi.org/10.1145/3581783.3612371

  63. [69]

    2025, arXiv e-prints, arXiv:2510.13477, doi:10.48550/arXiv

    Zhang, Z., Dou, M., Peng, L., Pan, H., Bagci, U., et al. : Videoads for fast-paced video understanding: Where opensource foundation models beat gpt-4o & gemini- 1.5 pro. arXiv preprint arXiv:2504.09282 (2025) https://doi.org/10.48550/ARXIV. 2504.09282

  64. [70]

    Masked feature prediction for self-supervised visual pre-training

    Gupta, V., Mittal, T., Mathur, P., Mishra, V., Maheshwari, M., et al.: 3massiv: Multilingual, multimodal and multi-aspect dataset of social media short videos. 2022 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 21032–21043 (2022) https://doi. org/10.1109/cvpr52688.2022.02039

  65. [71]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., et al. : Laion- 400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021) https://doi.org/10. 30 48550/arXiv.2111.02114

  66. [73]

    In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops (2018)

    Garcia, N., Vogiatzis, G.: How to read paintings: Semantic art understanding with multi-modal retrieval. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops (2018). https: //doi.org/10.1007/978-3-030-11012-3 52

  67. [74]

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 484–492 (2017) https://doi.org/ 10.1109/cvpr.2017.59

    Alameda-Pineda, X., Pilzer, A., Xu, D., Sebe, N., Ricci, E.: Viraliency: Pooling local virality. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 484–492 (2017) https://doi.org/ 10.1109/cvpr.2017.59

  68. [75]

    : Micro tells macro: Predicting the popularity of micro-videos via a transductive model

    Chen, J., Song, X., Nie, L., Wang, X., Zhang, H., et al. : Micro tells macro: Predicting the popularity of micro-videos via a transductive model. Proceedings of the 24th ACM international conference on Multimedia (2016) https://doi.org/10.1007/ s00530-020-00660-x

  69. [76]

    Proceed- ings of International Conference on Multi- media Retrieval (2014) https://doi.org/10

    Jiang, L., Miao, Y., Yang, Y., Lan, Z., Hauptmann, A.: Viral video style: A closer look at viral videos on youtube. Proceed- ings of International Conference on Multi- media Retrieval (2014) https://doi.org/10. 1145/2578726.2578754

  70. [77]

    In: Proceedings of the Fourth ACM Interna- tional Conference on Web Search and Data Mining, pp

    Figueiredo, F., Benevenuto, F., Almeida, J.M.: The tube over time: characterizing popularity growth of youtube videos. In: Proceedings of the Fourth ACM Interna- tional Conference on Web Search and Data Mining, pp. 745–754 (2011). https://doi. org/10.1109/tnsm.2019.2914222

  71. [78]

    In: Proceedings of the International AAAI Conference on Web and Social Media, vol

    Lakkaraju, H., McAuley, J., Leskovec, J.: What’s in a name? understanding the inter- play between titles, content, and communi- ties in social media. In: Proceedings of the International AAAI Conference on Web and Social Media, vol. 7, pp. 311–320 (2013). https://doi.org/10.1609/icwsm.v7i1.14408

  72. [79]

    In: AAAI Conference on Artificial Intelligence (2020)

    Pang, B., Zha, K., Zhang, Y., Lu, C.: Fur- ther understanding videos through adverbs: A new video task. In: AAAI Conference on Artificial Intelligence (2020). https://doi. org/10.1609/aaai.v34i07.6855

  73. [80]

    In: CVPR

    Liu, X., Shi, H., Chen, H., Yu, Z., Li, X., et al.: imigue: An identity-free video dataset for micro-gesture understanding and emo- tion analysis. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), 10626–10637 (2021) https: //doi.org/10.1109/cvpr46437.2021.01049

  74. [81]

    ArXiv abs/2405.00574 (2024) https://doi.org/ 10.1109/uemcon62879.2024.10754673

    Li, D., Liu, X., Xing, B., Xia, B., Zong, Y., et al.: Eald-mllm: Emotion analysis in long-sequential and de-identity videos with multi-modal large language model. ArXiv abs/2405.00574 (2024) https://doi.org/ 10.1109/uemcon62879.2024.10754673

  75. [82]

    : How would the viewer feel? estimating wellbeing from video sce- narios

    Mazeika, M., Tang, E., Zou, A., Basart, S., Chan, J.S., et al. : How would the viewer feel? estimating wellbeing from video sce- narios. Advances in Neural Information Pro- cessing Systems 35, 18571–18585 (2022) https://doi.org/10.4324/9780367855383-2

  76. [83]

    In: 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2024)

    Ren, Z., Ortega, J., Wang, Y., Chen, Z., Whitney, D., et al.: Veatic: Video-based emotion and affect tracking in context dataset. 2024 IEEE/CVF Winter Confer- ence on Applications of Computer Vision (WACV), 4455–4465 (2023) https://doi. org/10.1109/wacv57701.2024.00441

  77. [84]

    In: CVPR

    Achlioptas, P., Ovsjanikov, M., Haydarov, K., Elhoseiny, M., Guibas, L.J.: Artemis: Affective language for visual art. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11569–11579 (2021). https:// doi.org/10.1109/cvpr46437.2021.01140

  78. [85]

    Sample4Geo : Hard negative sampling for cross-view geo-localisation

    Yang, J., Huang, Q., Ding, T., Lischin- ski, D., Cohen-Or, D., et al.: Emoset: A large-scale visual emotion dataset with rich attributes. 2023 IEEE/CVF Inter- national Conference on Computer Vision 31 (ICCV), 20326–20337 (2023) https://doi. org/10.1109/iccv51070.2023.01864

  79. [86]

    In: Conference on Multimedia Modeling (2018)

    Lv, J., Liu, W., Zhou, L., Wu, B., Ma, H.: Multi-stream fusion model for social rela- tion recognition from videos. In: Conference on Multimedia Modeling (2018). https:// doi.org/10.1007/978-3-319-73603-7 29

  80. [87]

    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3561–3569 (2019) https://doi.org/ 10.1109/cvpr.2019.00368

    Liu, X., Liu, W., Zhang, M., Chen, J., Gao, L., et al.: Social relation recognition from videos via multi-scale spatial-temporal reasoning. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3561–3569 (2019) https://doi.org/ 10.1109/cvpr.2019.00368

Showing first 80 references.