Looking Beyond the Obvious: A Survey on Abstract Concept Recognition for Video Understanding

Gowreesh Mago; Pascal Mettes; Stevan Rudinac

arxiv: 2508.20765 · v2 · submitted 2025-08-28 · 💻 cs.CV · cs.AI

Looking Beyond the Obvious: A Survey on Abstract Concept Recognition for Video Understanding

Gowreesh Mago , Pascal Mettes , Stevan Rudinac This is my paper

Pith reviewed 2026-05-18 20:23 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords abstract concept recognitionvideo understandingfoundation modelssurveymultimodal modelscomputer vision

0 comments

The pith

Foundation models offer an ideal opportunity to recognize abstract concepts like justice and freedom in videos by building on past research.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys tasks and datasets developed over time for recognizing abstract concepts in video content. It claims that researchers have repeatedly attempted these problems using the best available tools at each stage. Recent progress in foundation models creates a fresh chance to make real headway on this challenge. Drawing from decades of prior experience can prevent repeating old mistakes when applying multimodal models. Success here would let video understanding systems reason at the high-level semantic layers that match human values and context.

Core claim

Abstract concept recognition forms a crucial open challenge in video understanding, where reasoning on multiple semantic levels based on contextual information is key. The authors argue that the recent advances in foundation models make for an ideal setting to address abstract concept understanding in videos. Automated understanding of high-level abstract concepts is imperative as it enables models to be more aligned with human reasoning and values. The survey examines different tasks and datasets used to understand abstract concepts in video content and advocates that drawing on decades of community experience will help shed light on this important open grand challenge and avoid re-invented

What carries the argument

Survey of tasks and datasets for abstract concept recognition in videos, which shows periodic research efforts using tools available at the time and now positions foundation models as the next step.

If this is right

Video understanding systems would move beyond detecting visible objects and actions to reasoning about high-level ideas.
Models would align more closely with human reasoning and values through contextual multi-level analysis.
Prior research on abstract tasks can be reused rather than restarted when applying new multimodal foundation models.
The field gains a clearer path to solving an open grand challenge without repeating past cycles of effort.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Video search and recommendation tools could retrieve content by themes such as togetherness or freedom instead of surface features alone.
This line of work might extend to ethical AI systems that interpret social implications in visual media.
Direct comparisons of foundation models against historical dataset results could test whether they truly close the gap in contextual reasoning.

Load-bearing premise

The tasks and datasets covered in the survey adequately represent the problem of abstract concept recognition and foundation models can overcome prior limits without facing new fundamental barriers.

What would settle it

An experiment in which foundation models show no improvement over earlier specialized methods when tested on the surveyed abstract concept video datasets would undermine the central claim.

read the original abstract

The automatic understanding of video content is advancing rapidly. Empowered by deeper neural networks and large datasets, machines are increasingly capable of understanding what is concretely visible in video frames, whether it be objects, actions, events, or scenes. In comparison, humans retain a unique ability to also look beyond concrete entities and recognize abstract concepts like justice, freedom, and togetherness. Abstract concept recognition forms a crucial open challenge in video understanding, where reasoning on multiple semantic levels based on contextual information is key. In this paper, we argue that the recent advances in foundation models make for an ideal setting to address abstract concept understanding in videos. Automated understanding of high-level abstract concepts is imperative as it enables models to be more aligned with human reasoning and values. In this survey, we study different tasks and datasets used to understand abstract concepts in video content. We observe that, periodically and over a long period, researchers have attempted to solve these tasks, making the best use of the tools available at their disposal. We advocate that drawing on decades of community experience will help us shed light on this important open grand challenge and avoid ``re-inventing the wheel'' as we start revisiting it in the era of multi-modal foundation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper is a survey on abstract concept recognition in video understanding. It reviews tasks and datasets for recognizing high-level abstract concepts (e.g., justice, freedom, togetherness) that require contextual and multi-level reasoning beyond concrete visual entities. The central claim is that recent foundation models create an ideal setting for this challenge and that the community should draw on decades of prior research attempts to avoid reinventing the wheel.

Significance. If the survey provides a representative overview of historical tasks/datasets and identifies transferable lessons, it could usefully orient future work on aligning video models with human-like abstract reasoning. The directional argument linking foundation models to this open problem is timely and could help consolidate community efforts.

major comments (2)

[Abstract] Abstract and introduction: the claim that surveyed tasks/datasets are sufficiently representative of abstract concept recognition (and that foundation models will succeed where prior approaches fell short) is presented without explicit inclusion criteria, coverage statistics, or discussion of potential gaps; this is load-bearing for the survey's utility as a foundation for new work.
[Tasks and datasets review] The advocacy for drawing on 'decades of community experience' would be strengthened by concrete mappings in the tasks/datasets review section showing which specific limitations of earlier methods (e.g., lack of context modeling) are directly addressable by current foundation-model capabilities.

minor comments (2)

Clarify the taxonomy or categorization scheme used to group the surveyed tasks and datasets for easier navigation.
Ensure all cited prior works include publication years and venues in the reference list for historical context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and the recommendation for minor revision. We address the major comments below and will update the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Abstract] Abstract and introduction: the claim that surveyed tasks/datasets are sufficiently representative of abstract concept recognition (and that foundation models will succeed where prior approaches fell short) is presented without explicit inclusion criteria, coverage statistics, or discussion of potential gaps; this is load-bearing for the survey's utility as a foundation for new work.

Authors: We acknowledge that explicit documentation of the survey methodology is important for establishing the representativeness of the reviewed tasks and datasets. In the revised version, we will introduce a 'Survey Scope and Methodology' subsection in the Introduction. This will specify the inclusion criteria (focusing on tasks that involve reasoning about abstract concepts requiring contextual, multi-level understanding beyond concrete visual elements), the search and selection process, quantitative coverage (e.g., total papers, tasks, and datasets surveyed), and a balanced discussion of limitations and gaps, such as potential biases toward certain video domains or under-explored abstract concepts. This addition will provide a solid foundation for the claims and guide future research. revision: yes
Referee: [Tasks and datasets review] The advocacy for drawing on 'decades of community experience' would be strengthened by concrete mappings in the tasks/datasets review section showing which specific limitations of earlier methods (e.g., lack of context modeling) are directly addressable by current foundation-model capabilities.

Authors: We agree that providing concrete linkages will make the argument more compelling. We will enhance the review section by adding targeted examples and a summary mapping. Specifically, we will highlight cases where earlier methods (e.g., in video emotion recognition or social scene understanding) were limited by insufficient modeling of long-term context or multimodal integration, and contrast this with foundation models' strengths in self-attention, large-scale pretraining, and zero-shot generalization. A new table or bullet-point summaries will map 4-5 representative limitations to corresponding FM capabilities, while maintaining a cautious tone that these are promising directions rather than guaranteed successes. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

This is a survey paper with no derivations, equations, fitted parameters, or new technical predictions. Its central claim is directional—that foundation models create an ideal setting for abstract concept recognition in video and that prior community experience should inform the effort—without reducing any result to quantities defined by its own choices or self-citations. All referenced tasks, datasets, and prior approaches are drawn from external literature rather than constructed internally, making the paper self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The survey rests on the domain assumption that abstract concepts are recognizable from video context and that historical attempts provide reusable insights; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Reasoning on multiple semantic levels based on contextual information is key to abstract concept recognition in video.
Invoked in the abstract when stating that abstract concept recognition forms a crucial open challenge where such reasoning is key.

pith-pipeline@v0.9.0 · 5752 in / 1134 out tokens · 37565 ms · 2026-05-18T20:23:31.947987+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

taxonomy of abstract concepts … Perception Understanding, Emotions and Social Signals, Narrative & Rhetorical Analysis … datasets such as AVA, MovieGraphs, LVU
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

foundation models … cross-modal understanding … large-scale training

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

233 extracted references · 233 canonical work pages · 10 internal anchors

[1]

https://www.red-dot.org/de/project/ freedom-of-speech-20384-20384

Dot, R.: Red Dot Design Award: FREE- DOM OF SPEECH — red-dot.org. https://www.red-dot.org/de/project/ freedom-of-speech-20384-20384. [Accessed 17-02-2025]. https://doi.org/10.5553/ab/ 0165-13312018098005031

work page doi:10.5553/ab/ 2025
[2]

[Place of Publication Not Identified: Publisher Not Identified, to 1943] (1942)

War Production Co-Ordinating Committee: We Can Do It! Rosie the Riveter. [Place of Publication Not Identified: Publisher Not Identified, to 1943] (1942). https://doi.org/ 10.3735/9781961844179.book-part-147

work page doi:10.3735/9781961844179.book-part-147 1943
[3]

: Deep learning for generic object detection: A survey

Liu, L., Ouyang, W., Wang, X., Fieguth, P.W., Chen, J., et al. : Deep learning for generic object detection: A survey. International Journal of Computer Vision 128, 261–318 (2018) https://doi.org/10. 1007/S11263-019-01247-4

work page 2018
[4]

: Deep learning for scene classification: A survey

Zeng, D., Liao, M., Tavakolian, M., Guo, Y., Zhou, B., et al. : Deep learning for scene classification: A survey. arXiv preprint arXiv:2101.10531 (2021) https://doi.org/10. 48550/arXiv.2101.10531

work page arXiv 2021
[5]

Interna- tional Journal of Computer Vision 130, 1366–1401 (2018) https://doi.org/10.1007/ S11263-022-01594-9

Kong, Y., Fu, Y.R.: Human action recog- nition and prediction: A survey. Interna- tional Journal of Computer Vision 130, 1366–1401 (2018) https://doi.org/10.1007/ S11263-022-01594-9

work page 2018
[6]

Human Perception of Visual Information, 85 (2022) https://doi

Zhao, S., Huang, Q., Tang, Y., Yao, X., Yang, J., et al.: Computational emotion analysis from images: Recent advances and future directions. Human Perception of Visual Information, 85 (2022) https://doi. org/10.1007/978-3-030-81465-6 4

work page doi:10.1007/978-3-030-81465-6 2022
[7]

In: International Conference on Image Analysis and Processing (2019)

Stefanini, M., Cornia, M., Baraldi, L., Corsini, M., Cucchiara, R.: Artpedia: A new visual-semantic dataset with visual and con- textual sentences in the artistic domain. In: International Conference on Image Analysis and Processing (2019). https://doi.org/10. 1007/978-3-030-30645-8 66

work page 2019
[8]

Seeing the Intangible: Survey of Image Classification into High-Level and Abstract Categories

Pandiani, D.S.M., Presutti, V.: Seeing the intangible: Survey of image classification into high-level and abstract categories. ArXiv preprint abs/2308.10562 (2023) https://doi.org/10.48550/arXiv.2308.10562

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.10562 2023
[9]

Interna- tional Journal of Computer Vision 60, 91– 110 (2004) https://doi.org/10.1023/b:visi

Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Interna- tional Journal of Computer Vision 60, 91– 110 (2004) https://doi.org/10.1023/b:visi. 0000029664.99615.94

work page doi:10.1023/b:visi 2004
[10]

2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) 1, 886–8931 (2005) https://doi

Dalal, N., Triggs, B.: Histograms of ori- ented gradients for human detection. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) 1, 886–8931 (2005) https://doi. org/10.1109/cvpr.2005.177 26

work page doi:10.1109/cvpr.2005.177 2005
[11]

In: AAAI Conference on Artificial Intelligence (2014)

Jiang, Y.-G., Xu, B., Xue, X.: Predicting emotions in user-generated videos. In: AAAI Conference on Artificial Intelligence (2014). https://doi.org/10.1609/aaai.v28i1.8724

work page doi:10.1609/aaai.v28i1.8724 2014
[12]

Proceedings of the 22nd ACM international conference on Multimedia (2014) https://doi.org/10.1145/ 2647868.2654927

Lu, X., Lin, Z.L., Jin, H., Yang, J., Wang, J.Z.: Rapid: Rating pictorial aes- thetics using deep learning. Proceedings of the 22nd ACM international conference on Multimedia (2014) https://doi.org/10.1145/ 2647868.2654927

work page arXiv 2014
[13]

: Quality assessment in the era of large models: A survey

Zhang, Z., Zhou, Y., Li, C., Zhao, B., Liu, X., et al. : Quality assessment in the era of large models: A survey. ACM Transac- tions on Multimedia Computing, Communi- cations and Applications (2024) https://doi. org/10.1145/3722559

work page doi:10.1145/3722559 2024
[14]

: Emotion-llama: Mul- timodal emotion recognition and reason- ing with instruction tuning

Cheng, Z., Cheng, Z.-Q., He, J.-Y., Wang, K., Lin, Y., et al. : Emotion-llama: Mul- timodal emotion recognition and reason- ing with instruction tuning. Advances in Neural Information Processing Systems 37, 110805–110853 (2024)

work page 2024
[15]

doi: 10.1109/TPAMI.2024

Zhang, J., Huang, J., Jin, S., Lu, S.: Vision- language models for vision tasks: A sur- vey. IEEE Transactions on Pattern Analy- sis and Machine Intelligence 46, 5625–5644 (2023) https://doi.org/10.1109/tpami.2024. 3369699

work page doi:10.1109/tpami.2024 2023
[16]

International Journal of Digital Humanities 5, 451–490 (2023) https://doi.org/10.1007/ s42803-023-00077-8

Pandiani, D.S.M., Lazzari, N., Erp, M., Pre- sutti, V.: Hypericons for interpretability: decoding abstract concepts in visual data. International Journal of Digital Humanities 5, 451–490 (2023) https://doi.org/10.1007/ s42803-023-00077-8

work page 2023
[17]

In: Workshop on Cognitive Aspects of the Lexicon (2024)

Cerini, L., Bondielli, A., Lenci, A.: Repre- senting abstract concepts with images: An investigation with large language models. In: Workshop on Cognitive Aspects of the Lexicon (2024)

work page 2024
[18]

IEEE Trans

Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.C.: Content-based image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell.22, 1349–1380 (2000) https://doi.org/10.1109/ 34.895972

work page 2000
[19]

In: 28th International Joint Con- ference on Artificial Intelligence, IJCAI 2019, pp

Aditya, S., Yang, Y., Baral, C.: Integrating knowledge and reasoning in image under- standing. In: 28th International Joint Con- ference on Artificial Intelligence, IJCAI 2019, pp. 6252–6259 (2019). https://doi. org/10.24963/ijcai.2019/873 . International Joint Conferences on Artificial Intelligence

work page doi:10.24963/ijcai.2019/873 2019
[20]

IEEE Transactions on Multime- dia 13, 303–319 (2011) https://doi.org/10

Fu, Z., Lu, G., Ting, K.M., Zhang, D.: A sur- vey of audio-based music classification and annotation. IEEE Transactions on Multime- dia 13, 303–319 (2011) https://doi.org/10. 1109/tmm.2010.2098858

work page arXiv 2011
[21]

Philanthropy, B.: Empowering Girls in India (2024)

work page 2024
[22]

In: International Conference on Machine Learning (2023)

Li, J., Li, D., Savarese, S., Hoi, S.C.H.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International Conference on Machine Learning (2023)

work page 2023
[23]

In: Computer Vision– ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part I 11, pp

Torresani, L., Szummer, M., Fitzgibbon, A.: Efficient object category recognition using classemes. In: Computer Vision– ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part I 11, pp. 776–789 (2010). https://doi.org/10. 1007/978-3-642-15549-9 56 . Springer

work page 2010
[24]

Proceedings of the 21st ACM international conference on Multimedia (2013) https://doi.org/10.1145/ 2502081.2502268

Borth, D., Chen, T., Ji, R., Chang, S.-F.: Sentibank: large-scale ontology and clas- sifiers for detecting sentiment and emo- tions in visual content. Proceedings of the 21st ACM international conference on Multimedia (2013) https://doi.org/10.1145/ 2502081.2502268

work page arXiv 2013
[25]

In: Bengio, Y., LeCun, Y

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015)

work page 2015
[27]

Neural computation 9(8), 1735–1780 (1997) https://doi.org/10.1162/ neco.1997.9.8.1735

Hochreiter, S., Schmidhuber, J.: Long short- term memory. Neural computation 9(8), 1735–1780 (1997) https://doi.org/10.1162/ neco.1997.9.8.1735

work page 1997
[28]

In: European Conference on Computer Vision (2023)

Li, Y., Wang, C., Jia, J.: Llama-vid: An image is worth 2 tokens in large lan- guage models. In: European Conference on Computer Vision (2023). https://doi.org/ 10.1007/978-3-031-72952-2 19

work page doi:10.1007/978-3-031-72952-2 2023
[29]

Masked feature prediction for self-supervised visual pre-training

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., et al.: Learning trans- ferable visual models from natural language supervision. In: International Conference on Machine Learning (2021). https://doi.org/ 10.1109/cvpr52688.2022.00101

work page doi:10.1109/cvpr52688.2022.00101 2021
[30]

The Llama 3 Herd of Models

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., et al.: The llama 3 herd of models. ArXiv abs/2407.21783 (2024) https://doi.org/10.1016/s0749-0720(15) 31012-4

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/s0749-0720(15 2024
[31]

abstract semantics: From mental representations to functional brain mapping

Mkrtychian, N.A., Blagovechtchenski, E.D., Kurmakaeva, D., Gnedykh, D.S., Kostro- mina, S.N., et al.: Concrete vs. abstract semantics: From mental representations to functional brain mapping. Frontiers in Human Neuroscience 13 (2019) https://doi. org/10.3389/fnhum.2019.00267

work page doi:10.3389/fnhum.2019.00267 2019
[32]

Journal of Cognition 6 (2023) https://doi.org/10

Banks, B., Borghi, A.M., Fargier, R., Fini, C., Jonauskait˙ e, D., et al.: Consensus paper: Current perspectives on abstract concepts and future research directions. Journal of Cognition 6 (2023) https://doi.org/10. 5334/joc.238

work page 2023
[33]

https://doi.org/10.61508/refl.v25i2

Bateman, J.A.: Text and Image: A Criti- cal Introduction to the Visual/Verbal Divide (2014). https://doi.org/10.61508/refl.v25i2. 166287

work page doi:10.61508/refl.v25i2 2014
[34]

Behavior Research Methods 46, 904–911 (2014) https://doi.org/10.3758/ s13428-013-0403-5

Brysbaert, M., Warriner, A.B., Kuper- man, V.: Concreteness ratings for 40 thousand generally known english word lemmas. Behavior Research Methods 46, 904–911 (2014) https://doi.org/10.3758/ s13428-013-0403-5

work page 2014
[35]

Cognitive science 29(5), 719–736 (2005) https://doi.org/10.1207/ s15516709cog0000 33

Katja Wiemer-Hastings, K., Xu, X.: Con- tent differences for abstract and con- crete concepts. Cognitive science 29(5), 719–736 (2005) https://doi.org/10.1207/ s15516709cog0000 33

work page 2005
[36]

Philosophi- cal Transactions of the Royal Society B 378 (2022) https://doi.org/10.1098/rstb.2021

Langland-Hassan, P., Davis, C.P.: A context-sensitive and non-linguistic approach to abstract concepts. Philosophi- cal Transactions of the Royal Society B 378 (2022) https://doi.org/10.1098/rstb.2021. 0355

work page doi:10.1098/rstb.2021 2022
[37]

Future Humanities (2024) https://doi.org/ 10.22541/au.171181017.78084528/v1

Pandiani, D.S.M.: The wicked problem of naming the intangible: Abstract concepts, binary thinking, and computer vision labels. Future Humanities (2024) https://doi.org/ 10.22541/au.171181017.78084528/v1

work page doi:10.22541/au.171181017.78084528/v1 2024
[38]

In: Electronic Imaging (2006)

Hare, J.S., Lewis, P.H., Enser, P.G.B., San- dom, C.J.: Mind the gap: another look at the problem of the semantic gap in image retrieval. In: Electronic Imaging (2006). https://doi.org/10.1117/12.647755

work page doi:10.1117/12.647755 2006
[40]

The communi- cation of ideas 37(1), 136–139 (1960)

Lasswell, H.D.: The structure and function of communication in society. The communi- cation of ideas 37(1), 136–139 (1960)

work page 1960
[41]

: Teaching human behavior improves content understanding abilities of vlms

Singh, S.K., Harini, S., Singla, Y.K., Chen, C., Shah, R.R., et al. : Teaching human behavior improves content understanding abilities of vlms. In: The Thirteenth Inter- national Conference on Learning Represen- tations (2024)

work page 2024
[42]

Graham, F.Q

Kinney, R.M., Anastasiades, C., Authur, R., Beltagy, I., Bragg, J., et al.: The semantic scholar open data platform. ArXiv preprint abs/2301.10140 (2023) https://doi.org/ 10.48550/ARXIV.2301.10140 28

work page doi:10.48550/arxiv.2301.10140 2023
[43]

https://portal.core

CORE – Conference Ranking Portal: CORE Conference Rankings. https://portal.core. edu.au/conf-ranks/. Accessed: 2025-06-30 (2025)

work page 2025
[46]

In: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp

Muennighoff, N., Tazi, N., Magne, L., Reimers, N.: Mteb: Massive text embed- ding benchmark. In: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 2014–2037 (2023). https://doi.org/10. 18653/v1/2023.eacl-main.148

work page 2014
[47]

Communication Theory 9, 119–161 (1999) https://doi.org/10.1111/j.1468-2885

Craig, R.T.: Communication theory as a field. Communication Theory 9, 119–161 (1999) https://doi.org/10.1111/j.1468-2885. 1999.tb00355.x

work page doi:10.1111/j.1468-2885 1999
[48]

Circle loss: A unified perspective of pair similarity optimization

Epstein, D., Chen, B., Vondrick, C.: Oops! predicting unintentional action in video. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 916–926 (2019) https://doi.org/10.1109/ cvpr42600.2020.00100

work page arXiv 2020
[50]

: Funqa: Towards sur- prising video comprehension

Xie, B., Zhang, S., Zhou, Z., Li, B., Zhang, Y., et al. : Funqa: Towards sur- prising video comprehension. In: Euro- pean Conference on Computer Vision, pp. 39–57 (2024). https://doi.org/10.1007/ 978-3-031-73232-4 3 . Springer

work page 2024
[51]

and Joo, J

Xu, X., Lu, Y., Lu, Z., Xiang, T.: Vid2int: Detecting implicit intention from long dia- log videos. 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), 3298–3307 (2021) https://doi.org/10.1109/ wacv48630.2021.00334

work page arXiv 2021
[52]

0: A large-scale bench- mark dataset for multimodal intent recogni- tion and out-of-scope detection in conversa- tions

Zhang, H., Wang, X., Xu, H., Zhou, Q., Gao, K., et al.: Mintrec2. 0: A large-scale bench- mark dataset for multimodal intent recogni- tion and out-of-scope detection in conversa- tions. In: The Twelfth International Confer- ence on Learning Representations (2024)

work page 2024
[54]

In: CVPR

Jia, M., Wu, Z., Reiter, A., Cardie, C., Belongie, S.J., et al.: Intentonomy: a dataset and study towards human intent under- standing. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 12981–12991 (2020) https://doi. org/10.1109/cvpr46437.2021.01279

work page doi:10.1109/cvpr46437.2021.01279 2021
[55]

: The konstanz natu- ral video database (konvid-1k)

Hosu, V., Hahn, F., Jenadeleh, M., Lin, H., Men, H., et al. : The konstanz natu- ral video database (konvid-1k). In: 2017 Ninth International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–6 (2017). https://doi.org/10.1109/qomex. 2017.7965673 . IEEE

work page doi:10.1109/qomex 2017
[56]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Ying, Z., Mandal, M., Ghadiyaram, D., Bovik, A.: Patch-vq:’patching up’the video quality problem. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14019–14029 (2021). https://doi.org/10.1109/cvpr46437. 2021.01380

work page doi:10.1109/cvpr46437 2021
[57]

In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp

Wu, H., Zhang, E., Liao, L., Chen, C., Hou, J., et al.: Exploring video quality assessment on user generated contents from aesthetic 29 and technical perspectives. In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision, pp. 20144–20154 (2023). https://doi.org/10.1109/iccv51070. 2023.01843

work page doi:10.1109/iccv51070 2023
[58]

2012 IEEE Conference on Computer Vision and Pattern Recognition, 2408–2415 (2012) https://doi.org/10.1109/ cvpr.2012.6247954

Murray, N., Marchesotti, L., Perronnin, F.: Ava: A large-scale database for aesthetic visual analysis. 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2408–2415 (2012) https://doi.org/10.1109/ cvpr.2012.6247954

work page arXiv 2012
[59]

IEEE Transactions on Image Processing 25(1), 372–387 (2015) https://doi.org/10.1109/tip.2015.2500021

Ghadiyaram, D., Bovik, A.C.: Massive online crowdsourced study of subjective and objective picture quality. IEEE Transactions on Image Processing 25(1), 372–387 (2015) https://doi.org/10.1109/tip.2015.2500021

work page doi:10.1109/tip.2015.2500021 2015
[60]

Chudnovsky and S

Sun, W., Zhou, F., Liao, Q.: Mdid: A mul- tiply distorted image database for image quality assessment. Pattern Recognition 61, 153–168 (2017) https://doi.org/10.1016/j. patcog.2016.07.033

work page doi:10.1016/j 2017
[61]

arXiv preprint arXiv:1803.08489 (2018) https://doi.org/10

Lin, H., Hosu, V., Saupe, D.: Koniq- 10k: Towards an ecologically valid and large-scale iqa database. arXiv preprint arXiv:1803.08489 (2018) https://doi.org/10. 1109/tip.2020.2967829

work page arXiv 2018
[62]

Momentum contrast for unsupervised visual representation learning

Fang, Y., Zhu, H., Zeng, Y., Ma, K., Wang, Z.: Perceptual quality assessment of smart- phone photography. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3677–3686 (2020). https://doi.org/10.1109/cvpr42600. 2020.00373

work page doi:10.1109/cvpr42600 2020
[63]

Oona Rainio, Jarmo Teuho, and Riku Klén

HariniS, I., Singh, S., Singla, Y.K., Bhat- tacharyya, A., Baths, V., et al.: Long-term ad memorability: Understanding & generat- ing memorable ads. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 5707–5718 (2023) https: //doi.org/10.1109/wacv61041.2025.00557

work page doi:10.1109/wacv61041.2025.00557 2025
[64]

: Automatic under- standing of image and video advertisements

Hussain, Z., Zhang, M., Zhang, X., Ye, K., Thomas, C., et al. : Automatic under- standing of image and video advertisements. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1705–1715 (2017). https://doi.org/10. 1109/cvpr.2017.123

work page 2017
[65]

YouTube-8M: A Large-Scale Video Classification Benchmark

Abu-El-Haija, S., Kothari, N., Lee, J., Nat- sev, P., Toderici, G., et al.: Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016) https://doi.org/10.48550/arXiv.1609.08675

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1609.08675 2016
[66]

In: CVPR

Wu, C., Kr¨ ahenb¨ uhl, P.: Towards long-form video understanding. 2021 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), 1884–1894 (2021) https://doi.org/10.1109/cvpr46437.2021. 00192

work page doi:10.1109/cvpr46437.2021 2021
[67]

Proceedings of the 29th ACM International Conference on Multimedia (2021) https://doi.org/10.1145/ 3474085.3479222

Wang, Z., Wu, L., Li, Z., Xiong, J., Lu, Q.: Overview of tencent multi-modal ads video understanding. Proceedings of the 29th ACM International Conference on Multimedia (2021) https://doi.org/10.1145/ 3474085.3479222

work page arXiv 2021
[68]

: Mm-au:towards multimodal understanding of advertisement videos

Bose, D., Hebbar, R., Feng, T., Somande- palli, K., Xu, A., et al. : Mm-au:towards multimodal understanding of advertisement videos. Proceedings of the 31st ACM Inter- national Conference on Multimedia (2023) https://doi.org/10.1145/3581783.3612371

work page doi:10.1145/3581783.3612371 2023
[69]

2025, arXiv e-prints, arXiv:2510.13477, doi:10.48550/arXiv

Zhang, Z., Dou, M., Peng, L., Pan, H., Bagci, U., et al. : Videoads for fast-paced video understanding: Where opensource foundation models beat gpt-4o & gemini- 1.5 pro. arXiv preprint arXiv:2504.09282 (2025) https://doi.org/10.48550/ARXIV. 2504.09282

work page internal anchor Pith review doi:10.48550/arxiv 2025
[70]

Masked feature prediction for self-supervised visual pre-training

Gupta, V., Mittal, T., Mathur, P., Mishra, V., Maheshwari, M., et al.: 3massiv: Multilingual, multimodal and multi-aspect dataset of social media short videos. 2022 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 21032–21043 (2022) https://doi. org/10.1109/cvpr52688.2022.02039

work page doi:10.1109/cvpr52688.2022.02039 2022
[71]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., et al. : Laion- 400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021) https://doi.org/10. 30 48550/arXiv.2111.02114

work page internal anchor Pith review Pith/arXiv arXiv 2021
[73]

In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops (2018)

Garcia, N., Vogiatzis, G.: How to read paintings: Semantic art understanding with multi-modal retrieval. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops (2018). https: //doi.org/10.1007/978-3-030-11012-3 52

work page doi:10.1007/978-3-030-11012-3 2018
[74]

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 484–492 (2017) https://doi.org/ 10.1109/cvpr.2017.59

Alameda-Pineda, X., Pilzer, A., Xu, D., Sebe, N., Ricci, E.: Viraliency: Pooling local virality. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 484–492 (2017) https://doi.org/ 10.1109/cvpr.2017.59

work page doi:10.1109/cvpr.2017.59 2017
[75]

: Micro tells macro: Predicting the popularity of micro-videos via a transductive model

Chen, J., Song, X., Nie, L., Wang, X., Zhang, H., et al. : Micro tells macro: Predicting the popularity of micro-videos via a transductive model. Proceedings of the 24th ACM international conference on Multimedia (2016) https://doi.org/10.1007/ s00530-020-00660-x

work page 2016
[76]

Proceed- ings of International Conference on Multi- media Retrieval (2014) https://doi.org/10

Jiang, L., Miao, Y., Yang, Y., Lan, Z., Hauptmann, A.: Viral video style: A closer look at viral videos on youtube. Proceed- ings of International Conference on Multi- media Retrieval (2014) https://doi.org/10. 1145/2578726.2578754

work page arXiv 2014
[77]

In: Proceedings of the Fourth ACM Interna- tional Conference on Web Search and Data Mining, pp

Figueiredo, F., Benevenuto, F., Almeida, J.M.: The tube over time: characterizing popularity growth of youtube videos. In: Proceedings of the Fourth ACM Interna- tional Conference on Web Search and Data Mining, pp. 745–754 (2011). https://doi. org/10.1109/tnsm.2019.2914222

work page doi:10.1109/tnsm.2019.2914222 2011
[78]

In: Proceedings of the International AAAI Conference on Web and Social Media, vol

Lakkaraju, H., McAuley, J., Leskovec, J.: What’s in a name? understanding the inter- play between titles, content, and communi- ties in social media. In: Proceedings of the International AAAI Conference on Web and Social Media, vol. 7, pp. 311–320 (2013). https://doi.org/10.1609/icwsm.v7i1.14408

work page doi:10.1609/icwsm.v7i1.14408 2013
[79]

In: AAAI Conference on Artificial Intelligence (2020)

Pang, B., Zha, K., Zhang, Y., Lu, C.: Fur- ther understanding videos through adverbs: A new video task. In: AAAI Conference on Artificial Intelligence (2020). https://doi. org/10.1609/aaai.v34i07.6855

work page doi:10.1609/aaai.v34i07.6855 2020
[80]

In: CVPR

Liu, X., Shi, H., Chen, H., Yu, Z., Li, X., et al.: imigue: An identity-free video dataset for micro-gesture understanding and emo- tion analysis. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), 10626–10637 (2021) https: //doi.org/10.1109/cvpr46437.2021.01049

work page doi:10.1109/cvpr46437.2021.01049 2021
[81]

ArXiv abs/2405.00574 (2024) https://doi.org/ 10.1109/uemcon62879.2024.10754673

Li, D., Liu, X., Xing, B., Xia, B., Zong, Y., et al.: Eald-mllm: Emotion analysis in long-sequential and de-identity videos with multi-modal large language model. ArXiv abs/2405.00574 (2024) https://doi.org/ 10.1109/uemcon62879.2024.10754673

work page doi:10.1109/uemcon62879.2024.10754673 2024
[82]

: How would the viewer feel? estimating wellbeing from video sce- narios

Mazeika, M., Tang, E., Zou, A., Basart, S., Chan, J.S., et al. : How would the viewer feel? estimating wellbeing from video sce- narios. Advances in Neural Information Pro- cessing Systems 35, 18571–18585 (2022) https://doi.org/10.4324/9780367855383-2

work page doi:10.4324/9780367855383-2 2022
[83]

In: 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2024)

Ren, Z., Ortega, J., Wang, Y., Chen, Z., Whitney, D., et al.: Veatic: Video-based emotion and affect tracking in context dataset. 2024 IEEE/CVF Winter Confer- ence on Applications of Computer Vision (WACV), 4455–4465 (2023) https://doi. org/10.1109/wacv57701.2024.00441

work page doi:10.1109/wacv57701.2024.00441 2024
[84]

In: CVPR

Achlioptas, P., Ovsjanikov, M., Haydarov, K., Elhoseiny, M., Guibas, L.J.: Artemis: Affective language for visual art. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11569–11579 (2021). https:// doi.org/10.1109/cvpr46437.2021.01140

work page doi:10.1109/cvpr46437.2021.01140 2021
[85]

Sample4Geo : Hard negative sampling for cross-view geo-localisation

Yang, J., Huang, Q., Ding, T., Lischin- ski, D., Cohen-Or, D., et al.: Emoset: A large-scale visual emotion dataset with rich attributes. 2023 IEEE/CVF Inter- national Conference on Computer Vision 31 (ICCV), 20326–20337 (2023) https://doi. org/10.1109/iccv51070.2023.01864

work page doi:10.1109/iccv51070.2023.01864 2023
[86]

In: Conference on Multimedia Modeling (2018)

Lv, J., Liu, W., Zhou, L., Wu, B., Ma, H.: Multi-stream fusion model for social rela- tion recognition from videos. In: Conference on Multimedia Modeling (2018). https:// doi.org/10.1007/978-3-319-73603-7 29

work page doi:10.1007/978-3-319-73603-7 2018
[87]

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3561–3569 (2019) https://doi.org/ 10.1109/cvpr.2019.00368

Liu, X., Liu, W., Zhang, M., Chen, J., Gao, L., et al.: Social relation recognition from videos via multi-scale spatial-temporal reasoning. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3561–3569 (2019) https://doi.org/ 10.1109/cvpr.2019.00368

work page doi:10.1109/cvpr.2019.00368 2019

Showing first 80 references.

[1] [1]

https://www.red-dot.org/de/project/ freedom-of-speech-20384-20384

Dot, R.: Red Dot Design Award: FREE- DOM OF SPEECH — red-dot.org. https://www.red-dot.org/de/project/ freedom-of-speech-20384-20384. [Accessed 17-02-2025]. https://doi.org/10.5553/ab/ 0165-13312018098005031

work page doi:10.5553/ab/ 2025

[2] [2]

[Place of Publication Not Identified: Publisher Not Identified, to 1943] (1942)

War Production Co-Ordinating Committee: We Can Do It! Rosie the Riveter. [Place of Publication Not Identified: Publisher Not Identified, to 1943] (1942). https://doi.org/ 10.3735/9781961844179.book-part-147

work page doi:10.3735/9781961844179.book-part-147 1943

[3] [3]

: Deep learning for generic object detection: A survey

Liu, L., Ouyang, W., Wang, X., Fieguth, P.W., Chen, J., et al. : Deep learning for generic object detection: A survey. International Journal of Computer Vision 128, 261–318 (2018) https://doi.org/10. 1007/S11263-019-01247-4

work page 2018

[4] [4]

: Deep learning for scene classification: A survey

Zeng, D., Liao, M., Tavakolian, M., Guo, Y., Zhou, B., et al. : Deep learning for scene classification: A survey. arXiv preprint arXiv:2101.10531 (2021) https://doi.org/10. 48550/arXiv.2101.10531

work page arXiv 2021

[5] [5]

Interna- tional Journal of Computer Vision 130, 1366–1401 (2018) https://doi.org/10.1007/ S11263-022-01594-9

Kong, Y., Fu, Y.R.: Human action recog- nition and prediction: A survey. Interna- tional Journal of Computer Vision 130, 1366–1401 (2018) https://doi.org/10.1007/ S11263-022-01594-9

work page 2018

[6] [6]

Human Perception of Visual Information, 85 (2022) https://doi

Zhao, S., Huang, Q., Tang, Y., Yao, X., Yang, J., et al.: Computational emotion analysis from images: Recent advances and future directions. Human Perception of Visual Information, 85 (2022) https://doi. org/10.1007/978-3-030-81465-6 4

work page doi:10.1007/978-3-030-81465-6 2022

[7] [7]

In: International Conference on Image Analysis and Processing (2019)

Stefanini, M., Cornia, M., Baraldi, L., Corsini, M., Cucchiara, R.: Artpedia: A new visual-semantic dataset with visual and con- textual sentences in the artistic domain. In: International Conference on Image Analysis and Processing (2019). https://doi.org/10. 1007/978-3-030-30645-8 66

work page 2019

[8] [8]

Seeing the Intangible: Survey of Image Classification into High-Level and Abstract Categories

Pandiani, D.S.M., Presutti, V.: Seeing the intangible: Survey of image classification into high-level and abstract categories. ArXiv preprint abs/2308.10562 (2023) https://doi.org/10.48550/arXiv.2308.10562

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.10562 2023

[9] [9]

Interna- tional Journal of Computer Vision 60, 91– 110 (2004) https://doi.org/10.1023/b:visi

Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Interna- tional Journal of Computer Vision 60, 91– 110 (2004) https://doi.org/10.1023/b:visi. 0000029664.99615.94

work page doi:10.1023/b:visi 2004

[10] [10]

2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) 1, 886–8931 (2005) https://doi

Dalal, N., Triggs, B.: Histograms of ori- ented gradients for human detection. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) 1, 886–8931 (2005) https://doi. org/10.1109/cvpr.2005.177 26

work page doi:10.1109/cvpr.2005.177 2005

[11] [11]

In: AAAI Conference on Artificial Intelligence (2014)

Jiang, Y.-G., Xu, B., Xue, X.: Predicting emotions in user-generated videos. In: AAAI Conference on Artificial Intelligence (2014). https://doi.org/10.1609/aaai.v28i1.8724

work page doi:10.1609/aaai.v28i1.8724 2014

[12] [12]

Proceedings of the 22nd ACM international conference on Multimedia (2014) https://doi.org/10.1145/ 2647868.2654927

Lu, X., Lin, Z.L., Jin, H., Yang, J., Wang, J.Z.: Rapid: Rating pictorial aes- thetics using deep learning. Proceedings of the 22nd ACM international conference on Multimedia (2014) https://doi.org/10.1145/ 2647868.2654927

work page arXiv 2014

[13] [13]

: Quality assessment in the era of large models: A survey

Zhang, Z., Zhou, Y., Li, C., Zhao, B., Liu, X., et al. : Quality assessment in the era of large models: A survey. ACM Transac- tions on Multimedia Computing, Communi- cations and Applications (2024) https://doi. org/10.1145/3722559

work page doi:10.1145/3722559 2024

[14] [14]

: Emotion-llama: Mul- timodal emotion recognition and reason- ing with instruction tuning

Cheng, Z., Cheng, Z.-Q., He, J.-Y., Wang, K., Lin, Y., et al. : Emotion-llama: Mul- timodal emotion recognition and reason- ing with instruction tuning. Advances in Neural Information Processing Systems 37, 110805–110853 (2024)

work page 2024

[15] [15]

doi: 10.1109/TPAMI.2024

Zhang, J., Huang, J., Jin, S., Lu, S.: Vision- language models for vision tasks: A sur- vey. IEEE Transactions on Pattern Analy- sis and Machine Intelligence 46, 5625–5644 (2023) https://doi.org/10.1109/tpami.2024. 3369699

work page doi:10.1109/tpami.2024 2023

[16] [16]

International Journal of Digital Humanities 5, 451–490 (2023) https://doi.org/10.1007/ s42803-023-00077-8

Pandiani, D.S.M., Lazzari, N., Erp, M., Pre- sutti, V.: Hypericons for interpretability: decoding abstract concepts in visual data. International Journal of Digital Humanities 5, 451–490 (2023) https://doi.org/10.1007/ s42803-023-00077-8

work page 2023

[17] [17]

In: Workshop on Cognitive Aspects of the Lexicon (2024)

Cerini, L., Bondielli, A., Lenci, A.: Repre- senting abstract concepts with images: An investigation with large language models. In: Workshop on Cognitive Aspects of the Lexicon (2024)

work page 2024

[18] [18]

IEEE Trans

Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.C.: Content-based image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell.22, 1349–1380 (2000) https://doi.org/10.1109/ 34.895972

work page 2000

[19] [19]

In: 28th International Joint Con- ference on Artificial Intelligence, IJCAI 2019, pp

Aditya, S., Yang, Y., Baral, C.: Integrating knowledge and reasoning in image under- standing. In: 28th International Joint Con- ference on Artificial Intelligence, IJCAI 2019, pp. 6252–6259 (2019). https://doi. org/10.24963/ijcai.2019/873 . International Joint Conferences on Artificial Intelligence

work page doi:10.24963/ijcai.2019/873 2019

[20] [20]

IEEE Transactions on Multime- dia 13, 303–319 (2011) https://doi.org/10

Fu, Z., Lu, G., Ting, K.M., Zhang, D.: A sur- vey of audio-based music classification and annotation. IEEE Transactions on Multime- dia 13, 303–319 (2011) https://doi.org/10. 1109/tmm.2010.2098858

work page arXiv 2011

[21] [21]

Philanthropy, B.: Empowering Girls in India (2024)

work page 2024

[22] [22]

In: International Conference on Machine Learning (2023)

Li, J., Li, D., Savarese, S., Hoi, S.C.H.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International Conference on Machine Learning (2023)

work page 2023

[23] [23]

In: Computer Vision– ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part I 11, pp

Torresani, L., Szummer, M., Fitzgibbon, A.: Efficient object category recognition using classemes. In: Computer Vision– ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part I 11, pp. 776–789 (2010). https://doi.org/10. 1007/978-3-642-15549-9 56 . Springer

work page 2010

[24] [24]

Proceedings of the 21st ACM international conference on Multimedia (2013) https://doi.org/10.1145/ 2502081.2502268

Borth, D., Chen, T., Ji, R., Chang, S.-F.: Sentibank: large-scale ontology and clas- sifiers for detecting sentiment and emo- tions in visual content. Proceedings of the 21st ACM international conference on Multimedia (2013) https://doi.org/10.1145/ 2502081.2502268

work page arXiv 2013

[25] [25]

In: Bengio, Y., LeCun, Y

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015)

work page 2015

[26] [27]

Neural computation 9(8), 1735–1780 (1997) https://doi.org/10.1162/ neco.1997.9.8.1735

Hochreiter, S., Schmidhuber, J.: Long short- term memory. Neural computation 9(8), 1735–1780 (1997) https://doi.org/10.1162/ neco.1997.9.8.1735

work page 1997

[27] [28]

In: European Conference on Computer Vision (2023)

Li, Y., Wang, C., Jia, J.: Llama-vid: An image is worth 2 tokens in large lan- guage models. In: European Conference on Computer Vision (2023). https://doi.org/ 10.1007/978-3-031-72952-2 19

work page doi:10.1007/978-3-031-72952-2 2023

[28] [29]

Masked feature prediction for self-supervised visual pre-training

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., et al.: Learning trans- ferable visual models from natural language supervision. In: International Conference on Machine Learning (2021). https://doi.org/ 10.1109/cvpr52688.2022.00101

work page doi:10.1109/cvpr52688.2022.00101 2021

[29] [30]

The Llama 3 Herd of Models

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., et al.: The llama 3 herd of models. ArXiv abs/2407.21783 (2024) https://doi.org/10.1016/s0749-0720(15) 31012-4

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/s0749-0720(15 2024

[30] [31]

abstract semantics: From mental representations to functional brain mapping

Mkrtychian, N.A., Blagovechtchenski, E.D., Kurmakaeva, D., Gnedykh, D.S., Kostro- mina, S.N., et al.: Concrete vs. abstract semantics: From mental representations to functional brain mapping. Frontiers in Human Neuroscience 13 (2019) https://doi. org/10.3389/fnhum.2019.00267

work page doi:10.3389/fnhum.2019.00267 2019

[31] [32]

Journal of Cognition 6 (2023) https://doi.org/10

Banks, B., Borghi, A.M., Fargier, R., Fini, C., Jonauskait˙ e, D., et al.: Consensus paper: Current perspectives on abstract concepts and future research directions. Journal of Cognition 6 (2023) https://doi.org/10. 5334/joc.238

work page 2023

[32] [33]

https://doi.org/10.61508/refl.v25i2

Bateman, J.A.: Text and Image: A Criti- cal Introduction to the Visual/Verbal Divide (2014). https://doi.org/10.61508/refl.v25i2. 166287

work page doi:10.61508/refl.v25i2 2014

[33] [34]

Behavior Research Methods 46, 904–911 (2014) https://doi.org/10.3758/ s13428-013-0403-5

Brysbaert, M., Warriner, A.B., Kuper- man, V.: Concreteness ratings for 40 thousand generally known english word lemmas. Behavior Research Methods 46, 904–911 (2014) https://doi.org/10.3758/ s13428-013-0403-5

work page 2014

[34] [35]

Cognitive science 29(5), 719–736 (2005) https://doi.org/10.1207/ s15516709cog0000 33

Katja Wiemer-Hastings, K., Xu, X.: Con- tent differences for abstract and con- crete concepts. Cognitive science 29(5), 719–736 (2005) https://doi.org/10.1207/ s15516709cog0000 33

work page 2005

[35] [36]

Philosophi- cal Transactions of the Royal Society B 378 (2022) https://doi.org/10.1098/rstb.2021

Langland-Hassan, P., Davis, C.P.: A context-sensitive and non-linguistic approach to abstract concepts. Philosophi- cal Transactions of the Royal Society B 378 (2022) https://doi.org/10.1098/rstb.2021. 0355

work page doi:10.1098/rstb.2021 2022

[36] [37]

Future Humanities (2024) https://doi.org/ 10.22541/au.171181017.78084528/v1

Pandiani, D.S.M.: The wicked problem of naming the intangible: Abstract concepts, binary thinking, and computer vision labels. Future Humanities (2024) https://doi.org/ 10.22541/au.171181017.78084528/v1

work page doi:10.22541/au.171181017.78084528/v1 2024

[37] [38]

In: Electronic Imaging (2006)

Hare, J.S., Lewis, P.H., Enser, P.G.B., San- dom, C.J.: Mind the gap: another look at the problem of the semantic gap in image retrieval. In: Electronic Imaging (2006). https://doi.org/10.1117/12.647755

work page doi:10.1117/12.647755 2006

[38] [40]

The communi- cation of ideas 37(1), 136–139 (1960)

Lasswell, H.D.: The structure and function of communication in society. The communi- cation of ideas 37(1), 136–139 (1960)

work page 1960

[39] [41]

: Teaching human behavior improves content understanding abilities of vlms

Singh, S.K., Harini, S., Singla, Y.K., Chen, C., Shah, R.R., et al. : Teaching human behavior improves content understanding abilities of vlms. In: The Thirteenth Inter- national Conference on Learning Represen- tations (2024)

work page 2024

[40] [42]

Graham, F.Q

Kinney, R.M., Anastasiades, C., Authur, R., Beltagy, I., Bragg, J., et al.: The semantic scholar open data platform. ArXiv preprint abs/2301.10140 (2023) https://doi.org/ 10.48550/ARXIV.2301.10140 28

work page doi:10.48550/arxiv.2301.10140 2023

[41] [43]

https://portal.core

CORE – Conference Ranking Portal: CORE Conference Rankings. https://portal.core. edu.au/conf-ranks/. Accessed: 2025-06-30 (2025)

work page 2025

[42] [46]

In: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp

Muennighoff, N., Tazi, N., Magne, L., Reimers, N.: Mteb: Massive text embed- ding benchmark. In: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 2014–2037 (2023). https://doi.org/10. 18653/v1/2023.eacl-main.148

work page 2014

[43] [47]

Communication Theory 9, 119–161 (1999) https://doi.org/10.1111/j.1468-2885

Craig, R.T.: Communication theory as a field. Communication Theory 9, 119–161 (1999) https://doi.org/10.1111/j.1468-2885. 1999.tb00355.x

work page doi:10.1111/j.1468-2885 1999

[44] [48]

Circle loss: A unified perspective of pair similarity optimization

Epstein, D., Chen, B., Vondrick, C.: Oops! predicting unintentional action in video. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 916–926 (2019) https://doi.org/10.1109/ cvpr42600.2020.00100

work page arXiv 2020

[45] [50]

: Funqa: Towards sur- prising video comprehension

Xie, B., Zhang, S., Zhou, Z., Li, B., Zhang, Y., et al. : Funqa: Towards sur- prising video comprehension. In: Euro- pean Conference on Computer Vision, pp. 39–57 (2024). https://doi.org/10.1007/ 978-3-031-73232-4 3 . Springer

work page 2024

[46] [51]

and Joo, J

Xu, X., Lu, Y., Lu, Z., Xiang, T.: Vid2int: Detecting implicit intention from long dia- log videos. 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), 3298–3307 (2021) https://doi.org/10.1109/ wacv48630.2021.00334

work page arXiv 2021

[47] [52]

0: A large-scale bench- mark dataset for multimodal intent recogni- tion and out-of-scope detection in conversa- tions

Zhang, H., Wang, X., Xu, H., Zhou, Q., Gao, K., et al.: Mintrec2. 0: A large-scale bench- mark dataset for multimodal intent recogni- tion and out-of-scope detection in conversa- tions. In: The Twelfth International Confer- ence on Learning Representations (2024)

work page 2024

[48] [54]

In: CVPR

Jia, M., Wu, Z., Reiter, A., Cardie, C., Belongie, S.J., et al.: Intentonomy: a dataset and study towards human intent under- standing. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 12981–12991 (2020) https://doi. org/10.1109/cvpr46437.2021.01279

work page doi:10.1109/cvpr46437.2021.01279 2021

[49] [55]

: The konstanz natu- ral video database (konvid-1k)

Hosu, V., Hahn, F., Jenadeleh, M., Lin, H., Men, H., et al. : The konstanz natu- ral video database (konvid-1k). In: 2017 Ninth International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–6 (2017). https://doi.org/10.1109/qomex. 2017.7965673 . IEEE

work page doi:10.1109/qomex 2017

[50] [56]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Ying, Z., Mandal, M., Ghadiyaram, D., Bovik, A.: Patch-vq:’patching up’the video quality problem. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14019–14029 (2021). https://doi.org/10.1109/cvpr46437. 2021.01380

work page doi:10.1109/cvpr46437 2021

[51] [57]

In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp

Wu, H., Zhang, E., Liao, L., Chen, C., Hou, J., et al.: Exploring video quality assessment on user generated contents from aesthetic 29 and technical perspectives. In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision, pp. 20144–20154 (2023). https://doi.org/10.1109/iccv51070. 2023.01843

work page doi:10.1109/iccv51070 2023

[52] [58]

2012 IEEE Conference on Computer Vision and Pattern Recognition, 2408–2415 (2012) https://doi.org/10.1109/ cvpr.2012.6247954

Murray, N., Marchesotti, L., Perronnin, F.: Ava: A large-scale database for aesthetic visual analysis. 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2408–2415 (2012) https://doi.org/10.1109/ cvpr.2012.6247954

work page arXiv 2012

[53] [59]

IEEE Transactions on Image Processing 25(1), 372–387 (2015) https://doi.org/10.1109/tip.2015.2500021

Ghadiyaram, D., Bovik, A.C.: Massive online crowdsourced study of subjective and objective picture quality. IEEE Transactions on Image Processing 25(1), 372–387 (2015) https://doi.org/10.1109/tip.2015.2500021

work page doi:10.1109/tip.2015.2500021 2015

[54] [60]

Chudnovsky and S

Sun, W., Zhou, F., Liao, Q.: Mdid: A mul- tiply distorted image database for image quality assessment. Pattern Recognition 61, 153–168 (2017) https://doi.org/10.1016/j. patcog.2016.07.033

work page doi:10.1016/j 2017

[55] [61]

arXiv preprint arXiv:1803.08489 (2018) https://doi.org/10

Lin, H., Hosu, V., Saupe, D.: Koniq- 10k: Towards an ecologically valid and large-scale iqa database. arXiv preprint arXiv:1803.08489 (2018) https://doi.org/10. 1109/tip.2020.2967829

work page arXiv 2018

[56] [62]

Momentum contrast for unsupervised visual representation learning

Fang, Y., Zhu, H., Zeng, Y., Ma, K., Wang, Z.: Perceptual quality assessment of smart- phone photography. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3677–3686 (2020). https://doi.org/10.1109/cvpr42600. 2020.00373

work page doi:10.1109/cvpr42600 2020

[57] [63]

Oona Rainio, Jarmo Teuho, and Riku Klén

HariniS, I., Singh, S., Singla, Y.K., Bhat- tacharyya, A., Baths, V., et al.: Long-term ad memorability: Understanding & generat- ing memorable ads. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 5707–5718 (2023) https: //doi.org/10.1109/wacv61041.2025.00557

work page doi:10.1109/wacv61041.2025.00557 2025

[58] [64]

: Automatic under- standing of image and video advertisements

Hussain, Z., Zhang, M., Zhang, X., Ye, K., Thomas, C., et al. : Automatic under- standing of image and video advertisements. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1705–1715 (2017). https://doi.org/10. 1109/cvpr.2017.123

work page 2017

[59] [65]

YouTube-8M: A Large-Scale Video Classification Benchmark

Abu-El-Haija, S., Kothari, N., Lee, J., Nat- sev, P., Toderici, G., et al.: Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016) https://doi.org/10.48550/arXiv.1609.08675

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1609.08675 2016

[60] [66]

In: CVPR

Wu, C., Kr¨ ahenb¨ uhl, P.: Towards long-form video understanding. 2021 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), 1884–1894 (2021) https://doi.org/10.1109/cvpr46437.2021. 00192

work page doi:10.1109/cvpr46437.2021 2021

[61] [67]

Proceedings of the 29th ACM International Conference on Multimedia (2021) https://doi.org/10.1145/ 3474085.3479222

Wang, Z., Wu, L., Li, Z., Xiong, J., Lu, Q.: Overview of tencent multi-modal ads video understanding. Proceedings of the 29th ACM International Conference on Multimedia (2021) https://doi.org/10.1145/ 3474085.3479222

work page arXiv 2021

[62] [68]

: Mm-au:towards multimodal understanding of advertisement videos

Bose, D., Hebbar, R., Feng, T., Somande- palli, K., Xu, A., et al. : Mm-au:towards multimodal understanding of advertisement videos. Proceedings of the 31st ACM Inter- national Conference on Multimedia (2023) https://doi.org/10.1145/3581783.3612371

work page doi:10.1145/3581783.3612371 2023

[63] [69]

2025, arXiv e-prints, arXiv:2510.13477, doi:10.48550/arXiv

Zhang, Z., Dou, M., Peng, L., Pan, H., Bagci, U., et al. : Videoads for fast-paced video understanding: Where opensource foundation models beat gpt-4o & gemini- 1.5 pro. arXiv preprint arXiv:2504.09282 (2025) https://doi.org/10.48550/ARXIV. 2504.09282

work page internal anchor Pith review doi:10.48550/arxiv 2025

[64] [70]

Masked feature prediction for self-supervised visual pre-training

Gupta, V., Mittal, T., Mathur, P., Mishra, V., Maheshwari, M., et al.: 3massiv: Multilingual, multimodal and multi-aspect dataset of social media short videos. 2022 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 21032–21043 (2022) https://doi. org/10.1109/cvpr52688.2022.02039

work page doi:10.1109/cvpr52688.2022.02039 2022

[65] [71]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., et al. : Laion- 400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021) https://doi.org/10. 30 48550/arXiv.2111.02114

work page internal anchor Pith review Pith/arXiv arXiv 2021

[66] [73]

In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops (2018)

Garcia, N., Vogiatzis, G.: How to read paintings: Semantic art understanding with multi-modal retrieval. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops (2018). https: //doi.org/10.1007/978-3-030-11012-3 52

work page doi:10.1007/978-3-030-11012-3 2018

[67] [74]

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 484–492 (2017) https://doi.org/ 10.1109/cvpr.2017.59

Alameda-Pineda, X., Pilzer, A., Xu, D., Sebe, N., Ricci, E.: Viraliency: Pooling local virality. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 484–492 (2017) https://doi.org/ 10.1109/cvpr.2017.59

work page doi:10.1109/cvpr.2017.59 2017

[68] [75]

: Micro tells macro: Predicting the popularity of micro-videos via a transductive model

Chen, J., Song, X., Nie, L., Wang, X., Zhang, H., et al. : Micro tells macro: Predicting the popularity of micro-videos via a transductive model. Proceedings of the 24th ACM international conference on Multimedia (2016) https://doi.org/10.1007/ s00530-020-00660-x

work page 2016

[69] [76]

Proceed- ings of International Conference on Multi- media Retrieval (2014) https://doi.org/10

Jiang, L., Miao, Y., Yang, Y., Lan, Z., Hauptmann, A.: Viral video style: A closer look at viral videos on youtube. Proceed- ings of International Conference on Multi- media Retrieval (2014) https://doi.org/10. 1145/2578726.2578754

work page arXiv 2014

[70] [77]

In: Proceedings of the Fourth ACM Interna- tional Conference on Web Search and Data Mining, pp

Figueiredo, F., Benevenuto, F., Almeida, J.M.: The tube over time: characterizing popularity growth of youtube videos. In: Proceedings of the Fourth ACM Interna- tional Conference on Web Search and Data Mining, pp. 745–754 (2011). https://doi. org/10.1109/tnsm.2019.2914222

work page doi:10.1109/tnsm.2019.2914222 2011

[71] [78]

In: Proceedings of the International AAAI Conference on Web and Social Media, vol

Lakkaraju, H., McAuley, J., Leskovec, J.: What’s in a name? understanding the inter- play between titles, content, and communi- ties in social media. In: Proceedings of the International AAAI Conference on Web and Social Media, vol. 7, pp. 311–320 (2013). https://doi.org/10.1609/icwsm.v7i1.14408

work page doi:10.1609/icwsm.v7i1.14408 2013

[72] [79]

In: AAAI Conference on Artificial Intelligence (2020)

Pang, B., Zha, K., Zhang, Y., Lu, C.: Fur- ther understanding videos through adverbs: A new video task. In: AAAI Conference on Artificial Intelligence (2020). https://doi. org/10.1609/aaai.v34i07.6855

work page doi:10.1609/aaai.v34i07.6855 2020

[73] [80]

In: CVPR

Liu, X., Shi, H., Chen, H., Yu, Z., Li, X., et al.: imigue: An identity-free video dataset for micro-gesture understanding and emo- tion analysis. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), 10626–10637 (2021) https: //doi.org/10.1109/cvpr46437.2021.01049

work page doi:10.1109/cvpr46437.2021.01049 2021

[74] [81]

ArXiv abs/2405.00574 (2024) https://doi.org/ 10.1109/uemcon62879.2024.10754673

Li, D., Liu, X., Xing, B., Xia, B., Zong, Y., et al.: Eald-mllm: Emotion analysis in long-sequential and de-identity videos with multi-modal large language model. ArXiv abs/2405.00574 (2024) https://doi.org/ 10.1109/uemcon62879.2024.10754673

work page doi:10.1109/uemcon62879.2024.10754673 2024

[75] [82]

: How would the viewer feel? estimating wellbeing from video sce- narios

Mazeika, M., Tang, E., Zou, A., Basart, S., Chan, J.S., et al. : How would the viewer feel? estimating wellbeing from video sce- narios. Advances in Neural Information Pro- cessing Systems 35, 18571–18585 (2022) https://doi.org/10.4324/9780367855383-2

work page doi:10.4324/9780367855383-2 2022

[76] [83]

In: 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2024)

Ren, Z., Ortega, J., Wang, Y., Chen, Z., Whitney, D., et al.: Veatic: Video-based emotion and affect tracking in context dataset. 2024 IEEE/CVF Winter Confer- ence on Applications of Computer Vision (WACV), 4455–4465 (2023) https://doi. org/10.1109/wacv57701.2024.00441

work page doi:10.1109/wacv57701.2024.00441 2024

[77] [84]

In: CVPR

Achlioptas, P., Ovsjanikov, M., Haydarov, K., Elhoseiny, M., Guibas, L.J.: Artemis: Affective language for visual art. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11569–11579 (2021). https:// doi.org/10.1109/cvpr46437.2021.01140

work page doi:10.1109/cvpr46437.2021.01140 2021

[78] [85]

Sample4Geo : Hard negative sampling for cross-view geo-localisation

Yang, J., Huang, Q., Ding, T., Lischin- ski, D., Cohen-Or, D., et al.: Emoset: A large-scale visual emotion dataset with rich attributes. 2023 IEEE/CVF Inter- national Conference on Computer Vision 31 (ICCV), 20326–20337 (2023) https://doi. org/10.1109/iccv51070.2023.01864

work page doi:10.1109/iccv51070.2023.01864 2023

[79] [86]

In: Conference on Multimedia Modeling (2018)

Lv, J., Liu, W., Zhou, L., Wu, B., Ma, H.: Multi-stream fusion model for social rela- tion recognition from videos. In: Conference on Multimedia Modeling (2018). https:// doi.org/10.1007/978-3-319-73603-7 29

work page doi:10.1007/978-3-319-73603-7 2018

[80] [87]

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3561–3569 (2019) https://doi.org/ 10.1109/cvpr.2019.00368

Liu, X., Liu, W., Zhang, M., Chen, J., Gao, L., et al.: Social relation recognition from videos via multi-scale spatial-temporal reasoning. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3561–3569 (2019) https://doi.org/ 10.1109/cvpr.2019.00368

work page doi:10.1109/cvpr.2019.00368 2019