Looking Beyond the Obvious: A Survey on Abstract Concept Recognition for Video Understanding
Pith reviewed 2026-05-18 20:23 UTC · model grok-4.3
The pith
Foundation models offer an ideal opportunity to recognize abstract concepts like justice and freedom in videos by building on past research.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Abstract concept recognition forms a crucial open challenge in video understanding, where reasoning on multiple semantic levels based on contextual information is key. The authors argue that the recent advances in foundation models make for an ideal setting to address abstract concept understanding in videos. Automated understanding of high-level abstract concepts is imperative as it enables models to be more aligned with human reasoning and values. The survey examines different tasks and datasets used to understand abstract concepts in video content and advocates that drawing on decades of community experience will help shed light on this important open grand challenge and avoid re-invented
What carries the argument
Survey of tasks and datasets for abstract concept recognition in videos, which shows periodic research efforts using tools available at the time and now positions foundation models as the next step.
If this is right
- Video understanding systems would move beyond detecting visible objects and actions to reasoning about high-level ideas.
- Models would align more closely with human reasoning and values through contextual multi-level analysis.
- Prior research on abstract tasks can be reused rather than restarted when applying new multimodal foundation models.
- The field gains a clearer path to solving an open grand challenge without repeating past cycles of effort.
Where Pith is reading between the lines
- Video search and recommendation tools could retrieve content by themes such as togetherness or freedom instead of surface features alone.
- This line of work might extend to ethical AI systems that interpret social implications in visual media.
- Direct comparisons of foundation models against historical dataset results could test whether they truly close the gap in contextual reasoning.
Load-bearing premise
The tasks and datasets covered in the survey adequately represent the problem of abstract concept recognition and foundation models can overcome prior limits without facing new fundamental barriers.
What would settle it
An experiment in which foundation models show no improvement over earlier specialized methods when tested on the surveyed abstract concept video datasets would undermine the central claim.
read the original abstract
The automatic understanding of video content is advancing rapidly. Empowered by deeper neural networks and large datasets, machines are increasingly capable of understanding what is concretely visible in video frames, whether it be objects, actions, events, or scenes. In comparison, humans retain a unique ability to also look beyond concrete entities and recognize abstract concepts like justice, freedom, and togetherness. Abstract concept recognition forms a crucial open challenge in video understanding, where reasoning on multiple semantic levels based on contextual information is key. In this paper, we argue that the recent advances in foundation models make for an ideal setting to address abstract concept understanding in videos. Automated understanding of high-level abstract concepts is imperative as it enables models to be more aligned with human reasoning and values. In this survey, we study different tasks and datasets used to understand abstract concepts in video content. We observe that, periodically and over a long period, researchers have attempted to solve these tasks, making the best use of the tools available at their disposal. We advocate that drawing on decades of community experience will help us shed light on this important open grand challenge and avoid ``re-inventing the wheel'' as we start revisiting it in the era of multi-modal foundation models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper is a survey on abstract concept recognition in video understanding. It reviews tasks and datasets for recognizing high-level abstract concepts (e.g., justice, freedom, togetherness) that require contextual and multi-level reasoning beyond concrete visual entities. The central claim is that recent foundation models create an ideal setting for this challenge and that the community should draw on decades of prior research attempts to avoid reinventing the wheel.
Significance. If the survey provides a representative overview of historical tasks/datasets and identifies transferable lessons, it could usefully orient future work on aligning video models with human-like abstract reasoning. The directional argument linking foundation models to this open problem is timely and could help consolidate community efforts.
major comments (2)
- [Abstract] Abstract and introduction: the claim that surveyed tasks/datasets are sufficiently representative of abstract concept recognition (and that foundation models will succeed where prior approaches fell short) is presented without explicit inclusion criteria, coverage statistics, or discussion of potential gaps; this is load-bearing for the survey's utility as a foundation for new work.
- [Tasks and datasets review] The advocacy for drawing on 'decades of community experience' would be strengthened by concrete mappings in the tasks/datasets review section showing which specific limitations of earlier methods (e.g., lack of context modeling) are directly addressable by current foundation-model capabilities.
minor comments (2)
- Clarify the taxonomy or categorization scheme used to group the surveyed tasks and datasets for easier navigation.
- Ensure all cited prior works include publication years and venues in the reference list for historical context.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and the recommendation for minor revision. We address the major comments below and will update the manuscript to incorporate the suggested improvements.
read point-by-point responses
-
Referee: [Abstract] Abstract and introduction: the claim that surveyed tasks/datasets are sufficiently representative of abstract concept recognition (and that foundation models will succeed where prior approaches fell short) is presented without explicit inclusion criteria, coverage statistics, or discussion of potential gaps; this is load-bearing for the survey's utility as a foundation for new work.
Authors: We acknowledge that explicit documentation of the survey methodology is important for establishing the representativeness of the reviewed tasks and datasets. In the revised version, we will introduce a 'Survey Scope and Methodology' subsection in the Introduction. This will specify the inclusion criteria (focusing on tasks that involve reasoning about abstract concepts requiring contextual, multi-level understanding beyond concrete visual elements), the search and selection process, quantitative coverage (e.g., total papers, tasks, and datasets surveyed), and a balanced discussion of limitations and gaps, such as potential biases toward certain video domains or under-explored abstract concepts. This addition will provide a solid foundation for the claims and guide future research. revision: yes
-
Referee: [Tasks and datasets review] The advocacy for drawing on 'decades of community experience' would be strengthened by concrete mappings in the tasks/datasets review section showing which specific limitations of earlier methods (e.g., lack of context modeling) are directly addressable by current foundation-model capabilities.
Authors: We agree that providing concrete linkages will make the argument more compelling. We will enhance the review section by adding targeted examples and a summary mapping. Specifically, we will highlight cases where earlier methods (e.g., in video emotion recognition or social scene understanding) were limited by insufficient modeling of long-term context or multimodal integration, and contrast this with foundation models' strengths in self-attention, large-scale pretraining, and zero-shot generalization. A new table or bullet-point summaries will map 4-5 representative limitations to corresponding FM capabilities, while maintaining a cautious tone that these are promising directions rather than guaranteed successes. revision: yes
Circularity Check
No significant circularity identified
full rationale
This is a survey paper with no derivations, equations, fitted parameters, or new technical predictions. Its central claim is directional—that foundation models create an ideal setting for abstract concept recognition in video and that prior community experience should inform the effort—without reducing any result to quantities defined by its own choices or self-citations. All referenced tasks, datasets, and prior approaches are drawn from external literature rather than constructed internally, making the paper self-contained against external benchmarks with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reasoning on multiple semantic levels based on contextual information is key to abstract concept recognition in video.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
taxonomy of abstract concepts … Perception Understanding, Emotions and Social Signals, Narrative & Rhetorical Analysis … datasets such as AVA, MovieGraphs, LVU
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
foundation models … cross-modal understanding … large-scale training
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
https://www.red-dot.org/de/project/ freedom-of-speech-20384-20384
Dot, R.: Red Dot Design Award: FREE- DOM OF SPEECH — red-dot.org. https://www.red-dot.org/de/project/ freedom-of-speech-20384-20384. [Accessed 17-02-2025]. https://doi.org/10.5553/ab/ 0165-13312018098005031
work page doi:10.5553/ab/ 2025
-
[2]
[Place of Publication Not Identified: Publisher Not Identified, to 1943] (1942)
War Production Co-Ordinating Committee: We Can Do It! Rosie the Riveter. [Place of Publication Not Identified: Publisher Not Identified, to 1943] (1942). https://doi.org/ 10.3735/9781961844179.book-part-147
-
[3]
: Deep learning for generic object detection: A survey
Liu, L., Ouyang, W., Wang, X., Fieguth, P.W., Chen, J., et al. : Deep learning for generic object detection: A survey. International Journal of Computer Vision 128, 261–318 (2018) https://doi.org/10. 1007/S11263-019-01247-4
work page 2018
-
[4]
: Deep learning for scene classification: A survey
Zeng, D., Liao, M., Tavakolian, M., Guo, Y., Zhou, B., et al. : Deep learning for scene classification: A survey. arXiv preprint arXiv:2101.10531 (2021) https://doi.org/10. 48550/arXiv.2101.10531
-
[5]
Kong, Y., Fu, Y.R.: Human action recog- nition and prediction: A survey. Interna- tional Journal of Computer Vision 130, 1366–1401 (2018) https://doi.org/10.1007/ S11263-022-01594-9
work page 2018
-
[6]
Human Perception of Visual Information, 85 (2022) https://doi
Zhao, S., Huang, Q., Tang, Y., Yao, X., Yang, J., et al.: Computational emotion analysis from images: Recent advances and future directions. Human Perception of Visual Information, 85 (2022) https://doi. org/10.1007/978-3-030-81465-6 4
-
[7]
In: International Conference on Image Analysis and Processing (2019)
Stefanini, M., Cornia, M., Baraldi, L., Corsini, M., Cucchiara, R.: Artpedia: A new visual-semantic dataset with visual and con- textual sentences in the artistic domain. In: International Conference on Image Analysis and Processing (2019). https://doi.org/10. 1007/978-3-030-30645-8 66
work page 2019
-
[8]
Seeing the Intangible: Survey of Image Classification into High-Level and Abstract Categories
Pandiani, D.S.M., Presutti, V.: Seeing the intangible: Survey of image classification into high-level and abstract categories. ArXiv preprint abs/2308.10562 (2023) https://doi.org/10.48550/arXiv.2308.10562
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.10562 2023
-
[9]
Interna- tional Journal of Computer Vision 60, 91– 110 (2004) https://doi.org/10.1023/b:visi
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Interna- tional Journal of Computer Vision 60, 91– 110 (2004) https://doi.org/10.1023/b:visi. 0000029664.99615.94
-
[10]
Dalal, N., Triggs, B.: Histograms of ori- ented gradients for human detection. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) 1, 886–8931 (2005) https://doi. org/10.1109/cvpr.2005.177 26
-
[11]
In: AAAI Conference on Artificial Intelligence (2014)
Jiang, Y.-G., Xu, B., Xue, X.: Predicting emotions in user-generated videos. In: AAAI Conference on Artificial Intelligence (2014). https://doi.org/10.1609/aaai.v28i1.8724
-
[12]
Lu, X., Lin, Z.L., Jin, H., Yang, J., Wang, J.Z.: Rapid: Rating pictorial aes- thetics using deep learning. Proceedings of the 22nd ACM international conference on Multimedia (2014) https://doi.org/10.1145/ 2647868.2654927
-
[13]
: Quality assessment in the era of large models: A survey
Zhang, Z., Zhou, Y., Li, C., Zhao, B., Liu, X., et al. : Quality assessment in the era of large models: A survey. ACM Transac- tions on Multimedia Computing, Communi- cations and Applications (2024) https://doi. org/10.1145/3722559
-
[14]
: Emotion-llama: Mul- timodal emotion recognition and reason- ing with instruction tuning
Cheng, Z., Cheng, Z.-Q., He, J.-Y., Wang, K., Lin, Y., et al. : Emotion-llama: Mul- timodal emotion recognition and reason- ing with instruction tuning. Advances in Neural Information Processing Systems 37, 110805–110853 (2024)
work page 2024
-
[15]
Zhang, J., Huang, J., Jin, S., Lu, S.: Vision- language models for vision tasks: A sur- vey. IEEE Transactions on Pattern Analy- sis and Machine Intelligence 46, 5625–5644 (2023) https://doi.org/10.1109/tpami.2024. 3369699
-
[16]
Pandiani, D.S.M., Lazzari, N., Erp, M., Pre- sutti, V.: Hypericons for interpretability: decoding abstract concepts in visual data. International Journal of Digital Humanities 5, 451–490 (2023) https://doi.org/10.1007/ s42803-023-00077-8
work page 2023
-
[17]
In: Workshop on Cognitive Aspects of the Lexicon (2024)
Cerini, L., Bondielli, A., Lenci, A.: Repre- senting abstract concepts with images: An investigation with large language models. In: Workshop on Cognitive Aspects of the Lexicon (2024)
work page 2024
-
[18]
Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.C.: Content-based image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell.22, 1349–1380 (2000) https://doi.org/10.1109/ 34.895972
work page 2000
-
[19]
In: 28th International Joint Con- ference on Artificial Intelligence, IJCAI 2019, pp
Aditya, S., Yang, Y., Baral, C.: Integrating knowledge and reasoning in image under- standing. In: 28th International Joint Con- ference on Artificial Intelligence, IJCAI 2019, pp. 6252–6259 (2019). https://doi. org/10.24963/ijcai.2019/873 . International Joint Conferences on Artificial Intelligence
-
[20]
IEEE Transactions on Multime- dia 13, 303–319 (2011) https://doi.org/10
Fu, Z., Lu, G., Ting, K.M., Zhang, D.: A sur- vey of audio-based music classification and annotation. IEEE Transactions on Multime- dia 13, 303–319 (2011) https://doi.org/10. 1109/tmm.2010.2098858
-
[21]
Philanthropy, B.: Empowering Girls in India (2024)
work page 2024
-
[22]
In: International Conference on Machine Learning (2023)
Li, J., Li, D., Savarese, S., Hoi, S.C.H.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International Conference on Machine Learning (2023)
work page 2023
-
[23]
Torresani, L., Szummer, M., Fitzgibbon, A.: Efficient object category recognition using classemes. In: Computer Vision– ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part I 11, pp. 776–789 (2010). https://doi.org/10. 1007/978-3-642-15549-9 56 . Springer
work page 2010
-
[24]
Borth, D., Chen, T., Ji, R., Chang, S.-F.: Sentibank: large-scale ontology and clas- sifiers for detecting sentiment and emo- tions in visual content. Proceedings of the 21st ACM international conference on Multimedia (2013) https://doi.org/10.1145/ 2502081.2502268
-
[25]
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015)
work page 2015
-
[27]
Neural computation 9(8), 1735–1780 (1997) https://doi.org/10.1162/ neco.1997.9.8.1735
Hochreiter, S., Schmidhuber, J.: Long short- term memory. Neural computation 9(8), 1735–1780 (1997) https://doi.org/10.1162/ neco.1997.9.8.1735
work page 1997
-
[28]
In: European Conference on Computer Vision (2023)
Li, Y., Wang, C., Jia, J.: Llama-vid: An image is worth 2 tokens in large lan- guage models. In: European Conference on Computer Vision (2023). https://doi.org/ 10.1007/978-3-031-72952-2 19
-
[29]
Masked feature prediction for self-supervised visual pre-training
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., et al.: Learning trans- ferable visual models from natural language supervision. In: International Conference on Machine Learning (2021). https://doi.org/ 10.1109/cvpr52688.2022.00101
-
[30]
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., et al.: The llama 3 herd of models. ArXiv abs/2407.21783 (2024) https://doi.org/10.1016/s0749-0720(15) 31012-4
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/s0749-0720(15 2024
-
[31]
abstract semantics: From mental representations to functional brain mapping
Mkrtychian, N.A., Blagovechtchenski, E.D., Kurmakaeva, D., Gnedykh, D.S., Kostro- mina, S.N., et al.: Concrete vs. abstract semantics: From mental representations to functional brain mapping. Frontiers in Human Neuroscience 13 (2019) https://doi. org/10.3389/fnhum.2019.00267
-
[32]
Journal of Cognition 6 (2023) https://doi.org/10
Banks, B., Borghi, A.M., Fargier, R., Fini, C., Jonauskait˙ e, D., et al.: Consensus paper: Current perspectives on abstract concepts and future research directions. Journal of Cognition 6 (2023) https://doi.org/10. 5334/joc.238
work page 2023
-
[33]
https://doi.org/10.61508/refl.v25i2
Bateman, J.A.: Text and Image: A Criti- cal Introduction to the Visual/Verbal Divide (2014). https://doi.org/10.61508/refl.v25i2. 166287
-
[34]
Behavior Research Methods 46, 904–911 (2014) https://doi.org/10.3758/ s13428-013-0403-5
Brysbaert, M., Warriner, A.B., Kuper- man, V.: Concreteness ratings for 40 thousand generally known english word lemmas. Behavior Research Methods 46, 904–911 (2014) https://doi.org/10.3758/ s13428-013-0403-5
work page 2014
-
[35]
Cognitive science 29(5), 719–736 (2005) https://doi.org/10.1207/ s15516709cog0000 33
Katja Wiemer-Hastings, K., Xu, X.: Con- tent differences for abstract and con- crete concepts. Cognitive science 29(5), 719–736 (2005) https://doi.org/10.1207/ s15516709cog0000 33
work page 2005
-
[36]
Philosophi- cal Transactions of the Royal Society B 378 (2022) https://doi.org/10.1098/rstb.2021
Langland-Hassan, P., Davis, C.P.: A context-sensitive and non-linguistic approach to abstract concepts. Philosophi- cal Transactions of the Royal Society B 378 (2022) https://doi.org/10.1098/rstb.2021. 0355
-
[37]
Future Humanities (2024) https://doi.org/ 10.22541/au.171181017.78084528/v1
Pandiani, D.S.M.: The wicked problem of naming the intangible: Abstract concepts, binary thinking, and computer vision labels. Future Humanities (2024) https://doi.org/ 10.22541/au.171181017.78084528/v1
-
[38]
Hare, J.S., Lewis, P.H., Enser, P.G.B., San- dom, C.J.: Mind the gap: another look at the problem of the semantic gap in image retrieval. In: Electronic Imaging (2006). https://doi.org/10.1117/12.647755
-
[40]
The communi- cation of ideas 37(1), 136–139 (1960)
Lasswell, H.D.: The structure and function of communication in society. The communi- cation of ideas 37(1), 136–139 (1960)
work page 1960
-
[41]
: Teaching human behavior improves content understanding abilities of vlms
Singh, S.K., Harini, S., Singla, Y.K., Chen, C., Shah, R.R., et al. : Teaching human behavior improves content understanding abilities of vlms. In: The Thirteenth Inter- national Conference on Learning Represen- tations (2024)
work page 2024
-
[42]
Kinney, R.M., Anastasiades, C., Authur, R., Beltagy, I., Bragg, J., et al.: The semantic scholar open data platform. ArXiv preprint abs/2301.10140 (2023) https://doi.org/ 10.48550/ARXIV.2301.10140 28
-
[43]
CORE – Conference Ranking Portal: CORE Conference Rankings. https://portal.core. edu.au/conf-ranks/. Accessed: 2025-06-30 (2025)
work page 2025
-
[46]
Muennighoff, N., Tazi, N., Magne, L., Reimers, N.: Mteb: Massive text embed- ding benchmark. In: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 2014–2037 (2023). https://doi.org/10. 18653/v1/2023.eacl-main.148
work page 2014
-
[47]
Communication Theory 9, 119–161 (1999) https://doi.org/10.1111/j.1468-2885
Craig, R.T.: Communication theory as a field. Communication Theory 9, 119–161 (1999) https://doi.org/10.1111/j.1468-2885. 1999.tb00355.x
-
[48]
Circle loss: A unified perspective of pair similarity optimization
Epstein, D., Chen, B., Vondrick, C.: Oops! predicting unintentional action in video. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 916–926 (2019) https://doi.org/10.1109/ cvpr42600.2020.00100
-
[50]
: Funqa: Towards sur- prising video comprehension
Xie, B., Zhang, S., Zhou, Z., Li, B., Zhang, Y., et al. : Funqa: Towards sur- prising video comprehension. In: Euro- pean Conference on Computer Vision, pp. 39–57 (2024). https://doi.org/10.1007/ 978-3-031-73232-4 3 . Springer
work page 2024
-
[51]
Xu, X., Lu, Y., Lu, Z., Xiang, T.: Vid2int: Detecting implicit intention from long dia- log videos. 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), 3298–3307 (2021) https://doi.org/10.1109/ wacv48630.2021.00334
-
[52]
Zhang, H., Wang, X., Xu, H., Zhou, Q., Gao, K., et al.: Mintrec2. 0: A large-scale bench- mark dataset for multimodal intent recogni- tion and out-of-scope detection in conversa- tions. In: The Twelfth International Confer- ence on Learning Representations (2024)
work page 2024
-
[54]
Jia, M., Wu, Z., Reiter, A., Cardie, C., Belongie, S.J., et al.: Intentonomy: a dataset and study towards human intent under- standing. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 12981–12991 (2020) https://doi. org/10.1109/cvpr46437.2021.01279
-
[55]
: The konstanz natu- ral video database (konvid-1k)
Hosu, V., Hahn, F., Jenadeleh, M., Lin, H., Men, H., et al. : The konstanz natu- ral video database (konvid-1k). In: 2017 Ninth International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–6 (2017). https://doi.org/10.1109/qomex. 2017.7965673 . IEEE
-
[56]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Ying, Z., Mandal, M., Ghadiyaram, D., Bovik, A.: Patch-vq:’patching up’the video quality problem. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14019–14029 (2021). https://doi.org/10.1109/cvpr46437. 2021.01380
-
[57]
In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp
Wu, H., Zhang, E., Liao, L., Chen, C., Hou, J., et al.: Exploring video quality assessment on user generated contents from aesthetic 29 and technical perspectives. In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision, pp. 20144–20154 (2023). https://doi.org/10.1109/iccv51070. 2023.01843
-
[58]
Murray, N., Marchesotti, L., Perronnin, F.: Ava: A large-scale database for aesthetic visual analysis. 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2408–2415 (2012) https://doi.org/10.1109/ cvpr.2012.6247954
-
[59]
IEEE Transactions on Image Processing 25(1), 372–387 (2015) https://doi.org/10.1109/tip.2015.2500021
Ghadiyaram, D., Bovik, A.C.: Massive online crowdsourced study of subjective and objective picture quality. IEEE Transactions on Image Processing 25(1), 372–387 (2015) https://doi.org/10.1109/tip.2015.2500021
-
[60]
Sun, W., Zhou, F., Liao, Q.: Mdid: A mul- tiply distorted image database for image quality assessment. Pattern Recognition 61, 153–168 (2017) https://doi.org/10.1016/j. patcog.2016.07.033
work page doi:10.1016/j 2017
-
[61]
arXiv preprint arXiv:1803.08489 (2018) https://doi.org/10
Lin, H., Hosu, V., Saupe, D.: Koniq- 10k: Towards an ecologically valid and large-scale iqa database. arXiv preprint arXiv:1803.08489 (2018) https://doi.org/10. 1109/tip.2020.2967829
-
[62]
Momentum contrast for unsupervised visual representation learning
Fang, Y., Zhu, H., Zeng, Y., Ma, K., Wang, Z.: Perceptual quality assessment of smart- phone photography. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3677–3686 (2020). https://doi.org/10.1109/cvpr42600. 2020.00373
-
[63]
Oona Rainio, Jarmo Teuho, and Riku Klén
HariniS, I., Singh, S., Singla, Y.K., Bhat- tacharyya, A., Baths, V., et al.: Long-term ad memorability: Understanding & generat- ing memorable ads. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 5707–5718 (2023) https: //doi.org/10.1109/wacv61041.2025.00557
-
[64]
: Automatic under- standing of image and video advertisements
Hussain, Z., Zhang, M., Zhang, X., Ye, K., Thomas, C., et al. : Automatic under- standing of image and video advertisements. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1705–1715 (2017). https://doi.org/10. 1109/cvpr.2017.123
work page 2017
-
[65]
YouTube-8M: A Large-Scale Video Classification Benchmark
Abu-El-Haija, S., Kothari, N., Lee, J., Nat- sev, P., Toderici, G., et al.: Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016) https://doi.org/10.48550/arXiv.1609.08675
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1609.08675 2016
-
[66]
Wu, C., Kr¨ ahenb¨ uhl, P.: Towards long-form video understanding. 2021 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), 1884–1894 (2021) https://doi.org/10.1109/cvpr46437.2021. 00192
-
[67]
Wang, Z., Wu, L., Li, Z., Xiong, J., Lu, Q.: Overview of tencent multi-modal ads video understanding. Proceedings of the 29th ACM International Conference on Multimedia (2021) https://doi.org/10.1145/ 3474085.3479222
-
[68]
: Mm-au:towards multimodal understanding of advertisement videos
Bose, D., Hebbar, R., Feng, T., Somande- palli, K., Xu, A., et al. : Mm-au:towards multimodal understanding of advertisement videos. Proceedings of the 31st ACM Inter- national Conference on Multimedia (2023) https://doi.org/10.1145/3581783.3612371
-
[69]
2025, arXiv e-prints, arXiv:2510.13477, doi:10.48550/arXiv
Zhang, Z., Dou, M., Peng, L., Pan, H., Bagci, U., et al. : Videoads for fast-paced video understanding: Where opensource foundation models beat gpt-4o & gemini- 1.5 pro. arXiv preprint arXiv:2504.09282 (2025) https://doi.org/10.48550/ARXIV. 2504.09282
work page internal anchor Pith review doi:10.48550/arxiv 2025
-
[70]
Masked feature prediction for self-supervised visual pre-training
Gupta, V., Mittal, T., Mathur, P., Mishra, V., Maheshwari, M., et al.: 3massiv: Multilingual, multimodal and multi-aspect dataset of social media short videos. 2022 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 21032–21043 (2022) https://doi. org/10.1109/cvpr52688.2022.02039
-
[71]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., et al. : Laion- 400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021) https://doi.org/10. 30 48550/arXiv.2111.02114
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[73]
In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops (2018)
Garcia, N., Vogiatzis, G.: How to read paintings: Semantic art understanding with multi-modal retrieval. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops (2018). https: //doi.org/10.1007/978-3-030-11012-3 52
-
[74]
Alameda-Pineda, X., Pilzer, A., Xu, D., Sebe, N., Ricci, E.: Viraliency: Pooling local virality. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 484–492 (2017) https://doi.org/ 10.1109/cvpr.2017.59
-
[75]
: Micro tells macro: Predicting the popularity of micro-videos via a transductive model
Chen, J., Song, X., Nie, L., Wang, X., Zhang, H., et al. : Micro tells macro: Predicting the popularity of micro-videos via a transductive model. Proceedings of the 24th ACM international conference on Multimedia (2016) https://doi.org/10.1007/ s00530-020-00660-x
work page 2016
-
[76]
Proceed- ings of International Conference on Multi- media Retrieval (2014) https://doi.org/10
Jiang, L., Miao, Y., Yang, Y., Lan, Z., Hauptmann, A.: Viral video style: A closer look at viral videos on youtube. Proceed- ings of International Conference on Multi- media Retrieval (2014) https://doi.org/10. 1145/2578726.2578754
-
[77]
In: Proceedings of the Fourth ACM Interna- tional Conference on Web Search and Data Mining, pp
Figueiredo, F., Benevenuto, F., Almeida, J.M.: The tube over time: characterizing popularity growth of youtube videos. In: Proceedings of the Fourth ACM Interna- tional Conference on Web Search and Data Mining, pp. 745–754 (2011). https://doi. org/10.1109/tnsm.2019.2914222
-
[78]
In: Proceedings of the International AAAI Conference on Web and Social Media, vol
Lakkaraju, H., McAuley, J., Leskovec, J.: What’s in a name? understanding the inter- play between titles, content, and communi- ties in social media. In: Proceedings of the International AAAI Conference on Web and Social Media, vol. 7, pp. 311–320 (2013). https://doi.org/10.1609/icwsm.v7i1.14408
-
[79]
In: AAAI Conference on Artificial Intelligence (2020)
Pang, B., Zha, K., Zhang, Y., Lu, C.: Fur- ther understanding videos through adverbs: A new video task. In: AAAI Conference on Artificial Intelligence (2020). https://doi. org/10.1609/aaai.v34i07.6855
-
[80]
Liu, X., Shi, H., Chen, H., Yu, Z., Li, X., et al.: imigue: An identity-free video dataset for micro-gesture understanding and emo- tion analysis. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), 10626–10637 (2021) https: //doi.org/10.1109/cvpr46437.2021.01049
-
[81]
ArXiv abs/2405.00574 (2024) https://doi.org/ 10.1109/uemcon62879.2024.10754673
Li, D., Liu, X., Xing, B., Xia, B., Zong, Y., et al.: Eald-mllm: Emotion analysis in long-sequential and de-identity videos with multi-modal large language model. ArXiv abs/2405.00574 (2024) https://doi.org/ 10.1109/uemcon62879.2024.10754673
-
[82]
: How would the viewer feel? estimating wellbeing from video sce- narios
Mazeika, M., Tang, E., Zou, A., Basart, S., Chan, J.S., et al. : How would the viewer feel? estimating wellbeing from video sce- narios. Advances in Neural Information Pro- cessing Systems 35, 18571–18585 (2022) https://doi.org/10.4324/9780367855383-2
-
[83]
In: 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2024)
Ren, Z., Ortega, J., Wang, Y., Chen, Z., Whitney, D., et al.: Veatic: Video-based emotion and affect tracking in context dataset. 2024 IEEE/CVF Winter Confer- ence on Applications of Computer Vision (WACV), 4455–4465 (2023) https://doi. org/10.1109/wacv57701.2024.00441
-
[84]
Achlioptas, P., Ovsjanikov, M., Haydarov, K., Elhoseiny, M., Guibas, L.J.: Artemis: Affective language for visual art. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11569–11579 (2021). https:// doi.org/10.1109/cvpr46437.2021.01140
-
[85]
Sample4Geo : Hard negative sampling for cross-view geo-localisation
Yang, J., Huang, Q., Ding, T., Lischin- ski, D., Cohen-Or, D., et al.: Emoset: A large-scale visual emotion dataset with rich attributes. 2023 IEEE/CVF Inter- national Conference on Computer Vision 31 (ICCV), 20326–20337 (2023) https://doi. org/10.1109/iccv51070.2023.01864
-
[86]
In: Conference on Multimedia Modeling (2018)
Lv, J., Liu, W., Zhou, L., Wu, B., Ma, H.: Multi-stream fusion model for social rela- tion recognition from videos. In: Conference on Multimedia Modeling (2018). https:// doi.org/10.1007/978-3-319-73603-7 29
-
[87]
Liu, X., Liu, W., Zhang, M., Chen, J., Gao, L., et al.: Social relation recognition from videos via multi-scale spatial-temporal reasoning. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3561–3569 (2019) https://doi.org/ 10.1109/cvpr.2019.00368
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.