Recognition: 2 theorem links
· Lean TheoremLanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
Pith reviewed 2026-05-17 03:22 UTC · model grok-4.3
The pith
Language serves as a semantic anchor to align video, audio, depth, and infrared into one shared feature space.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By freezing the language encoder acquired through video-language pretraining and training additional modality encoders with contrastive learning against language features, all modalities are mapped into a single feature space. This achieves multi-modal semantic alignment in which language functions as the intermediary, enabling the framework to scale from two modalities to N modalities that include audio, depth, and infrared. The VIDAL-10M dataset supplies the required language-centered alignment pairs for this training process.
What carries the argument
LanguageBind, the procedure that maps every modality encoder to the fixed feature space of a frozen language encoder through contrastive learning.
If this is right
- Audio, depth, and infrared encoders acquire semantic alignment solely through their contrastive links to language.
- The same training recipe can add any new modality without requiring paired data between the new modality and existing non-language modalities.
- Unified representations improve performance on retrieval and classification tasks across video, audio, depth, and infrared benchmarks.
- Modalities become complementary in downstream applications because each contributes information routed through the common language space.
Where Pith is reading between the lines
- The same frozen-language anchor could be used to incorporate additional sensor streams such as thermal or LiDAR data without redesigning the alignment objective.
- Scaling laws for adding modalities might be measured by tracking how retrieval performance changes as more encoders are trained sequentially against the same language space.
- Applications that already rely on video-language models could gain infrared or depth understanding by simply attaching a new encoder and continuing contrastive training on modest new data.
Load-bearing premise
The language encoder trained only on video-text pairs already contains sufficiently rich semantics to serve as an effective binding anchor for infrared, depth, and audio without direct cross-modal supervision between those modalities.
What would settle it
If contrastive training against a randomly initialized language encoder produces the same retrieval accuracy on infrared-to-video and depth-to-audio tasks as training against the pretrained language encoder, the semantic-binding role of language would be refuted.
read the original abstract
The video-language (VL) pretraining has achieved remarkable improvement in multiple downstream tasks. However, the current VL pretraining framework is hard to extend to multiple modalities (N modalities, N>=3) beyond vision and language. We thus propose LanguageBind, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics. Specifically, we freeze the language encoder acquired by VL pretraining, then train encoders for other modalities with contrastive learning. As a result, all modalities are mapped to a shared feature space, implementing multi-modal semantic alignment. While LanguageBind ensures that we can extend VL modalities to N modalities, we also need a high-quality dataset with alignment data pairs centered on language. We thus propose VIDAL-10M with Video, Infrared, Depth, Audio and their corresponding Language, naming as VIDAL-10M. In our VIDAL-10M, all videos are from short video platforms with complete semantics rather than truncated segments from long videos, and all the video, depth, infrared, and audio modalities are aligned to their textual descriptions. LanguageBind has achieved superior performance on a wide range of 15 benchmarks covering video, audio, depth, and infrared. Moreover, multiple experiments have provided evidence for the effectiveness of LanguageBind in achieving indirect alignment and complementarity among diverse modalities. Code address: https://github.com/PKU-YuanGroup/LanguageBind
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes LanguageBind to extend video-language pretraining to N modalities (N>=3) by freezing a language encoder pretrained on video-text pairs and training new encoders for infrared, depth, and audio via contrastive learning against it. This creates a shared semantic space using language as the binding modality. The authors introduce the VIDAL-10M dataset of aligned video/infrared/depth/audio/language pairs collected from short videos. They claim superior results on 15 benchmarks across video, audio, depth, and infrared, plus experimental evidence for indirect alignment and cross-modal complementarity.
Significance. If the results and ablations hold, the work is significant for providing a practical route to N-modal alignment that avoids collecting exhaustive cross-modal pairs. The VIDAL-10M dataset is a concrete community resource, and the extensive evaluation on 15 benchmarks plus code release are positive contributions. The approach builds directly on existing VL contrastive frameworks without introducing new free parameters in the binding step.
major comments (2)
- [§3] §3 (Method): The central claim of effective indirect alignment rests on the frozen language encoder (pretrained only on video-text) already containing transferable semantics for infrared thermal signatures, depth geometry, and acoustic events. This assumption is load-bearing; if the embeddings primarily encode RGB scene content, modality-specific information will be lost or distorted. The paper should include a concrete test (e.g., zero-shot transfer of language-derived features to infrared-only tasks or semantic probing of the language space on non-visual concepts) to quantify how much relevant semantics are present before contrastive training.
- [§4] §4 (Experiments): The abstract and results summary assert superior performance on 15 benchmarks and evidence of complementarity, yet provide no numerical deltas, baseline tables, or ablation controls in the high-level description. Without these, it is impossible to assess whether gains exceed standard contrastive scaling or dataset effects. Specific tables comparing against direct multimodal baselines and ablations removing the language anchor would be required to substantiate the N-modality extension claim.
minor comments (2)
- [Abstract] Abstract: The phrase 'superior performance on a wide range of 15 benchmarks' should be accompanied by at least one or two concrete metric values or benchmark names for immediate clarity.
- [Dataset section] Dataset description: Clarify the exact alignment procedure and quality control steps used to pair infrared/depth/audio with language descriptions in VIDAL-10M; this affects reproducibility of the indirect-alignment results.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and the opportunity to improve our manuscript. We address each of the major comments point by point below, providing clarifications and indicating revisions where the manuscript will be updated.
read point-by-point responses
-
Referee: §3 (Method): The central claim of effective indirect alignment rests on the frozen language encoder (pretrained only on video-text) already containing transferable semantics for infrared thermal signatures, depth geometry, and acoustic events. This assumption is load-bearing; if the embeddings primarily encode RGB scene content, modality-specific information will be lost or distorted. The paper should include a concrete test (e.g., zero-shot transfer of language-derived features to infrared-only tasks or semantic probing of the language space on non-visual concepts) to quantify how much relevant semantics are present before contrastive training.
Authors: We appreciate the referee's emphasis on validating the transferable semantics in the pretrained language encoder. Our manuscript already provides evidence for indirect alignment through multiple experiments demonstrating effective mapping of infrared, depth, and audio modalities to the language space, as well as cross-modal complementarity. To further quantify the semantics present prior to training as suggested, we will add a new subsection with semantic probing of the language embeddings on non-visual concepts and zero-shot transfer results on infrared and depth tasks in the revised manuscript. revision: yes
-
Referee: §4 (Experiments): The abstract and results summary assert superior performance on 15 benchmarks and evidence of complementarity, yet provide no numerical deltas, baseline tables, or ablation controls in the high-level description. Without these, it is impossible to assess whether gains exceed standard contrastive scaling or dataset effects. Specific tables comparing against direct multimodal baselines and ablations removing the language anchor would be required to substantiate the N-modality extension claim.
Authors: We agree that the high-level descriptions in the abstract and introduction would benefit from more specific references to the quantitative results. The full manuscript includes detailed tables in Section 4 with numerical results on all 15 benchmarks, comparisons to baselines, and ablations including those removing the language anchor. In the revision, we will update the abstract and results summary to include key numerical deltas and explicit pointers to these tables and ablation studies to better substantiate the claims. revision: yes
Circularity Check
No significant circularity; derivation relies on external pretraining and new dataset.
full rationale
The paper's core procedure freezes an externally pretrained language encoder (from prior VL work) and applies standard contrastive loss to align new modality encoders on the independently collected VIDAL-10M dataset. No equations define a target quantity in terms of itself, no fitted parameters are relabeled as predictions, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. Performance claims rest on empirical evaluation against external benchmarks rather than reducing to the training construction by definition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Contrastive learning on language-paired data produces semantically meaningful embeddings for non-language modalities
invented entities (1)
-
VIDAL-10M dataset
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we freeze the language encoder acquired by VL pretraining, then train encoders for other modalities with contrastive learning... all modalities are mapped to a shared feature space
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LanguageBind has achieved superior performance on a wide range of 15 benchmarks covering video, audio, depth, and infrared
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 20 Pith papers
-
StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning
StoryTR is a new benchmark and agentic data pipeline that adds explicit Theory of Mind reasoning chains to train smaller video retrieval models, yielding a 15% relative IoU gain over larger baselines on narrative content.
-
Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic
SAri-RFT applies GRPO-based reinforcement fine-tuning to LVLMs on novel two-term and three-term visual semantic arithmetic tasks, reaching SOTA on the new IRPD dataset and Visual7W-Telling.
-
EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models
EmergentBridge improves zero-shot cross-modal transfer for unpaired modality pairs by learning noisy bridge anchors and enforcing proxy alignment only in the orthogonal subspace to preserve existing anchor alignments.
-
PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction
PolySLGen generates contextually appropriate and temporally coherent multimodal speaking and listening reactions for polyadic interactions by fusing group motion and social cues.
-
Tiled Prompts: Overcoming Prompt Misguidance in Image and Video Super-Resolution
Tiled Prompts generates tile-specific text prompts for each latent tile in diffusion super-resolution to reduce errors from global prompts and improve perceptual quality.
-
MMLANDMARKS: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding
MMLandmarks supplies 197k aerial and 329k ground images plus text and GPS for 18,557 landmarks to benchmark multimodal geo-spatial understanding.
-
Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?
Introduces the first dedicated benchmark for live multi-modal LLM task guidance with mistake detection and a streaming baseline model.
-
MLVU: Benchmarking Multi-task Long Video Understanding
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
-
ReCoVR: Closing the Loop in Interactive Composed Video Retrieval
ReCoVR introduces a reflexive dual-pathway architecture for interactive composed video retrieval that outperforms baselines by combining intent routing with trajectory-level reflection on retrieval history.
-
EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models
EmergentBridge enhances zero-shot cross-modal performance on unpaired modalities by learning noisy bridge anchors from existing alignments and enforcing proxy alignment only in the orthogonal subspace to avoid gradien...
-
Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning
TG-DP decouples reconstruction and alignment objectives into separate paths with teacher guidance on visibility patterns, yielding SOTA zero-shot audio-video retrieval gains on AudioSet.
-
The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment
Contrastive Fusion (ConFu) adds a fused-modality contrastive term to jointly align individual modalities and their combinations, enabling capture of higher-order dependencies like XOR relations while preserving pairwi...
-
TempCompass: Do Video LLMs Really Understand Videos?
TempCompass benchmark reveals that state-of-the-art Video LLMs have poor ability to perceive temporal aspects such as speed, direction, and ordering in videos.
-
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
-
Beyond Surface Artifacts: Capturing Shared Latent Forgery Knowledge Across Modalities
Introduces MAF framework and DeepModal-Bench to capture universal cross-modal forgery traces for better generalization in multimodal deepfake detection.
-
Beyond Relevance: On the Relationship Between Retrieval and RAG Information Coverage
Coverage-focused retrieval metrics correlate strongly with nugget coverage in RAG responses across text and multimodal benchmarks, supporting their use as performance proxies when retrieval and generation goals align.
-
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
-
ClimateVID -- Social Media Videos Analysis and Challenges Involved
Vision-language models fail at zero-shot detection of climate-specific classes in social media videos, while DINOv2 and ConvNeXt V2 embeddings yield meaningful clusters via minimum-cost multicut.
-
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.
Reference graph
Works this paper leans on
-
[2]
Localizing moments in video with natural language
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision, pp.\ 5803--5812, 2017
work page 2017
-
[3]
Convolutional neural networks for static and dynamic breast infrared imaging classification
Matheus de Freitas Oliveira Baffa and Lucas Grassano Lattari. Convolutional neural networks for static and dynamic breast infrared imaging classification. In 2018 31st SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pp.\ 174--181. IEEE, 2018
work page 2018
-
[4]
Interactive intrinsic video editing
Nicolas Bonneel, Kalyan Sunkavalli, James Tompkin, Deqing Sun, Sylvain Paris, and Hanspeter Pfister. Interactive intrinsic video editing. ACM Transactions on Graphics (TOG), 33 0 (6): 0 1--10, 2014
work page 2014
-
[5]
Activitynet: A large-scale video benchmark for human activity understanding
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, pp.\ 961--970, 2015
work page 2015
-
[6]
Yuanzhouhan Cao, Zifeng Wu, and Chunhua Shen. Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Transactions on Circuits and Systems for Video Technology, 28 0 (11): 0 3174--3182, 2017
work page 2017
-
[7]
Simplifying video editing using metadata
Juan Casares, A Chris Long, Brad A Myers, Rishi Bhatnagar, Scott M Stevens, Laura Dabbish, Dan Yocum, and Albert Corbett. Simplifying video editing using metadata. In Proceedings of the 4th conference on Designing interactive systems: processes, practices, methods, and techniques, pp.\ 157--166, 2002
work page 2002
-
[8]
Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts
Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 3558--3568, 2021
work page 2021
-
[9]
Collecting highly parallel data for paraphrase evaluation
David Chen and William B Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp.\ 190--200, 2011
work page 2011
-
[13]
Content-based video recommendation system based on stylistic visual features
Yashar Deldjoo, Mehdi Elahi, Paolo Cremonesi, Franca Garzotto, Pietro Piazzolla, and Massimo Quadrana. Content-based video recommendation system based on stylistic visual features. Journal on Data Semantics, 5: 0 99--113, 2016
work page 2016
-
[14]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.\ 248--255. Ieee, 2009
work page 2009
-
[16]
Frederic Font, Gerard Roma, and Xavier Serra. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia, pp.\ 411--412, 2013
work page 2013
-
[17]
Audio set: An ontology and human-labeled dataset for audio events
Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.\ 776--780. IEEE, 2017
work page 2017
-
[18]
Imagebind: One embedding space to bind them all
Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 15180--15190, 2023
work page 2023
-
[19]
Ego4d: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 18995--19012, 2022
work page 2022
-
[20]
Ava: A video dataset of spatio-temporally localized atomic visual actions
Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 6047--6056, 2018
work page 2018
-
[21]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 770--778, 2016
work page 2016
-
[22]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll \'a r, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 16000--16009, 2022
work page 2022
-
[25]
Llvip: A visible-infrared paired dataset for low-light vision
Xinyu Jia, Chuang Zhu, Minzhen Li, Wenqi Tang, and Wenli Zhou. Llvip: A visible-infrared paired dataset for low-light vision. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 3496--3504, 2021
work page 2021
-
[26]
Large-scale video classification with convolutional neural networks
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp.\ 1725--1732, 2014
work page 2014
-
[28]
Bert: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, volume 1, pp.\ 2, 2019
work page 2019
-
[29]
Audiocaps: Generating captions for audios in the wild
Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.\ 119--132, 2019
work page 2019
-
[31]
Mmact: A large-scale dataset for cross modal human action understanding
Quan Kong, Ziming Wu, Ziwei Deng, Martin Klinkigt, Bin Tong, and Tomokazu Murakami. Mmact: A large-scale dataset for cross modal human action understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 8658--8667, 2019
work page 2019
-
[32]
Hmdb: a large video database for human motion recognition
Hildegard Kuehne, Hueihan Jhuang, Est \' baliz Garrote, Tomaso Poggio, and Thomas Serre. Hmdb: a large video database for human motion recognition. In 2011 International conference on computer vision, pp.\ 2556--2563. IEEE, 2011
work page 2011
-
[33]
Dong-Guw Lee, Myung-Hwan Jeon, Younggun Cho, and Ayoung Kim. Edge-guided multi-domain rgb-to-tir image translation for training vision tasks with challenging labels. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 8291--8298. IEEE, 2023
work page 2023
-
[35]
Scaling language-image pre-training via masking
Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichtenhofer, and Kaiming He. Scaling language-image pre-training via masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 23390--23400, 2023 b
work page 2023
-
[36]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp.\ 740--755. Springer, 2014
work page 2014
-
[37]
Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning
Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508: 0 293--304, 2022
work page 2022
-
[38]
Howto100m: Learning a text-video embedding by watching hundred million narrated video clips
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 2630--2640, 2019
work page 2019
-
[39]
Learning joint embedding with multimodal cues for cross-modal video-text retrieval
Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K Roy-Chowdhury. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, pp.\ 19--27, 2018
work page 2018
-
[40]
Learning audio-video modalities from image captions
Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, and Cordelia Schmid. Learning audio-video modalities from image captions. In European Conference on Computer Vision, pp.\ 407--426. Springer, 2022
work page 2022
-
[42]
Esc: Dataset for environmental sound classification
Karol J Piczak. Esc: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia, pp.\ 1015--1018, 2015
work page 2015
-
[43]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PMLR, 2021
work page 2021
-
[44]
Recognizing human actions: a local svm approach
Christian Schuldt, Ivan Laptev, and Barbara Caputo. Recognizing human actions: a local svm approach. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., volume 3, pp.\ 32--36. IEEE, 2004
work page 2004
-
[45]
Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 2556--2565, 2018
work page 2018
-
[47]
Indoor segmentation and support inference from rgbd images
Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In Computer Vision--ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12, pp.\ 746--760. Springer, 2012
work page 2012
-
[48]
Two-stream convolutional networks for action recognition in videos
Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27, 2014
work page 2014
-
[49]
Image and video search engine for the world wide web
John R Smith and Shih-Fu Chang. Image and video search engine for the world wide web. In Storage and Retrieval for Image and Video Databases V, volume 3022, pp.\ 84--95. SPIE, 1997
work page 1997
-
[51]
Free teledyne flir thermal dataset for algorithm training
Teledyne FLIR . Free teledyne flir thermal dataset for algorithm training. https://www.flir.com/oem/adas/adas-dataset-form/, 2015 a . Accessed: 2023-09-16
work page 2015
-
[52]
Free teledyne flir thermal dataset for algorithm training
Teledyne FLIR . Free teledyne flir thermal dataset for algorithm training. https://www.flir.com/oem/adas/adas-dataset-form/, 2015 b . Accessed: 2023-09-16
work page 2015
-
[53]
Audio-visual event localization in unconstrained videos
Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu. Audio-visual event localization in unconstrained videos. In Proceedings of the European conference on computer vision (ECCV), pp.\ 247--263, 2018
work page 2018
-
[54]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017
work page 2017
-
[55]
Omnivl: One foundation model for image-language and video-language tasks
Junke Wang, Dongdong Chen, Zuxuan Wu, Chong Luo, Luowei Zhou, Yucheng Zhao, Yujia Xie, Ce Liu, Yu-Gang Jiang, and Lu Yuan. Omnivl: One foundation model for image-language and video-language tasks. Advances in neural information processing systems, 35: 0 5696--5710, 2022 a
work page 2022
-
[58]
Yusong Wu*, Ke Chen*, Tianyu Zhang*, Yuchen Hui*, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2023
work page 2023
-
[59]
Msr-vtt: A large video description dataset for bridging video and language
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 5288--5296, 2016
work page 2016
-
[60]
Advancing high-resolution video-language representation with large-scale video transcriptions
Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo. Advancing high-resolution video-language representation with large-scale video transcriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 5036--5045, 2022
work page 2022
-
[65]
Pointclip: Point cloud understanding by clip
Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xupeng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 8552--8562, 2022
work page 2022
-
[67]
Indoor segmentation and support inference from rgbd images , author=. Computer Vision--ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12 , pages=. 2012 , organization=
work page 2012
-
[68]
arXiv preprint arXiv:2007.11154 , year=
Rethinking CNN models for audio classification , author=. arXiv preprint arXiv:2007.11154 , year=
-
[69]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Advancing high-resolution video-language representation with large-scale video transcriptions , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[70]
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Transformer-xl: Attentive language models beyond a fixed-length context , author=. arXiv preprint arXiv:1901.02860 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[71]
Proceedings of naacL-HLT , volume=
Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of naacL-HLT , volume=
-
[72]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Ego4d: Around the world in 3,000 hours of egocentric video , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[73]
Proceedings of the 23rd ACM international conference on Multimedia , pages=
ESC: Dataset for environmental sound classification , author=. Proceedings of the 23rd ACM international conference on Multimedia , pages=
-
[74]
International conference on machine learning , pages=
Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=
work page 2021
-
[75]
Scaling Learning Algorithms Towards
Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
-
[76]
and Osindero, Simon and Teh, Yee Whye , journal =
Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
- [77]
-
[78]
2018 31st SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI) , pages=
Convolutional neural networks for static and dynamic breast infrared imaging classification , author=. 2018 31st SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI) , pages=. 2018 , organization=
work page 2018
-
[79]
Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval , pages=
Learning joint embedding with multimodal cues for cross-modal video-text retrieval , author=. Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval , pages=
work page 2018
-
[80]
Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages=
Large-scale video classification with convolutional neural networks , author=. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages=
-
[81]
IEEE Transactions on Circuits and Systems for Video Technology , volume=
Estimating depth from monocular images as classification using deep fully convolutional residual networks , author=. IEEE Transactions on Circuits and Systems for Video Technology , volume=. 2017 , publisher=
work page 2017
-
[82]
Advances in neural information processing systems , volume=
Attention is all you need , author=. Advances in neural information processing systems , volume=
-
[83]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[84]
ACM Transactions on Graphics (TOG) , volume=
Interactive intrinsic video editing , author=. ACM Transactions on Graphics (TOG) , volume=. 2014 , publisher=
work page 2014
-
[85]
Simplifying video editing using metadata , author=. Proceedings of the 4th conference on Designing interactive systems: processes, practices, methods, and techniques , pages=
-
[86]
Proceedings of the IEEE International Conference on Computer Vision Workshops , pages=
Large-scale content-only video recommendation , author=. Proceedings of the IEEE International Conference on Computer Vision Workshops , pages=
-
[87]
Journal on Data Semantics , volume=
Content-based video recommendation system based on stylistic visual features , author=. Journal on Data Semantics , volume=. 2016 , publisher=
work page 2016
-
[88]
Storage and Retrieval for Image and Video Databases V , volume=
Image and video search engine for the world wide web , author=. Storage and Retrieval for Image and Video Databases V , volume=. 1997 , organization=
work page 1997
-
[89]
Proceedings of the 6th ACM international conference on Image and video retrieval , pages=
Towards optimal bag-of-features for object categorization and semantic video retrieval , author=. Proceedings of the 6th ACM international conference on Image and video retrieval , pages=
-
[90]
Localizing Moments in Video with Temporal Language
Localizing moments in video with temporal language , author=. arXiv preprint arXiv:1809.01337 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[91]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
mplug-owl: Modularization empowers large language models with multimodality , author=. arXiv preprint arXiv:2304.14178 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[92]
arXiv preprint arXiv:2201.07436 , year=
Global-local path networks for monocular depth estimation with vertical cutdepth , author=. arXiv preprint arXiv:2201.07436 , year=
-
[93]
2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=
Edge-guided multi-domain rgb-to-tir image translation for training vision tasks with challenging labels , author=. 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2023 , organization=
work page 2023
-
[94]
2011 International conference on computer vision , pages=
HMDB: a large video database for human motion recognition , author=. 2011 International conference on computer vision , pages=. 2011 , organization=
work page 2011
-
[95]
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
UCF101: A dataset of 101 human actions classes from videos in the wild , author=. arXiv preprint arXiv:1212.0402 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[96]
2015 IEEE conference on computer vision and pattern recognition (CVPR) , pages=
Activitynet: A large-scale video benchmark for human activity understanding , author=. 2015 IEEE conference on computer vision and pattern recognition (CVPR) , pages=. 2015 , organization=
work page 2015
-
[97]
Proceedings of the European Conference on Computer Vision (ECCV) , pages=
Scaling egocentric vision: The epic-kitchens dataset , author=. Proceedings of the European Conference on Computer Vision (ECCV) , pages=
-
[98]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Ferv39k: a large-scale multi-scene dataset for facial expression recognition in videos , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[99]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Hacs: Human action clips and segments dataset for recognition and temporal localization , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[100]
The" something something" video database for learning and evaluating visual common sense , author=. Proceedings of the IEEE international conference on computer vision , pages=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.