pith. machine review for the scientific record. sign in

arxiv: 2310.01852 · v7 · submitted 2023-10-03 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

Authors on Pith no claims yet

Pith reviewed 2026-05-17 03:22 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords multi-modal pretrainingcontrastive learningvideo-language modelssemantic alignmentN-modality extensioncross-modal retrievalshared embedding spacedataset construction
0
0 comments X

The pith

Language serves as a semantic anchor to align video, audio, depth, and infrared into one shared feature space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to extend video-language pretraining to any number of additional modalities by using language as the central binding element. A language encoder pretrained on video-text pairs is frozen, after which encoders for audio, depth, and infrared are trained with contrastive learning so that matching pairs are pulled close in feature space. This produces a unified representation where modalities align indirectly through their common ties to language rather than through direct pairs among themselves. The authors also release VIDAL-10M, a dataset of short videos with aligned language, depth, infrared, and audio captions drawn from platforms that preserve complete semantics. Experiments across fifteen benchmarks show improved retrieval and classification results while also indicating that the modalities gain complementary information in the shared space.

Core claim

By freezing the language encoder acquired through video-language pretraining and training additional modality encoders with contrastive learning against language features, all modalities are mapped into a single feature space. This achieves multi-modal semantic alignment in which language functions as the intermediary, enabling the framework to scale from two modalities to N modalities that include audio, depth, and infrared. The VIDAL-10M dataset supplies the required language-centered alignment pairs for this training process.

What carries the argument

LanguageBind, the procedure that maps every modality encoder to the fixed feature space of a frozen language encoder through contrastive learning.

If this is right

  • Audio, depth, and infrared encoders acquire semantic alignment solely through their contrastive links to language.
  • The same training recipe can add any new modality without requiring paired data between the new modality and existing non-language modalities.
  • Unified representations improve performance on retrieval and classification tasks across video, audio, depth, and infrared benchmarks.
  • Modalities become complementary in downstream applications because each contributes information routed through the common language space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same frozen-language anchor could be used to incorporate additional sensor streams such as thermal or LiDAR data without redesigning the alignment objective.
  • Scaling laws for adding modalities might be measured by tracking how retrieval performance changes as more encoders are trained sequentially against the same language space.
  • Applications that already rely on video-language models could gain infrared or depth understanding by simply attaching a new encoder and continuing contrastive training on modest new data.

Load-bearing premise

The language encoder trained only on video-text pairs already contains sufficiently rich semantics to serve as an effective binding anchor for infrared, depth, and audio without direct cross-modal supervision between those modalities.

What would settle it

If contrastive training against a randomly initialized language encoder produces the same retrieval accuracy on infrared-to-video and depth-to-audio tasks as training against the pretrained language encoder, the semantic-binding role of language would be refuted.

read the original abstract

The video-language (VL) pretraining has achieved remarkable improvement in multiple downstream tasks. However, the current VL pretraining framework is hard to extend to multiple modalities (N modalities, N>=3) beyond vision and language. We thus propose LanguageBind, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics. Specifically, we freeze the language encoder acquired by VL pretraining, then train encoders for other modalities with contrastive learning. As a result, all modalities are mapped to a shared feature space, implementing multi-modal semantic alignment. While LanguageBind ensures that we can extend VL modalities to N modalities, we also need a high-quality dataset with alignment data pairs centered on language. We thus propose VIDAL-10M with Video, Infrared, Depth, Audio and their corresponding Language, naming as VIDAL-10M. In our VIDAL-10M, all videos are from short video platforms with complete semantics rather than truncated segments from long videos, and all the video, depth, infrared, and audio modalities are aligned to their textual descriptions. LanguageBind has achieved superior performance on a wide range of 15 benchmarks covering video, audio, depth, and infrared. Moreover, multiple experiments have provided evidence for the effectiveness of LanguageBind in achieving indirect alignment and complementarity among diverse modalities. Code address: https://github.com/PKU-YuanGroup/LanguageBind

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes LanguageBind to extend video-language pretraining to N modalities (N>=3) by freezing a language encoder pretrained on video-text pairs and training new encoders for infrared, depth, and audio via contrastive learning against it. This creates a shared semantic space using language as the binding modality. The authors introduce the VIDAL-10M dataset of aligned video/infrared/depth/audio/language pairs collected from short videos. They claim superior results on 15 benchmarks across video, audio, depth, and infrared, plus experimental evidence for indirect alignment and cross-modal complementarity.

Significance. If the results and ablations hold, the work is significant for providing a practical route to N-modal alignment that avoids collecting exhaustive cross-modal pairs. The VIDAL-10M dataset is a concrete community resource, and the extensive evaluation on 15 benchmarks plus code release are positive contributions. The approach builds directly on existing VL contrastive frameworks without introducing new free parameters in the binding step.

major comments (2)
  1. [§3] §3 (Method): The central claim of effective indirect alignment rests on the frozen language encoder (pretrained only on video-text) already containing transferable semantics for infrared thermal signatures, depth geometry, and acoustic events. This assumption is load-bearing; if the embeddings primarily encode RGB scene content, modality-specific information will be lost or distorted. The paper should include a concrete test (e.g., zero-shot transfer of language-derived features to infrared-only tasks or semantic probing of the language space on non-visual concepts) to quantify how much relevant semantics are present before contrastive training.
  2. [§4] §4 (Experiments): The abstract and results summary assert superior performance on 15 benchmarks and evidence of complementarity, yet provide no numerical deltas, baseline tables, or ablation controls in the high-level description. Without these, it is impossible to assess whether gains exceed standard contrastive scaling or dataset effects. Specific tables comparing against direct multimodal baselines and ablations removing the language anchor would be required to substantiate the N-modality extension claim.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'superior performance on a wide range of 15 benchmarks' should be accompanied by at least one or two concrete metric values or benchmark names for immediate clarity.
  2. [Dataset section] Dataset description: Clarify the exact alignment procedure and quality control steps used to pair infrared/depth/audio with language descriptions in VIDAL-10M; this affects reproducibility of the indirect-alignment results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the opportunity to improve our manuscript. We address each of the major comments point by point below, providing clarifications and indicating revisions where the manuscript will be updated.

read point-by-point responses
  1. Referee: §3 (Method): The central claim of effective indirect alignment rests on the frozen language encoder (pretrained only on video-text) already containing transferable semantics for infrared thermal signatures, depth geometry, and acoustic events. This assumption is load-bearing; if the embeddings primarily encode RGB scene content, modality-specific information will be lost or distorted. The paper should include a concrete test (e.g., zero-shot transfer of language-derived features to infrared-only tasks or semantic probing of the language space on non-visual concepts) to quantify how much relevant semantics are present before contrastive training.

    Authors: We appreciate the referee's emphasis on validating the transferable semantics in the pretrained language encoder. Our manuscript already provides evidence for indirect alignment through multiple experiments demonstrating effective mapping of infrared, depth, and audio modalities to the language space, as well as cross-modal complementarity. To further quantify the semantics present prior to training as suggested, we will add a new subsection with semantic probing of the language embeddings on non-visual concepts and zero-shot transfer results on infrared and depth tasks in the revised manuscript. revision: yes

  2. Referee: §4 (Experiments): The abstract and results summary assert superior performance on 15 benchmarks and evidence of complementarity, yet provide no numerical deltas, baseline tables, or ablation controls in the high-level description. Without these, it is impossible to assess whether gains exceed standard contrastive scaling or dataset effects. Specific tables comparing against direct multimodal baselines and ablations removing the language anchor would be required to substantiate the N-modality extension claim.

    Authors: We agree that the high-level descriptions in the abstract and introduction would benefit from more specific references to the quantitative results. The full manuscript includes detailed tables in Section 4 with numerical results on all 15 benchmarks, comparisons to baselines, and ablations including those removing the language anchor. In the revision, we will update the abstract and results summary to include key numerical deltas and explicit pointers to these tables and ablation studies to better substantiate the claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external pretraining and new dataset.

full rationale

The paper's core procedure freezes an externally pretrained language encoder (from prior VL work) and applies standard contrastive loss to align new modality encoders on the independently collected VIDAL-10M dataset. No equations define a target quantity in terms of itself, no fitted parameters are relabeled as predictions, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. Performance claims rest on empirical evaluation against external benchmarks rather than reducing to the training construction by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The method assumes language embeddings from VL pretraining already encode semantics rich enough to align other modalities indirectly; the new dataset is the main empirical contribution.

axioms (1)
  • domain assumption Contrastive learning on language-paired data produces semantically meaningful embeddings for non-language modalities
    Invoked when the language encoder is frozen and other encoders are trained to match it.
invented entities (1)
  • VIDAL-10M dataset no independent evidence
    purpose: Supply aligned video, infrared, depth, audio, and language pairs for N-modality training
    Newly collected dataset introduced to support the method; no external validation cited in abstract.

pith-pipeline@v0.9.0 · 5600 in / 1333 out tokens · 65290 ms · 2026-05-17T03:22:41.992852+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning

    cs.AI 2026-04 unverdicted novelty 7.0

    StoryTR is a new benchmark and agentic data pipeline that adds explicit Theory of Mind reasoning chains to train smaller video retrieval models, yielding a 15% relative IoU gain over larger baselines on narrative content.

  2. Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic

    cs.AI 2026-04 unverdicted novelty 7.0

    SAri-RFT applies GRPO-based reinforcement fine-tuning to LVLMs on novel two-term and three-term visual semantic arithmetic tasks, reaching SOTA on the new IRPD dataset and Visual7W-Telling.

  3. EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models

    cs.AI 2026-04 unverdicted novelty 7.0

    EmergentBridge improves zero-shot cross-modal transfer for unpaired modality pairs by learning noisy bridge anchors and enforcing proxy alignment only in the orthogonal subspace to preserve existing anchor alignments.

  4. PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction

    cs.CV 2026-04 unverdicted novelty 7.0

    PolySLGen generates contextually appropriate and temporally coherent multimodal speaking and listening reactions for polyadic interactions by fusing group motion and social cues.

  5. Tiled Prompts: Overcoming Prompt Misguidance in Image and Video Super-Resolution

    cs.CV 2026-02 unverdicted novelty 7.0

    Tiled Prompts generates tile-specific text prompts for each latent tile in diffusion super-resolution to reduce errors from global prompts and improve perceptual quality.

  6. MMLANDMARKS: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding

    cs.CV 2025-12 conditional novelty 7.0

    MMLandmarks supplies 197k aerial and 329k ground images plus text and GPS for 18,557 landmarks to benchmark multimodal geo-spatial understanding.

  7. Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?

    cs.CV 2025-11 unverdicted novelty 7.0

    Introduces the first dedicated benchmark for live multi-modal LLM task guidance with mistake detection and a streaming baseline model.

  8. MLVU: Benchmarking Multi-task Long Video Understanding

    cs.CV 2024-06 conditional novelty 7.0

    MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.

  9. ReCoVR: Closing the Loop in Interactive Composed Video Retrieval

    cs.IR 2026-05 unverdicted novelty 6.0

    ReCoVR introduces a reflexive dual-pathway architecture for interactive composed video retrieval that outperforms baselines by combining intent routing with trajectory-level reflection on retrieval history.

  10. EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models

    cs.AI 2026-04 unverdicted novelty 6.0

    EmergentBridge enhances zero-shot cross-modal performance on unpaired modalities by learning noisy bridge anchors from existing alignments and enforcing proxy alignment only in the orthogonal subspace to avoid gradien...

  11. Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning

    cs.SD 2026-04 unverdicted novelty 6.0

    TG-DP decouples reconstruction and alignment objectives into separate paths with teacher guidance on visibility patterns, yielding SOTA zero-shot audio-video retrieval gains on AudioSet.

  12. The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

    cs.CV 2025-11 unverdicted novelty 6.0

    Contrastive Fusion (ConFu) adds a fused-modality contrastive term to jointly align individual modalities and their combinations, enabling capture of higher-order dependencies like XOR relations while preserving pairwi...

  13. TempCompass: Do Video LLMs Really Understand Videos?

    cs.CV 2024-03 unverdicted novelty 6.0

    TempCompass benchmark reveals that state-of-the-art Video LLMs have poor ability to perceive temporal aspects such as speed, direction, and ordering in videos.

  14. Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    cs.CV 2023-11 unverdicted novelty 6.0

    Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.

  15. Beyond Surface Artifacts: Capturing Shared Latent Forgery Knowledge Across Modalities

    cs.CV 2026-04 unverdicted novelty 5.0

    Introduces MAF framework and DeepModal-Bench to capture universal cross-modal forgery traces for better generalization in multimodal deepfake detection.

  16. Beyond Relevance: On the Relationship Between Retrieval and RAG Information Coverage

    cs.IR 2026-03 unverdicted novelty 5.0

    Coverage-focused retrieval metrics correlate strongly with nugget coverage in RAG responses across text and multimodal benchmarks, supporting their use as performance proxies when retrieval and generation goals align.

  17. UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    cs.CV 2025-06 unverdicted novelty 5.0

    UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.

  18. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    cs.CV 2023-12 unverdicted novelty 5.0

    InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

  19. ClimateVID -- Social Media Videos Analysis and Challenges Involved

    cs.CV 2026-04 unverdicted novelty 4.0

    Vision-language models fail at zero-shot detection of climate-specific classes in social media videos, while DINOv2 and ConvNeXt V2 embeddings yield meaningful clusters via minimum-cost multicut.

  20. VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    cs.CV 2024-06 unverdicted novelty 4.0

    VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.

Reference graph

Works this paper leans on

202 extracted references · 202 canonical work pages · cited by 19 Pith papers · 13 internal anchors

  1. [2]

    Localizing moments in video with natural language

    Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision, pp.\ 5803--5812, 2017

  2. [3]

    Convolutional neural networks for static and dynamic breast infrared imaging classification

    Matheus de Freitas Oliveira Baffa and Lucas Grassano Lattari. Convolutional neural networks for static and dynamic breast infrared imaging classification. In 2018 31st SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pp.\ 174--181. IEEE, 2018

  3. [4]

    Interactive intrinsic video editing

    Nicolas Bonneel, Kalyan Sunkavalli, James Tompkin, Deqing Sun, Sylvain Paris, and Hanspeter Pfister. Interactive intrinsic video editing. ACM Transactions on Graphics (TOG), 33 0 (6): 0 1--10, 2014

  4. [5]

    Activitynet: A large-scale video benchmark for human activity understanding

    Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, pp.\ 961--970, 2015

  5. [6]

    Estimating depth from monocular images as classification using deep fully convolutional residual networks

    Yuanzhouhan Cao, Zifeng Wu, and Chunhua Shen. Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Transactions on Circuits and Systems for Video Technology, 28 0 (11): 0 3174--3182, 2017

  6. [7]

    Simplifying video editing using metadata

    Juan Casares, A Chris Long, Brad A Myers, Rishi Bhatnagar, Scott M Stevens, Laura Dabbish, Dan Yocum, and Albert Corbett. Simplifying video editing using metadata. In Proceedings of the 4th conference on Designing interactive systems: processes, practices, methods, and techniques, pp.\ 157--166, 2002

  7. [8]

    Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 3558--3568, 2021

  8. [9]

    Collecting highly parallel data for paraphrase evaluation

    David Chen and William B Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp.\ 190--200, 2011

  9. [13]

    Content-based video recommendation system based on stylistic visual features

    Yashar Deldjoo, Mehdi Elahi, Paolo Cremonesi, Franca Garzotto, Pietro Piazzolla, and Massimo Quadrana. Content-based video recommendation system based on stylistic visual features. Journal on Data Semantics, 5: 0 99--113, 2016

  10. [14]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.\ 248--255. Ieee, 2009

  11. [16]

    Freesound technical demo

    Frederic Font, Gerard Roma, and Xavier Serra. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia, pp.\ 411--412, 2013

  12. [17]

    Audio set: An ontology and human-labeled dataset for audio events

    Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.\ 776--780. IEEE, 2017

  13. [18]

    Imagebind: One embedding space to bind them all

    Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 15180--15190, 2023

  14. [19]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 18995--19012, 2022

  15. [20]

    Ava: A video dataset of spatio-temporally localized atomic visual actions

    Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 6047--6056, 2018

  16. [21]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 770--778, 2016

  17. [22]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll \'a r, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 16000--16009, 2022

  18. [25]

    Llvip: A visible-infrared paired dataset for low-light vision

    Xinyu Jia, Chuang Zhu, Minzhen Li, Wenqi Tang, and Wenli Zhou. Llvip: A visible-infrared paired dataset for low-light vision. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 3496--3504, 2021

  19. [26]

    Large-scale video classification with convolutional neural networks

    Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp.\ 1725--1732, 2014

  20. [28]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, volume 1, pp.\ 2, 2019

  21. [29]

    Audiocaps: Generating captions for audios in the wild

    Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.\ 119--132, 2019

  22. [31]

    Mmact: A large-scale dataset for cross modal human action understanding

    Quan Kong, Ziming Wu, Ziwei Deng, Martin Klinkigt, Bin Tong, and Tomokazu Murakami. Mmact: A large-scale dataset for cross modal human action understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 8658--8667, 2019

  23. [32]

    Hmdb: a large video database for human motion recognition

    Hildegard Kuehne, Hueihan Jhuang, Est \' baliz Garrote, Tomaso Poggio, and Thomas Serre. Hmdb: a large video database for human motion recognition. In 2011 International conference on computer vision, pp.\ 2556--2563. IEEE, 2011

  24. [33]

    Edge-guided multi-domain rgb-to-tir image translation for training vision tasks with challenging labels

    Dong-Guw Lee, Myung-Hwan Jeon, Younggun Cho, and Ayoung Kim. Edge-guided multi-domain rgb-to-tir image translation for training vision tasks with challenging labels. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 8291--8298. IEEE, 2023

  25. [35]

    Scaling language-image pre-training via masking

    Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichtenhofer, and Kaiming He. Scaling language-image pre-training via masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 23390--23400, 2023 b

  26. [36]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp.\ 740--755. Springer, 2014

  27. [37]

    Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning

    Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508: 0 293--304, 2022

  28. [38]

    Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

    Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 2630--2640, 2019

  29. [39]

    Learning joint embedding with multimodal cues for cross-modal video-text retrieval

    Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K Roy-Chowdhury. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, pp.\ 19--27, 2018

  30. [40]

    Learning audio-video modalities from image captions

    Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, and Cordelia Schmid. Learning audio-video modalities from image captions. In European Conference on Computer Vision, pp.\ 407--426. Springer, 2022

  31. [42]

    Esc: Dataset for environmental sound classification

    Karol J Piczak. Esc: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia, pp.\ 1015--1018, 2015

  32. [43]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PMLR, 2021

  33. [44]

    Recognizing human actions: a local svm approach

    Christian Schuldt, Ivan Laptev, and Barbara Caputo. Recognizing human actions: a local svm approach. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., volume 3, pp.\ 32--36. IEEE, 2004

  34. [45]

    Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 2556--2565, 2018

  35. [47]

    Indoor segmentation and support inference from rgbd images

    Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In Computer Vision--ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12, pp.\ 746--760. Springer, 2012

  36. [48]

    Two-stream convolutional networks for action recognition in videos

    Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27, 2014

  37. [49]

    Image and video search engine for the world wide web

    John R Smith and Shih-Fu Chang. Image and video search engine for the world wide web. In Storage and Retrieval for Image and Video Databases V, volume 3022, pp.\ 84--95. SPIE, 1997

  38. [51]

    Free teledyne flir thermal dataset for algorithm training

    Teledyne FLIR . Free teledyne flir thermal dataset for algorithm training. https://www.flir.com/oem/adas/adas-dataset-form/, 2015 a . Accessed: 2023-09-16

  39. [52]

    Free teledyne flir thermal dataset for algorithm training

    Teledyne FLIR . Free teledyne flir thermal dataset for algorithm training. https://www.flir.com/oem/adas/adas-dataset-form/, 2015 b . Accessed: 2023-09-16

  40. [53]

    Audio-visual event localization in unconstrained videos

    Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu. Audio-visual event localization in unconstrained videos. In Proceedings of the European conference on computer vision (ECCV), pp.\ 247--263, 2018

  41. [54]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

  42. [55]

    Omnivl: One foundation model for image-language and video-language tasks

    Junke Wang, Dongdong Chen, Zuxuan Wu, Chong Luo, Luowei Zhou, Yucheng Zhao, Yujia Xie, Ce Liu, Yu-Gang Jiang, and Lu Yuan. Omnivl: One foundation model for image-language and video-language tasks. Advances in neural information processing systems, 35: 0 5696--5710, 2022 a

  43. [58]

    Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

    Yusong Wu*, Ke Chen*, Tianyu Zhang*, Yuchen Hui*, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2023

  44. [59]

    Msr-vtt: A large video description dataset for bridging video and language

    Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 5288--5296, 2016

  45. [60]

    Advancing high-resolution video-language representation with large-scale video transcriptions

    Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo. Advancing high-resolution video-language representation with large-scale video transcriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 5036--5045, 2022

  46. [65]

    Pointclip: Point cloud understanding by clip

    Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xupeng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 8552--8562, 2022

  47. [67]

    Computer Vision--ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12 , pages=

    Indoor segmentation and support inference from rgbd images , author=. Computer Vision--ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12 , pages=. 2012 , organization=

  48. [68]

    arXiv preprint arXiv:2007.11154 , year=

    Rethinking CNN models for audio classification , author=. arXiv preprint arXiv:2007.11154 , year=

  49. [69]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Advancing high-resolution video-language representation with large-scale video transcriptions , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  50. [70]

    Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

    Transformer-xl: Attentive language models beyond a fixed-length context , author=. arXiv preprint arXiv:1901.02860 , year=

  51. [71]

    Proceedings of naacL-HLT , volume=

    Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of naacL-HLT , volume=

  52. [72]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Ego4d: Around the world in 3,000 hours of egocentric video , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  53. [73]

    Proceedings of the 23rd ACM international conference on Multimedia , pages=

    ESC: Dataset for environmental sound classification , author=. Proceedings of the 23rd ACM international conference on Multimedia , pages=

  54. [74]

    International conference on machine learning , pages=

    Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

  55. [75]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  56. [76]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  57. [77]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  58. [78]

    2018 31st SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI) , pages=

    Convolutional neural networks for static and dynamic breast infrared imaging classification , author=. 2018 31st SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI) , pages=. 2018 , organization=

  59. [79]

    Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval , pages=

    Learning joint embedding with multimodal cues for cross-modal video-text retrieval , author=. Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval , pages=

  60. [80]

    Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages=

    Large-scale video classification with convolutional neural networks , author=. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages=

  61. [81]

    IEEE Transactions on Circuits and Systems for Video Technology , volume=

    Estimating depth from monocular images as classification using deep fully convolutional residual networks , author=. IEEE Transactions on Circuits and Systems for Video Technology , volume=. 2017 , publisher=

  62. [82]

    Advances in neural information processing systems , volume=

    Attention is all you need , author=. Advances in neural information processing systems , volume=

  63. [83]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  64. [84]

    ACM Transactions on Graphics (TOG) , volume=

    Interactive intrinsic video editing , author=. ACM Transactions on Graphics (TOG) , volume=. 2014 , publisher=

  65. [85]

    Proceedings of the 4th conference on Designing interactive systems: processes, practices, methods, and techniques , pages=

    Simplifying video editing using metadata , author=. Proceedings of the 4th conference on Designing interactive systems: processes, practices, methods, and techniques , pages=

  66. [86]

    Proceedings of the IEEE International Conference on Computer Vision Workshops , pages=

    Large-scale content-only video recommendation , author=. Proceedings of the IEEE International Conference on Computer Vision Workshops , pages=

  67. [87]

    Journal on Data Semantics , volume=

    Content-based video recommendation system based on stylistic visual features , author=. Journal on Data Semantics , volume=. 2016 , publisher=

  68. [88]

    Storage and Retrieval for Image and Video Databases V , volume=

    Image and video search engine for the world wide web , author=. Storage and Retrieval for Image and Video Databases V , volume=. 1997 , organization=

  69. [89]

    Proceedings of the 6th ACM international conference on Image and video retrieval , pages=

    Towards optimal bag-of-features for object categorization and semantic video retrieval , author=. Proceedings of the 6th ACM international conference on Image and video retrieval , pages=

  70. [90]

    Localizing Moments in Video with Temporal Language

    Localizing moments in video with temporal language , author=. arXiv preprint arXiv:1809.01337 , year=

  71. [91]

    mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

    mplug-owl: Modularization empowers large language models with multimodality , author=. arXiv preprint arXiv:2304.14178 , year=

  72. [92]

    arXiv preprint arXiv:2201.07436 , year=

    Global-local path networks for monocular depth estimation with vertical cutdepth , author=. arXiv preprint arXiv:2201.07436 , year=

  73. [93]

    2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=

    Edge-guided multi-domain rgb-to-tir image translation for training vision tasks with challenging labels , author=. 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2023 , organization=

  74. [94]

    2011 International conference on computer vision , pages=

    HMDB: a large video database for human motion recognition , author=. 2011 International conference on computer vision , pages=. 2011 , organization=

  75. [95]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    UCF101: A dataset of 101 human actions classes from videos in the wild , author=. arXiv preprint arXiv:1212.0402 , year=

  76. [96]

    2015 IEEE conference on computer vision and pattern recognition (CVPR) , pages=

    Activitynet: A large-scale video benchmark for human activity understanding , author=. 2015 IEEE conference on computer vision and pattern recognition (CVPR) , pages=. 2015 , organization=

  77. [97]

    Proceedings of the European Conference on Computer Vision (ECCV) , pages=

    Scaling egocentric vision: The epic-kitchens dataset , author=. Proceedings of the European Conference on Computer Vision (ECCV) , pages=

  78. [98]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Ferv39k: a large-scale multi-scene dataset for facial expression recognition in videos , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  79. [99]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Hacs: Human action clips and segments dataset for recognition and temporal localization , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  80. [100]

    something something

    The" something something" video database for learning and evaluating visual common sense , author=. Proceedings of the IEEE international conference on computer vision , pages=

Showing first 80 references.