arxiv: 2310.01852 · v7 · submitted 2023-10-03 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

Bin Zhu , Bin Lin , Munan Ning , Yang Yan , Jiaxi Cui , HongFa Wang , Yatian Pang , Wenhao Jiang

show 6 more authors

Junwu Zhang Zongwei Li Wancai Zhang Zhifeng Li Wei Liu Li Yuan

Authors on Pith no claims yet

Pith reviewed 2026-05-17 03:22 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords multi-modal pretrainingcontrastive learningvideo-language modelssemantic alignmentN-modality extensioncross-modal retrievalshared embedding spacedataset construction

0 comments

The pith

Language serves as a semantic anchor to align video, audio, depth, and infrared into one shared feature space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to extend video-language pretraining to any number of additional modalities by using language as the central binding element. A language encoder pretrained on video-text pairs is frozen, after which encoders for audio, depth, and infrared are trained with contrastive learning so that matching pairs are pulled close in feature space. This produces a unified representation where modalities align indirectly through their common ties to language rather than through direct pairs among themselves. The authors also release VIDAL-10M, a dataset of short videos with aligned language, depth, infrared, and audio captions drawn from platforms that preserve complete semantics. Experiments across fifteen benchmarks show improved retrieval and classification results while also indicating that the modalities gain complementary information in the shared space.

Core claim

By freezing the language encoder acquired through video-language pretraining and training additional modality encoders with contrastive learning against language features, all modalities are mapped into a single feature space. This achieves multi-modal semantic alignment in which language functions as the intermediary, enabling the framework to scale from two modalities to N modalities that include audio, depth, and infrared. The VIDAL-10M dataset supplies the required language-centered alignment pairs for this training process.

What carries the argument

LanguageBind, the procedure that maps every modality encoder to the fixed feature space of a frozen language encoder through contrastive learning.

If this is right

Audio, depth, and infrared encoders acquire semantic alignment solely through their contrastive links to language.
The same training recipe can add any new modality without requiring paired data between the new modality and existing non-language modalities.
Unified representations improve performance on retrieval and classification tasks across video, audio, depth, and infrared benchmarks.
Modalities become complementary in downstream applications because each contributes information routed through the common language space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same frozen-language anchor could be used to incorporate additional sensor streams such as thermal or LiDAR data without redesigning the alignment objective.
Scaling laws for adding modalities might be measured by tracking how retrieval performance changes as more encoders are trained sequentially against the same language space.
Applications that already rely on video-language models could gain infrared or depth understanding by simply attaching a new encoder and continuing contrastive training on modest new data.

Load-bearing premise

The language encoder trained only on video-text pairs already contains sufficiently rich semantics to serve as an effective binding anchor for infrared, depth, and audio without direct cross-modal supervision between those modalities.

What would settle it

If contrastive training against a randomly initialized language encoder produces the same retrieval accuracy on infrared-to-video and depth-to-audio tasks as training against the pretrained language encoder, the semantic-binding role of language would be refuted.

read the original abstract

The video-language (VL) pretraining has achieved remarkable improvement in multiple downstream tasks. However, the current VL pretraining framework is hard to extend to multiple modalities (N modalities, N>=3) beyond vision and language. We thus propose LanguageBind, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics. Specifically, we freeze the language encoder acquired by VL pretraining, then train encoders for other modalities with contrastive learning. As a result, all modalities are mapped to a shared feature space, implementing multi-modal semantic alignment. While LanguageBind ensures that we can extend VL modalities to N modalities, we also need a high-quality dataset with alignment data pairs centered on language. We thus propose VIDAL-10M with Video, Infrared, Depth, Audio and their corresponding Language, naming as VIDAL-10M. In our VIDAL-10M, all videos are from short video platforms with complete semantics rather than truncated segments from long videos, and all the video, depth, infrared, and audio modalities are aligned to their textual descriptions. LanguageBind has achieved superior performance on a wide range of 15 benchmarks covering video, audio, depth, and infrared. Moreover, multiple experiments have provided evidence for the effectiveness of LanguageBind in achieving indirect alignment and complementarity among diverse modalities. Code address: https://github.com/PKU-YuanGroup/LanguageBind

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes LanguageBind to extend video-language pretraining to N modalities (N>=3) by freezing a language encoder pretrained on video-text pairs and training new encoders for infrared, depth, and audio via contrastive learning against it. This creates a shared semantic space using language as the binding modality. The authors introduce the VIDAL-10M dataset of aligned video/infrared/depth/audio/language pairs collected from short videos. They claim superior results on 15 benchmarks across video, audio, depth, and infrared, plus experimental evidence for indirect alignment and cross-modal complementarity.

Significance. If the results and ablations hold, the work is significant for providing a practical route to N-modal alignment that avoids collecting exhaustive cross-modal pairs. The VIDAL-10M dataset is a concrete community resource, and the extensive evaluation on 15 benchmarks plus code release are positive contributions. The approach builds directly on existing VL contrastive frameworks without introducing new free parameters in the binding step.

major comments (2)

[§3] §3 (Method): The central claim of effective indirect alignment rests on the frozen language encoder (pretrained only on video-text) already containing transferable semantics for infrared thermal signatures, depth geometry, and acoustic events. This assumption is load-bearing; if the embeddings primarily encode RGB scene content, modality-specific information will be lost or distorted. The paper should include a concrete test (e.g., zero-shot transfer of language-derived features to infrared-only tasks or semantic probing of the language space on non-visual concepts) to quantify how much relevant semantics are present before contrastive training.
[§4] §4 (Experiments): The abstract and results summary assert superior performance on 15 benchmarks and evidence of complementarity, yet provide no numerical deltas, baseline tables, or ablation controls in the high-level description. Without these, it is impossible to assess whether gains exceed standard contrastive scaling or dataset effects. Specific tables comparing against direct multimodal baselines and ablations removing the language anchor would be required to substantiate the N-modality extension claim.

minor comments (2)

[Abstract] Abstract: The phrase 'superior performance on a wide range of 15 benchmarks' should be accompanied by at least one or two concrete metric values or benchmark names for immediate clarity.
[Dataset section] Dataset description: Clarify the exact alignment procedure and quality control steps used to pair infrared/depth/audio with language descriptions in VIDAL-10M; this affects reproducibility of the indirect-alignment results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the opportunity to improve our manuscript. We address each of the major comments point by point below, providing clarifications and indicating revisions where the manuscript will be updated.

read point-by-point responses

Referee: §3 (Method): The central claim of effective indirect alignment rests on the frozen language encoder (pretrained only on video-text) already containing transferable semantics for infrared thermal signatures, depth geometry, and acoustic events. This assumption is load-bearing; if the embeddings primarily encode RGB scene content, modality-specific information will be lost or distorted. The paper should include a concrete test (e.g., zero-shot transfer of language-derived features to infrared-only tasks or semantic probing of the language space on non-visual concepts) to quantify how much relevant semantics are present before contrastive training.

Authors: We appreciate the referee's emphasis on validating the transferable semantics in the pretrained language encoder. Our manuscript already provides evidence for indirect alignment through multiple experiments demonstrating effective mapping of infrared, depth, and audio modalities to the language space, as well as cross-modal complementarity. To further quantify the semantics present prior to training as suggested, we will add a new subsection with semantic probing of the language embeddings on non-visual concepts and zero-shot transfer results on infrared and depth tasks in the revised manuscript. revision: yes
Referee: §4 (Experiments): The abstract and results summary assert superior performance on 15 benchmarks and evidence of complementarity, yet provide no numerical deltas, baseline tables, or ablation controls in the high-level description. Without these, it is impossible to assess whether gains exceed standard contrastive scaling or dataset effects. Specific tables comparing against direct multimodal baselines and ablations removing the language anchor would be required to substantiate the N-modality extension claim.

Authors: We agree that the high-level descriptions in the abstract and introduction would benefit from more specific references to the quantitative results. The full manuscript includes detailed tables in Section 4 with numerical results on all 15 benchmarks, comparisons to baselines, and ablations including those removing the language anchor. In the revision, we will update the abstract and results summary to include key numerical deltas and explicit pointers to these tables and ablation studies to better substantiate the claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external pretraining and new dataset.

full rationale

The paper's core procedure freezes an externally pretrained language encoder (from prior VL work) and applies standard contrastive loss to align new modality encoders on the independently collected VIDAL-10M dataset. No equations define a target quantity in terms of itself, no fitted parameters are relabeled as predictions, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. Performance claims rest on empirical evaluation against external benchmarks rather than reducing to the training construction by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The method assumes language embeddings from VL pretraining already encode semantics rich enough to align other modalities indirectly; the new dataset is the main empirical contribution.

axioms (1)

domain assumption Contrastive learning on language-paired data produces semantically meaningful embeddings for non-language modalities
Invoked when the language encoder is frozen and other encoders are trained to match it.

invented entities (1)

VIDAL-10M dataset no independent evidence
purpose: Supply aligned video, infrared, depth, audio, and language pairs for N-modality training
Newly collected dataset introduced to support the method; no external validation cited in abstract.

pith-pipeline@v0.9.0 · 5600 in / 1333 out tokens · 65290 ms · 2026-05-17T03:22:41.992852+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we freeze the language encoder acquired by VL pretraining, then train encoders for other modalities with contrastive learning... all modalities are mapped to a shared feature space
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LanguageBind has achieved superior performance on a wide range of 15 benchmarks covering video, audio, depth, and infrared

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning
cs.AI 2026-04 unverdicted novelty 7.0

StoryTR is a new benchmark and agentic data pipeline that adds explicit Theory of Mind reasoning chains to train smaller video retrieval models, yielding a 15% relative IoU gain over larger baselines on narrative content.
Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic
cs.AI 2026-04 unverdicted novelty 7.0

SAri-RFT applies GRPO-based reinforcement fine-tuning to LVLMs on novel two-term and three-term visual semantic arithmetic tasks, reaching SOTA on the new IRPD dataset and Visual7W-Telling.
EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models
cs.AI 2026-04 unverdicted novelty 7.0

EmergentBridge improves zero-shot cross-modal transfer for unpaired modality pairs by learning noisy bridge anchors and enforcing proxy alignment only in the orthogonal subspace to preserve existing anchor alignments.
PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction
cs.CV 2026-04 unverdicted novelty 7.0

PolySLGen generates contextually appropriate and temporally coherent multimodal speaking and listening reactions for polyadic interactions by fusing group motion and social cues.
Tiled Prompts: Overcoming Prompt Misguidance in Image and Video Super-Resolution
cs.CV 2026-02 unverdicted novelty 7.0

Tiled Prompts generates tile-specific text prompts for each latent tile in diffusion super-resolution to reduce errors from global prompts and improve perceptual quality.
MMLANDMARKS: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding
cs.CV 2025-12 conditional novelty 7.0

MMLandmarks supplies 197k aerial and 329k ground images plus text and GPS for 18,557 landmarks to benchmark multimodal geo-spatial understanding.
Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?
cs.CV 2025-11 unverdicted novelty 7.0

Introduces the first dedicated benchmark for live multi-modal LLM task guidance with mistake detection and a streaming baseline model.
MLVU: Benchmarking Multi-task Long Video Understanding
cs.CV 2024-06 conditional novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
ReCoVR: Closing the Loop in Interactive Composed Video Retrieval
cs.IR 2026-05 unverdicted novelty 6.0

ReCoVR introduces a reflexive dual-pathway architecture for interactive composed video retrieval that outperforms baselines by combining intent routing with trajectory-level reflection on retrieval history.
EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models
cs.AI 2026-04 unverdicted novelty 6.0

EmergentBridge enhances zero-shot cross-modal performance on unpaired modalities by learning noisy bridge anchors from existing alignments and enforcing proxy alignment only in the orthogonal subspace to avoid gradien...
Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning
cs.SD 2026-04 unverdicted novelty 6.0

TG-DP decouples reconstruction and alignment objectives into separate paths with teacher guidance on visibility patterns, yielding SOTA zero-shot audio-video retrieval gains on AudioSet.
The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment
cs.CV 2025-11 unverdicted novelty 6.0

Contrastive Fusion (ConFu) adds a fused-modality contrastive term to jointly align individual modalities and their combinations, enabling capture of higher-order dependencies like XOR relations while preserving pairwi...
TempCompass: Do Video LLMs Really Understand Videos?
cs.CV 2024-03 unverdicted novelty 6.0

TempCompass benchmark reveals that state-of-the-art Video LLMs have poor ability to perceive temporal aspects such as speed, direction, and ordering in videos.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
cs.CV 2023-11 unverdicted novelty 6.0

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
Beyond Surface Artifacts: Capturing Shared Latent Forgery Knowledge Across Modalities
cs.CV 2026-04 unverdicted novelty 5.0

Introduces MAF framework and DeepModal-Bench to capture universal cross-modal forgery traces for better generalization in multimodal deepfake detection.
Beyond Relevance: On the Relationship Between Retrieval and RAG Information Coverage
cs.IR 2026-03 unverdicted novelty 5.0

Coverage-focused retrieval metrics correlate strongly with nugget coverage in RAG responses across text and multimodal benchmarks, supporting their use as performance proxies when retrieval and generation goals align.
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
cs.CV 2025-06 unverdicted novelty 5.0

UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
cs.CV 2023-12 unverdicted novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
ClimateVID -- Social Media Videos Analysis and Challenges Involved
cs.CV 2026-04 unverdicted novelty 4.0

Vision-language models fail at zero-shot detection of climate-specific classes in social media videos, while DINOv2 and ConvNeXt V2 embeddings yield meaningful clusters via minimum-cost multicut.
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
cs.CV 2024-06 unverdicted novelty 4.0

VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.

Reference graph

Works this paper leans on

202 extracted references · 202 canonical work pages · cited by 19 Pith papers · 13 internal anchors

[2]

Localizing moments in video with natural language

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision, pp.\ 5803--5812, 2017

work page 2017
[3]

Convolutional neural networks for static and dynamic breast infrared imaging classification

Matheus de Freitas Oliveira Baffa and Lucas Grassano Lattari. Convolutional neural networks for static and dynamic breast infrared imaging classification. In 2018 31st SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pp.\ 174--181. IEEE, 2018

work page 2018
[4]

Interactive intrinsic video editing

Nicolas Bonneel, Kalyan Sunkavalli, James Tompkin, Deqing Sun, Sylvain Paris, and Hanspeter Pfister. Interactive intrinsic video editing. ACM Transactions on Graphics (TOG), 33 0 (6): 0 1--10, 2014

work page 2014
[5]

Activitynet: A large-scale video benchmark for human activity understanding

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, pp.\ 961--970, 2015

work page 2015
[6]

Estimating depth from monocular images as classification using deep fully convolutional residual networks

Yuanzhouhan Cao, Zifeng Wu, and Chunhua Shen. Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Transactions on Circuits and Systems for Video Technology, 28 0 (11): 0 3174--3182, 2017

work page 2017
[7]

Simplifying video editing using metadata

Juan Casares, A Chris Long, Brad A Myers, Rishi Bhatnagar, Scott M Stevens, Laura Dabbish, Dan Yocum, and Albert Corbett. Simplifying video editing using metadata. In Proceedings of the 4th conference on Designing interactive systems: processes, practices, methods, and techniques, pp.\ 157--166, 2002

work page 2002
[8]

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 3558--3568, 2021

work page 2021
[9]

Collecting highly parallel data for paraphrase evaluation

David Chen and William B Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp.\ 190--200, 2011

work page 2011
[13]

Content-based video recommendation system based on stylistic visual features

Yashar Deldjoo, Mehdi Elahi, Paolo Cremonesi, Franca Garzotto, Pietro Piazzolla, and Massimo Quadrana. Content-based video recommendation system based on stylistic visual features. Journal on Data Semantics, 5: 0 99--113, 2016

work page 2016
[14]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.\ 248--255. Ieee, 2009

work page 2009
[16]

Freesound technical demo

Frederic Font, Gerard Roma, and Xavier Serra. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia, pp.\ 411--412, 2013

work page 2013
[17]

Audio set: An ontology and human-labeled dataset for audio events

Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.\ 776--780. IEEE, 2017

work page 2017
[18]

Imagebind: One embedding space to bind them all

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 15180--15190, 2023

work page 2023
[19]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 18995--19012, 2022

work page 2022
[20]

Ava: A video dataset of spatio-temporally localized atomic visual actions

Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 6047--6056, 2018

work page 2018
[21]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 770--778, 2016

work page 2016
[22]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll \'a r, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 16000--16009, 2022

work page 2022
[25]

Llvip: A visible-infrared paired dataset for low-light vision

Xinyu Jia, Chuang Zhu, Minzhen Li, Wenqi Tang, and Wenli Zhou. Llvip: A visible-infrared paired dataset for low-light vision. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 3496--3504, 2021

work page 2021
[26]

Large-scale video classification with convolutional neural networks

Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp.\ 1725--1732, 2014

work page 2014
[28]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, volume 1, pp.\ 2, 2019

work page 2019
[29]

Audiocaps: Generating captions for audios in the wild

Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.\ 119--132, 2019

work page 2019
[31]

Mmact: A large-scale dataset for cross modal human action understanding

Quan Kong, Ziming Wu, Ziwei Deng, Martin Klinkigt, Bin Tong, and Tomokazu Murakami. Mmact: A large-scale dataset for cross modal human action understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 8658--8667, 2019

work page 2019
[32]

Hmdb: a large video database for human motion recognition

Hildegard Kuehne, Hueihan Jhuang, Est \' baliz Garrote, Tomaso Poggio, and Thomas Serre. Hmdb: a large video database for human motion recognition. In 2011 International conference on computer vision, pp.\ 2556--2563. IEEE, 2011

work page 2011
[33]

Edge-guided multi-domain rgb-to-tir image translation for training vision tasks with challenging labels

Dong-Guw Lee, Myung-Hwan Jeon, Younggun Cho, and Ayoung Kim. Edge-guided multi-domain rgb-to-tir image translation for training vision tasks with challenging labels. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 8291--8298. IEEE, 2023

work page 2023
[35]

Scaling language-image pre-training via masking

Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichtenhofer, and Kaiming He. Scaling language-image pre-training via masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 23390--23400, 2023 b

work page 2023
[36]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp.\ 740--755. Springer, 2014

work page 2014
[37]

Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning

Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508: 0 293--304, 2022

work page 2022
[38]

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 2630--2640, 2019

work page 2019
[39]

Learning joint embedding with multimodal cues for cross-modal video-text retrieval

Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K Roy-Chowdhury. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, pp.\ 19--27, 2018

work page 2018
[40]

Learning audio-video modalities from image captions

Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, and Cordelia Schmid. Learning audio-video modalities from image captions. In European Conference on Computer Vision, pp.\ 407--426. Springer, 2022

work page 2022
[42]

Esc: Dataset for environmental sound classification

Karol J Piczak. Esc: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia, pp.\ 1015--1018, 2015

work page 2015
[43]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PMLR, 2021

work page 2021
[44]

Recognizing human actions: a local svm approach

Christian Schuldt, Ivan Laptev, and Barbara Caputo. Recognizing human actions: a local svm approach. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., volume 3, pp.\ 32--36. IEEE, 2004

work page 2004
[45]

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 2556--2565, 2018

work page 2018
[47]

Indoor segmentation and support inference from rgbd images

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In Computer Vision--ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12, pp.\ 746--760. Springer, 2012

work page 2012
[48]

Two-stream convolutional networks for action recognition in videos

Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27, 2014

work page 2014
[49]

Image and video search engine for the world wide web

John R Smith and Shih-Fu Chang. Image and video search engine for the world wide web. In Storage and Retrieval for Image and Video Databases V, volume 3022, pp.\ 84--95. SPIE, 1997

work page 1997
[51]

Free teledyne flir thermal dataset for algorithm training

Teledyne FLIR . Free teledyne flir thermal dataset for algorithm training. https://www.flir.com/oem/adas/adas-dataset-form/, 2015 a . Accessed: 2023-09-16

work page 2015
[52]

Free teledyne flir thermal dataset for algorithm training

Teledyne FLIR . Free teledyne flir thermal dataset for algorithm training. https://www.flir.com/oem/adas/adas-dataset-form/, 2015 b . Accessed: 2023-09-16

work page 2015
[53]

Audio-visual event localization in unconstrained videos

Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu. Audio-visual event localization in unconstrained videos. In Proceedings of the European conference on computer vision (ECCV), pp.\ 247--263, 2018

work page 2018
[54]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[55]

Omnivl: One foundation model for image-language and video-language tasks

Junke Wang, Dongdong Chen, Zuxuan Wu, Chong Luo, Luowei Zhou, Yucheng Zhao, Yujia Xie, Ce Liu, Yu-Gang Jiang, and Lu Yuan. Omnivl: One foundation model for image-language and video-language tasks. Advances in neural information processing systems, 35: 0 5696--5710, 2022 a

work page 2022
[58]

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Yusong Wu*, Ke Chen*, Tianyu Zhang*, Yuchen Hui*, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2023

work page 2023
[59]

Msr-vtt: A large video description dataset for bridging video and language

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 5288--5296, 2016

work page 2016
[60]

Advancing high-resolution video-language representation with large-scale video transcriptions

Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo. Advancing high-resolution video-language representation with large-scale video transcriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 5036--5045, 2022

work page 2022
[65]

Pointclip: Point cloud understanding by clip

Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xupeng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 8552--8562, 2022

work page 2022
[67]

Computer Vision--ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12 , pages=

Indoor segmentation and support inference from rgbd images , author=. Computer Vision--ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12 , pages=. 2012 , organization=

work page 2012
[68]

arXiv preprint arXiv:2007.11154 , year=

Rethinking CNN models for audio classification , author=. arXiv preprint arXiv:2007.11154 , year=

work page arXiv 2007
[69]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Advancing high-resolution video-language representation with large-scale video transcriptions , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[70]

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Transformer-xl: Attentive language models beyond a fixed-length context , author=. arXiv preprint arXiv:1901.02860 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1901
[71]

Proceedings of naacL-HLT , volume=

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of naacL-HLT , volume=

work page
[72]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Ego4d: Around the world in 3,000 hours of egocentric video , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[73]

Proceedings of the 23rd ACM international conference on Multimedia , pages=

ESC: Dataset for environmental sound classification , author=. Proceedings of the 23rd ACM international conference on Multimedia , pages=

work page
[74]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[75]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[76]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[77]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[78]

2018 31st SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI) , pages=

Convolutional neural networks for static and dynamic breast infrared imaging classification , author=. 2018 31st SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI) , pages=. 2018 , organization=

work page 2018
[79]

Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval , pages=

Learning joint embedding with multimodal cues for cross-modal video-text retrieval , author=. Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval , pages=

work page 2018
[80]

Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages=

Large-scale video classification with convolutional neural networks , author=. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages=

work page
[81]

IEEE Transactions on Circuits and Systems for Video Technology , volume=

Estimating depth from monocular images as classification using deep fully convolutional residual networks , author=. IEEE Transactions on Circuits and Systems for Video Technology , volume=. 2017 , publisher=

work page 2017
[82]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

work page
[83]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[84]

ACM Transactions on Graphics (TOG) , volume=

Interactive intrinsic video editing , author=. ACM Transactions on Graphics (TOG) , volume=. 2014 , publisher=

work page 2014
[85]

Proceedings of the 4th conference on Designing interactive systems: processes, practices, methods, and techniques , pages=

Simplifying video editing using metadata , author=. Proceedings of the 4th conference on Designing interactive systems: processes, practices, methods, and techniques , pages=

work page
[86]

Proceedings of the IEEE International Conference on Computer Vision Workshops , pages=

Large-scale content-only video recommendation , author=. Proceedings of the IEEE International Conference on Computer Vision Workshops , pages=

work page
[87]

Journal on Data Semantics , volume=

Content-based video recommendation system based on stylistic visual features , author=. Journal on Data Semantics , volume=. 2016 , publisher=

work page 2016
[88]

Storage and Retrieval for Image and Video Databases V , volume=

Image and video search engine for the world wide web , author=. Storage and Retrieval for Image and Video Databases V , volume=. 1997 , organization=

work page 1997
[89]

Proceedings of the 6th ACM international conference on Image and video retrieval , pages=

Towards optimal bag-of-features for object categorization and semantic video retrieval , author=. Proceedings of the 6th ACM international conference on Image and video retrieval , pages=

work page
[90]

Localizing Moments in Video with Temporal Language

Localizing moments in video with temporal language , author=. arXiv preprint arXiv:1809.01337 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[91]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

mplug-owl: Modularization empowers large language models with multimodality , author=. arXiv preprint arXiv:2304.14178 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[92]

arXiv preprint arXiv:2201.07436 , year=

Global-local path networks for monocular depth estimation with vertical cutdepth , author=. arXiv preprint arXiv:2201.07436 , year=

work page arXiv
[93]

2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Edge-guided multi-domain rgb-to-tir image translation for training vision tasks with challenging labels , author=. 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2023 , organization=

work page 2023
[94]

2011 International conference on computer vision , pages=

HMDB: a large video database for human motion recognition , author=. 2011 International conference on computer vision , pages=. 2011 , organization=

work page 2011
[95]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

UCF101: A dataset of 101 human actions classes from videos in the wild , author=. arXiv preprint arXiv:1212.0402 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[96]

2015 IEEE conference on computer vision and pattern recognition (CVPR) , pages=

Activitynet: A large-scale video benchmark for human activity understanding , author=. 2015 IEEE conference on computer vision and pattern recognition (CVPR) , pages=. 2015 , organization=

work page 2015
[97]

Proceedings of the European Conference on Computer Vision (ECCV) , pages=

Scaling egocentric vision: The epic-kitchens dataset , author=. Proceedings of the European Conference on Computer Vision (ECCV) , pages=

work page
[98]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Ferv39k: a large-scale multi-scene dataset for facial expression recognition in videos , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[99]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Hacs: Human action clips and segments dataset for recognition and temporal localization , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[100]

something something

The" something something" video database for learning and evaluating visual common sense , author=. Proceedings of the IEEE international conference on computer vision , pages=

work page

Showing first 80 references.