{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:QT2OC6LCZXLTZ47EOLBAFJSMSP","short_pith_number":"pith:QT2OC6LC","schema_version":"1.0","canonical_sha256":"84f4e17962cdd73cf3e472c202a64c93dc2fd8ff7032b1327a7be4af869fdfc7","source":{"kind":"arxiv","id":"2310.01852","version":7},"attestation_state":"computed","paper":{"title":"LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Language serves as a semantic anchor to align video, audio, depth, and infrared into one shared feature space.","cross_cats":["cs.AI"],"primary_cat":"cs.CV","authors_text":"Bin Lin, Bin Zhu, Hongfa Wang, Jiaxi Cui, Junwu Zhang, Li Yuan, Munan Ning, Wancai Zhang, Wei Liu, Wenhao Jiang, Yang Yan, Yatian Pang, ZhiFeng Li, Zongwei Li","submitted_at":"2023-10-03T07:33:27Z","abstract_excerpt":"The video-language (VL) pretraining has achieved remarkable improvement in multiple downstream tasks. However, the current VL pretraining framework is hard to extend to multiple modalities (N modalities, N>=3) beyond vision and language. We thus propose LanguageBind, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics. Specifically, we freeze the language encoder acquired by VL pretraining, then train encoders for other modalities with contrastive learning. As a result, all modalities are mapped to a shared feat"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2310.01852","kind":"arxiv","version":7},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2023-10-03T07:33:27Z","cross_cats_sorted":["cs.AI"],"title_canon_sha256":"7e6e9944d2ba0e8600d523445e1059096ea4896ecd671d4aa18fb88ebe4f6991","abstract_canon_sha256":"a89d3f1b0310aacfd85d336f1ea4af2f879bd89dae84a6100731d407e7c398cc"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:15.256300Z","signature_b64":"yjwH6st+4N/Po4Wr1RrfXKYTaCWB1bUBg1meR6gPj6OIbipkDYUcOProk607GjQ1XZXu7BR4KvOewjYhzH/kDA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"84f4e17962cdd73cf3e472c202a64c93dc2fd8ff7032b1327a7be4af869fdfc7","last_reissued_at":"2026-05-17T23:38:15.255731Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:15.255731Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Language serves as a semantic anchor to align video, audio, depth, and infrared into one shared feature space.","cross_cats":["cs.AI"],"primary_cat":"cs.CV","authors_text":"Bin Lin, Bin Zhu, Hongfa Wang, Jiaxi Cui, Junwu Zhang, Li Yuan, Munan Ning, Wancai Zhang, Wei Liu, Wenhao Jiang, Yang Yan, Yatian Pang, ZhiFeng Li, Zongwei Li","submitted_at":"2023-10-03T07:33:27Z","abstract_excerpt":"The video-language (VL) pretraining has achieved remarkable improvement in multiple downstream tasks. However, the current VL pretraining framework is hard to extend to multiple modalities (N modalities, N>=3) beyond vision and language. We thus propose LanguageBind, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics. Specifically, we freeze the language encoder acquired by VL pretraining, then train encoders for other modalities with contrastive learning. As a result, all modalities are mapped to a shared feat"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"LanguageBind has achieved superior performance on a wide range of 15 benchmarks covering video, audio, depth, and infrared. Moreover, multiple experiments have provided evidence for the effectiveness of LanguageBind in achieving indirect alignment and complementarity among diverse modalities.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That a language encoder trained only on video-text pairs already contains sufficiently rich semantics to serve as an effective binding anchor for infrared, depth, and audio without direct cross-modal supervision between those modalities.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Language serves as a semantic anchor to align video, audio, depth, and infrared into one shared feature space.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"3291f03733b970be4d19dec471342430ac0561520121797ae1bc1d137f83eee2"},"source":{"id":"2310.01852","kind":"arxiv","version":7},"verdict":{"id":"e202374a-68cb-4ef4-8e7f-a7458b2a1fca","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T03:22:41.992852Z","strongest_claim":"LanguageBind has achieved superior performance on a wide range of 15 benchmarks covering video, audio, depth, and infrared. Moreover, multiple experiments have provided evidence for the effectiveness of LanguageBind in achieving indirect alignment and complementarity among diverse modalities.","one_line_summary":"LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That a language encoder trained only on video-text pairs already contains sufficiently rich semantics to serve as an effective binding anchor for infrared, depth, and audio without direct cross-modal supervision between those modalities.","pith_extraction_headline":"Language serves as a semantic anchor to align video, audio, depth, and infrared into one shared feature space."},"references":{"count":202,"sample":[{"doi":"","year":2017,"title":"Localizing moments in video with natural language","work_id":"60648aa4-9c56-4c4e-965b-63ce097f94a6","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2018,"title":"Convolutional neural networks for static and dynamic breast infrared imaging classification","work_id":"7aa7caf9-15c9-48f1-84eb-0a01dfc3274b","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2014,"title":"Interactive intrinsic video editing","work_id":"1091ffc8-2e17-4d60-ba40-1cfec22d8ae8","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2015,"title":"Activitynet: A large-scale video benchmark for human activity understanding","work_id":"da056c16-524b-48ee-8932-184520fa61cc","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2017,"title":"Estimating depth from monocular images as classification using deep fully convolutional residual networks","work_id":"133f4176-a408-468c-b4ee-d39cccb97f9a","ref_index":6,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":202,"snapshot_sha256":"6d8055846ff12dd16c3472988b600c5720281ba362e926f713461c60240adc2b","internal_anchors":13},"formal_canon":{"evidence_count":2,"snapshot_sha256":"319357faa4f43578d3c7d4828459bdafe42a36857d37b028fdb784975e3da570"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2310.01852","created_at":"2026-05-17T23:38:15.255811+00:00"},{"alias_kind":"arxiv_version","alias_value":"2310.01852v7","created_at":"2026-05-17T23:38:15.255811+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2310.01852","created_at":"2026-05-17T23:38:15.255811+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":22,"internal_anchor_count":22,"sample":[{"citing_arxiv_id":"2409.07825","citing_title":"Deep Multimodal Learning with Missing Modality: A Survey","ref_index":82,"is_internal_anchor":true},{"citing_arxiv_id":"2511.12034","citing_title":"Calibrated Multimodal Representation Learning with Missing Modalities","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2511.21998","citing_title":"Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?","ref_index":67,"is_internal_anchor":true},{"citing_arxiv_id":"2511.21331","citing_title":"The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2403.00476","citing_title":"TempCompass: Do Video LLMs Really Understand Videos?","ref_index":134,"is_internal_anchor":true},{"citing_arxiv_id":"2512.17492","citing_title":"MMLANDMARKS: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding","ref_index":107,"is_internal_anchor":true},{"citing_arxiv_id":"2602.03342","citing_title":"Tiled Prompts: Overcoming Prompt Misguidance in Image and Video Super-Resolution","ref_index":69,"is_internal_anchor":true},{"citing_arxiv_id":"2603.08819","citing_title":"Beyond Relevance: On the Relationship Between Retrieval and RAG Information Coverage","ref_index":72,"is_internal_anchor":true},{"citing_arxiv_id":"2406.04264","citing_title":"MLVU: Benchmarking Multi-task Long Video Understanding","ref_index":62,"is_internal_anchor":true},{"citing_arxiv_id":"2311.10122","citing_title":"Video-LLaVA: Learning United Visual Representation by Alignment Before Projection","ref_index":107,"is_internal_anchor":true},{"citing_arxiv_id":"2312.14238","citing_title":"InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks","ref_index":187,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11043","citing_title":"EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models","ref_index":62,"is_internal_anchor":true},{"citing_arxiv_id":"2506.03147","citing_title":"UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation","ref_index":60,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27968","citing_title":"ClimateVID -- Social Media Videos Analysis and Challenges Involved","ref_index":81,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09836","citing_title":"ReCoVR: Closing the Loop in Interactive Composed Video Retrieval","ref_index":58,"is_internal_anchor":true},{"citing_arxiv_id":"2604.23198","citing_title":"StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19567","citing_title":"Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11043","citing_title":"EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models","ref_index":62,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08147","citing_title":"Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2604.07763","citing_title":"Beyond Surface Artifacts: Capturing Shared Latent Forgery Knowledge Across Modalities","ref_index":68,"is_internal_anchor":true},{"citing_arxiv_id":"2406.07476","citing_title":"VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs","ref_index":58,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08125","citing_title":"PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction","ref_index":97,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/QT2OC6LCZXLTZ47EOLBAFJSMSP","json":"https://pith.science/pith/QT2OC6LCZXLTZ47EOLBAFJSMSP.json","graph_json":"https://pith.science/api/pith-number/QT2OC6LCZXLTZ47EOLBAFJSMSP/graph.json","events_json":"https://pith.science/api/pith-number/QT2OC6LCZXLTZ47EOLBAFJSMSP/events.json","paper":"https://pith.science/paper/QT2OC6LC"},"agent_actions":{"view_html":"https://pith.science/pith/QT2OC6LCZXLTZ47EOLBAFJSMSP","download_json":"https://pith.science/pith/QT2OC6LCZXLTZ47EOLBAFJSMSP.json","view_paper":"https://pith.science/paper/QT2OC6LC","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2310.01852&json=true","fetch_graph":"https://pith.science/api/pith-number/QT2OC6LCZXLTZ47EOLBAFJSMSP/graph.json","fetch_events":"https://pith.science/api/pith-number/QT2OC6LCZXLTZ47EOLBAFJSMSP/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/QT2OC6LCZXLTZ47EOLBAFJSMSP/action/timestamp_anchor","attest_storage":"https://pith.science/pith/QT2OC6LCZXLTZ47EOLBAFJSMSP/action/storage_attestation","attest_author":"https://pith.science/pith/QT2OC6LCZXLTZ47EOLBAFJSMSP/action/author_attestation","sign_citation":"https://pith.science/pith/QT2OC6LCZXLTZ47EOLBAFJSMSP/action/citation_signature","submit_replication":"https://pith.science/pith/QT2OC6LCZXLTZ47EOLBAFJSMSP/action/replication_record"}},"created_at":"2026-05-17T23:38:15.255811+00:00","updated_at":"2026-05-17T23:38:15.255811+00:00"}