{"work":{"id":"b9701eca-d05e-4d2e-9045-6761df4ba175","openalex_id":null,"doi":null,"arxiv_id":"2729.2023","raw_key":null,"title":"In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp","authors":null,"authors_text":"J","year":2023,"venue":null,"abstract":null,"external_url":"https://arxiv.org/abs/2729.2023","cited_by_count":null,"metadata_source":"arxiv_reference","metadata_fetched_at":"2026-05-25T09:05:35.673668+00:00","pith_arxiv_id":null,"created_at":"2026-05-09T18:55:07.585017+00:00","updated_at":"2026-05-25T09:05:35.673668+00:00","title_quality_ok":false,"display_title":"ImageBind One Embedding Space to Bind Them All","render_title":"ImageBind One Embedding Space to Bind Them All"},"hub":{"state":{"work_id":"b9701eca-d05e-4d2e-9045-6761df4ba175","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":119,"external_cited_by_count":null,"distinct_field_count":18,"first_pith_cited_at":"2024-01-07T18:12:20+00:00","last_pith_cited_at":"2026-05-22T17:20:17+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-02T18:54:55.374121+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":36},{"context_role":"baseline","n":7},{"context_role":"method","n":4},{"context_role":"dataset","n":2}],"polarity_counts":[{"context_polarity":"background","n":37},{"context_polarity":"baseline","n":7},{"context_polarity":"use_method","n":4},{"context_polarity":"use_dataset","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"ImageBind One Embedding Space to Bind Them All","claims":[{"claim_text":"Divide-and-Conquer Approach to Holistic Cognition in High-Similarity Contexts with Limited Data Conference acronym 'XX, June 03-05, 2018, Woodstock, NY Table 4: The classification accuracy of the competing models on the five subsets of the SoyAgeing dataset. Here, ViT-B [ 10] denotes the Visual Transformer base model, while Swin-B [28] represents the Swin Transformer base model. Method Backbone Top 1 Accuracy (%) R1 R3 R4 R5 R6 Average SimCLR ICML20 [6] ResNet-50 53.6 45.7 45.4 50.4 35.9 46.2 BY","claim_type":"baseline","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"are widely used for real-time object detection and instance segmentation in robotics applications. Additionally, mul- timodal sensing, which combines visual, LiDAR, and radar data, has improved the robustness of perception systems in autonomous vehicles and drones. Furthermore, modern autonomous vehicles employ occupancy networks to re- construct 3D environments [126] and incorporate flow information to predict object motion [2], further improving situational awareness and navigation. Decision-M","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Situation Analysis MovieGraphs (Q) [89], HLVU (Q) [93], Social-IQ (Q) [94], DeSIQ (Q) [95] Narrative & Rhetorical Analysis Humor/Sarcasm/Satire MUStARD (C) [96], UR-FUNNY (C) [97], MHD (C) [98], WITS (C) [99], ExFunTube (Cap) [100], YesBut (C) [101], V-FLUTE (Cap,C) [102],AVH (R) [103], FOR (C) [103] Visual Metaphors VMC (Cap) [104], V-FLUTE (Cap,C) [102], Mul- tiMET (C) [105], MetaCLUE (C,Q,Cap) [106] Misinformation VMH (C) [107], FakeSV (C) [108], Fake Video Cor- pus (C) [109], NewsCLIPpings (","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"T= 16 consecutive frames to maintain the continuity of the frequency spectrum. Finally, these frames undergo a comprehensive video-level augmentation pipeline to simulate real-world variations. Baselines.For a comprehensive evaluation, we compare SpInShield with the following advanced and representative baselines, which are categorized into:Frame-level methods: SLADD [ 2], SBI [35], UCF [46], IID [ 16], LSDA [ 45], ProDet [ 3], and CDFA [ 23];Video-level methods: TALL [ 44], SLF [5], NACO [52], ","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"To achieve an optimal equilibrium among the constituent objective functions, the weighting coeﬀicients λ1, λ2, λ3, and λ4 are empirically assigned values of 4.0, 24.0, 0.5, and 4.5, respectively. To benchmark our network's performance, we compare it against eight recent advanced VIF architectures: EMMA [35], SwinFusion [14], T2EA [36], WaveFusion [37], CDDFuse [38], MaeFuse [39], SPDFusion [40], and TDFusion [18]. Quantitative assessment is carried out using a suite of five established evaluatio","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"full-head fidelity while generalizing to unconstrained real portraits, especially in hard regions such as hair, side/back head, accessories, and teeth. This issue is clearly reflected in existing model designs. Early 3DMM-based methods [2, 11, 14, 16, 42, 55] are mainly front- face oriented, so they often miss full-head regions. Tri-Plane meth- ods [1, 23, 50] rely on implicit representations, which makes geomet- ric constraints weaker and often leads to insufficient 3D consistency across views.","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks ImageBind One Embedding Space to Bind Them All because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (34 contexts).","role_counts":[{"n":34,"context_role":"background"},{"n":6,"context_role":"baseline"},{"n":3,"context_role":"method"},{"n":2,"context_role":"dataset"}]},"error":null,"updated_at":"2026-05-21T07:12:39.576041+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"ad0a11de-6175-4339-a8fa-1f60baf1c99e","orcid":null,"display_name":"doi: 10"}]},"error":null,"updated_at":"2026-05-21T07:12:39.573701+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T12:50:34.489423+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Masked autoencoders are scalable vision learners","work_id":"0a23d1b7-bd56-43cc-8a80-7c43ce994e1e","shared_citers":19},{"title":"Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs","work_id":"7efbc2dd-b0f2-4f71-bb1c-d2fcf110d805","shared_citers":17},{"title":"& Vondrick, C","work_id":"b8a8bb9e-1d31-40e2-9cab-ae21e338dde6","shared_citers":16},{"title":"IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =","work_id":"9da51225-b7bd-4032-b7db-ca577971dafe","shared_citers":12},{"title":"MambaVision: A hybrid Mamba- Transformer vision backbone","work_id":"d0e5199d-8907-47b1-905a-07ab8b623a4c","shared_citers":12},{"title":"In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","work_id":"7083a41e-5666-435b-ab26-c753f6490b9a","shared_citers":11},{"title":"URL https://doi.org/10.48550/arXiv","work_id":"5c2060c6-427c-4321-be22-49ccae439d80","shared_citers":9},{"title":"Editing conditional radiance fields","work_id":"3820f598-11b0-45c3-8c99-0079181ac0a7","shared_citers":8},{"title":"LoRA: Low-Rank Adaptation of Large Language Models","work_id":"0426219a-789e-4964-adc8-a04538510818","shared_citers":6},{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":5},{"title":"Learning Transferable Visual Models From Natural Language Supervision","work_id":"6de86bb5-27bd-4d5c-8b89-967ebfc52659","shared_citers":5},{"title":"Tomasi and R","work_id":"135418b1-cafd-49fd-803d-1ca6433d4b1b","shared_citers":5},{"title":"Explaining and Harnessing Adversarial Examples","work_id":"2cedf8f6-7539-4c49-8136-f42a20487146","shared_citers":4},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":4},{"title":"Recognizing indoor scenes","work_id":"45b0bfd8-65dc-4252-b2ab-2f6b411d04d0","shared_citers":4},{"title":"URLhttp://dx.doi.org/10.1109/CVPR.2016.90","work_id":"b353bda2-591d-479a-9c8b-22dfcba12431","shared_citers":4},{"title":"why should I trust you?","work_id":"238df2e4-a3e5-46f3-860e-3ae2b0094b97","shared_citers":4},{"title":"Conditional prompt learning for vision- language models","work_id":"025819dc-724a-4ff8-ba0a-0ba72c046d8c","shared_citers":3},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":3},{"title":"Flow Matching for Generative Modeling","work_id":"6edb71c4-5d64-40af-a394-9757ea051a36","shared_citers":3},{"title":"Heinig, K","work_id":"cf4c4e77-acaa-46b4-b066-ddf045165d05","shared_citers":3},{"title":"In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","work_id":"caaf86e4-4cdb-450e-80c7-d20d4375abae","shared_citers":3},{"title":"Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew Lee, Mattias Heinrich, Kazunari Misawa, Kensaku Mori, Steven McDonagh, Nils Y","work_id":"ad0a8ee9-e814-486c-a25b-40126e136b0f","shared_citers":3},{"title":"Representation Learning with Contrastive Predictive Coding","work_id":"7b08a1d4-d565-424e-9c86-6ef244b7b90a","shared_citers":3}],"time_series":[{"n":1,"year":2024},{"n":55,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T12:50:43.024299+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T12:50:39.572674+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"ImageBind One Embedding Space to Bind Them All","claims":[{"claim_text":"Divide-and-Conquer Approach to Holistic Cognition in High-Similarity Contexts with Limited Data Conference acronym 'XX, June 03-05, 2018, Woodstock, NY Table 4: The classification accuracy of the competing models on the five subsets of the SoyAgeing dataset. Here, ViT-B [ 10] denotes the Visual Transformer base model, while Swin-B [28] represents the Swin Transformer base model. Method Backbone Top 1 Accuracy (%) R1 R3 R4 R5 R6 Average SimCLR ICML20 [6] ResNet-50 53.6 45.7 45.4 50.4 35.9 46.2 BY","claim_type":"baseline","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"are widely used for real-time object detection and instance segmentation in robotics applications. Additionally, mul- timodal sensing, which combines visual, LiDAR, and radar data, has improved the robustness of perception systems in autonomous vehicles and drones. Furthermore, modern autonomous vehicles employ occupancy networks to re- construct 3D environments [126] and incorporate flow information to predict object motion [2], further improving situational awareness and navigation. Decision-M","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Situation Analysis MovieGraphs (Q) [89], HLVU (Q) [93], Social-IQ (Q) [94], DeSIQ (Q) [95] Narrative & Rhetorical Analysis Humor/Sarcasm/Satire MUStARD (C) [96], UR-FUNNY (C) [97], MHD (C) [98], WITS (C) [99], ExFunTube (Cap) [100], YesBut (C) [101], V-FLUTE (Cap,C) [102],AVH (R) [103], FOR (C) [103] Visual Metaphors VMC (Cap) [104], V-FLUTE (Cap,C) [102], Mul- tiMET (C) [105], MetaCLUE (C,Q,Cap) [106] Misinformation VMH (C) [107], FakeSV (C) [108], Fake Video Cor- pus (C) [109], NewsCLIPpings (","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"T= 16 consecutive frames to maintain the continuity of the frequency spectrum. Finally, these frames undergo a comprehensive video-level augmentation pipeline to simulate real-world variations. Baselines.For a comprehensive evaluation, we compare SpInShield with the following advanced and representative baselines, which are categorized into:Frame-level methods: SLADD [ 2], SBI [35], UCF [46], IID [ 16], LSDA [ 45], ProDet [ 3], and CDFA [ 23];Video-level methods: TALL [ 44], SLF [5], NACO [52], ","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"To achieve an optimal equilibrium among the constituent objective functions, the weighting coeﬀicients λ1, λ2, λ3, and λ4 are empirically assigned values of 4.0, 24.0, 0.5, and 4.5, respectively. To benchmark our network's performance, we compare it against eight recent advanced VIF architectures: EMMA [35], SwinFusion [14], T2EA [36], WaveFusion [37], CDDFuse [38], MaeFuse [39], SPDFusion [40], and TDFusion [18]. Quantitative assessment is carried out using a suite of five established evaluatio","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"full-head fidelity while generalizing to unconstrained real portraits, especially in hard regions such as hair, side/back head, accessories, and teeth. This issue is clearly reflected in existing model designs. Early 3DMM-based methods [2, 11, 14, 16, 42, 55] are mainly front- face oriented, so they often miss full-head regions. Tri-Plane meth- ods [1, 23, 50] rely on implicit representations, which makes geomet- ric constraints weaker and often leads to insufficient 3D consistency across views.","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks ImageBind One Embedding Space to Bind Them All because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (34 contexts).","role_counts":[{"n":34,"context_role":"background"},{"n":6,"context_role":"baseline"},{"n":3,"context_role":"method"},{"n":2,"context_role":"dataset"}]},"error":null,"updated_at":"2026-05-21T07:12:39.011239+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"In: 2023 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR)","claims":[],"why_cited":"Pith tracks In: 2023 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T12:50:34.492968+00:00"}},"summary":{"title":"In: 2023 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR)","claims":[],"why_cited":"Pith tracks In: 2023 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Masked autoencoders are scalable vision learners","work_id":"0a23d1b7-bd56-43cc-8a80-7c43ce994e1e","shared_citers":19},{"title":"Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs","work_id":"7efbc2dd-b0f2-4f71-bb1c-d2fcf110d805","shared_citers":17},{"title":"& Vondrick, C","work_id":"b8a8bb9e-1d31-40e2-9cab-ae21e338dde6","shared_citers":16},{"title":"IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =","work_id":"9da51225-b7bd-4032-b7db-ca577971dafe","shared_citers":12},{"title":"MambaVision: A hybrid Mamba- Transformer vision backbone","work_id":"d0e5199d-8907-47b1-905a-07ab8b623a4c","shared_citers":12},{"title":"In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","work_id":"7083a41e-5666-435b-ab26-c753f6490b9a","shared_citers":11},{"title":"URL https://doi.org/10.48550/arXiv","work_id":"5c2060c6-427c-4321-be22-49ccae439d80","shared_citers":9},{"title":"Editing conditional radiance fields","work_id":"3820f598-11b0-45c3-8c99-0079181ac0a7","shared_citers":8},{"title":"LoRA: Low-Rank Adaptation of Large Language Models","work_id":"0426219a-789e-4964-adc8-a04538510818","shared_citers":6},{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":5},{"title":"Learning Transferable Visual Models From Natural Language Supervision","work_id":"6de86bb5-27bd-4d5c-8b89-967ebfc52659","shared_citers":5},{"title":"Tomasi and R","work_id":"135418b1-cafd-49fd-803d-1ca6433d4b1b","shared_citers":5},{"title":"Explaining and Harnessing Adversarial Examples","work_id":"2cedf8f6-7539-4c49-8136-f42a20487146","shared_citers":4},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":4},{"title":"Recognizing indoor scenes","work_id":"45b0bfd8-65dc-4252-b2ab-2f6b411d04d0","shared_citers":4},{"title":"URLhttp://dx.doi.org/10.1109/CVPR.2016.90","work_id":"b353bda2-591d-479a-9c8b-22dfcba12431","shared_citers":4},{"title":"why should I trust you?","work_id":"238df2e4-a3e5-46f3-860e-3ae2b0094b97","shared_citers":4},{"title":"Conditional prompt learning for vision- language models","work_id":"025819dc-724a-4ff8-ba0a-0ba72c046d8c","shared_citers":3},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":3},{"title":"Flow Matching for Generative Modeling","work_id":"6edb71c4-5d64-40af-a394-9757ea051a36","shared_citers":3},{"title":"Heinig, K","work_id":"cf4c4e77-acaa-46b4-b066-ddf045165d05","shared_citers":3},{"title":"In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","work_id":"caaf86e4-4cdb-450e-80c7-d20d4375abae","shared_citers":3},{"title":"Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew Lee, Mattias Heinrich, Kazunari Misawa, Kensaku Mori, Steven McDonagh, Nils Y","work_id":"ad0a8ee9-e814-486c-a25b-40126e136b0f","shared_citers":3},{"title":"Representation Learning with Contrastive Predictive Coding","work_id":"7b08a1d4-d565-424e-9c86-6ef244b7b90a","shared_citers":3}],"time_series":[{"n":1,"year":2024},{"n":55,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"ad0a11de-6175-4339-a8fa-1f60baf1c99e","orcid":null,"display_name":"doi: 10","source":"manual","import_confidence":0.72}]}}