Chinese clip: Contrastive vision-language pretraining in chinese.arXiv preprint arXiv:2211.01335

· 2022 · arXiv 2211.01335

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

Disentangling Fact from Sentiment: A Dynamic Conflict-Consensus Framework for Multimodal Fake News Detection

cs.LG · 2025-12-19 · unverdicted · novelty 7.0

DCCF disentangles fact and sentiment in multimodal data, applies dynamic polarization to extract conflicts, and uses a conflict-consensus mechanism to improve fake news detection accuracy by 3.52% on average over baselines.

MyoSem: Aligning Electromyography to Natural-Language Action Semantics for Hand Action Understanding

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

MyoSem is a multimodal alignment framework that maps EMG signals to text-based action semantics for bidirectional retrieval and improved generalization in hand action understanding.

MindAlign: Bridging EEG, Vision, and Language for Zero-Shot Visual Decoding

cs.LG · 2026-05-23 · unverdicted · novelty 6.0

A tri-modal contrastive learning method for EEG-based zero-shot visual decoding reports 54.1% top-1 accuracy on the Things-EEG2 200-way benchmark, outperforming prior baselines of 32.4%.

TIGER-FG: Text-Guided Implicit Fine-Grained Grounding for E-commerce Retrieval

cs.IR · 2026-05-18 · unverdicted · novelty 6.0

TIGER-FG proposes text-guided implicit fine-grained grounding with dual distillation to address modality and granularity asymmetries in image-to-multimodal e-commerce retrieval, reporting Recall@1 gains of 6.1 and 34.4 points on two new benchmarks.

Text-Guided Visual Representation Learning for Robust Multimodal E-Commerce Recommendation

cs.IR · 2026-05-17 · unverdicted · novelty 6.0

TGQ-Former uses metadata-guided hybrid queries and dual-gated modulation to improve visual token selection in multimodal e-commerce retrieval, raising average Hit Rate@100 by 6.04% over baselines.

DRG-Font: Dynamic Reference-Guided Few-shot Font Generation via Contrastive Style-Content Disentanglement

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

DRG-Font generates stylistically consistent glyphs from few references by decomposing style and content via contrastive disentanglement, dynamic reference selection, and multi-scale fusion blocks.

JARVIS: An Evidence-Grounded Retrieval System for Interpretable Deceptive Reviews Adjudication

cs.IR · 2026-02-13 · unverdicted · novelty 5.0

JARVIS combines hybrid retrieval and evidence graphs with LLMs to raise deceptive-review detection precision from 0.953 to 0.988 and recall from 0.830 to 0.901 on a custom dataset while cutting manual inspection time by 75% in production.

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

cs.CV · 2023-12-21 · unverdicted · novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

JuZhou 1.0 Technical Report: The First Edge-Native Text-to-Image Foundation Model Trained Entirely on China-Developed AI Accelerators

cs.CV · 2026-06-25 · unverdicted · novelty 4.0

JuZhou 1.0 is a 0.387B-parameter T2I diffusion model with 4-step inference achieving 0.69 GenEval, trained on 9M Chinese pairs using Sugon K100 accelerators and deployable on Android/iOS devices.

UniNote: A Unified Embedding Model for Multimodal Representation and Ranking

cs.IR · 2026-05-28 · unverdicted · novelty 4.0

UniNote proposes a two-stage trained unified embedding model (contrastive SFT then RL) for multimodal I2I retrieval that claims SOTA results and was deployed at Xiaohongshu with MRL for improved quality and efficiency.

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

cs.CV · 2025-02-14 · unverdicted · novelty 4.0

Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

cs.CV · 2025-11-27

citing papers explorer

Showing 1 of 1 citing paper after filters.

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks cs.CV · 2023-12-21 · unverdicted · none · ref 163
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

Chinese clip: Contrastive vision-language pretraining in chinese.arXiv preprint arXiv:2211.01335

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer