Chinese clip: Contrastive vision-language pretraining in chinese

An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou · 2022 · arXiv 2211.01335

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

Disentangling Fact from Sentiment: A Dynamic Conflict-Consensus Framework for Multimodal Fake News Detection

cs.LG · 2025-12-19 · unverdicted · novelty 7.0

DCCF disentangles fact and sentiment in multimodal data, applies dynamic polarization to extract conflicts, and uses a conflict-consensus mechanism to improve fake news detection accuracy by 3.52% on average over baselines.

TIGER-FG: Text-Guided Implicit Fine-Grained Grounding for E-commerce Retrieval

cs.IR · 2026-05-18 · unverdicted · novelty 6.0

TIGER-FG proposes text-guided implicit fine-grained grounding with dual distillation to address modality and granularity asymmetries in image-to-multimodal e-commerce retrieval, reporting Recall@1 gains of 6.1 and 34.4 points on two new benchmarks.

Text-Guided Visual Representation Learning for Robust Multimodal E-Commerce Recommendation

cs.IR · 2026-05-17 · unverdicted · novelty 6.0

TGQ-Former uses metadata-guided hybrid queries and dual-gated modulation to improve visual token selection in multimodal e-commerce retrieval, raising average Hit Rate@100 by 6.04% over baselines.

DRG-Font: Dynamic Reference-Guided Few-shot Font Generation via Contrastive Style-Content Disentanglement

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

DRG-Font generates stylistically consistent glyphs from few references by decomposing style and content via contrastive disentanglement, dynamic reference selection, and multi-scale fusion blocks.

JARVIS: An Evidence-Grounded Retrieval System for Interpretable Deceptive Reviews Adjudication

cs.IR · 2026-02-13 · unverdicted · novelty 5.0

JARVIS combines hybrid retrieval and evidence graphs with LLMs to raise deceptive-review detection precision from 0.953 to 0.988 and recall from 0.830 to 0.901 on a custom dataset while cutting manual inspection time by 75% in production.

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

cs.CV · 2025-11-27 · unverdicted · novelty 5.0

Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and streamlined training.

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

cs.CV · 2023-12-21 · unverdicted · novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

cs.CV · 2025-02-14 · unverdicted · novelty 4.0

Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.

citing papers explorer

Showing 8 of 8 citing papers.

Disentangling Fact from Sentiment: A Dynamic Conflict-Consensus Framework for Multimodal Fake News Detection cs.LG · 2025-12-19 · unverdicted · none · ref 20
DCCF disentangles fact and sentiment in multimodal data, applies dynamic polarization to extract conflicts, and uses a conflict-consensus mechanism to improve fake news detection accuracy by 3.52% on average over baselines.
TIGER-FG: Text-Guided Implicit Fine-Grained Grounding for E-commerce Retrieval cs.IR · 2026-05-18 · unverdicted · none · ref 14
TIGER-FG proposes text-guided implicit fine-grained grounding with dual distillation to address modality and granularity asymmetries in image-to-multimodal e-commerce retrieval, reporting Recall@1 gains of 6.1 and 34.4 points on two new benchmarks.
Text-Guided Visual Representation Learning for Robust Multimodal E-Commerce Recommendation cs.IR · 2026-05-17 · unverdicted · none · ref 32
TGQ-Former uses metadata-guided hybrid queries and dual-gated modulation to improve visual token selection in multimodal e-commerce retrieval, raising average Hit Rate@100 by 6.04% over baselines.
DRG-Font: Dynamic Reference-Guided Few-shot Font Generation via Contrastive Style-Content Disentanglement cs.CV · 2026-04-15 · unverdicted · none · ref 40
DRG-Font generates stylistically consistent glyphs from few references by decomposing style and content via contrastive disentanglement, dynamic reference selection, and multi-scale fusion blocks.
JARVIS: An Evidence-Grounded Retrieval System for Interpretable Deceptive Reviews Adjudication cs.IR · 2026-02-13 · unverdicted · none · ref 34
JARVIS combines hybrid retrieval and evidence graphs with LLMs to raise deceptive-review detection precision from 0.953 to 0.988 and recall from 0.830 to 0.901 on a custom dataset while cutting manual inspection time by 75% in production.
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer cs.CV · 2025-11-27 · unverdicted · none · ref 86
Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and streamlined training.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks cs.CV · 2023-12-21 · unverdicted · none · ref 163
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model cs.CV · 2025-02-14 · unverdicted · none · ref 279
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.

Chinese clip: Contrastive vision-language pretraining in chinese

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer