DCCF disentangles fact and sentiment in multimodal data, applies dynamic polarization to extract conflicts, and uses a conflict-consensus mechanism to improve fake news detection accuracy by 3.52% on average over baselines.
Chinese clip: Contrastive vision-language pretraining in chinese.arXiv preprint arXiv:2211.01335
12 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
MyoSem is a multimodal alignment framework that maps EMG signals to text-based action semantics for bidirectional retrieval and improved generalization in hand action understanding.
A tri-modal contrastive learning method for EEG-based zero-shot visual decoding reports 54.1% top-1 accuracy on the Things-EEG2 200-way benchmark, outperforming prior baselines of 32.4%.
TIGER-FG proposes text-guided implicit fine-grained grounding with dual distillation to address modality and granularity asymmetries in image-to-multimodal e-commerce retrieval, reporting Recall@1 gains of 6.1 and 34.4 points on two new benchmarks.
TGQ-Former uses metadata-guided hybrid queries and dual-gated modulation to improve visual token selection in multimodal e-commerce retrieval, raising average Hit Rate@100 by 6.04% over baselines.
DRG-Font generates stylistically consistent glyphs from few references by decomposing style and content via contrastive disentanglement, dynamic reference selection, and multi-scale fusion blocks.
JARVIS combines hybrid retrieval and evidence graphs with LLMs to raise deceptive-review detection precision from 0.953 to 0.988 and recall from 0.830 to 0.901 on a custom dataset while cutting manual inspection time by 75% in production.
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
JuZhou 1.0 is a 0.387B-parameter T2I diffusion model with 4-step inference achieving 0.69 GenEval, trained on 9M Chinese pairs using Sugon K100 accelerators and deployable on Android/iOS devices.
UniNote proposes a two-stage trained unified embedding model (contrastive SFT then RL) for multimodal I2I retrieval that claims SOTA results and was deployed at Xiaohongshu with MRL for improved quality and efficiency.
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
citing papers explorer
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.