{"total":13,"items":[{"citing_arxiv_id":"2606.28421","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"JuZhou 1.0 Technical Report: The First Edge-Native Text-to-Image Foundation Model Trained Entirely on China-Developed AI Accelerators","primary_cat":"cs.CV","submitted_at":"2026-06-25T17:14:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"JuZhou 1.0 is a 0.387B-parameter T2I diffusion model with 4-step inference achieving 0.69 GenEval, trained on 9M Chinese pairs using Sugon K100 accelerators and deployable on Android/iOS devices.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04591","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fine-grained Fragment Retrieval in Multi-modal Long-form Dialogues","primary_cat":"cs.CL","submitted_at":"2026-06-03T08:29:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces FFR task, F2RVLM and FFRS models, and MLDR dataset for retrieving coherent multi-modal dialogue fragments, reporting superior performance on single-dialogue and corpus benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00174","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MyoSem: Aligning Electromyography to Natural-Language Action Semantics for Hand Action Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-29T13:47:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MyoSem is a multimodal alignment framework that maps EMG signals to text-based action semantics for bidirectional retrieval and improved generalization in hand action understanding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29287","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"UniNote: A Unified Embedding Model for Multimodal Representation and Ranking","primary_cat":"cs.IR","submitted_at":"2026-05-28T03:11:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"UniNote proposes a two-stage trained unified embedding model (contrastive SFT then RL) for multimodal I2I retrieval that claims SOTA results and was deployed at Xiaohongshu with MRL for improved quality and efficiency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24523","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MindAlign: Bridging EEG, Vision, and Language for Zero-Shot Visual Decoding","primary_cat":"cs.LG","submitted_at":"2026-05-23T11:23:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A tri-modal contrastive learning method for EEG-based zero-shot visual decoding reports 54.1% top-1 accuracy on the Things-EEG2 200-way benchmark, outperforming prior baselines of 32.4%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18434","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TIGER-FG: Text-Guided Implicit Fine-Grained Grounding for E-commerce Retrieval","primary_cat":"cs.IR","submitted_at":"2026-05-18T14:07:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TIGER-FG proposes text-guided implicit fine-grained grounding with dual distillation to address modality and granularity asymmetries in image-to-multimodal e-commerce retrieval, reporting Recall@1 gains of 6.1 and 34.4 points on two new benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17366","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Text-Guided Visual Representation Learning for Robust Multimodal E-Commerce Recommendation","primary_cat":"cs.IR","submitted_at":"2026-05-17T10:20:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TGQ-Former uses metadata-guided hybrid queries and dual-gated modulation to improve visual token selection in multimodal e-commerce retrieval, raising average Hit Rate@100 by 6.04% over baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13797","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DRG-Font: Dynamic Reference-Guided Few-shot Font Generation via Contrastive Style-Content Disentanglement","primary_cat":"cs.CV","submitted_at":"2026-04-15T12:34:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DRG-Font generates stylistically consistent glyphs from few references by decomposing style and content via contrastive disentanglement, dynamic reference selection, and multi-scale fusion blocks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.12941","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"JARVIS: An Evidence-Grounded Retrieval System for Interpretable Deceptive Reviews Adjudication","primary_cat":"cs.IR","submitted_at":"2026-02-13T13:57:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"JARVIS combines hybrid retrieval and evidence graphs with LLMs to raise deceptive-review detection precision from 0.953 to 0.988 and recall from 0.830 to 0.901 on a custom dataset while cutting manual inspection time by 75% in production.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.20670","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Disentangling Fact from Sentiment: A Dynamic Conflict-Consensus Framework for Multimodal Fake News Detection","primary_cat":"cs.LG","submitted_at":"2025-12-19T10:20:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DCCF disentangles fact and sentiment in multimodal data, applies dynamic polarization to extract conflicts, and uses a conflict-consensus mechanism to improve fake news detection accuracy by 3.52% on average over baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.22699","ref_index":86,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer","primary_cat":"cs.CV","submitted_at":"2025-11-27T18:52:07+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"to Chinese culture. The same model also performs safety assessment by assigning Not-Safe-for-Work (NSFW) scores, allowing for the unified filtering of both semantically irrelevant and inappropriate content. Cross-Modal Consistency and Captioning.The alignment between an image and its textual description is paramount. • Text-Image Correlation:We use CN-CLIP [86] to compute the alignment score between an image and its associated alt caption. Pairs with low correlation scores are discarded to ensure the relevance of textual supervision. • Multi-Level Captioning:For all images selected for pre-training, we generate a structured set of captions, including concise tags, short phrases, and detailed long-form descriptions."},{"citing_arxiv_id":"2502.10248","ref_index":279,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model","primary_cat":"cs.CV","submitted_at":"2025-02-14T15:58:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2312.14238","ref_index":163,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks","primary_cat":"cs.CV","submitted_at":"2023-12-21T18:59:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[161] Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In CVPR, pages 5288-5296, 2016. 10, 14, 16 [162] An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, and Chang Zhou. Chinese clip: Contrastive vision-language pretraining in chinese. arXiv preprint arXiv:2211.01335, 2022. 7, 8, 10 [163] Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge, Xiu Li, and Ying Shan. Gpt4tools: Teaching large lan- guage model to use tools via self-instruction.arXiv preprint arXiv:2305.18752, 2023. 3 [164] Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-Hsuan Sung, et al. Multilingual uni-"}],"limit":50,"offset":0}