A new cross-cultural benchmark shows vision-language models infer structured cultural metadata from images inconsistently, with fragmented signals and large performance gaps across regions and metadata types.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 3representative citing papers
A collaborative parametric knowledge calibration framework for retrieval-augmented KB-VQA enables bidirectional knowledge sharing between retriever and generator, yielding a 4.7% accuracy gain and 7.5% boost to base MLLMs via late interaction and reflective answering.
TGQ-Former uses metadata-guided hybrid queries and dual-gated modulation to improve visual token selection in multimodal e-commerce retrieval, raising average Hit Rate@100 by 6.04% over baselines.
citing papers explorer
-
Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images
A new cross-cultural benchmark shows vision-language models infer structured cultural metadata from images inconsistently, with fragmented signals and large performance gaps across regions and metadata types.
-
Enabling Collaborative Parametric Knowledge Calibration for Retrieval-Augmented Vision Question Answering
A collaborative parametric knowledge calibration framework for retrieval-augmented KB-VQA enables bidirectional knowledge sharing between retriever and generator, yielding a 4.7% accuracy gain and 7.5% boost to base MLLMs via late interaction and reflective answering.
-
Text-Guided Visual Representation Learning for Robust Multimodal E-Commerce Recommendation
TGQ-Former uses metadata-guided hybrid queries and dual-gated modulation to improve visual token selection in multimodal e-commerce retrieval, raising average Hit Rate@100 by 6.04% over baselines.