pith. sign in

arxiv: 2510.21151 · v2 · pith:YDUG4D5Ynew · submitted 2025-10-24 · 💻 cs.IR

VOGUE: A Multimodal Dataset for Conversational Recommendation in Fashion

classification 💻 cs.IR
keywords conversationalmultimodalhumanrecommendationvoguedatasetdialoguefashion
0
0 comments X
read the original abstract

Multimodal conversational recommendation has recently emerged as a promising paradigm for delivering personalized experiences through natural dialogue enriched by visual and contextual grounding. Yet currently available multimodal conversational recommendation datasets remain limited: existing resources either simulate conversations, omit user history or fail to collect sufficiently detailed feedback, which constrain the types of research and evaluation they support. To address these gaps we introduce VOGUE, a dataset of 60 human human dialogues containing 2100 granularly labeled utterances in realistic fashion shopping scenarios. Each dialogue is paired with a shared visual catalogue, item metadata, user fashion profiles and post conversation ratings from both users (Seekers) and recommenders (Assistants). This design enables rigorous evaluation of conversational inference, including not only alignment between predicted and ground truth preferences but also calibration against full rating distributions and comparison with explicit and implicit user satisfaction signals. Our analyses of VOGUE reveal distinctive dynamics of visually grounded dialogue, e.g. recommenders frequently recommend items simultaneously in feature based groups, which creates distinct conversational phases bridged by Seeker critiques and refinements. Benchmarking Multimodal Large Language Models against human Recommenders shows that while MLLMs approach human level alignment in aggregate they exhibit systematic distribution errors in reproducing human ratings and struggle to generalize preference inference beyond explicitly discussed items. These findings establish VOGUE as both a unique resource for studying multimodal conversational systems and a challenge dataset beyond the current recommendation capabilities of existing top tier multimodal foundation models such as GPT-5-mini and Gemini-2.5-Flash.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems

    cs.AI 2026-05 unverdicted novelty 7.0

    Goal-Mem improves RAG memory retrieval in agentic LLMs by explicit goal decomposition and backward chaining via Natural Language Logic, outperforming nine baselines on multi-hop and implicit inference tasks.

  2. TRACE: Tourism Recommendation with Accountable Citation Evidence

    cs.IR 2026-05 unverdicted novelty 7.0

    TRACE is a new benchmark dataset and evaluation suite for conversational tourism recommenders that requires systems to suggest POIs, cite verifiable review spans, and recover from rejections, revealing a Three-Compete...

  3. Evaluating Scene-based In-Situ Item Labeling for Immersive Conversational Recommendation

    cs.IR 2026-04 unverdicted novelty 7.0

    Existing methods for selecting in-situ labels in immersive recommendation scenes often show redundant or incomplete information and fail to anticipate users' proactive information needs.

  4. FASH-iCNN: Making Editorial Fashion Identity Inspectable Through Multimodal CNN Probing

    cs.CV 2026-04 unverdicted novelty 6.0

    A multimodal CNN on 87,547 Vogue images classifies fashion houses at 78.2% top-1 accuracy, decades at 88.6%, and years at 58.3% with 2.2-year mean error, and shows texture and luminance carry most of the house-identit...