VOGUE: A Multimodal Dataset for Conversational Recommendation in Fashion

David Guo; Jiazhou Liang; Minqi Sun; Scott Sanner; Yilun Jiang

VOGUE: A Multimodal Dataset for Conversational Recommendation in Fashion

Not yet reviewed by Pith; the record is open.

Re-run · record.json Download PDF Read on arXiv ↗

This paper has not been read by Pith yet. Machine review is queued; the pith claim, tier, and objections will appear here once it completes.

SPECIMEN: schema-true, not a live event

T0 review · schema-true

One-sentence machine reading of the paper's core claim.

pith:XXXXXXXX · record.json · timestamp

arxiv 2510.21151 v2 pith:YDUG4D5Y submitted 2025-10-24 cs.IR

VOGUE: A Multimodal Dataset for Conversational Recommendation in Fashion

David Guo , Minqi Sun , Yilun Jiang , Jiazhou Liang , Scott Sanner This is my paper

classification cs.IR

keywords conversationalmultimodalhumanrecommendationvoguedatasetdialoguefashion

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

0 comments

read the original abstract

Multimodal conversational recommendation has recently emerged as a promising paradigm for delivering personalized experiences through natural dialogue enriched by visual and contextual grounding. Yet currently available multimodal conversational recommendation datasets remain limited: existing resources either simulate conversations, omit user history or fail to collect sufficiently detailed feedback, which constrain the types of research and evaluation they support. To address these gaps we introduce VOGUE, a dataset of 60 human human dialogues containing 2100 granularly labeled utterances in realistic fashion shopping scenarios. Each dialogue is paired with a shared visual catalogue, item metadata, user fashion profiles and post conversation ratings from both users (Seekers) and recommenders (Assistants). This design enables rigorous evaluation of conversational inference, including not only alignment between predicted and ground truth preferences but also calibration against full rating distributions and comparison with explicit and implicit user satisfaction signals. Our analyses of VOGUE reveal distinctive dynamics of visually grounded dialogue, e.g. recommenders frequently recommend items simultaneously in feature based groups, which creates distinct conversational phases bridged by Seeker critiques and refinements. Benchmarking Multimodal Large Language Models against human Recommenders shows that while MLLMs approach human level alignment in aggregate they exhibit systematic distribution errors in reproducing human ratings and struggle to generalize preference inference beyond explicitly discussed items. These findings establish VOGUE as both a unique resource for studying multimodal conversational systems and a challenge dataset beyond the current recommendation capabilities of existing top tier multimodal foundation models such as GPT-5-mini and Gemini-2.5-Flash.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems
cs.AI 2026-05 unverdicted novelty 7.0

Goal-Mem improves RAG memory retrieval in agentic LLMs by explicit goal decomposition and backward chaining via Natural Language Logic, outperforming nine baselines on multi-hop and implicit inference tasks.
TRACE: Tourism Recommendation with Accountable Citation Evidence
cs.IR 2026-05 unverdicted novelty 7.0

TRACE is a new benchmark dataset and evaluation suite for conversational tourism recommenders that requires systems to suggest POIs, cite verifiable review spans, and recover from rejections, revealing a Three-Compete...
Evaluating Scene-based In-Situ Item Labeling for Immersive Conversational Recommendation
cs.IR 2026-04 unverdicted novelty 7.0

Existing methods for selecting in-situ labels in immersive recommendation scenes often show redundant or incomplete information and fail to anticipate users' proactive information needs.
Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems
cs.AI 2026-05 unverdicted novelty 6.0

Goal-Mem decomposes user goals into subgoals for targeted memory retrieval using Natural Language Logic, improving performance on multi-hop reasoning tasks in conversational agents.
FASH-iCNN: Making Editorial Fashion Identity Inspectable Through Multimodal CNN Probing
cs.CV 2026-04 unverdicted novelty 6.0

A multimodal CNN on 87,547 Vogue images classifies fashion houses at 78.2% top-1 accuracy, decades at 88.6%, and years at 58.3% with 2.2-year mean error, and shows texture and luminance carry most of the house-identit...
Evaluating Scene-based In-Situ Item Labeling for Immersive Conversational Recommendation
cs.IR 2026-04 conditional novelty 6.0

Existing IR/LLM/VLM methods for in-situ item labels in immersive CRS fail on modality use, visual redundancy, and proactive needs under new explicit-vs-proactive evaluation metrics.