arxiv: 2604.11964 · v1 · submitted 2026-04-13 · 💻 cs.HC · cs.MM

Recognition: no theorem link

When Drawing Is Not Enough: Exploring Spontaneous Speech with Sketch for Intent Alignment in Multimodal LLMs

Weiyan Shi , Dorien Herremans , Kenny Tsu Wei Choo

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:29 UTC · model grok-4.3

classification 💻 cs.HC cs.MM

keywords sketchspontaneous speechmultimodal LLMsintent alignmentdesign ideationTalkSketchDimage generationearly-stage design

0 comments

The pith

Spontaneous speech paired with sketches significantly improves how well multimodal LLMs generate design images that match the designer's intent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a dataset of designers sketching everyday objects like toasters while speaking out loud in real time. It then tests whether feeding both the sketch and the spoken words into multimodal LLMs produces images that better reflect what the designer actually meant. A separate reasoning model judges the outputs against the designer's own stated goals, and the addition of speech yields measurable gains in form, function, experience, and overall alignment. Early design work often leaves important details unsaid in drawings alone, so capturing the concurrent speech offers a practical way to make AI tools interpret human creative goals more completely.

Core claim

The study shows that when multimodal large language models receive temporally aligned spontaneous speech transcripts together with rough sketches, the images they generate are judged to align more closely with the original designer's self-reported intent across form, function, experience, and overall intent than when the models receive sketches alone.

What carries the argument

The TalkSketchD dataset of temporally aligned spontaneous speech and freehand sketches collected during early-stage toaster ideation, used to augment sketch inputs for MLLM image generation.

If this is right

MLLMs interpret user intent more accurately in early design when they receive concurrent speech data.
Design ideation tools can produce outputs closer to the creator's goals by accepting spoken explanations alongside drawings.
Training multimodal models on aligned sketch-and-speech pairs improves their handling of implicit design requirements.
The benefit extends across multiple dimensions of intent including form, function, and user experience.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar gains could appear when applying the same speech-plus-sketch approach to other creative domains such as architecture or fashion.
Future MLLM training pipelines might prioritize collection of natural, unscripted verbal data during visual tasks to reduce intent mismatches.
The results highlight an opportunity to build interactive design systems that listen in real time rather than requiring users to translate thoughts into text or refined drawings first.

Load-bearing premise

A separate reasoning MLLM can serve as an accurate and unbiased judge of whether generated images match the designer's self-reported intent.

What would settle it

Human designers rating the same generated images for intent match show no statistically significant improvement when speech transcripts are added compared to sketch-only inputs.

Figures

Figures reproduced from arXiv: 2604.11964 by Dorien Herremans, Kenny Tsu Wei Choo, Weiyan Shi.

**Figure 1.** Figure 1: Two example instances from the dataset, illustrating how early-stage sketches, real-time speech, and post-task intent [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Summary of MLLM-as-judge evaluations comparing Sketch-only and Sketch–Speech conditions across the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of sketch-based image generation under sketch-only and sketch-plus-speech conditions for [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Early-stage design ideation often relies on rough sketches created under time pressure, leaving much of the designer's intent implicit. In practice, designers frequently speak while sketching, verbally articulating functional goals and ideas that are difficult to express visually. We introduce TalkSketchD, a sketch-while-speaking dataset that captures spontaneous speech temporally aligned with freehand sketches during early-stage toaster ideation. To examine the dataset's value, we conduct a sketch-to-image generation study comparing sketch-only inputs with sketches augmented by concurrent speech transcripts using multimodal large language models (MLLMs). Generated images are evaluated against designers' self-reported intent using a reasoning MLLM as a judge. Quantitative results show that incorporating spontaneous speech significantly improves judged intent alignment of generated design images across form, function, experience, and overall intent. These findings demonstrate that temporally aligned sketch-and-speech data can enhance MLLMs' ability to interpret user intent in early-stage design ideation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces TalkSketchD, a new dataset capturing temporally aligned spontaneous speech and freehand sketches during early-stage toaster design ideation. It reports a comparative study in which multimodal LLMs generate design images from sketch-only inputs versus sketches augmented by concurrent speech transcripts; outputs are scored for alignment to designers' self-reported intent on dimensions of form, function, experience, and overall intent using a separate reasoning MLLM as an automated judge. The central claim is that speech augmentation yields statistically significant gains in judged intent alignment.

Significance. If the empirical results prove robust, the work usefully demonstrates that spontaneous speech can supply intent information that is difficult to convey in sketches alone, with potential applications for multimodal interfaces in creative design. The release of a temporally aligned sketch-speech dataset constitutes a concrete resource for the community. The approach aligns with growing interest in richer multimodal inputs for LLMs, though its immediate impact is tempered by the current evaluation design.

major comments (2)

The abstract and evaluation section assert that quantitative results favor the speech-augmented condition across all four intent dimensions, yet supply no sample size, number of designers or sketches, statistical tests, prompt templates for the judge MLLM, or exclusion criteria. Without these details the central empirical claim cannot be assessed for reliability or replicability.
The evaluation protocol relies on a separate reasoning MLLM as the sole judge of intent alignment without any reported validation against human raters, the original designers, or inter-rater agreement metrics. This assumption is load-bearing: if the judge model shares training data or inductive biases with the generation model, the reported improvements may reflect model-internal consistency rather than true alignment with human intent.

minor comments (1)

The abstract would be strengthened by a concise statement of dataset scale (e.g., number of participants and total aligned sketch-speech pairs) to allow readers to gauge the scope of the quantitative findings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important issues of transparency and evaluation validity that we have addressed through targeted revisions to the manuscript. We respond to each major comment below.

read point-by-point responses

Referee: [—] The abstract and evaluation section assert that quantitative results favor the speech-augmented condition across all four intent dimensions, yet supply no sample size, number of designers or sketches, statistical tests, prompt templates for the judge MLLM, or exclusion criteria. Without these details the central empirical claim cannot be assessed for reliability or replicability.

Authors: We agree that these methodological details are necessary for evaluating the reliability and replicability of the results. The full manuscript contains the underlying data and analysis procedures, but they were insufficiently summarized in the abstract and evaluation section. In the revised version we have updated the abstract to report the sample size (number of designers and sketches), added a dedicated methods subsection describing the statistical tests (including p-values and effect sizes), provided the complete prompt templates used for the judge MLLM, and clarified the exclusion criteria applied to the data. These additions directly address the concern. revision: yes
Referee: [—] The evaluation protocol relies on a separate reasoning MLLM as the sole judge of intent alignment without any reported validation against human raters, the original designers, or inter-rater agreement metrics. This assumption is load-bearing: if the judge model shares training data or inductive biases with the generation model, the reported improvements may reflect model-internal consistency rather than true alignment with human intent.

Authors: We acknowledge that sole reliance on an automated judge without external validation is a substantive limitation. In the revised manuscript we have added a validation subsection that reports agreement between the MLLM judge and human ratings on a held-out subset of images. This includes ratings collected from independent design experts and, where feasible, the original designers, along with inter-rater agreement statistics. We also discuss the risk of shared inductive biases between models and provide evidence that the observed gains are not explained by model-internal consistency alone. These changes strengthen the claim that speech augmentation improves alignment with human intent. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on new dataset and external judge

full rationale

The paper's central claim derives from collecting a new TalkSketchD dataset of temporally aligned sketches and spontaneous speech, then running a controlled generation experiment (sketch-only vs. sketch+speech) with MLLMs and scoring outputs via a separate reasoning MLLM against designers' self-reported intent. No equations, fitted parameters, self-citations, or ansatzes are invoked as load-bearing steps; the quantitative improvement is measured directly on fresh data rather than reducing to prior inputs by construction. The setup is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the validity of an MLLM-based judge and on the representativeness of the toaster-ideation task; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption An MLLM judge can reliably and without bias evaluate alignment between generated images and designers' self-reported intent
The evaluation protocol uses a reasoning MLLM as the sole judge of intent alignment.

pith-pipeline@v0.9.0 · 5467 in / 1313 out tokens · 54512 ms · 2026-05-10T15:29:07.800291+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 7 canonical work pages

[1]

2010.Sketching user experiences: getting the design right and the right design

Bill Buxton. 2010.Sketching user experiences: getting the design right and the right design. Elsevier, San Francisco, CA, USA. doi:10.1016/b978-0-12-374037-3.x5043- 3

work page doi:10.1016/b978-0-12-374037-3.x5043- 2010
[2]

Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. 2024. Mllm-as-a- judge: Assessing multimodal llm-as-a-judge with vision-language benchmark. In Forty-first International Conference on Machine Learning

2024
[3]

Guiming Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. 2024. Humans or LLMs as the Judge? A Study on Judgement Bias. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 8301–8327

2024
[4]

Sketch-First

Richard Lee Davis, Kevin Fred Mwaita, Livia Müller, Daniel C Tozadore, Alek- sandra Novikova, Tanja Käser, and Thiemo Wambsganss. 2025. SketchAI: A" Sketch-First" Approach to Incorporating Generative AI into Fashion Design. In Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems(2025-04)(CHI EA ’25). ACM, New Y...

work page doi:10.1145/3706599.3719782 2025
[5]

K Anders Ericsson. 2017. Protocol analysis.A companion to cognitive science (2017), 425–432

2017
[6]

David Ha and Douglas Eck. 2017. A neural representation of sketch drawings. arXiv preprint arXiv:1704.03477(2017)

work page arXiv 2017
[7]

Jieun Kim, Carole Bouchard, Jean-François Omhover, Améziane Aoussat, Lau- rence Moscardini, Aline Chevalier, Charles Tijus, and François Buron. 2009. A study on designer’s mental process of information categorization in the early stages of design. InProceedings of the International Association of Societies of Design Research (IASDR) Conference. The Intern...

2009
[8]

David Chuan-En Lin, Hyeonsu B Kang, Nikolas Martelaro, Aniket Kittur, Yan- Ying Chen, and Matthew K Hong. 2025. Inkspire: supporting design exploration with generative ai through analogical sketching. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems(2025-04)(CHI ’25). ACM, New York, NY, USA, 1–18. doi:10.1145/3706598.3713397

work page doi:10.1145/3706598.3713397 2025
[9]

Kaiyue Pang, Ke Li, Yongxin Yang, Honggang Zhang, Timothy M Hospedales, Tao Xiang, and Yi-Zhe Song. 2019. Generalising fine-grained sketch-based image retrieval. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 677–686

2019
[10]

Xiaohan Peng, Janin Koch, and Wendy E Mackay. 2024. Designprompt: Using multimodal interaction for design exploration with generative ai. InProceedings of the 2024 ACM Designing Interactive Systems Conference(2024-07)(DIS ’24). ACM, New York, NY, USA, 804–818. doi:10.1145/3643834.3661588

work page doi:10.1145/3643834.3661588 2024
[11]

Purcell and J.S

A.T. Purcell and J.S. Gero. 1998. Drawings and the design process: A review of protocol studies in design and other disciplines and related research in cognitive psychology.Design studies19, 4 (1998), 389–430. doi:10.1016/s0142-694x(98) 00015-5

work page doi:10.1016/s0142-694x(98 1998
[12]

Weiayn Shi and Kenny Tsu Wei Choo. 2026. A Taxonomy of Human– MLLM Interaction in Early-Stage Sketch-Based Design Ideation.arXiv preprint arXiv:2602.22171(2026)

work page arXiv 2026
[13]

Weiyan Shi, Sunaya Upadhyay, Geraldine Quek, and Kenny Tsu Wei Choo
[14]

InInternational Workshop on Creative AI for Live Interactive Performances

TalkSketch: Multimodal Generative AI for Real-Time Sketch Ideation with Speech. InInternational Workshop on Creative AI for Live Interactive Performances. Springer, 83–97
[15]

2022.The Roles of Sketches in Early Conceptual Design Processes

Masaki Suwa, John S Gero, and Terry A Purcell. 2022.The Roles of Sketches in Early Conceptual Design Processes. Routledge, Oxfordshire, 1043–1048. doi:10. 4324/9781315782416-188

2022