arxiv: 2605.04165 · v1 · submitted 2026-05-05 · 💻 cs.MA · cs.HC

Recognition: unknown

FlowEval: Reference-based Evaluation of Generated User Interfaces

Jason Wu , Priyan Vaithilingam , Eldon Schoop , Jeffrey Nichols , Titus Barik

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:38 UTC · model grok-4.3

classification 💻 cs.MA cs.HC

keywords UI generation evaluationreference-based metricsnavigation tracesdynamic time warpinggenerated interfacesusability assessmentinteraction flowsLLM evaluation

0 comments

The pith

FlowEval shows that comparing navigation traces on generated UIs to real sites produces scores that strongly match expert human judgments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FlowEval to solve the problem of reliably judging AI-generated user interfaces for interaction quality. It works by recording sequences of user actions on both a generated interface and its real-world reference site, then scoring how closely those sequences align. Similarity is measured with reference-based techniques such as dynamic time warping. A small study with expert evaluators found these automated scores track closely with human assessments of whether the interface supports realistic flows. This matters because it offers a middle path between slow expert testing and opaque automated checks.

Core claim

FlowEval is a reference-based evaluation framework that assesses whether generated user interfaces support realistic interaction flows. It does so by collecting navigation traces from generated UIs and their real website counterparts, then applying similarity metrics such as dynamic time warping to quantify alignment. In a small-scale study with expert UI evaluators, these reference-based metrics showed strong correlation with human judgments, indicating they can serve as a scalable proxy for trustworthy evaluation of UI generation systems.

What carries the argument

Reference-based navigation trace comparison via similarity metrics such as dynamic time warping. It records sequences of user interactions on a generated UI and on the corresponding real site, then measures how closely the sequences match to indicate support for realistic user flows.

If this is right

Developers can test many more generated interfaces for interaction support without proportional increases in expert time.
High trace similarity scores predict positive expert judgments on whether a UI enables realistic user paths.
UI generation systems can use the metrics as an objective signal during model training or candidate selection.
Evaluation becomes more transparent and repeatable than reliance on purely visual or code-based automated judges.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same trace-comparison idea could be tested on mobile or desktop applications beyond web pages.
Divergent segments of the traces might point to recurring design mistakes made by current generators.
Real usage logs from popular sites could supply reference traces at scale for training better evaluators.

Load-bearing premise

Navigation traces collected from generated UIs are comparable in structure and meaning to traces from real websites, and the chosen similarity metrics capture the interaction qualities that expert evaluators actually care about.

What would settle it

A larger study in which expert ratings and FlowEval similarity scores diverge on several generated UIs would show the correlation does not hold.

Figures

Figures reproduced from arXiv: 2605.04165 by Eldon Schoop, Jason Wu, Jeffrey Nichols, Priyan Vaithilingam, Titus Barik.

**Figure 1.** Figure 1: An overview of our evaluation approach. A view at source ↗

**Figure 2.** Figure 2: A screenshot of our arena interface used by human raters in our experiment. view at source ↗

read the original abstract

While large language models (LLMs) and coding agents are often applied to user interface (UI) development, developers find it difficult to reliably assess their proficiency in visual and interaction design. Existing evaluations either rely on human experts, who can accurately assess usability by testing critical flows but are slow and costly, or on automated judges, which are scalable but less accurate and opaque. We present FlowEval, a reference-based framework that measures whether a generated UI supports realistic interaction flows by comparing navigation traces from real websites to traces from generated analogs using reference-based similarity metrics (e.g., dynamic time warping). In a small-scale study with expert UI evaluators, we show that reference-based metrics strongly correlate with human judgments, suggesting that they can provide scalable yet trustworthy evaluation for UI generation systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FlowEval's trace-comparison idea using DTW on navigation paths is a reasonable attempt at scalable UI evaluation, but the small expert study lacks the numbers needed to back its correlation claim.

read the letter

FlowEval proposes comparing navigation traces from generated UIs to traces collected from real websites, using metrics like dynamic time warping to score how well the generated version supports realistic interaction flows. The paper reports a small-scale expert study where these reference-based scores line up with human judgments, positioning the method as a middle ground between slow expert testing and opaque automated judges.

Referee Report

3 major / 2 minor

Summary. The paper introduces FlowEval, a reference-based evaluation framework for generated user interfaces. It collects navigation traces from real websites and generated UI analogs, then applies similarity metrics such as dynamic time warping (DTW) to quantify how well the generated UI supports realistic interaction flows. A small-scale study with expert UI evaluators is reported to show that these reference-based metrics strongly correlate with human judgments of usability, positioning FlowEval as a scalable yet trustworthy alternative to purely human or opaque automated evaluation.

Significance. If the reported correlation is statistically robust and the trace-comparability assumption holds, FlowEval could meaningfully reduce reliance on costly expert evaluation for UI generation systems while providing interpretable, reference-grounded scores. The approach avoids circularity by using independent real-website traces and standard metrics, and the absence of free parameters or fitted entities is a strength. However, the current evidence base is too thin to establish trustworthiness at scale.

major comments (3)

[Evaluation section] Evaluation section (or §4): the claim that reference-based metrics 'strongly correlate' with human judgments is not supported by any reported sample size (number of generated UIs or expert evaluators), correlation coefficient, p-value, or confidence interval. Without these quantities it is impossible to assess whether the observed relationship exceeds chance or selection bias.
[§3] §3 (Framework description): the assumption that navigation traces collected from generated UIs are structurally comparable to real-website traces (same task semantics and interaction primitives) is stated but not validated; no protocol for trace collection, feature representation for DTW, or control for differing UI affordances is described.
[Abstract and Evaluation] Abstract and Evaluation: the study is described only as 'small-scale' with no details on controls for confounds (e.g., evaluator expertise, task selection, or UI generation method), making it difficult to determine whether the correlation generalizes beyond the chosen examples.

minor comments (2)

[§3] Notation for the DTW distance and trace representation should be formalized with an equation or pseudocode to improve reproducibility.
[§3] The paper should clarify whether the real-website traces are collected under the same task instructions given to the generated UIs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major comment point by point below, indicating where revisions will be made to improve clarity and completeness.

read point-by-point responses

Referee: [Evaluation section] Evaluation section (or §4): the claim that reference-based metrics 'strongly correlate' with human judgments is not supported by any reported sample size (number of generated UIs or expert evaluators), correlation coefficient, p-value, or confidence interval. Without these quantities it is impossible to assess whether the observed relationship exceeds chance or selection bias.

Authors: We agree that the current manuscript does not provide sufficient quantitative details to support the correlation claim. In the revised version, we will expand the Evaluation section to report the exact sample sizes (number of generated UIs and expert evaluators), the correlation coefficient, p-value, confidence interval, and the statistical test employed. We will also moderate the language to reflect the preliminary nature of the small-scale study. revision: yes
Referee: [§3] §3 (Framework description): the assumption that navigation traces collected from generated UIs are structurally comparable to real-website traces (same task semantics and interaction primitives) is stated but not validated; no protocol for trace collection, feature representation for DTW, or control for differing UI affordances is described.

Authors: We will revise §3 to include a detailed protocol for trace collection from both real websites and generated UIs, ensuring alignment of task semantics. We will specify the feature representation used for DTW (e.g., sequences of states and actions) and discuss any controls or assumptions regarding differences in UI affordances to better validate the comparability of traces. revision: yes
Referee: [Abstract and Evaluation] Abstract and Evaluation: the study is described only as 'small-scale' with no details on controls for confounds (e.g., evaluator expertise, task selection, or UI generation method), making it difficult to determine whether the correlation generalizes beyond the chosen examples.

Authors: We acknowledge that additional details on study design are needed. In the revised abstract and Evaluation section, we will describe the expertise of the evaluators, criteria for task selection, UI generation methods, and steps taken to address potential confounds. We will also explicitly note the limitations on generalizability due to the small-scale study. revision: yes

Circularity Check

0 steps flagged

No circularity: FlowEval uses external real-website traces and standard metrics with an independent human study

full rationale

The paper defines FlowEval as a reference-based comparison of navigation traces from real websites against generated UIs, employing off-the-shelf similarity measures such as dynamic time warping. The central empirical claim is a correlation observed in a separate small-scale expert study; this correlation is presented as an external validation rather than a quantity derived from or fitted to the same traces used in the metric definition. No equations reduce a prediction to its own inputs by construction, no self-citation chain bears the load of the core argument, and no ansatz or uniqueness result is smuggled in. The framework therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes navigation traces are sufficient proxies for UI quality.

axioms (1)

domain assumption Navigation traces from real websites serve as valid references for evaluating generated UIs
Central to the reference-based comparison method described.

pith-pipeline@v0.9.0 · 5435 in / 1173 out tokens · 37670 ms · 2026-05-08T17:38:07.208268+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 14 canonical work pages · 2 internal anchors

[1]

2025 , month = nov, eprint =

Computer-Use Agents as Judges for Generative User Interface , author =. 2025 , month = nov, eprint =. doi:10.48550/arXiv.2511.15567 , url =

work page doi:10.48550/arxiv.2511.15567 2025
[2]

2025 , month = oct, eprint =

WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality , author =. 2025 , month = oct, eprint =. doi:10.48550/arXiv.2510.18560 , url =

work page doi:10.48550/arxiv.2510.18560 2025
[3]

Biometrics , volume=

The measurement of observer agreement for categorical data , author=. Biometrics , volume=. 1977 , doi=

1977
[4]

Artifactsbench: Bridging the visual-interactive gap in llm code generation evaluation.arXiv preprint arXiv:2507.04952, 2025

ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation , author =. 2025 , month = jul, eprint =. doi:10.48550/arXiv.2507.04952 , url =

work page doi:10.48550/arxiv.2507.04952 2025
[5]

IEEE Transactions on Acoustics, Speech, and Signal Processing , volume=

Dynamic programming algorithm optimization for spoken word recognition , author=. IEEE Transactions on Acoustics, Speech, and Signal Processing , volume=. 1978 , publisher=

1978
[6]

Real: Benchmarking autonomous agents on deterministic simulations of real websites.arXiv preprint arXiv:2504.11543,

REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites , author =. arXiv preprint arXiv:2504.11543 , year =

work page arXiv
[7]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

UI-TARS: Pioneering Automated GUI Interaction with Native Agents , author =. arXiv preprint arXiv:2501.12326 , year =

work page internal anchor Pith review arXiv
[8]

Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems , series =

Computation of Interface Aesthetics , author =. Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems , series =. 2015 , publisher =

2015
[9]

arXiv preprint arXiv:2510.15306 , year=

WebGen-V Bench: Structured Representation for Enhancing Visual Design in LLM-based Web Generation and Evaluation , author=. arXiv preprint arXiv:2510.15306 , year=

work page arXiv
[10]

The handbook of task analysis for human-computer interaction , pages=

ConcurTaskTrees: an engineered notation for task models , author=. The handbook of task analysis for human-computer interaction , pages=. 2004 , publisher=

2004
[11]

URL https://doi.org/10.18653/v1/2024.acl-long.371

He, Hongliang and Yao, Wenlin and Ma, Kaixin and Yu, Wenhao and Dai, Yong and Zhang, Hongming and Lan, Zhenzhong and Yu, Dong. W eb V oyager: Building an End-to-End Web Agent with Large Multimodal Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.371

work page doi:10.18653/v1/2024.acl-long.371 2024
[12]

Proceedings of the 2016 ACM conference on designing interactive systems , pages=

Sketchplore: Sketch and explore with a layout optimiser , author=. Proceedings of the 2016 ACM conference on designing interactive systems , pages=

2016
[13]

ACM Transactions on Computer-Human Interaction (TOCHI) , volume=

Supporting cognitive models as users , author=. ACM Transactions on Computer-Human Interaction (TOCHI) , volume=. 2000 , publisher=

2000
[14]

Adjunct Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology , series =

Aalto Interface Metrics (AIM): A Service and Codebase for Computational GUI Evaluation , author =. Adjunct Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology , series =. 2018 , publisher =

2018
[15]

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems , pages =

Heuristic Evaluation of User Interfaces , author =. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems , pages =. 1990 , publisher =

1990
[16]

Soviet Physics Doklady , volume =

Binary codes capable of correcting deletions, insertions, and reversals , author =. Soviet Physics Doklady , volume =
[17]

1994 , publisher =

Usability Engineering , author =. 1994 , publisher =

1994
[18]

Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology , series =

Duan, Peitong and Cheng, Chin-Yi and Li, Gang and Hartmann, Bjoern and Li, Yang , title =. Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology , series =. 2024 , location =. doi:10.1145/3654777.3676381 , url =

work page doi:10.1145/3654777.3676381 2024
[19]

Proceedings of the 32nd International Conference on Machine Learning , series =

From Word Embeddings To Document Distances , author =. Proceedings of the 32nd International Conference on Machine Learning , series =. 2015 , editor =

2015
[20]

Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology , pages=

UIClip: a data-driven model for assessing user interface design , author=. Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology , pages=
[21]

and Stoica, Ion , booktitle =

Chiang, Wei-Lin and Zheng, Lianmin and Sheng, Ying and Angelopoulos, Anastasios Nikolas and Li, Tianle and Li, Dacheng and Zhu, Banghua and Zhang, Hao and Jordan, Michael and Gonzalez, Joseph E. and Stoica, Ion , booktitle =. Chatbot Arena: An Open Platform for Evaluating. 2024 , editor =

2024
[22]

UIC oder: Finetuning Large Language Models to Generate User Interface Code through Automated Feedback

Wu, Jason and Schoop, Eldon and Leung, Alan and Barik, Titus and Bigham, Jeffrey and Nichols, Jeffrey. UIC oder: Finetuning Large Language Models to Generate User Interface Code through Automated Feedback. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume ...

work page doi:10.18653/v1/2024.naacl-long.417 2024
[23]

Design Arena: Crowdsourced Benchmark for AI-Generated Design , author =
[24]

, title =

Elo, Arpad E. , title =. 1978 , isbn =

1978
[25]

ArXiv , year=

Evaluating Large Language Models Trained on Code , author=. ArXiv , year=
[26]

H uman E val- XL : A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization

Peng, Qiwei and Chai, Yekun and Li, Xuhong. H uman E val- XL : A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024

2024
[27]

ISBN 979-8-89176-189-6

Si, Chenglei and Zhang, Yanzhe and Li, Ryan and Yang, Zhengyuan and Liu, Ruibo and Yang, Diyi. D esign2 C ode: Benchmarking Multimodal Code Generation for Automated Front-End Engineering. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long P...

work page doi:10.18653/v1/2025.naacl-long.199 2025
[28]

arXiv preprint arXiv:2510.15306 , year=

WebGen-V Bench: Structured Representation for Enhancing Visual Design in LLM-based Web Generation and Evaluation , author =. 2025 , month = oct, eprint =. doi:10.48550/arXiv.2510.15306 , url =

work page doi:10.48550/arxiv.2510.15306 2025
[29]

and Agrawala, Maneesh and Hartmann, Bj\"

Luther, Kurt and Tolentino, Jari-Lee and Wu, Wei and Pavel, Amy and Bailey, Brian P. and Agrawala, Maneesh and Hartmann, Bj\". Structuring, Aggregating, and Evaluating Crowdsourced Design Critique , booktitle =. 2015 , pages =. doi:10.1145/2675133.2675283 , url =

work page doi:10.1145/2675133.2675283 2015
[30]

2025 , month = dec, day =

akhaliq , title =. 2025 , month = dec, day =

2025
[31]

2016 , edition =

Designing the User Interface: Strategies for Effective Human-Computer Interaction , author =. 2016 , edition =

2016
[32]

and Zhang, Hao and Gonzalez, Joseph E

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , booktitle =. Judging. 2023 , url =

2023
[33]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

From generation to judgment: Opportunities and challenges of llm-as-a-judge , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[34]

2025 , url =

WebDev Arena: A Live LLM Leaderboard for Web App Development , author =. 2025 , url =

2025
[35]

Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics , pages =

Bleu: a Method for Automatic Evaluation of Machine Translation , author =. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics , pages =. 2002 , publisher =

2002
[36]

Proceedings of the Eighth Conference on Machine Translation , pages =

eBLEU: Unexpectedly Good Machine Translation Evaluation Using Simple Word Embeddings , author =. Proceedings of the Eighth Conference on Machine Translation , pages =. 2023 , address =

2023
[37]

Text Summarization Branches Out: Proceedings of the ACL-04 Workshop , pages =

ROUGE: A Package for Automatic Evaluation of Summaries , author =. Text Summarization Branches Out: Proceedings of the ACL-04 Workshop , pages =. 2004 , address =

2004
[38]

2005 , address =

Banerjee, Satanjeev and Lavie, Alon , booktitle =. 2005 , address =

2005
[39]

Jaccard, Paul , journal =. \'
[40]

The Annals of Mathematical Statistics , volume =

On Information and Sufficiency , author =. The Annals of Mathematical Statistics , volume =. 1951 , publisher =

1951
[41]

Journal of Machine Learning Research , volume =

A Kernel Two-Sample Test , author =. Journal of Machine Learning Research , volume =
[42]

International Journal of Computer Vision , volume =

The Earth Mover's Distance as a Metric for Image Retrieval , author =. International Journal of Computer Vision , volume =. 2000 , publisher =

2000
[43]

2025 , month = oct, url =

Claude Sonnet 4.5 System Card , author =. 2025 , month = oct, url =

2025
[44]

2025 , month = sep, url =

Addendum to GPT-5 System Card: GPT-5-Codex , author =. 2025 , month = sep, url =

2025
[45]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author =. 2025 , archivePrefix =. doi:10.48550/arXiv.2507.06261 , url =. 2507.06261 , primaryClass =

work page Pith review doi:10.48550/arxiv.2507.06261 2025
[46]

2025 , month = jul, url =

Qwen3-Coder: Agentic Coding in the World , author =. 2025 , month = jul, url =

2025
[47]

gpt-oss-120b & gpt-oss-20b Model Card

gpt-oss-120b & gpt-oss-20b Model Card , author =. 2025 , archivePrefix =. doi:10.48550/arXiv.2508.10925 , url =. 2508.10925 , primaryClass =

work page internal anchor Pith review doi:10.48550/arxiv.2508.10925 2025
[48]

International Journal of Man-Machine Studies , volume =

Cognitive walkthroughs: A method for theory-based evaluation of user interfaces , author =. International Journal of Man-Machine Studies , volume =. 1992 , month = may, doi =

1992
[49]

Usability Inspection Methods , editor =

The cognitive walkthrough method: A practitioner’s guide , author =. Usability Inspection Methods , editor =. 1994 , url =

1994