arxiv: 2603.09652 · v3 · submitted 2026-03-10 · 💻 cs.AI

Recognition: no theorem link

MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants

Zuhao Zhang , Chengyue Yu , Yuante Li , Chenyi Zhuang , Linjian Mo , Shuai Li

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:39 UTC · model grok-4.3

classification 💻 cs.AI

keywords MiniAppBenchinteractive HTML generationLLM evaluationagentic evaluationbrowser automationMiniAppshuman alignment

0 comments

The pith

Current LLMs struggle to generate high-quality interactive MiniApps while a new browser-based evaluator aligns with human judgment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes MiniAppBench as the first benchmark specifically for LLMs that must produce dynamic HTML applications with customized interaction logic rather than static text or simple code. It draws 500 tasks from a real-world service that has already produced over ten million generations, covering six practical domains. The authors also introduce MiniAppEval, which runs automated exploratory tests inside a browser to score generated apps on whether they meet user intention, look correct, and behave correctly when used. Experiments show current models still fall short on these interactive requirements, yet the new evaluator produces scores that track human assessments closely enough to serve as a repeatable standard.

Core claim

MiniAppBench distills 500 principle-driven tasks from a real-world application with more than ten million generations across six domains; MiniAppEval is an agentic framework that uses browser automation to perform human-like exploratory testing and scores each generated MiniApp on the three dimensions of Intention, Static quality, and Dynamic behavior, demonstrating both that present LLMs face significant challenges producing high-quality MiniApps and that the evaluation framework itself aligns well with human judgment.

What carries the argument

MiniAppBench benchmark of 500 tasks together with MiniAppEval, an agentic browser-automation framework that conducts exploratory testing across Intention, Static, and Dynamic dimensions without requiring a single ground-truth answer.

If this is right

LLMs will need targeted improvements in generating customized interaction logic that follows real-world principles.
Future benchmarks for code generation can move beyond static correctness to test dynamic, user-driven behavior.
Agentic evaluation that simulates real browser use can serve as a repeatable substitute for human judgment when no single correct answer exists.
Research on LLM-powered assistants can now measure progress toward fully interactive HTML responses using a shared, publicly sourced task set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same browser-testing approach could be applied to other open-ended generation tasks such as web-app or game prototyping.
If the benchmark is adopted, training data for LLMs may shift toward examples that emphasize dynamic state and user interaction rather than static markup.
Real-world usage logs provide a practical way to keep benchmarks current as user needs evolve.
Low scores on the Dynamic dimension point to a specific training gap in handling user-driven state changes that static code checks miss.

Load-bearing premise

The 500 tasks taken from one real-world service are representative of the full range of interactive capabilities users need, and browser-based exploratory testing can assess open-ended interactions in a systematic and unbiased way.

What would settle it

Running MiniAppEval on a fresh set of generated MiniApps and finding low correlation with independent human ratings on the same apps would show the evaluator is not reliable.

Figures

Figures reproduced from arXiv: 2603.09652 by Chengyue Yu, Chenyi Zhuang, Linjian Mo, Shuai Li, Yuante Li, Zuhao Zhang.

**Figure 1.** Figure 1: The shift from text to MINIAPPS. Unlike static text, MINIAPPS transforms abstract explanations into intuitive visualizations and unlocks actionable tasks (e.g., diet tracking) that were previously impossible. code transcends its role as a mere intermediate symbolic representation; it becomes a direct executable medium through which a model’s internal knowledge is externalized into dynamic, user-facing a… view at source ↗

**Figure 2.** Figure 2: Failure Cases in Principle Adherence. MINIAPPS require models to capture and instantiate relevant real-world principles, while MINIAPPEVAL proves effective due to its multicomponent system design (eval-ref, code, playwright). From this perspective, we posit that rendered HTML responses will emerge as a new form of human–LLM interaction, which we term MINIAPPS. Unlike traditional web pages, which primar… view at source ↗

**Figure 3.** Figure 3: Overview of the MINIAPPBENCH dataset and construction process. (a)–(d) illustrate the dataset construction pipeline. (e) summarizes the dataset features and distributions (domain and difficulty), with the distribution of subclasses shown in the side bar charts. (f) presents representative MINIAPPS examples from six domains. • ri is the structured evaluation reference. Unlike traditional benchmarks that re… view at source ↗

**Figure 4.** Figure 4: MINIAPPEVAL vs. Previous Methods. Unlike brittle scripts or rigid comparisons, MINIAPPEVAL integrates code inspection with dynamic execution. It complements human evaluation by verifying underlying physical principles and automating tedious testing scenarios to ensure robust assessment. index.html file that integrates the document markup, embedded styling, and functional logic. Our evaluation uses the HT… view at source ↗

**Figure 5.** Figure 5: Overall model pass rate on MINIAPPBENCH dimensions (Intention, Static, Dynamic) exceeds this value, i.e., min(S i , Ss , Sd ) > 0.8. GPT-5.2 achieved the highest performance with an average pass rate of 45.46%, while the overall mean across all models was 17.05%. These results underscore the challenges current models face in generating successful MINIAPPS. The details are shown in [PITH_FULL_IMAGE:figures… view at source ↗

**Figure 6.** Figure 6: Token Length & Inference Time vs Average pass rate [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Multi-dimensional trajectory analysis. The figure contains nine subplots: (a) tokens vs. step (scatter); (b) token distribution by step (boxplot); (c) average tokens vs. step with dispersion (mean/median/std); (d) cumulative time vs. step; (e) time interval vs. step (log scale); (f) tokens vs. time interval; (g) prompt tokens vs. completion tokens; (h) histogram of step values; (i) token statistics by step… view at source ↗

read the original abstract

With the rapid advancement of Large Language Models (LLMs) in code generation, human-AI interaction is evolving from static text responses to dynamic, interactive HTML-based applications, which we term MiniApps. These applications require models to not only render visual interfaces but also construct customized interaction logic that adheres to real-world principles. However, existing benchmarks primarily focus on algorithmic correctness or static layout reconstruction, failing to capture the capabilities required for this new paradigm. To address this gap, we introduce MiniAppBench, the first comprehensive benchmark designed to evaluate principle-driven, interactive application generation. Sourced from a real-world application with 10M+ generations, MiniAppBench distills 500 tasks across six domains (e.g., Games, Science, and Tools). Furthermore, to tackle the challenge of evaluating open-ended interactions where no single ground truth exists, we propose MiniAppEval, an agentic evaluation framework. Leveraging browser automation, it performs human-like exploratory testing to systematically assess applications across three dimensions: Intention, Static, and Dynamic. Our experiments reveal that current LLMs still face significant challenges in generating high-quality MiniApps, while MiniAppEval demonstrates high alignment with human judgment, establishing a reliable standard for future research. Our homepage is available in miniappbench.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MiniAppBench introduces the first benchmark for LLMs generating interactive HTML apps from real usage data, but its single-source task set limits how far the results can generalize.

read the letter

This paper's main contribution is introducing MiniAppBench as the first benchmark aimed at LLMs that produce interactive HTML applications instead of plain text responses. It also comes with MiniAppEval, an agent-based way to test those apps in a browser. What stands out is the effort to ground the tasks in real usage. They distilled 500 examples from a production app that has handled over ten million generations, spread across domains like games, science, and tools. The evaluation looks at three angles: whether the app matches the user's intention, how it looks statically, and how it behaves when interacted with dynamically. Using browser automation to explore the app is a reasonable way to handle the lack of a single ground truth. The approach moves the field forward from benchmarks that only check code syntax or fixed layouts. It tries to assess whether the generated app actually supports the kind of interactions people would expect in a real assistant. On the downside, the entire task set comes from one application. Even with the domain split, this could mean the benchmark mostly captures the interaction patterns and UI conventions of that specific app rather than the broader space of principle-driven interfaces. Things like multi-step state machines or collaborative features might be underrepresented. The dynamic scoring depends heavily on the exploration agent; if its strategy misses certain paths or applies timeouts inconsistently, that could skew the results and the reported alignment with human raters. This is relevant for anyone working on LLM systems that output more than text, such as AI assistants or code generators focused on user interfaces. It sets up a measurable problem where current models apparently fall short. The paper deserves a serious referee because the benchmark and the evaluation method are specific enough to get useful feedback on, even though the task sourcing and the robustness of the agentic tester will probably need more justification or expansion.

Referee Report

2 major / 2 minor

Summary. The paper introduces MiniAppBench, a benchmark of 500 tasks distilled from a single real-world application with 10M+ generations across six domains (Games, Science, Tools, etc.), to evaluate LLMs on generating principle-driven interactive HTML MiniApps. It also proposes MiniAppEval, an agentic browser-automation framework for exploratory testing that assesses generated apps along Intention, Static, and Dynamic dimensions without requiring a single ground truth. Experiments are reported to show that current LLMs face significant challenges in producing high-quality MiniApps while MiniAppEval achieves high alignment with human judgment.

Significance. If the representativeness and evaluation reliability claims hold, the work would establish a practically grounded benchmark for the emerging paradigm of interactive HTML generation, moving beyond static layout or algorithmic correctness benchmarks. The agentic evaluation approach for open-ended interactions is a methodological contribution that could influence future assessment of dynamic LLM outputs.

major comments (2)

[Benchmark Construction] Benchmark Construction: The 500 tasks are sourced exclusively from one real-world application. This leaves open whether the benchmark adequately covers the full range of interaction principles (e.g., complex state machines, multi-user coordination, or accessibility-driven flows) needed to support the central claim that LLMs struggle with high-quality MiniApps in general.
[MiniAppEval Framework] MiniAppEval: The Dynamic dimension depends on agentic browser exploration without ground truth. Any bias in the agent's exploration policy (path coverage, click ordering, timeout handling) directly impacts the reported human alignment; the manuscript must provide quantitative validation (e.g., inter-rater agreement statistics or ablation on exploration parameters) to substantiate the reliability claim.

minor comments (2)

[Abstract] The abstract states that experiments reveal LLM challenges and high alignment but supplies no quantitative metrics (e.g., scores per dimension or correlation coefficients); adding a one-sentence summary of key results would improve the abstract.
[Evaluation Methodology] Clarify the exact scoring rubrics and aggregation method for the three evaluation dimensions (Intention, Static, Dynamic) and how browser automation implements human-like exploratory testing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below, clarifying our design choices while committing to revisions that strengthen the manuscript's claims on representativeness and evaluation reliability.

read point-by-point responses

Referee: [Benchmark Construction] The 500 tasks are sourced exclusively from one real-world application. This leaves open whether the benchmark adequately covers the full range of interaction principles (e.g., complex state machines, multi-user coordination, or accessibility-driven flows) needed to support the central claim that LLMs struggle with high-quality MiniApps in general.

Authors: We acknowledge that MiniAppBench originates from a single production application with over 10M generations. This choice was deliberate to ground the benchmark in authentic, high-volume user interactions rather than synthetic tasks. The 500 tasks were distilled to span six diverse domains and a wide range of interaction principles observable in that corpus. While we agree this does not exhaustively cover every possible principle (such as multi-user coordination or advanced accessibility flows), the benchmark still reveals substantial gaps in current LLMs' ability to generate principle-driven interactive MiniApps. We will add an expanded limitations subsection that explicitly discusses scope, generalizability, and planned extensions to additional interaction types. revision: partial
Referee: [MiniAppEval Framework] The Dynamic dimension depends on agentic browser exploration without ground truth. Any bias in the agent's exploration policy (path coverage, click ordering, timeout handling) directly impacts the reported human alignment; the manuscript must provide quantitative validation (e.g., inter-rater agreement statistics or ablation on exploration parameters) to substantiate the reliability claim.

Authors: We appreciate this point on evaluation robustness. The current manuscript reports strong overall alignment between MiniAppEval and human judgments, but we agree that additional quantitative evidence is required. In the revised version we will include (1) inter-rater agreement statistics (Cohen's kappa) between the agentic evaluator and multiple human raters on a held-out subset, and (2) ablation results on key exploration parameters including path coverage depth, click ordering heuristics, and timeout thresholds. These additions will be placed in the MiniAppEval evaluation subsection to directly address potential policy biases. revision: yes

Circularity Check

0 steps flagged

No significant circularity: benchmark and evaluation derived from external data and new framework

full rationale

The paper constructs MiniAppBench by distilling 500 tasks from an external real-world application (10M+ generations) across six domains and introduces MiniAppEval as an independent agentic browser-based evaluation framework for Intention, Static, and Dynamic dimensions. No equations, fitted parameters, self-citations, or derivations reduce any claim to its own inputs by construction. The central results (LLM challenges in MiniApp generation and alignment of MiniAppEval with human judgment) rest on externally sourced tasks and a newly proposed evaluation method without self-referential loops or load-bearing self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim depends on the representativeness of tasks drawn from a single real-world source and the assumption that automated browser exploration can proxy human judgment on open-ended interactions; no free parameters are fitted in the described approach.

axioms (2)

domain assumption Tasks distilled from one application with 10M+ generations represent the required interactive capabilities across six domains
Stated in the abstract as the source for the 500 tasks.
domain assumption Browser automation performing exploratory testing can systematically assess intention, static quality, and dynamic behavior in the absence of ground truth
Core premise of the MiniAppEval framework described in the abstract.

invented entities (2)

MiniApps no independent evidence
purpose: Dynamic, interactive HTML-based applications generated by LLMs that adhere to real-world principles
New term introduced to describe the target output paradigm.
MiniAppEval no independent evidence
purpose: Agentic evaluation framework that uses browser automation for human-like testing across three dimensions
New framework proposed to solve the open-ended evaluation problem.

pith-pipeline@v0.9.0 · 5541 in / 1610 out tokens · 59721 ms · 2026-05-15T13:39:32.162386+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 1 internal anchor

[1]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Accessed: 2026-01-22. Nickerson, R. S. Confirmation bias: A ubiquitous phe- nomenon in many guises.Review of General Psychology, 2(2):175–220, June 1998. doi: 10.1037/1089-2680.2.2. 175. 9 MINIAPPBENCH: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants Ning, L., Liang, Z., Jiang, Z., Qu, H., Ding, Y ., Fan, W., Wei, X....

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1037/1089-2680.2.2 2026
[3]

references

If tsconfig.json uses the "references" field to reference tsconfig.node.json, then tsconfig.node.json must be generated. Example content: {{ "compilerOptions": {{ "composite": true, "skipLibCheck": true, "module": "ESNext", "moduleResolution": "bundler", "allowSyntheticDefaultImports": true, "strict": true }}, "include": ["vite.config.ts"] }} C.1.3 POSITI...

work page 2000
[4]

autoprefixer

If autoprefixer is used in postcss.config.js, package.json’s devDependencies must include: "autoprefixer": "ˆ10.4.14", "postcss": "ˆ8.4.31"

work page
[5]

references

If tsconfig.json uses the "references" field to reference tsconfig.node.json, then tsconfig.node.json must be generated. Example content: {{ "compilerOptions": {{ "composite": true, "skipLibCheck": true, "module": "ESNext", "moduleResolution": "bundler", "allowSyntheticDefaultImports": true, "strict": true }}, 23 MINIAPPBENCH: Evaluating the Shift from Te...

work page
[6]

**Interactivity Requirements **: - The application must support users actively changing variables, with corresponding changes in results - Avoid implementing only simple content folding, navigation bar switching, or content pagination - Ensure the application provides a truly interactive experience, such as: calculation results changing after user input, ...

work page
[7]

com) - Use semantic HTML tags - Ensure responsive design, adapting to different screen sizes (desktop and mobile)

**Technical Constraints **: - **Absolutely prohibit the use of fonts.googleapis.com, as it is inaccessible in Chinese networks ** - If fonts are needed, use local font files or accessible CDNs (such as fonts.aliyun. com) - Use semantic HTML tags - Ensure responsive design, adapting to different screen sizes (desktop and mobile)

work page
[8]

**Functional Implementation Constraints **: - **Pure frontend implementation **: All functionality must be implemented on the frontend, without calling any backend APIs or external services - **Self-contained**: The application must be self-contained and not depend on external services or APIs - **Data storage **: If data persistence is needed, only use b...

work page
[9]

First output the project introduction and operation guide (using the following format): ## Project Introduction [Briefly describe the project’s functionality, purpose, and features] ## Operation Guide [Explain how to run the project, how to use main features, precautions, etc.]

work page
[10]

#### ‘file path‘

Then output project files (according to the directory structure above): - Each file uses "#### ‘file path‘" marker - Followed immediately by a code block (wrapped with ‘‘‘) Please generate a project according to the above constraints: {} 24 MINIAPPBENCH: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants Prompts for Gen...

work page
[11]

**Code and User Requirement Consistency: ** - First, determine whether the code can solve the user’s core needs and truly help users solve problems. 26 MINIAPPBENCH: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants Prompt for MINIAPPEVAL(Without Playwright Mode) (continued) - Second, analyze whether the title (title) ...

work page
[12]

- Count whether the code contains key elements that should be in the requirements, and point out missing parts

**Code Structure and Element Coverage: ** - Based on the HTML code structure, analyze the rationality of page layout (such as element nesting, semantic tag usage) and whether it has basic UI elements related to requirements (such as text, forms, buttons, etc.). - Count whether the code contains key elements that should be in the requirements, and point ou...

work page
[13]

intention

**Interactive Function Implementation: ** - Analyze whether the JavaScript code implements the interactive functions required by the requirements. - Check whether event listeners and function definitions are complete and whether the logic is correct. - Determine whether the code logic can achieve the expected interactive effects and whether there are obvi...

work page
[14]

- Second, determine whether the page title (title) and main Header content demonstrate key information and functions for completing user tasks

**Page and User Requirement Consistency: ** - First, determine whether the currently generated mini-app can solve the user’s core needs and truly help users solve problems. - Second, determine whether the page title (title) and main Header content demonstrate key information and functions for completing user tasks. Focus on checking the consistency betwee...

work page
[15]

screenshot

**Page Aesthetics and Element Coverage: ** - Based on the web page snapshot (HTML structure, DOM elements and their content), analyze page aesthetics (such as color matching, typography, visual hierarchy) and whether it has basic UI elements related to requirements (such as text, forms, buttons, etc.). - You can use the ‘mcp__ms-playwright__browser_evalua...

work page
[16]

browser_take_screenshot

**Interactive Function Usability: ** - Only judge the results returned by operations and DOM state changes. Strictly prohibit subjective assumptions about whether interactions are available. Must be based on actual operations. - In game applications such as fireworks and shooting, prioritize using the ‘mcp__ms- playwright__browser_evaluate‘ tool for rapid...

work page
[17]

of the page

In an objective and neutral manner, describe in detail the overall layout, colors, main areas, graphic elements (especially SVG graphics), interactive controls, etc. of the page

work page
[18]

Focus on listing all elements related to graphics/visualization, such as coordinate axes, line charts, bar charts, circles, rectangles, paths, text annotations, etc

work page
[19]

whether it meets a certain requirement

Do not subjectively guess "whether it meets a certain requirement", only describe what you actually see. Please output only a JSON object with the following example structure: { "page_summary": "Overall page structure and general content description", "visual_elements": [ { "type": "svg", "description": "A 400x400 SVG canvas with a blue circle in the cent...

work page
[20]

User requirements (User Query)

work page
[21]

blind observer

An objective page description (Page Description) given by another "blind observer" To determine: To what extent the page completes the user requirements. Notes: - You cannot modify the facts in the Page Description, you can only reason based on it. - Do not assume the page has elements/interactions not written in the description. Please output a JSON with...

work page