InteractScience: Programmatic and Visually-Grounded Evaluation of Interactive Scientific Demonstration Code Generation

Fei Yuan; Gong Cheng; Kai Chen; Lei Li; Qiaosheng Chen; Qipeng Guo; Yang Liu

arxiv: 2510.09724 · v2 · pith:FOG4ROWRnew · submitted 2025-10-10 · 💻 cs.SE · cs.AI

InteractScience: Programmatic and Visually-Grounded Evaluation of Interactive Scientific Demonstration Code Generation

Qiaosheng Chen , Yang Liu , Lei Li , Kai Chen , Qipeng Guo , Gong Cheng , Fei Yuan This is my paper

Pith reviewed 2026-05-21 20:34 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords LLM code generationinteractive demonstrationsscientific visualizationbenchmark evaluationprogrammatic testingvisual groundingfront-end developmenteducational tools

0 comments

The pith

InteractScience is the first benchmark to automatically evaluate LLMs on generating interactive scientific demonstration code by combining programmatic tests with visual snapshot checks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models can now generate full applications from text instructions, opening possibilities for interactive scientific demonstrations that explain concepts through working code. Current benchmarks test either pure knowledge or static web pages but miss the integrated skill of accurate science plus responsive front-end interactions. The authors created a hybrid framework that runs unit tests on interaction logic and compares rendered outputs against reference images. They built InteractScience with questions in five scientific domains, each supplied with tests, snapshots, and checklists, then ran it on thirty leading models. Results show consistent shortfalls in blending domain knowledge with correct interactive coding, establishing a measurable baseline for future work on educationally useful generators.

Core claim

The paper claims that a hybrid framework pairing rigorous programmatic functional testing of interaction logic with visually-grounded checks against reference snapshots can automatically assess the combined ability of LLMs to produce accurate scientific knowledge and correctly behaving interactive front-end code, as shown by evaluation results across thirty models on a new benchmark of domain-specific questions.

What carries the argument

The hybrid evaluation framework that runs unit tests for interaction logic and compares rendered outputs to reference snapshots.

If this is right

Current LLMs show measurable gaps when required to merge scientific accuracy with responsive user interfaces.
The benchmark supplies a repeatable way to track improvements in generating functional educational demonstrations.
Reliable code generation in this setting would directly support new teaching methods and research communication tools.
The public release of questions, tests, and data enables standardized comparison across future models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar hybrid testing could be adapted for interactive code in engineering or medical visualization tasks.
Models that improve on this benchmark might enable rapid creation of customized online science lessons.
Automated scoring of both logic and visuals could support iterative refinement loops during generation.
Widespread adoption might shift how scientific outreach materials are produced and validated.

Load-bearing premise

The manually designed questions, unit tests, reference snapshots, and checklists accurately reflect what counts as a correct and educationally useful interactive scientific demonstration.

What would settle it

If high-scoring model outputs are routinely judged by domain experts as scientifically inaccurate or pedagogically ineffective despite passing the benchmark tests and visual checks.

Figures

Figures reproduced from arXiv: 2510.09724 by Fei Yuan, Gong Cheng, Kai Chen, Lei Li, Qiaosheng Chen, Qipeng Guo, Yang Liu.

**Figure 1.** Figure 1: Illustration of three tasks. (a) Knowledge Question Answering: given the query about forces act of a block placed on an inclined plane, an LLM can output a correct textual explanation. (b) Webpage Code Generation: given the instruction of write a blog webpage, an LLM can generate functional static HTML code. (c) Scientific Demonstration Code Generation: generating an interactive demo for the inclined plan… view at source ↗

**Figure 2.** Figure 2: Pipeline of data collection and evaluation suite synthesis. The data collection step retrieves [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Performance of LLMs across different difficulty levels. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Performance of LLMs across different disciplines. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Performance of multimodal LLMs under varying numbers of reference snapshot inputs. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Example snapshots that illustrate the complementarity of CLIP and VLM-judge scores. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Reference and generated snapshots of different models for a Fields of Magnet Array [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Reference and generated snapshots of different models for a Interwoven Spherical Trian [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: System prompt for implementation plan synthesis. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: System prompt for PFT test case synthesis. [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: System prompt for VQT test case synthesis. [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: System prompt for PFT unit test script synthesis. [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: System prompt for VQT unit test script synthesis. [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: System prompt for VQT checklist synthesis Part 1. [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 15.** Figure 15: System prompt for VQT checklist synthesis Part 2. [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

**Figure 16.** Figure 16: System prompt for VQT checklist synthesis Part 3. [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗

**Figure 17.** Figure 17: System prompt for VQT VLM-as-Judge. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are increasingly capable of generating complete applications from natural language instructions, creating new opportunities in science and education. In these domains, interactive scientific demonstrations are particularly valuable for explaining concepts, supporting new teaching methods, and presenting research findings. Generating such demonstrations requires models to combine accurate scientific knowledge with the ability to implement interactive front-end code that behaves correctly and responds to user actions. This capability goes beyond the scope of existing benchmarks, which typically evaluate either knowledge question answering without grounding in code or static web code generation without scientific interactivity. To evaluate this integrated ability, we design a hybrid framework that combines programmatic functional testing to rigorously verify interaction logic with visually-grounded qualitative testing to assess rendered outputs against reference snapshots. Building on this framework, we present InteractScience, a benchmark consisting of a substantial set of carefully designed questions across five scientific domains, each paired with unit tests, reference snapshots, and checklists. We evaluate 30 leading open- and closed-source LLMs and report results that highlight ongoing weaknesses in integrating domain knowledge with interactive front-end coding. Our work positions InteractScience as the first benchmark to automatically measure this combined capability with realistic interactive operations, providing a foundation for advancing reliable and educationally useful scientific demonstration code generation. All code and data are publicly available at https://github.com/open-compass/InteractScience.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InteractScience adds a hybrid benchmark for LLMs on interactive scientific demos but its manually built tests lack reported validation.

read the letter

The main thing to know is that this paper presents InteractScience as the first benchmark for automatically evaluating LLMs on generating interactive scientific demonstration code, using both programmatic tests and visual grounding. It does something useful by addressing a gap where prior benchmarks either test scientific knowledge in isolation or static web code without the interactive science angle. The hybrid framework makes sense: unit tests check if the interactions work as expected, while snapshots and checklists assess the visual and qualitative aspects. Evaluating thirty models and releasing the dataset publicly gives others a starting point to measure progress in this area. The results point to ongoing issues with models blending accurate science with functional interactive code, which seems plausible. Where it falls short is in the details of benchmark construction. The questions and associated tests appear manually created without any reported steps for validation, such as expert review for scientific accuracy or pedagogical effectiveness, or measures like inter-rater reliability. This leaves the central claim resting on unverified choices about what makes a good interactive demo. If the checklists reflect narrow views rather than standard criteria, the findings on model weaknesses won't be as reliable. This paper is for folks building or studying benchmarks for LLM applications in science education and research communication. A reader focused on code generation evaluation or AI-assisted teaching tools would get practical value from seeing the framework and the model comparisons, provided they can access and review the actual test items. It deserves a serious referee because the topic is relevant and the public release is a strength. Reviewers could help address the validation gaps to strengthen the work.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces InteractScience, a benchmark for evaluating LLMs on generating interactive scientific demonstration code that integrates accurate domain knowledge with functional interactive front-end implementations. It proposes a hybrid evaluation framework that pairs programmatic unit tests for interaction logic with visually-grounded checks against reference snapshots and checklists. The benchmark spans questions across five scientific domains; the authors evaluate 30 open- and closed-source models, report persistent weaknesses in the integrated capability, and position InteractScience as the first automatic benchmark for this combined task. All code and data are released publicly.

Significance. If the benchmark construction and evaluation criteria prove reliable, the work would supply a reproducible, publicly available resource for measuring progress on a practically relevant capability that sits between pure knowledge QA and static web-code generation. The hybrid testing approach and broad model evaluation provide a concrete baseline that could guide future model development for science-education applications.

major comments (2)

[§4] §4 (Benchmark Construction): The description of how the questions, unit tests, reference snapshots, and checklists were created provides no information on pilot studies, expert review for scientific accuracy, or inter-rater reliability for the visual checks. Because the central claim—that the benchmark automatically measures 'correct and educationally useful' interactive demonstrations—rests on these artifacts faithfully operationalizing the target capability, the absence of validation details is load-bearing.
[§5] §5 (Evaluation Setup): Exclusion criteria for test cases and any filtering applied before scoring are not stated. Without these details it is difficult to interpret the reported model weaknesses or to assess whether the results generalize beyond the specific manually chosen items.

minor comments (2)

[Abstract] The abstract states that the benchmark contains 'a substantial set' of questions but does not give the exact count per domain; adding these numbers would improve precision.
[Figure 2] Figure 2 (framework diagram) would benefit from explicit arrows or labels indicating the flow between programmatic tests and snapshot comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive review. We address each major comment below and indicate the revisions that will be made to improve clarity and reproducibility.

read point-by-point responses

Referee: [§4] §4 (Benchmark Construction): The description of how the questions, unit tests, reference snapshots, and checklists were created provides no information on pilot studies, expert review for scientific accuracy, or inter-rater reliability for the visual checks. Because the central claim—that the benchmark automatically measures 'correct and educationally useful' interactive demonstrations—rests on these artifacts faithfully operationalizing the target capability, the absence of validation details is load-bearing.

Authors: We agree that the current description of benchmark construction is insufficiently detailed. In the revised manuscript we will expand §4 with a new subsection describing the iterative question-design process, the steps taken to verify scientific accuracy through internal expert review by the author team (who include domain specialists), and the multi-author consensus process used to finalize reference snapshots and checklists. We will also note that formal pilot studies and quantitative inter-rater reliability statistics were not performed; instead we relied on repeated internal review cycles. These additions will directly address the load-bearing concern raised. revision: yes
Referee: [§5] §5 (Evaluation Setup): Exclusion criteria for test cases and any filtering applied before scoring are not stated. Without these details it is difficult to interpret the reported model weaknesses or to assess whether the results generalize beyond the specific manually chosen items.

Authors: We acknowledge that the absence of explicit exclusion and filtering criteria limits interpretability. In the revised §5 we will add a dedicated paragraph specifying the criteria applied during test-case selection (e.g., requirement that each item admit a non-trivial interactive behavior, exclusion of cases with ambiguous visual references, and removal of any items that failed internal sanity checks). We will also clarify that no post-hoc filtering was performed on model outputs before scoring. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark is newly constructed from manual design inputs

full rationale

The paper presents InteractScience as a new benchmark consisting of manually designed questions, unit tests, reference snapshots, and checklists across five scientific domains, evaluated via a hybrid programmatic and visually-grounded framework. No equations, fitted parameters, predictions, or derivations are present that reduce to the paper's own inputs by construction. The central claim of being the first to measure combined scientific knowledge and interactive front-end coding is positioned as arising from the novelty of the described artifacts and testing approach, without load-bearing self-citations or ansatzes imported from prior author work. The construction details serve as the independent starting point rather than an output derived from the evaluation results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the hand-crafted test cases and reference materials rather than on learned parameters or new physical entities.

axioms (1)

domain assumption Reference snapshots and checklists provide reliable ground truth for both visual correctness and educational utility of generated demonstrations.
Invoked when scoring model outputs against the benchmark.

pith-pipeline@v0.9.0 · 5786 in / 1038 out tokens · 27978 ms · 2026-05-21T20:34:51.840262+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages

[1]

doi: 10.1145/3696410.3714889

ACM, 2025. doi: 10.1145/3696410.3714889. URLhttps://doi.org/10.1145/ 3696410.3714889. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. Dan Hendrycks, Collin B...

work page doi:10.1145/3696410.3714889 2025
[2]

and BigCodeBench (Zhuo et al., 2024) evaluate models through hidden unit tests on program- ming or competitive coding problems, capturing algorithmic correctness but ignoring interactivity and visual fidelity. In parallel, visualization-oriented benchmarks such as VisCoder (Ni et al., 2025), ChartCoder (Zhao et al., 2025), DrawingPandas (Galimzyanov et al...

work page 2024
[3]

The demo must be implemented as a single standalone HTML file with inline HTML, CSS, and JavaScript

work page
[4]

External libraries may only be included via **CDN** (e.g., p5.js, three.js, D3.js, Plotly.js, MathJax)

work page
[5]

Do **not** reference any external or uploaded media assets (images, videos, audio), and do **not** use base64-encoded binaries

work page
[6]

The plan **must fully describe the observed UI state and behavior** in the provided screenshots (including default values, rendered formulas, slider settings, etc.)

work page
[7]

### Output Format (strictly follow this structure, no extra commentary or code):

Require the large language model that uses this plan to **strictly follow the implementation instructions**—any missing information will lead to incorrect results. ### Output Format (strictly follow this structure, no extra commentary or code):

work page
[8]

Page Content Structure Describe each logical UI section (e.g., Title, Description, Control Panel, Graph Area, Formula Display) and its role

work page
[9]

Include their types (e.g.,<div>, <input type="range">,<canvas>,<button>,<select>)

HTML Components List **all required HTML elements**, grouped by section. Include their types (e.g.,<div>, <input type="range">,<canvas>,<button>,<select>). Note if MathJax is required for formula rendering

work page
[10]

Component IDs and State For every interactive component (sliders, checkboxes, dropdowns, buttons, etc.): - Assign a uniqueid(e.g.,slider-angle,btn-play) - Provide: Initial/default value; Minimum and maximum (and step, if applicable); Label text or tooltip, if any

work page
[11]

Interaction Logic Explain **exactly** how each control affects the interface. For each user interaction, describe: - What changes in the visual (e.g., redraws, updates) - What dependent values or formulas update - Whether animation or resets are triggered Do not omit any interaction shown in the screenshots

work page
[12]

The resulting plan must be detailed enough that a large language model can accurately reproduce the entire original demo, including all interactions and visuals

Visualization Techniques Specify the rendering strategy and technology for each visual element: - p5.js or Canvas API for custom 2D graphics - three.js for 3D scenes - D3.js or SVG for dynamic diagrams - Plotly.js for charts or plots - leaflet.js for maps - MathJax for math formula rendering - CSS for styling and layout (e.g., flex/grid, transitions, colo...

work page
[13]

The component is visible on page load

work page
[14]

The component has the correct default value or state (as defined in the plan)

work page
[15]

The component can be interacted with correctly (e.g., drag slider, click button)

work page
[16]

Boundary behavior should be tested (e.g., min/max values, reset, toggle on/off)

work page
[17]

### Output Format: For each component, write one test case in the following format: - Title: [Short description of the control being tested] - Steps & Assertions:

The interaction causes some change to the diagram, equation, UI element, or output (verify change occurred, not correctness). ### Output Format: For each component, write one test case in the following format: - Title: [Short description of the control being tested] - Steps & Assertions:

work page
[18]

Assert: [Component is visible]

work page
[19]

Assert: [Component has correct default value or state]

work page
[20]

Action: [Perform a realistic user interaction]

work page
[21]

Assert: [UI update or state change occurred]

work page
[22]

Action: [Boundary interaction or reset]

work page
[23]

- Do **not** invent behavior not described in the implementation plan

Assert: [System handles boundary or reset with some change] ### Guidelines: - The difficulty of each test case should be **moderate**—not overly simple, not overly complex. - Do **not** invent behavior not described in the implementation plan. - Use only what is described or visible in the plan and screenshots. - One case per component. - Focus on detecti...

work page
[24]

Action: [Simulate the first interaction to reach this state]

work page
[25]

drag to 70% of the bar

Action: [Next interaction, if any] ... N. Action: [Final interaction needed to match the screenshot] N+1. Assert: Take a screenshot of the current UI state ### Guidelines: - All interactions must be **derived strictly from the screenshot and the design plan**. - Test cases must follow the **exact order of input screenshots** (first test case for first scr...

work page
[31]

Does not rely on function readiness unless stated—only perform initial navigation and DOM load. ### Test Setup: Load the HTML file using: const fileUrl=’file://’+require(’path’).resolve(__dirname,’../pages/{id}.html’);} No need to wait for external scripts or function readiness beyond page load. ### Output format: - You must generate only valid complete P...

work page
[32]

Navigates to the local HTML page using the code below

work page
[33]

Performs **only the exact user actions** listed in each test case (e.g., drag, click, input)

work page
[34]

Uses the DOM structure and component IDs specified in the plan—do not guess or infer selectors beyond what is provided

work page
[35]

After executing all actions, takes a full-page screenshot of the resulting UI state and saves it as: await page.screenshot({{path:’./snapshots/{id}-[i].png’,fullPage:true}}); where [i] is the index of the test case starting from 1

work page
[36]

Ensures that tests are strictly based on the input plan and test cases—do not invent new behaviors or UI logic

work page
[37]

Does not rely on function readiness unless stated—only perform initial navigation and DOM load. ### Test Setup: Load the HTML file using: const fileUrl=’file://’+require(’path’).resolve(__dirname,’../pages/{id}.html’); No need to wait for external scripts or function readiness beyond page load. ### Output format: - You must generate only valid complete Pl...

work page
[38]

A **detailed implementation plan** of an interactive scientific demo, describing the UI structure, control elements (sliders, buttons, dropdowns, text fields), and the theorem or scientific principle the demo explains

work page
[39]

Each screenshot shows: * The **input snapshot** (current state of controls such as slider values, button toggles, dropdown selections)

A set of **screenshots of the demo under different input states**. Each screenshot shows: * The **input snapshot** (current state of controls such as slider values, button toggles, dropdown selections). * The **visual output** (graph, diagram, simulation, or formula rendering) produced by the demo under that input. Your task is to generate a **checklist f...

work page
[40]

* Treat control states only as **inputs** that determine what the visualization should show

**Checklist is output-oriented** * Do not check whether buttons, sliders, or controls are styled correctly. * Treat control states only as **inputs** that determine what the visualization should show. * Focus all checklist items on verifying whether the **visual output image** is scientifically correct

work page
[41]

* Example: *If the angle slider is set to 45°, then the plotted projectile trajectory must peak at the midpoint of its range.*

**Connect input to output explicitly** * Every checklist item must link the given **input state** to the expected **output visualization**. * Example: *If the angle slider is set to 45°, then the plotted projectile trajectory must peak at the midpoint of its range.*

work page
[42]

* For physics demos: motion paths, conservation laws, force vectors

**Scientific focus** * For math demos: formulas, curves, intersections, asymptotes. * For physics demos: motion paths, conservation laws, force vectors. * For geometry demos: shapes, proportions, congruency. * For statistics/plots: distributions, scaling, labeled values

work page
[43]

* Do not assume hidden internal states or unseen code behavior

**Visual verification** * All checklist items must be verifiable by comparing the **screenshot output image** with a reference screenshot. * Do not assume hidden internal states or unseen code behavior

work page
[44]

screenshot_id

**Do not go beyond the plan** * Only include checklist items that are **explicitly described in the implementation plan** *and* are **visible in the screenshot**. * Do not invent or infer behavior that is not both in the plan and observable in the screenshot. * If something is in the plan but not visible in the screenshot, **do not include it**. Figure 14...

work page
[45]

A **reference screenshot** that represents the correct output of the demo under a specific input state

work page
[46]

A **generated screenshot** from a candidate implementation under the same input state

work page
[47]

### Your task

A **checklist** of verification items describing what scientific properties must be visible and correct in the output image. ### Your task

work page
[48]

Carefully compare the **generated screenshot** with the **reference screenshot**

work page
[49]

For each **checklist item**, assign a score from **1 to 5** using the rubric below

work page
[50]

checklist_results

Provide a short justification for each score. ### Scoring Rubric * **5 (Perfect / Fully Correct)** * Output image matches the reference screenshot precisely for this checklist item. * No scientific or visual errors observed. * **4 (Minor Deviation)** * Output image mostly matches, but there are small differences (e.g., slight shift in curve, small scaling...

work page 2025

[1] [1]

doi: 10.1145/3696410.3714889

ACM, 2025. doi: 10.1145/3696410.3714889. URLhttps://doi.org/10.1145/ 3696410.3714889. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. Dan Hendrycks, Collin B...

work page doi:10.1145/3696410.3714889 2025

[2] [2]

and BigCodeBench (Zhuo et al., 2024) evaluate models through hidden unit tests on program- ming or competitive coding problems, capturing algorithmic correctness but ignoring interactivity and visual fidelity. In parallel, visualization-oriented benchmarks such as VisCoder (Ni et al., 2025), ChartCoder (Zhao et al., 2025), DrawingPandas (Galimzyanov et al...

work page 2024

[3] [3]

The demo must be implemented as a single standalone HTML file with inline HTML, CSS, and JavaScript

work page

[4] [4]

External libraries may only be included via **CDN** (e.g., p5.js, three.js, D3.js, Plotly.js, MathJax)

work page

[5] [5]

Do **not** reference any external or uploaded media assets (images, videos, audio), and do **not** use base64-encoded binaries

work page

[6] [6]

The plan **must fully describe the observed UI state and behavior** in the provided screenshots (including default values, rendered formulas, slider settings, etc.)

work page

[7] [7]

### Output Format (strictly follow this structure, no extra commentary or code):

Require the large language model that uses this plan to **strictly follow the implementation instructions**—any missing information will lead to incorrect results. ### Output Format (strictly follow this structure, no extra commentary or code):

work page

[8] [8]

Page Content Structure Describe each logical UI section (e.g., Title, Description, Control Panel, Graph Area, Formula Display) and its role

work page

[9] [9]

Include their types (e.g.,<div>, <input type="range">,<canvas>,<button>,<select>)

HTML Components List **all required HTML elements**, grouped by section. Include their types (e.g.,<div>, <input type="range">,<canvas>,<button>,<select>). Note if MathJax is required for formula rendering

work page

[10] [10]

Component IDs and State For every interactive component (sliders, checkboxes, dropdowns, buttons, etc.): - Assign a uniqueid(e.g.,slider-angle,btn-play) - Provide: Initial/default value; Minimum and maximum (and step, if applicable); Label text or tooltip, if any

work page

[11] [11]

Interaction Logic Explain **exactly** how each control affects the interface. For each user interaction, describe: - What changes in the visual (e.g., redraws, updates) - What dependent values or formulas update - Whether animation or resets are triggered Do not omit any interaction shown in the screenshots

work page

[12] [12]

The resulting plan must be detailed enough that a large language model can accurately reproduce the entire original demo, including all interactions and visuals

Visualization Techniques Specify the rendering strategy and technology for each visual element: - p5.js or Canvas API for custom 2D graphics - three.js for 3D scenes - D3.js or SVG for dynamic diagrams - Plotly.js for charts or plots - leaflet.js for maps - MathJax for math formula rendering - CSS for styling and layout (e.g., flex/grid, transitions, colo...

work page

[13] [13]

The component is visible on page load

work page

[14] [14]

The component has the correct default value or state (as defined in the plan)

work page

[15] [15]

The component can be interacted with correctly (e.g., drag slider, click button)

work page

[16] [16]

Boundary behavior should be tested (e.g., min/max values, reset, toggle on/off)

work page

[17] [17]

### Output Format: For each component, write one test case in the following format: - Title: [Short description of the control being tested] - Steps & Assertions:

The interaction causes some change to the diagram, equation, UI element, or output (verify change occurred, not correctness). ### Output Format: For each component, write one test case in the following format: - Title: [Short description of the control being tested] - Steps & Assertions:

work page

[18] [18]

Assert: [Component is visible]

work page

[19] [19]

Assert: [Component has correct default value or state]

work page

[20] [20]

Action: [Perform a realistic user interaction]

work page

[21] [21]

Assert: [UI update or state change occurred]

work page

[22] [22]

Action: [Boundary interaction or reset]

work page

[23] [23]

- Do **not** invent behavior not described in the implementation plan

Assert: [System handles boundary or reset with some change] ### Guidelines: - The difficulty of each test case should be **moderate**—not overly simple, not overly complex. - Do **not** invent behavior not described in the implementation plan. - Use only what is described or visible in the plan and screenshots. - One case per component. - Focus on detecti...

work page

[24] [24]

Action: [Simulate the first interaction to reach this state]

work page

[25] [25]

drag to 70% of the bar

Action: [Next interaction, if any] ... N. Action: [Final interaction needed to match the screenshot] N+1. Assert: Take a screenshot of the current UI state ### Guidelines: - All interactions must be **derived strictly from the screenshot and the design plan**. - Test cases must follow the **exact order of input screenshots** (first test case for first scr...

work page

[26] [31]

Does not rely on function readiness unless stated—only perform initial navigation and DOM load. ### Test Setup: Load the HTML file using: const fileUrl=’file://’+require(’path’).resolve(__dirname,’../pages/{id}.html’);} No need to wait for external scripts or function readiness beyond page load. ### Output format: - You must generate only valid complete P...

work page

[27] [32]

Navigates to the local HTML page using the code below

work page

[28] [33]

Performs **only the exact user actions** listed in each test case (e.g., drag, click, input)

work page

[29] [34]

Uses the DOM structure and component IDs specified in the plan—do not guess or infer selectors beyond what is provided

work page

[30] [35]

After executing all actions, takes a full-page screenshot of the resulting UI state and saves it as: await page.screenshot({{path:’./snapshots/{id}-[i].png’,fullPage:true}}); where [i] is the index of the test case starting from 1

work page

[31] [36]

Ensures that tests are strictly based on the input plan and test cases—do not invent new behaviors or UI logic

work page

[32] [37]

Does not rely on function readiness unless stated—only perform initial navigation and DOM load. ### Test Setup: Load the HTML file using: const fileUrl=’file://’+require(’path’).resolve(__dirname,’../pages/{id}.html’); No need to wait for external scripts or function readiness beyond page load. ### Output format: - You must generate only valid complete Pl...

work page

[33] [38]

A **detailed implementation plan** of an interactive scientific demo, describing the UI structure, control elements (sliders, buttons, dropdowns, text fields), and the theorem or scientific principle the demo explains

work page

[34] [39]

Each screenshot shows: * The **input snapshot** (current state of controls such as slider values, button toggles, dropdown selections)

A set of **screenshots of the demo under different input states**. Each screenshot shows: * The **input snapshot** (current state of controls such as slider values, button toggles, dropdown selections). * The **visual output** (graph, diagram, simulation, or formula rendering) produced by the demo under that input. Your task is to generate a **checklist f...

work page

[35] [40]

* Treat control states only as **inputs** that determine what the visualization should show

**Checklist is output-oriented** * Do not check whether buttons, sliders, or controls are styled correctly. * Treat control states only as **inputs** that determine what the visualization should show. * Focus all checklist items on verifying whether the **visual output image** is scientifically correct

work page

[36] [41]

* Example: *If the angle slider is set to 45°, then the plotted projectile trajectory must peak at the midpoint of its range.*

**Connect input to output explicitly** * Every checklist item must link the given **input state** to the expected **output visualization**. * Example: *If the angle slider is set to 45°, then the plotted projectile trajectory must peak at the midpoint of its range.*

work page

[37] [42]

* For physics demos: motion paths, conservation laws, force vectors

**Scientific focus** * For math demos: formulas, curves, intersections, asymptotes. * For physics demos: motion paths, conservation laws, force vectors. * For geometry demos: shapes, proportions, congruency. * For statistics/plots: distributions, scaling, labeled values

work page

[38] [43]

* Do not assume hidden internal states or unseen code behavior

**Visual verification** * All checklist items must be verifiable by comparing the **screenshot output image** with a reference screenshot. * Do not assume hidden internal states or unseen code behavior

work page

[39] [44]

screenshot_id

**Do not go beyond the plan** * Only include checklist items that are **explicitly described in the implementation plan** *and* are **visible in the screenshot**. * Do not invent or infer behavior that is not both in the plan and observable in the screenshot. * If something is in the plan but not visible in the screenshot, **do not include it**. Figure 14...

work page

[40] [45]

A **reference screenshot** that represents the correct output of the demo under a specific input state

work page

[41] [46]

A **generated screenshot** from a candidate implementation under the same input state

work page

[42] [47]

### Your task

A **checklist** of verification items describing what scientific properties must be visible and correct in the output image. ### Your task

work page

[43] [48]

Carefully compare the **generated screenshot** with the **reference screenshot**

work page

[44] [49]

For each **checklist item**, assign a score from **1 to 5** using the rubric below

work page

[45] [50]

checklist_results

Provide a short justification for each score. ### Scoring Rubric * **5 (Perfect / Fully Correct)** * Output image matches the reference screenshot precisely for this checklist item. * No scientific or visual errors observed. * **4 (Minor Deviation)** * Output image mostly matches, but there are small differences (e.g., slight shift in curve, small scaling...

work page 2025