pith. sign in

arxiv: 2510.09724 · v2 · pith:FOG4ROWRnew · submitted 2025-10-10 · 💻 cs.SE · cs.AI

InteractScience: Programmatic and Visually-Grounded Evaluation of Interactive Scientific Demonstration Code Generation

Pith reviewed 2026-05-21 20:34 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords LLM code generationinteractive demonstrationsscientific visualizationbenchmark evaluationprogrammatic testingvisual groundingfront-end developmenteducational tools
0
0 comments X

The pith

InteractScience is the first benchmark to automatically evaluate LLMs on generating interactive scientific demonstration code by combining programmatic tests with visual snapshot checks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models can now generate full applications from text instructions, opening possibilities for interactive scientific demonstrations that explain concepts through working code. Current benchmarks test either pure knowledge or static web pages but miss the integrated skill of accurate science plus responsive front-end interactions. The authors created a hybrid framework that runs unit tests on interaction logic and compares rendered outputs against reference images. They built InteractScience with questions in five scientific domains, each supplied with tests, snapshots, and checklists, then ran it on thirty leading models. Results show consistent shortfalls in blending domain knowledge with correct interactive coding, establishing a measurable baseline for future work on educationally useful generators.

Core claim

The paper claims that a hybrid framework pairing rigorous programmatic functional testing of interaction logic with visually-grounded checks against reference snapshots can automatically assess the combined ability of LLMs to produce accurate scientific knowledge and correctly behaving interactive front-end code, as shown by evaluation results across thirty models on a new benchmark of domain-specific questions.

What carries the argument

The hybrid evaluation framework that runs unit tests for interaction logic and compares rendered outputs to reference snapshots.

If this is right

  • Current LLMs show measurable gaps when required to merge scientific accuracy with responsive user interfaces.
  • The benchmark supplies a repeatable way to track improvements in generating functional educational demonstrations.
  • Reliable code generation in this setting would directly support new teaching methods and research communication tools.
  • The public release of questions, tests, and data enables standardized comparison across future models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar hybrid testing could be adapted for interactive code in engineering or medical visualization tasks.
  • Models that improve on this benchmark might enable rapid creation of customized online science lessons.
  • Automated scoring of both logic and visuals could support iterative refinement loops during generation.
  • Widespread adoption might shift how scientific outreach materials are produced and validated.

Load-bearing premise

The manually designed questions, unit tests, reference snapshots, and checklists accurately reflect what counts as a correct and educationally useful interactive scientific demonstration.

What would settle it

If high-scoring model outputs are routinely judged by domain experts as scientifically inaccurate or pedagogically ineffective despite passing the benchmark tests and visual checks.

Figures

Figures reproduced from arXiv: 2510.09724 by Fei Yuan, Gong Cheng, Kai Chen, Lei Li, Qiaosheng Chen, Qipeng Guo, Yang Liu.

Figure 1
Figure 1. Figure 1: Illustration of three tasks. (a) Knowledge Question Answering: given the query about forces act of a block placed on an inclined plane, an LLM can output a correct textual explanation. (b) Webpage Code Generation: given the instruction of write a blog webpage, an LLM can gen￾erate functional static HTML code. (c) Scientific Demonstration Code Generation: generating an interactive demo for the inclined plan… view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline of data collection and evaluation suite synthesis. The data collection step retrieves [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance of LLMs across different difficulty levels. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance of LLMs across different disciplines. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance of multimodal LLMs under varying numbers of reference snapshot inputs. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example snapshots that illustrate the complementarity of CLIP and VLM-judge scores. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Reference and generated snapshots of different models for a Fields of Magnet Array [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Reference and generated snapshots of different models for a Interwoven Spherical Trian [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: System prompt for implementation plan synthesis. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: System prompt for PFT test case synthesis. [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: System prompt for VQT test case synthesis. [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: System prompt for PFT unit test script synthesis. [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: System prompt for VQT unit test script synthesis. [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: System prompt for VQT checklist synthesis Part 1. [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: System prompt for VQT checklist synthesis Part 2. [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: System prompt for VQT checklist synthesis Part 3. [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: System prompt for VQT VLM-as-Judge. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗
read the original abstract

Large Language Models (LLMs) are increasingly capable of generating complete applications from natural language instructions, creating new opportunities in science and education. In these domains, interactive scientific demonstrations are particularly valuable for explaining concepts, supporting new teaching methods, and presenting research findings. Generating such demonstrations requires models to combine accurate scientific knowledge with the ability to implement interactive front-end code that behaves correctly and responds to user actions. This capability goes beyond the scope of existing benchmarks, which typically evaluate either knowledge question answering without grounding in code or static web code generation without scientific interactivity. To evaluate this integrated ability, we design a hybrid framework that combines programmatic functional testing to rigorously verify interaction logic with visually-grounded qualitative testing to assess rendered outputs against reference snapshots. Building on this framework, we present InteractScience, a benchmark consisting of a substantial set of carefully designed questions across five scientific domains, each paired with unit tests, reference snapshots, and checklists. We evaluate 30 leading open- and closed-source LLMs and report results that highlight ongoing weaknesses in integrating domain knowledge with interactive front-end coding. Our work positions InteractScience as the first benchmark to automatically measure this combined capability with realistic interactive operations, providing a foundation for advancing reliable and educationally useful scientific demonstration code generation. All code and data are publicly available at https://github.com/open-compass/InteractScience.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces InteractScience, a benchmark for evaluating LLMs on generating interactive scientific demonstration code that integrates accurate domain knowledge with functional interactive front-end implementations. It proposes a hybrid evaluation framework that pairs programmatic unit tests for interaction logic with visually-grounded checks against reference snapshots and checklists. The benchmark spans questions across five scientific domains; the authors evaluate 30 open- and closed-source models, report persistent weaknesses in the integrated capability, and position InteractScience as the first automatic benchmark for this combined task. All code and data are released publicly.

Significance. If the benchmark construction and evaluation criteria prove reliable, the work would supply a reproducible, publicly available resource for measuring progress on a practically relevant capability that sits between pure knowledge QA and static web-code generation. The hybrid testing approach and broad model evaluation provide a concrete baseline that could guide future model development for science-education applications.

major comments (2)
  1. [§4] §4 (Benchmark Construction): The description of how the questions, unit tests, reference snapshots, and checklists were created provides no information on pilot studies, expert review for scientific accuracy, or inter-rater reliability for the visual checks. Because the central claim—that the benchmark automatically measures 'correct and educationally useful' interactive demonstrations—rests on these artifacts faithfully operationalizing the target capability, the absence of validation details is load-bearing.
  2. [§5] §5 (Evaluation Setup): Exclusion criteria for test cases and any filtering applied before scoring are not stated. Without these details it is difficult to interpret the reported model weaknesses or to assess whether the results generalize beyond the specific manually chosen items.
minor comments (2)
  1. [Abstract] The abstract states that the benchmark contains 'a substantial set' of questions but does not give the exact count per domain; adding these numbers would improve precision.
  2. [Figure 2] Figure 2 (framework diagram) would benefit from explicit arrows or labels indicating the flow between programmatic tests and snapshot comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive review. We address each major comment below and indicate the revisions that will be made to improve clarity and reproducibility.

read point-by-point responses
  1. Referee: [§4] §4 (Benchmark Construction): The description of how the questions, unit tests, reference snapshots, and checklists were created provides no information on pilot studies, expert review for scientific accuracy, or inter-rater reliability for the visual checks. Because the central claim—that the benchmark automatically measures 'correct and educationally useful' interactive demonstrations—rests on these artifacts faithfully operationalizing the target capability, the absence of validation details is load-bearing.

    Authors: We agree that the current description of benchmark construction is insufficiently detailed. In the revised manuscript we will expand §4 with a new subsection describing the iterative question-design process, the steps taken to verify scientific accuracy through internal expert review by the author team (who include domain specialists), and the multi-author consensus process used to finalize reference snapshots and checklists. We will also note that formal pilot studies and quantitative inter-rater reliability statistics were not performed; instead we relied on repeated internal review cycles. These additions will directly address the load-bearing concern raised. revision: yes

  2. Referee: [§5] §5 (Evaluation Setup): Exclusion criteria for test cases and any filtering applied before scoring are not stated. Without these details it is difficult to interpret the reported model weaknesses or to assess whether the results generalize beyond the specific manually chosen items.

    Authors: We acknowledge that the absence of explicit exclusion and filtering criteria limits interpretability. In the revised §5 we will add a dedicated paragraph specifying the criteria applied during test-case selection (e.g., requirement that each item admit a non-trivial interactive behavior, exclusion of cases with ambiguous visual references, and removal of any items that failed internal sanity checks). We will also clarify that no post-hoc filtering was performed on model outputs before scoring. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark is newly constructed from manual design inputs

full rationale

The paper presents InteractScience as a new benchmark consisting of manually designed questions, unit tests, reference snapshots, and checklists across five scientific domains, evaluated via a hybrid programmatic and visually-grounded framework. No equations, fitted parameters, predictions, or derivations are present that reduce to the paper's own inputs by construction. The central claim of being the first to measure combined scientific knowledge and interactive front-end coding is positioned as arising from the novelty of the described artifacts and testing approach, without load-bearing self-citations or ansatzes imported from prior author work. The construction details serve as the independent starting point rather than an output derived from the evaluation results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the hand-crafted test cases and reference materials rather than on learned parameters or new physical entities.

axioms (1)
  • domain assumption Reference snapshots and checklists provide reliable ground truth for both visual correctness and educational utility of generated demonstrations.
    Invoked when scoring model outputs against the benchmark.

pith-pipeline@v0.9.0 · 5786 in / 1038 out tokens · 27978 ms · 2026-05-21T20:34:51.840262+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages

  1. [1]

    doi: 10.1145/3696410.3714889

    ACM, 2025. doi: 10.1145/3696410.3714889. URLhttps://doi.org/10.1145/ 3696410.3714889. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. Dan Hendrycks, Collin B...

  2. [2]

    and BigCodeBench (Zhuo et al., 2024) evaluate models through hidden unit tests on program- ming or competitive coding problems, capturing algorithmic correctness but ignoring interactivity and visual fidelity. In parallel, visualization-oriented benchmarks such as VisCoder (Ni et al., 2025), ChartCoder (Zhao et al., 2025), DrawingPandas (Galimzyanov et al...

  3. [3]

    The demo must be implemented as a single standalone HTML file with inline HTML, CSS, and JavaScript

  4. [4]

    External libraries may only be included via **CDN** (e.g., p5.js, three.js, D3.js, Plotly.js, MathJax)

  5. [5]

    Do **not** reference any external or uploaded media assets (images, videos, audio), and do **not** use base64-encoded binaries

  6. [6]

    The plan **must fully describe the observed UI state and behavior** in the provided screenshots (including default values, rendered formulas, slider settings, etc.)

  7. [7]

    ### Output Format (strictly follow this structure, no extra commentary or code):

    Require the large language model that uses this plan to **strictly follow the implementation instructions**—any missing information will lead to incorrect results. ### Output Format (strictly follow this structure, no extra commentary or code):

  8. [8]

    Page Content Structure Describe each logical UI section (e.g., Title, Description, Control Panel, Graph Area, Formula Display) and its role

  9. [9]

    Include their types (e.g.,<div>, <input type="range">,<canvas>,<button>,<select>)

    HTML Components List **all required HTML elements**, grouped by section. Include their types (e.g.,<div>, <input type="range">,<canvas>,<button>,<select>). Note if MathJax is required for formula rendering

  10. [10]

    Component IDs and State For every interactive component (sliders, checkboxes, dropdowns, buttons, etc.): - Assign a uniqueid(e.g.,slider-angle,btn-play) - Provide: Initial/default value; Minimum and maximum (and step, if applicable); Label text or tooltip, if any

  11. [11]

    Interaction Logic Explain **exactly** how each control affects the interface. For each user interaction, describe: - What changes in the visual (e.g., redraws, updates) - What dependent values or formulas update - Whether animation or resets are triggered Do not omit any interaction shown in the screenshots

  12. [12]

    The resulting plan must be detailed enough that a large language model can accurately reproduce the entire original demo, including all interactions and visuals

    Visualization Techniques Specify the rendering strategy and technology for each visual element: - p5.js or Canvas API for custom 2D graphics - three.js for 3D scenes - D3.js or SVG for dynamic diagrams - Plotly.js for charts or plots - leaflet.js for maps - MathJax for math formula rendering - CSS for styling and layout (e.g., flex/grid, transitions, colo...

  13. [13]

    The component is visible on page load

  14. [14]

    The component has the correct default value or state (as defined in the plan)

  15. [15]

    The component can be interacted with correctly (e.g., drag slider, click button)

  16. [16]

    Boundary behavior should be tested (e.g., min/max values, reset, toggle on/off)

  17. [17]

    ### Output Format: For each component, write one test case in the following format: - Title: [Short description of the control being tested] - Steps & Assertions:

    The interaction causes some change to the diagram, equation, UI element, or output (verify change occurred, not correctness). ### Output Format: For each component, write one test case in the following format: - Title: [Short description of the control being tested] - Steps & Assertions:

  18. [18]

    Assert: [Component is visible]

  19. [19]

    Assert: [Component has correct default value or state]

  20. [20]

    Action: [Perform a realistic user interaction]

  21. [21]

    Assert: [UI update or state change occurred]

  22. [22]

    Action: [Boundary interaction or reset]

  23. [23]

    - Do **not** invent behavior not described in the implementation plan

    Assert: [System handles boundary or reset with some change] ### Guidelines: - The difficulty of each test case should be **moderate**—not overly simple, not overly complex. - Do **not** invent behavior not described in the implementation plan. - Use only what is described or visible in the plan and screenshots. - One case per component. - Focus on detecti...

  24. [24]

    Action: [Simulate the first interaction to reach this state]

  25. [25]

    drag to 70% of the bar

    Action: [Next interaction, if any] ... N. Action: [Final interaction needed to match the screenshot] N+1. Assert: Take a screenshot of the current UI state ### Guidelines: - All interactions must be **derived strictly from the screenshot and the design plan**. - Test cases must follow the **exact order of input screenshots** (first test case for first scr...

  26. [31]

    Does not rely on function readiness unless stated—only perform initial navigation and DOM load. ### Test Setup: Load the HTML file using: const fileUrl=’file://’+require(’path’).resolve(__dirname,’../pages/{id}.html’);} No need to wait for external scripts or function readiness beyond page load. ### Output format: - You must generate only valid complete P...

  27. [32]

    Navigates to the local HTML page using the code below

  28. [33]

    Performs **only the exact user actions** listed in each test case (e.g., drag, click, input)

  29. [34]

    Uses the DOM structure and component IDs specified in the plan—do not guess or infer selectors beyond what is provided

  30. [35]

    After executing all actions, takes a full-page screenshot of the resulting UI state and saves it as: await page.screenshot({{path:’./snapshots/{id}-[i].png’,fullPage:true}}); where [i] is the index of the test case starting from 1

  31. [36]

    Ensures that tests are strictly based on the input plan and test cases—do not invent new behaviors or UI logic

  32. [37]

    Does not rely on function readiness unless stated—only perform initial navigation and DOM load. ### Test Setup: Load the HTML file using: const fileUrl=’file://’+require(’path’).resolve(__dirname,’../pages/{id}.html’); No need to wait for external scripts or function readiness beyond page load. ### Output format: - You must generate only valid complete Pl...

  33. [38]

    A **detailed implementation plan** of an interactive scientific demo, describing the UI structure, control elements (sliders, buttons, dropdowns, text fields), and the theorem or scientific principle the demo explains

  34. [39]

    Each screenshot shows: * The **input snapshot** (current state of controls such as slider values, button toggles, dropdown selections)

    A set of **screenshots of the demo under different input states**. Each screenshot shows: * The **input snapshot** (current state of controls such as slider values, button toggles, dropdown selections). * The **visual output** (graph, diagram, simulation, or formula rendering) produced by the demo under that input. Your task is to generate a **checklist f...

  35. [40]

    * Treat control states only as **inputs** that determine what the visualization should show

    **Checklist is output-oriented** * Do not check whether buttons, sliders, or controls are styled correctly. * Treat control states only as **inputs** that determine what the visualization should show. * Focus all checklist items on verifying whether the **visual output image** is scientifically correct

  36. [41]

    * Example: *If the angle slider is set to 45°, then the plotted projectile trajectory must peak at the midpoint of its range.*

    **Connect input to output explicitly** * Every checklist item must link the given **input state** to the expected **output visualization**. * Example: *If the angle slider is set to 45°, then the plotted projectile trajectory must peak at the midpoint of its range.*

  37. [42]

    * For physics demos: motion paths, conservation laws, force vectors

    **Scientific focus** * For math demos: formulas, curves, intersections, asymptotes. * For physics demos: motion paths, conservation laws, force vectors. * For geometry demos: shapes, proportions, congruency. * For statistics/plots: distributions, scaling, labeled values

  38. [43]

    * Do not assume hidden internal states or unseen code behavior

    **Visual verification** * All checklist items must be verifiable by comparing the **screenshot output image** with a reference screenshot. * Do not assume hidden internal states or unseen code behavior

  39. [44]

    screenshot_id

    **Do not go beyond the plan** * Only include checklist items that are **explicitly described in the implementation plan** *and* are **visible in the screenshot**. * Do not invent or infer behavior that is not both in the plan and observable in the screenshot. * If something is in the plan but not visible in the screenshot, **do not include it**. Figure 14...

  40. [45]

    A **reference screenshot** that represents the correct output of the demo under a specific input state

  41. [46]

    A **generated screenshot** from a candidate implementation under the same input state

  42. [47]

    ### Your task

    A **checklist** of verification items describing what scientific properties must be visible and correct in the output image. ### Your task

  43. [48]

    Carefully compare the **generated screenshot** with the **reference screenshot**

  44. [49]

    For each **checklist item**, assign a score from **1 to 5** using the rubric below

  45. [50]

    checklist_results

    Provide a short justification for each score. ### Scoring Rubric * **5 (Perfect / Fully Correct)** * Output image matches the reference screenshot precisely for this checklist item. * No scientific or visual errors observed. * **4 (Minor Deviation)** * Output image mostly matches, but there are small differences (e.g., slight shift in curve, small scaling...