Introduces WorldCoder-Bench and StateProbe for evaluating LLM-generated physically grounded 3D browser worlds, with frontier models reaching at most 27.8% verification coverage.
Iwr-bench: Can lvlms reconstruct interactive webpage from a user interaction video?arXiv preprint arXiv:2509.24709, 2025
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
Cookie-Bench is a reference-free 1,000-query web development benchmark paired with Cookie-Frame, a metacognition-inspired three-stage framework (static perception, agent interaction, dynamic scoring) that aligns with human ratings on 13 frontier LLMs.
A Paper-to-Interactive-System Agent and I-WebGenBench benchmark with 19 papers enable converting scientific PDFs into executable interactive web systems, with PaperVoyager framework shown to improve quality.
citing papers explorer
-
WorldCoder-Bench: Benchmarking Physically Grounded 3D World Synthesis
Introduces WorldCoder-Bench and StateProbe for evaluating LLM-generated physically grounded 3D browser worlds, with frontier models reaching at most 27.8% verification coverage.
-
Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation
Cookie-Bench is a reference-free 1,000-query web development benchmark paired with Cookie-Frame, a metacognition-inspired three-stage framework (static perception, agent interaction, dynamic scoring) that aligns with human ratings on 13 frontier LLMs.
-
I-WebGenBench : Evaluating Interactivity in LLM-Generated Scientific Web Applications
A Paper-to-Interactive-System Agent and I-WebGenBench benchmark with 19 papers enable converting scientific PDFs into executable interactive web systems, with PaperVoyager framework shown to improve quality.