Generative Visual Code Mobile World Models

Jamin Shin; Segyu Lee; Se-Young Yun; Sungjun Han; Woosung Koh

Generative Visual Code Mobile World Models

Not yet reviewed by Pith; the record is open.

Re-run · record.json Download PDF Read on arXiv ↗

This paper has not been read by Pith yet. Machine review is queued; the pith claim, tier, and objections will appear here once it completes.

SPECIMEN: schema-true, not a live event

T0 review · schema-true

One-sentence machine reading of the paper's core claim.

pith:XXXXXXXX · record.json · timestamp

arxiv 2602.01576 v2 pith:6ITMBXYH submitted 2026-02-02 cs.LG cs.AIcs.CV

Generative Visual Code Mobile World Models

Woosung Koh , Sungjun Han , Segyu Lee , Se-Young Yun , Jamin Shin This is my paper

classification cs.LG cs.AIcs.CV

keywords visualmobilecodedatagworldmodelsworldgeneration

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

0 comments

read the original abstract

Mobile Graphical User Interface (GUI) World Models (WMs) offer a promising path for improving mobile GUI agent performance at train- and inference-time. However, current approaches face a critical trade-off: text-based WMs sacrifice visual fidelity, while the inability of visual WMs in precise text rendering led to their reliance on slow, complex pipelines dependent on numerous external models. We propose a novel paradigm: visual world modeling via renderable code generation, where a single Vision-Language Model (VLM) predicts the next GUI state as executable web code that renders to pixels, rather than generating pixels directly. This combines the strengths of both approaches: VLMs retain their linguistic priors for precise text rendering while their pre-training on structured web code enables high-fidelity visual generation. We introduce gWorld (8B, 32B), the first open-weight visual mobile GUI WMs built on this paradigm, along with a data generation framework (gWorld) that automatically synthesizes code-based training data. In extensive evaluation across 4 in- and 2 out-of-distribution benchmarks, gWorld sets a new pareto frontier in accuracy versus model size, outperforming 8 frontier open-weight models over 50.25x larger. Further analyses show that (1) scaling training data via gWorld yields meaningful gains, (2) each component of our pipeline improves data quality, and (3) stronger world modeling improves downstream mobile GUI policy performance.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
cs.AI 2026-04 unverdicted novelty 7.0

Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...
Qwen-AgentWorld: Language World Models for General Agents
cs.CL 2026-06 unverdicted novelty 6.0

Qwen-AgentWorld are language world models that simulate multi-domain agent environments and boost general agent capabilities via decoupled RL simulation and unified foundation model training.
How Mobile World Model Guides GUI Agents?
cs.AI 2026-05 unverdicted novelty 6.0

Mobile world models in text, image, and code modalities reach state-of-the-art on their benchmarks and improve downstream GUI agent performance, with code best for in-distribution accuracy and text more robust for out...
Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection
cs.CR 2026-04 unverdicted novelty 6.0

Semantic-level UI Element Injection distracts GUI agents by overlaying safety-aligned UI elements, achieving up to 4.4x higher attack success rates that transfer across models and create persistent attractors.
How Mobile World Model Guides GUI Agents?
cs.AI 2026-05 unverdicted novelty 4.0

World models trained on delta text, full text, diffusion images, and renderable code achieve SoTA on two benchmarks and improve downstream GUI agent performance on three mobile datasets with modality-specific strengths.
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
cs.AI 2026-04 conditional novelty 4.0

A survey proposing a three-level capability taxonomy (L1 Predictor, L2 Simulator, L3 Evolver) for world models across physical, digital, social, and scientific domains.