Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

Fangyu Liu; Hexiang Hu; Iulia Turc; Julian Eisenschlos; Kenton Lee; Kristina Toutanova; Mandar Joshi; Ming-Wei Chang; Peter Shaw; Urvashi Khandelwal

arxiv: 2210.03347 · v2 · pith:GFNRU7NUnew · submitted 2022-10-07 · 💻 cs.CL · cs.CV

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

Kenton Lee , Mandar Joshi , Iulia Turc , Hexiang Hu , Fangyu Liu , Julian Eisenschlos , Urvashi Khandelwal , Peter Shaw

show 2 more authors

Ming-Wei Chang Kristina Toutanova

This is my paper

classification 💻 cs.CL cs.CV

keywords languagepretrainingmodelpix2structpretrainedtasksvisualdata

0 comments

read the original abstract

Visually-situated language is ubiquitous -- sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to this diversity, previous work has typically relied on domain-specific recipes with limited sharing of the underlying data, model architectures, and objectives. We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding, which can be finetuned on tasks containing visually-situated language. Pix2Struct is pretrained by learning to parse masked screenshots of web pages into simplified HTML. The web, with its richness of visual elements cleanly reflected in the HTML structure, provides a large source of pretraining data well suited to the diversity of downstream tasks. Intuitively, this objective subsumes common pretraining signals such as OCR, language modeling, image captioning. In addition to the novel pretraining strategy, we introduce a variable-resolution input representation and a more flexible integration of language and vision inputs, where language prompts such as questions are rendered directly on top of the input image. For the first time, we show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains: documents, illustrations, user interfaces, and natural images.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding
cs.CL 2026-02 unverdicted novelty 7.0

Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.
PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading
cs.AI 2026-01 conditional novelty 7.0

PlotChain benchmark reports top MLLMs reaching ~80% field-level accuracy on engineering plot reading under human-like tolerances, but with persistent failures on frequency-domain tasks like bandpass and FFT spectra.
Self-Distillation Policy Optimization via Visual Feedback: Bridging Code and Visual Artifacts
cs.AI 2026-06 unverdicted novelty 6.0

Visual-SDPO distills visual feedback from rendered code outputs into a student policy via grounded credit weighting and GRPO, yielding over 10-point gains on chart/UI/slide benchmarks.
Beyond Encoder Accumulation: Measuring Encoder Roles in Multi-Encoder VLMs
cs.CV 2026-06 unverdicted novelty 6.0

Retraining all 31 subsets of five vision encoders shows Capacity and Necessity are distinct, pre-projector effective rank predicts residual performance at fixed parameter count, and high-Capacity plus adaptive complem...
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
cs.CL 2024-10 unverdicted novelty 6.0

OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
GPT-4V(ision) is a Generalist Web Agent, if Grounded
cs.IR 2024-01 conditional novelty 6.0

GPT-4V achieves 51.1% success on live web tasks as a generalist agent when plans are manually grounded, outperforming text-only models, but automatic grounding lags far behind oracle performance.
MUIAnno: An Expert-Annotated Dataset and Evaluation Benchmark for Mobile UI Understanding
cs.HC 2026-05 unverdicted novelty 5.0

MUIAnno is an expert-annotated dataset of mobile UI screens from iOS apps with structured JSON labels and baseline results for UI element detection.
From Handwriting to Structured Data: Benchmarking AI Digitisation of Handwritten Forms
cs.CV 2026-04 unverdicted novelty 4.0

Frontier multimodal LLMs achieve ~85% accuracy and ~90% weighted F1 on digitizing complex handwritten medical forms, with Gemini 3.1 strongest overall and prompt optimization lifting macro metrics over 60%.