Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
read the original abstract
Visually-situated language is ubiquitous -- sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to this diversity, previous work has typically relied on domain-specific recipes with limited sharing of the underlying data, model architectures, and objectives. We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding, which can be finetuned on tasks containing visually-situated language. Pix2Struct is pretrained by learning to parse masked screenshots of web pages into simplified HTML. The web, with its richness of visual elements cleanly reflected in the HTML structure, provides a large source of pretraining data well suited to the diversity of downstream tasks. Intuitively, this objective subsumes common pretraining signals such as OCR, language modeling, image captioning. In addition to the novel pretraining strategy, we introduce a variable-resolution input representation and a more flexible integration of language and vision inputs, where language prompts such as questions are rendered directly on top of the input image. For the first time, we show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains: documents, illustrations, user interfaces, and natural images.
This paper has not been read by Pith yet.
Forward citations
Cited by 8 Pith papers
-
CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding
Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.
-
PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading
PlotChain benchmark reports top MLLMs reaching ~80% field-level accuracy on engineering plot reading under human-like tolerances, but with persistent failures on frequency-domain tasks like bandpass and FFT spectra.
-
Self-Distillation Policy Optimization via Visual Feedback: Bridging Code and Visual Artifacts
Visual-SDPO distills visual feedback from rendered code outputs into a student policy via grounded credit weighting and GRPO, yielding over 10-point gains on chart/UI/slide benchmarks.
-
Beyond Encoder Accumulation: Measuring Encoder Roles in Multi-Encoder VLMs
Retraining all 31 subsets of five vision encoders shows Capacity and Necessity are distinct, pre-projector effective rank predicts residual performance at fixed parameter count, and high-Capacity plus adaptive complem...
-
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
-
GPT-4V(ision) is a Generalist Web Agent, if Grounded
GPT-4V achieves 51.1% success on live web tasks as a generalist agent when plans are manually grounded, outperforming text-only models, but automatic grounding lags far behind oracle performance.
-
MUIAnno: An Expert-Annotated Dataset and Evaluation Benchmark for Mobile UI Understanding
MUIAnno is an expert-annotated dataset of mobile UI screens from iOS apps with structured JSON labels and baseline results for UI element detection.
-
From Handwriting to Structured Data: Benchmarking AI Digitisation of Handwritten Forms
Frontier multimodal LLMs achieve ~85% accuracy and ~90% weighted F1 on digitizing complex handwritten medical forms, with Gemini 3.1 strongest overall and prompt optimization lifting macro metrics over 60%.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.