Recognition: unknown
MIXAR: Scaling Autoregressive Pixel-based Language Models to Multiple Languages and Scripts
Pith reviewed 2026-05-10 15:46 UTC · model grok-4.3
The pith
A generative pixel-based language model trained on eight languages and scripts improves multilingual task performance and handles unseen languages more robustly than prior approaches.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MIXAR is the first generative pixel-based language model trained on eight different languages that use a range of scripts; it delivers substantial gains on both discriminative and generative multilingual tasks, remains effective on languages never seen during training, and shows further improvements in generative performance such as LAMBADA together with greater resistance to orthographic attacks once scaled to 0.5 billion parameters.
What carries the argument
The MIXAR autoregressive architecture that ingests text as pixel images, allowing it to process diverse scripts without any tokenization step.
If this is right
- Substantial performance gains appear on both discriminative and generative multilingual tasks relative to earlier pixel-based and tokenizer-based models.
- The model exhibits robustness on languages absent from its training data.
- Scaling to 0.5 billion parameters produces additional gains on generative benchmarks such as LAMBADA.
- Robustness to orthographic attacks increases with model scale.
Where Pith is reading between the lines
- Pixel-level processing may reduce the preprocessing overhead that tokenizers impose when new scripts are added.
- The same architecture could be tested on even larger numbers of scripts to determine whether the observed generalization continues.
- If pixel representations prove sufficient, downstream applications might avoid maintaining separate tokenizers for each language family.
Load-bearing premise
That training on pixels from eight languages is enough to overcome the perceptual differences between scripts and produce generalization and robustness without tokenization.
What would settle it
A head-to-head test in which a tokenizer-based model trained on the identical eight-language data outperforms MIXAR on the same multilingual discriminative and generative tasks, or where MIXAR shows no advantage on a controlled set of previously unseen languages.
Figures
read the original abstract
Pixel-based language models are gaining momentum as alternatives to traditional token-based approaches, promising to circumvent tokenization challenges. However, the inherent perceptual diversity across languages poses a significant hurdle for multilingual generalization in pixel space. This paper introduces MIXAR, the first generative pixel-based language model trained on eight different languages utilizing a range of different scripts. We empirically evaluate MIXAR against previous pixel-based models as well as comparable tokenizer-based models, demonstrating substantial performance improvement on discriminative and generative multilingual tasks. Additionally, we show how MIXAR is robust to languages never seen during the training. These results are further strengthened when scaling the model to 0.5B parameters which not only improves its capabilities in generative tasks like LAMBADA but also its robustness when challenged with input perturbations such as orthographic attacks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MIXAR, the first generative pixel-based autoregressive language model trained on eight languages spanning multiple scripts. It claims substantial performance gains over prior pixel-based and tokenizer-based models on both discriminative and generative multilingual tasks, robustness to languages and scripts unseen during training, and additional benefits from scaling to 0.5B parameters, including improved LAMBADA scores and greater resistance to orthographic attacks.
Significance. If the empirical claims are substantiated with detailed, reproducible results, this would represent a meaningful advance in multilingual language modeling by showing that pixel-based autoregressive models can address script diversity without tokenization. The scaling behavior and robustness findings, if rigorously demonstrated, would provide concrete evidence for the advantages of pixel representations in handling perceptual variation across languages.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): The manuscript asserts 'substantial performance improvement' and 'robustness' on multilingual tasks but supplies no quantitative metrics, baseline models, evaluation details, error bars, or statistical significance tests. Without these, the central empirical claims cannot be assessed for magnitude or reliability.
- [§3] §3 (Data and Training): No information is provided on training data composition, including per-language or per-script data volumes, balance across the eight languages, or rendering details such as image resolution and font choices. This information is load-bearing for the robustness claims, as dominance by a subset of scripts (e.g., Latin) could confound apparent generalization to unseen languages rather than demonstrating an inherent advantage of the pixel approach.
minor comments (1)
- [Abstract] The abstract references LAMBADA without clarifying whether the standard English version or a multilingual adaptation is used, and does not specify the exact orthographic attack types evaluated.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving the clarity and completeness of our empirical claims and data description. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The manuscript asserts 'substantial performance improvement' and 'robustness' on multilingual tasks but supplies no quantitative metrics, baseline models, evaluation details, error bars, or statistical significance tests. Without these, the central empirical claims cannot be assessed for magnitude or reliability.
Authors: We agree that the abstract would be strengthened by including specific quantitative results. In the revised manuscript, we will update the abstract to report key metrics (e.g., accuracy or perplexity improvements on the multilingual tasks), name the main baseline models, and briefly note the evaluation setup. For §4, the experiments section already includes comparisons against prior pixel-based models and tokenizer-based models on both discriminative and generative tasks, along with results demonstrating robustness to unseen languages and benefits from scaling to 0.5B parameters. However, we acknowledge the value of additional rigor: we will add error bars from multiple random seeds, more explicit evaluation details (datasets, prompts, and metrics), and statistical significance tests for the reported gains. These changes will allow readers to better assess the magnitude and reliability of the improvements. revision: yes
-
Referee: [§3] §3 (Data and Training): No information is provided on training data composition, including per-language or per-script data volumes, balance across the eight languages, or rendering details such as image resolution and font choices. This information is load-bearing for the robustness claims, as dominance by a subset of scripts (e.g., Latin) could confound apparent generalization to unseen languages rather than demonstrating an inherent advantage of the pixel approach.
Authors: We agree that these details are essential for interpreting the robustness results and for ruling out potential confounds from data imbalance. In the revised version, we will substantially expand §3 to include a per-language and per-script breakdown of the training data volumes, the overall balance across the eight languages and scripts, and the rendering specifications (image resolution and font choices used for each script). This added information will clarify the data composition and support that the observed robustness to unseen languages and scripts stems from the pixel-based modeling approach rather than from Latin-script dominance. revision: yes
Circularity Check
No circularity: purely empirical claims without derivations or self-referential reductions
full rationale
The paper presents MIXAR as a trained generative model evaluated on multilingual tasks, with claims of robustness to unseen languages and scaling benefits supported by experimental results. No equations, parameter fittings presented as predictions, uniqueness theorems, or ansatzes appear in the provided text. Central claims rest on benchmark comparisons and training descriptions rather than any step that reduces by construction to its own inputs or prior self-citations. This is a standard empirical ML paper whose results are externally falsifiable via replication on the stated tasks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Do all languages cost the same? tokenization in the era of commercial language models
Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Jungo Kasai, David R Mortensen, Noah A Smith, and Yulia Tsvetkov. Do all languages cost the same? tokenization in the era of commercial language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 9904–9923,
2023
-
[2]
Xnli: Evaluating cross-lingual sentence representations
Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. Xnli: Evaluating cross-lingual sentence representations. InProceedings of the 2018 conference on empirical methods in natural language processing, pp. 2475–2485,
2018
-
[3]
Revis- iting pre-trained models for Chinese natural language processing
Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. Revis- iting pre-trained models for Chinese natural language processing. In Trevor Cohn, Yulan He, and Yang Liu (eds.),Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 657–668, Online, November
2020
-
[4]
doi: 10.18653/v1/2020.findings-emnlp.58
Association for Computational Lin- guistics. doi: 10.18653/v1/2020.findings-emnlp.58. URL https://aclanthology.org/ 2020.findings-emnlp.58/. Falcon Dai and Zheng Cai. Glyph-aware embedding of chinese characters. InProceedings of the First Workshop on Subword and Character Level Models in NLP, pp. 64–69,
-
[5]
Bert: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp. 4171–4186,
2019
-
[6]
From variational to deterministic autoencoders.arXiv preprint arXiv:1903.12436,
Partha Ghosh, Mehdi SM Sajjadi, Antonio Vergari, Michael Black, and Bernhard Sch¨olkopf. From variational to deterministic autoencoders.arXiv preprint arXiv:1903.12436,
-
[7]
12 Preprint. Under review. Andreas Grivas, Lorenzo Loconte, Emile van Krieken, Piotr Nawrot, Yu Zhao, Euan Wielewski, Pasquale Minervini, Edoardo Ponti, and Antonio Vergari. Fast and expressive multi-token prediction with probabilistic circuits.arXiv preprint arXiv:2511.11346,
-
[8]
Benchmarking Neural Network Robustness to Common Corruptions and Perturbations
Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations.arXiv preprint arXiv:1903.12261,
work page internal anchor Pith review arXiv 1903
-
[9]
Multilingual pretraining for pixel language models
Ilker Kesen, Jonas F Lotz, Ingo Ziegler, Phillip Rust, and Desmond Elliott. Multilingual pretraining for pixel language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 29582–29599,
2025
-
[10]
textless-lib: A library for textless spoken language processing
Eugene Kharitonov, Jade Copet, Kushal Lakhotia, Tu-Anh Nguyen, Paden Tomasello, Ann Lee, Ali Elkahky, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, et al. textless-lib: A library for textless spoken language processing. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language...
2022
-
[11]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
URL https://arxiv. org/abs/1808.06226. Sander Land and Max Bartolo. Fishing for magikarp: Automatically detecting under-trained tokens in large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 11631–11646,
work page internal anchor Pith review arXiv 2024
-
[13]
Vocabulary attack to hijack large language model applications.arXiv preprint arXiv:2404.02637,
Patrick Levi and Christoph P Neumann. Vocabulary attack to hijack large language model applications.arXiv preprint arXiv:2404.02637,
-
[14]
Visually grounded reasoning across languages and cultures
Fangyu Liu, Emanuele Bugliarello, Edoardo Maria Ponti, Siva Reddy, Nigel Collier, and Desmond Elliott. Visually grounded reasoning across languages and cultures. InPro- ceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 10467–10485,
2021
-
[15]
Learning character-level composi- tionality with visual features.arXiv preprint arXiv:1704.04859,
Frederick Liu, Han Lu, Chieh Lo, and Graham Neubig. Learning character-level composi- tionality with visual features.arXiv preprint arXiv:1704.04859,
-
[16]
SGDR: Stochastic Gradient Descent with Warm Restarts
URLhttps://arxiv.org/abs/1608.03983. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization,
work page internal anchor Pith review arXiv
-
[17]
Decoupled Weight Decay Regularization
URL https://arxiv.org/abs/1711.05101. Jonas Lotz, Elizabeth Salesky, Phillip Rust, and Desmond Elliott. Text rendering strategies for pixel language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 10155–10172,
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Overcoming vocabulary constraints with pixel-level fallback.arXiv preprint arXiv:2504.02122,
Jonas F Lotz, Hendra Setiawan, Stephan Peitz, and Yova Kementchedjhieva. Overcoming vocabulary constraints with pixel-level fallback.arXiv preprint arXiv:2504.02122,
-
[19]
The LAMBADA dataset: Word prediction requiring a broad discourse context
Denis Paperno, Germ ´an Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern´andez. The lambada dataset: Word prediction requiring a broad discourse context.arXiv preprint arXiv:1606.06031,
-
[20]
Language modelling with pixels.arXiv preprint arXiv:2207.06991,
Phillip Rust, Jonas F Lotz, Emanuele Bugliarello, Elizabeth Salesky, Miryam de Lhoneux, and Desmond Elliott. Language modelling with pixels.arXiv preprint arXiv:2207.06991,
-
[21]
Robust open-vocabulary translation from visual text representations
Elizabeth Salesky, David Etter, and Matt Post. Robust open-vocabulary translation from visual text representations. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 7235–7252,
2021
-
[22]
Multilingual pixel repre- sentations for translation and effective cross-lingual transfer
Elizabeth Salesky, Neha Verma, Philipp Koehn, and Matt Post. Multilingual pixel repre- sentations for translation and effective cross-lingual transfer. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 13845–13861,
2023
-
[23]
Neural Machine Translation of Rare Words with Subword Units
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725, 2016a. Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units...
work page internal anchor Pith review arXiv
-
[24]
GLU Variants Improve Transformer
Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202,
work page internal anchor Pith review arXiv 2002
-
[25]
Baohua Sun, Lin Yang, Patrick Dong, Wenhan Zhang, Jason Dong, and Charles Young. Super characters: A conversion from sentiment classification to image classification.arXiv preprint arXiv:1810.07653,
-
[26]
Zijun Sun, Xiaoya Li, Xiaofei Sun, Yuxian Meng, Xiang Ao, Qing He, Fei Wu, and Jiwei Li. Chinesebert: Chinese pretraining enhanced by glyph and pinyin information.arXiv preprint arXiv:2106.16038,
-
[27]
Pixar: Auto-regressive language modeling in pixel space.arXiv preprint arXiv:2401.03321,
Yintao Tai, Xiyang Liao, Alessandro Suglia, and Antonio Vergari. Pixar: Auto-regressive language modeling in pixel space.arXiv preprint arXiv:2401.03321,
-
[28]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
14 Preprint. Under review. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461,
work page internal anchor Pith review arXiv
-
[30]
Towards ai-complete question answering: A set of prerequisite toy tasks
Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart Van Merri¨enboer, Armand Joulin, and Tomas Mikolov. Towards ai-complete question answering: A set of prerequisite toy tasks.arXiv preprint arXiv:1502.05698,
-
[31]
URL https://arxiv.org/abs/1609. 08144. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
(2023) and increase the patch size to 32×32 pixels as an essential step to correctly represent these languages
For this reason, we go beyond what was studied Lotz et al. (2023) and increase the patch size to 32×32 pixels as an essential step to correctly represent these languages. While this facilitates encoding more complex scripts, it also increases the complexity of training due to the increased image resolution. Moreover, modeling a higher dimensional distribu...
2023
-
[33]
stage1 MNLI QQP QNLI SST-2 COLA STSB MRPC RTE WNLI 85MPIXARstage1lr 3e-5 3e-5 3e-5 3e-5 3e-5 3e-5 6e-5 3e-5 3e-5 116MMIXARstage1lr 3e-5 3e-5 3e-5 3e-5 3e-5 3e-5 6e-5 3e-5 6e-5 477MMIXARstage1lr 3e-5 3e-5 3e-5 3e-5 6e-5 3e-5 6e-5 3e-5 6e-5 Weight decay 0.1 0.1 0.1 0.01 0.01 0.01 0.01 0.01 0.01 Optimizer AdamW Warmup Linear warmup Warmup steps 1000 1000 500...
2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.