Tokenizing Semantic Segmentation with Run Length Encoding

Abhineet Singh; Justin Rozeboom; Nilanjan Ray

arxiv: 2602.21627 · v3 · submitted 2026-02-25 · 💻 cs.CV

Tokenizing Semantic Segmentation with Run Length Encoding

Abhineet Singh , Justin Rozeboom , Nilanjan Ray This is my paper

Pith reviewed 2026-05-15 19:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords semantic segmentationrun length encodingautoregressive modelingvideo segmentationpanoptic segmentationtoken sequencesPix2Seq

0 comments

The pith

Run-length encoding converts semantic segmentation masks into token sequences that autoregressive language models can predict for both images and videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that semantic segmentation masks, normally dense pixel maps, can instead be represented as compact sequences of discrete tokens. Run length encoding discretizes each mask, after which an adapted Pix2Seq autoregressive decoder learns to emit the token sequence directly from the input image or video frame. Novel compression steps shorten these sequences enough to handle video lengths, and instance identifiers are folded into the tokens to support panoptic output as well. The resulting models reach competitive accuracy on two domain-specific datasets despite severe compute limits, showing that language-model-style generation can serve as a unified backbone for dense vision tasks.

Core claim

Segmentation masks are discretized into token sequences via run length encoding and then predicted autoregressively; compression strategies reduce sequence length to make video extension practical, while instance labels are incorporated into the same token stream to produce panoptic results.

What carries the argument

Run length encoding of masks into discrete tokens, combined with compression tokenization to shorten sequences for autoregressive generation.

If this is right

Image and video segmentation share the same autoregressive decoder architecture once masks are tokenized.
Panoptic segmentation is obtained simply by adding instance identifiers to the same token vocabulary.
Video inputs become feasible because the compression steps keep token sequences short enough for practical generation.
The same framework can be trained on domain-specific data and still match specialized models under limited compute.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The token-sequence formulation could be applied to other dense prediction problems such as depth or surface normal estimation by defining analogous encodings.
Larger language-model backbones would likely narrow the remaining accuracy gap on general benchmarks without changing the tokenization pipeline.
Public code release allows direct testing of whether the approach scales to standard video segmentation datasets once more compute is available.

Load-bearing premise

Run length encoding plus the compression strategies retain enough spatial detail for the autoregressive model to reconstruct accurate masks, even across long video sequences.

What would settle it

Direct IoU comparison on a video dataset containing small or thin objects between the tokenized autoregressive outputs and a standard convolutional segmentation baseline, checking whether fine boundary detail is lost in the encoding step.

read the original abstract

This paper presents a new unified approach to semantic segmentation in both images and videos by using language modeling to output the masks as sequences of discrete tokens. We use run length encoding (RLE) to discretize the segmentation masks, and adapt the Pix2Seq framework to learn autoregressive models to output these tokens. We propose novel tokenization strategies to compress the lengths of the token sequences to make it practicable to extend this approach to videos. We also show how instance information can be incorporated into the tokenization process to perform panoptic segmentation. We evaluate our models on two domain-specific datasets to demonstrate their competitiveness with the state of the art in certain scenarios, in spite of being severely bottlenecked by our limited computational resources. We supplement these analyses by proposing several promising approaches to foster future competitiveness in general-purpose applications, and facilitate this by making our code and models publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RLE tokenization adapts Pix2Seq for video and panoptic segmentation but evaluations stay preliminary due to compute limits.

read the letter

The core of this paper is adapting Pix2Seq to output segmentation masks as RLE token sequences instead of direct pixel predictions, with new compression steps to handle video lengths and an instance encoding trick for panoptic output. That combination is the actual new piece: not just RLE on its own, but the specific ways they compress runs and fold in panoptic info so autoregressive modeling stays tractable across images and short videos. Releasing code and models is useful for anyone who wants to test the tokenization directly. The approach is straightforward and the compression ideas address a real practical bottleneck in turning dense masks into sequences. On the soft side, the results rest on two domain-specific datasets with the authors openly noting severe compute constraints, so claims of competitiveness lack the usual detailed baselines, failure cases, or scaling tests. The RLE flattening of 2D structure does carry the risk the stress-test note flags—short runs around small objects or boundaries can be fragile under prediction noise or compression—and the paper does not yet show how often that happens in practice. This is for computer vision researchers already working on sequence or autoregressive models for dense tasks who want a concrete tokenization recipe to try. A reader looking for unified image-video frameworks would find the compression strategies worth examining. I would send it for peer review; the tokenization work is concrete enough that referees could usefully pressure the error analysis and generalizability without needing major new experiments upfront.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a unified approach to semantic segmentation for images and videos by discretizing masks via run-length encoding (RLE) and training autoregressive models (adapted from Pix2Seq) to predict the resulting token sequences. Novel compression strategies are introduced to shorten sequences for video extension, instance information is incorporated for panoptic segmentation, and models are evaluated on two domain-specific datasets where they are reported competitive with SOTA despite severe compute limits; code and models are released publicly.

Significance. If the RLE-plus-compression pipeline can be shown to preserve boundary and small-object fidelity at scale, the work would usefully extend language-modeling paradigms to dense prediction tasks and simplify multimodal integration. The public code release supports reproducibility, but the narrow evaluation scope currently limits broader significance.

major comments (2)

[Abstract and Evaluation] Abstract and Evaluation section: competitiveness is asserted on two domain-specific datasets without reported detailed baselines, per-class error analysis, or ablation on boundary precision; this leaves the central claim that RLE tokenization plus compression preserves sufficient spatial structure untested for general cases.
[Tokenization strategies] Tokenization and compression strategies: the paper states that proposed compression makes video sequences practicable, yet provides no quantitative measure of information loss (e.g., change in mask IoU or boundary F-score before/after compression), which is load-bearing for the video-extension claim.

minor comments (1)

[Abstract] The abstract refers to 'limited computational resources' without specifying model sizes, sequence lengths, or hardware used, which would help readers assess the reported competitiveness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses

Referee: [Abstract and Evaluation] Abstract and Evaluation section: competitiveness is asserted on two domain-specific datasets without reported detailed baselines, per-class error analysis, or ablation on boundary precision; this leaves the central claim that RLE tokenization plus compression preserves sufficient spatial structure untested for general cases.

Authors: The manuscript explicitly frames its evaluation on two domain-specific datasets due to severe compute limits, claiming competitiveness only in those scenarios while proposing directions for general-purpose use. We did not assert broad general-case performance. To strengthen the presentation, we will expand the evaluation section with more detailed baselines and per-class error analysis on the reported datasets; we will also add a boundary-precision ablation computed via the public code release. revision: partial
Referee: [Tokenization strategies] Tokenization and compression strategies: the paper states that proposed compression makes video sequences practicable, yet provides no quantitative measure of information loss (e.g., change in mask IoU or boundary F-score before/after compression), which is load-bearing for the video-extension claim.

Authors: We agree that explicit quantification of information loss is important for the video claim. The original manuscript emphasized sequence-length reduction and end-task performance. In revision we will report the delta in mask IoU and boundary F-score before versus after each compression step to directly measure fidelity impact. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical tokenization method evaluated on external data

full rationale

The paper proposes an empirical pipeline: RLE discretization of masks, compression tokenization strategies, adaptation of the Pix2Seq autoregressive framework, and instance-aware extensions for panoptic segmentation. All central claims are supported by training and evaluation on independent external datasets under stated compute constraints. No equations, predictions, or self-citations reduce any result to its own inputs by construction. The method is falsifiable via pixel-level metrics on held-out data, with no self-definitional loops, fitted-input predictions, or load-bearing self-citations.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that RLE sequences retain enough spatial information for autoregressive prediction to succeed, plus free parameters in the novel compression strategies.

free parameters (1)

RLE compression parameters
Parameters controlling tokenization strategies to shorten sequences for video feasibility, chosen to make the approach practicable.

axioms (1)

domain assumption Autoregressive language models can capture the spatial structure of RLE-tokenized segmentation masks
Invoked when adapting the Pix2Seq framework to output masks as token sequences.

pith-pipeline@v0.9.0 · 5445 in / 1158 out tokens · 30848 ms · 2026-05-15T19:56:36.910962+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use run length encoding (RLE) to discretize the segmentation masks, and adapt the Pix2Seq framework to learn autoregressive models to output these tokens. We propose novel tokenization strategies to compress the lengths of the token sequences...
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We also show how instance information can be incorporated into the tokenization process to perform panoptic segmentation.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

[1]

Acuna, H

D. Acuna, H. Ling, A. Kar, and S. Fidler, Efﬁcient Interactive An- notation of Segmentation Datasets with Polygon-RNN++ , CVPR (2018)

work page 2018
[2]

Castrejón, K

L. Castrejón, K. Kundu, R. Urtasun, and S. Fidler, Annotating Object Instances with a Polygon-RNN , CVPR (2017), 4485–4493

work page 2017
[3]

Chen, Pix2Seq Codebase: Multi-tasks with generative modeling , online: https://github.com/google-research/pix2seq

T. Chen, Pix2Seq Codebase: Multi-tasks with generative modeling , online: https://github.com/google-research/pix2seq

work page
[4]

T. Chen, S. Saxena, L. Li, T.-Y . Lin, D. J. Fleet, and G. Hinton, A Uniﬁed Sequence Interface for Vision Tasks (2022)

work page 2022
[5]

T. Chen, S. Saxena, L. Li, D. J. Fleet, and G. E. Hinton, Pix2seq: A Language Modeling Framework for Object Detection (2022)

work page 2022
[6]

T. Chen, L. Li, S. Saxena, G. E. Hinton, and D. J. Fleet, A Generalist Framework for Panoptic Segmentation of Images and Videos, ICCV (2023)

work page 2023
[7]

Chitta, J

K. Chitta, J. M. Álvarez, and M. Hebert, Quadtree Generating Networks: Efﬁcient Hierarchical Scene Parsing with Sparse Convo- lutions, W ACV (2020), 2009–2018

work page 2020
[8]

Cordts et al., The cityscapes dataset for semantic urban scene understanding (2016), 3213–3223

M. Cordts et al., The cityscapes dataset for semantic urban scene understanding (2016), 3213–3223

work page 2016
[9]

L. R. Dice, Measures of the Amount of Ecologic Association Be- tween Species, Ecology 26 (1945), no. 3, 297–302

work page 1945
[10]

Lazarow, W

J. Lazarow, W. Xu, and Z. Tu, Instance segmentation with mask- supervised polygonal boundary transformers (2022)

work page 2022
[11]

Liang, N

J. Liang, N. Homayounfar, W.-C. Ma, Y . Xiong, R. Hu, and R. Urtasun, PolyTransform: Deep Polygon Transformer for Instance Segmentation (2020)

work page 2020
[12]

Lin et al., Microsoft COCO: Common Objects in Context (2014), 740–755

T.-Y . Lin et al., Microsoft COCO: Common Objects in Context (2014), 740–755

work page 2014
[13]

Liu et al., Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, ICCV (2021), 9992–10002

Z. Liu et al., Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, ICCV (2021), 9992–10002

work page 2021
[14]

Robinson and C

A. Robinson and C. Cherry, Results of a prototype television band- width compression scheme , Proceedings of the IEEE 55 (1967), no. 3, 356–364

work page 1967
[15]

Singh, P2S-Video: Extension of Pix2Seq for Video Detection and Segmentation, online: https://github.com/abhineet123/p2s-video

A. Singh, P2S-Video: Extension of Pix2Seq for Video Detection and Segmentation, online: https://github.com/abhineet123/p2s-video

work page
[16]

Singh, Video Detection and Segmentation with Language Model- ing, online: https://webdocs.cs.ualberta.ca/~asingh1/p2s/

A. Singh, Video Detection and Segmentation with Language Model- ing, online: https://webdocs.cs.ualberta.ca/~asingh1/p2s/

work page
[17]

Singh, Object detection and segmentation with deep learning: From ﬁxed to variable-length representations , Ph.D

A. Singh, Object detection and segmentation with deep learning: From ﬁxed to variable-length representations , Ph.D. thesis, Univer- sity of Alberta, Edmonton, Canada, 2025

work page 2025
[18]

Singh and N

A. Singh and N. Ray, Improving token-based object detection with video, IEEE Access (2025), 1–1

work page 2025
[19]

Singh, H

A. Singh, H. Kalke, M. R. Loewen, and N. Ray, River Ice Segmen- tation With Deep Learning , IEEE Transactions on Geoscience and Remote Sensing 58 (2020), 7570–7579

work page 2020
[20]

Singh, I

A. Singh, I. Jasra, O. Mouhammed, N. Dadheech, N. Ray, and J. Shapiro, Towards Early Prediction of Human iPSC Reprogram- ming Success, Machine Learning for Biomedical Imaging 2 (2023), 390–407

work page 2023
[21]

Soille, Morphological Image Analysis: Principles and Applica- tions, Springer-V erlag, 2003

P . Soille, Morphological Image Analysis: Principles and Applica- tions, Springer-V erlag, 2003

work page 2003
[22]

Sørensen, T

T. Sørensen, T. Sørensen, T. Biering-Sørensen, T. Sørensen, and J. T. Sorensen, A method of establishing group of equal amplitude in plant sociobiology based on similarity of species content and its application to analyses of the vegetation on Danish commons (1948). How to cite this article: Singh A., Rozeboom J., Ray N.. To- kenizing Semantic Segmentatio...

work page 1948

[1] [1]

Acuna, H

D. Acuna, H. Ling, A. Kar, and S. Fidler, Efﬁcient Interactive An- notation of Segmentation Datasets with Polygon-RNN++ , CVPR (2018)

work page 2018

[2] [2]

Castrejón, K

L. Castrejón, K. Kundu, R. Urtasun, and S. Fidler, Annotating Object Instances with a Polygon-RNN , CVPR (2017), 4485–4493

work page 2017

[3] [3]

Chen, Pix2Seq Codebase: Multi-tasks with generative modeling , online: https://github.com/google-research/pix2seq

T. Chen, Pix2Seq Codebase: Multi-tasks with generative modeling , online: https://github.com/google-research/pix2seq

work page

[4] [4]

T. Chen, S. Saxena, L. Li, T.-Y . Lin, D. J. Fleet, and G. Hinton, A Uniﬁed Sequence Interface for Vision Tasks (2022)

work page 2022

[5] [5]

T. Chen, S. Saxena, L. Li, D. J. Fleet, and G. E. Hinton, Pix2seq: A Language Modeling Framework for Object Detection (2022)

work page 2022

[6] [6]

T. Chen, L. Li, S. Saxena, G. E. Hinton, and D. J. Fleet, A Generalist Framework for Panoptic Segmentation of Images and Videos, ICCV (2023)

work page 2023

[7] [7]

Chitta, J

K. Chitta, J. M. Álvarez, and M. Hebert, Quadtree Generating Networks: Efﬁcient Hierarchical Scene Parsing with Sparse Convo- lutions, W ACV (2020), 2009–2018

work page 2020

[8] [8]

Cordts et al., The cityscapes dataset for semantic urban scene understanding (2016), 3213–3223

M. Cordts et al., The cityscapes dataset for semantic urban scene understanding (2016), 3213–3223

work page 2016

[9] [9]

L. R. Dice, Measures of the Amount of Ecologic Association Be- tween Species, Ecology 26 (1945), no. 3, 297–302

work page 1945

[10] [10]

Lazarow, W

J. Lazarow, W. Xu, and Z. Tu, Instance segmentation with mask- supervised polygonal boundary transformers (2022)

work page 2022

[11] [11]

Liang, N

J. Liang, N. Homayounfar, W.-C. Ma, Y . Xiong, R. Hu, and R. Urtasun, PolyTransform: Deep Polygon Transformer for Instance Segmentation (2020)

work page 2020

[12] [12]

Lin et al., Microsoft COCO: Common Objects in Context (2014), 740–755

T.-Y . Lin et al., Microsoft COCO: Common Objects in Context (2014), 740–755

work page 2014

[13] [13]

Liu et al., Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, ICCV (2021), 9992–10002

Z. Liu et al., Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, ICCV (2021), 9992–10002

work page 2021

[14] [14]

Robinson and C

A. Robinson and C. Cherry, Results of a prototype television band- width compression scheme , Proceedings of the IEEE 55 (1967), no. 3, 356–364

work page 1967

[15] [15]

Singh, P2S-Video: Extension of Pix2Seq for Video Detection and Segmentation, online: https://github.com/abhineet123/p2s-video

A. Singh, P2S-Video: Extension of Pix2Seq for Video Detection and Segmentation, online: https://github.com/abhineet123/p2s-video

work page

[16] [16]

Singh, Video Detection and Segmentation with Language Model- ing, online: https://webdocs.cs.ualberta.ca/~asingh1/p2s/

A. Singh, Video Detection and Segmentation with Language Model- ing, online: https://webdocs.cs.ualberta.ca/~asingh1/p2s/

work page

[17] [17]

Singh, Object detection and segmentation with deep learning: From ﬁxed to variable-length representations , Ph.D

A. Singh, Object detection and segmentation with deep learning: From ﬁxed to variable-length representations , Ph.D. thesis, Univer- sity of Alberta, Edmonton, Canada, 2025

work page 2025

[18] [18]

Singh and N

A. Singh and N. Ray, Improving token-based object detection with video, IEEE Access (2025), 1–1

work page 2025

[19] [19]

Singh, H

A. Singh, H. Kalke, M. R. Loewen, and N. Ray, River Ice Segmen- tation With Deep Learning , IEEE Transactions on Geoscience and Remote Sensing 58 (2020), 7570–7579

work page 2020

[20] [20]

Singh, I

A. Singh, I. Jasra, O. Mouhammed, N. Dadheech, N. Ray, and J. Shapiro, Towards Early Prediction of Human iPSC Reprogram- ming Success, Machine Learning for Biomedical Imaging 2 (2023), 390–407

work page 2023

[21] [21]

Soille, Morphological Image Analysis: Principles and Applica- tions, Springer-V erlag, 2003

P . Soille, Morphological Image Analysis: Principles and Applica- tions, Springer-V erlag, 2003

work page 2003

[22] [22]

Sørensen, T

T. Sørensen, T. Sørensen, T. Biering-Sørensen, T. Sørensen, and J. T. Sorensen, A method of establishing group of equal amplitude in plant sociobiology based on similarity of species content and its application to analyses of the vegetation on Danish commons (1948). How to cite this article: Singh A., Rozeboom J., Ray N.. To- kenizing Semantic Segmentatio...

work page 1948