Tokenizing Semantic Segmentation with Run Length Encoding
Pith reviewed 2026-05-15 19:56 UTC · model grok-4.3
The pith
Run-length encoding converts semantic segmentation masks into token sequences that autoregressive language models can predict for both images and videos.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Segmentation masks are discretized into token sequences via run length encoding and then predicted autoregressively; compression strategies reduce sequence length to make video extension practical, while instance labels are incorporated into the same token stream to produce panoptic results.
What carries the argument
Run length encoding of masks into discrete tokens, combined with compression tokenization to shorten sequences for autoregressive generation.
If this is right
- Image and video segmentation share the same autoregressive decoder architecture once masks are tokenized.
- Panoptic segmentation is obtained simply by adding instance identifiers to the same token vocabulary.
- Video inputs become feasible because the compression steps keep token sequences short enough for practical generation.
- The same framework can be trained on domain-specific data and still match specialized models under limited compute.
Where Pith is reading between the lines
- The token-sequence formulation could be applied to other dense prediction problems such as depth or surface normal estimation by defining analogous encodings.
- Larger language-model backbones would likely narrow the remaining accuracy gap on general benchmarks without changing the tokenization pipeline.
- Public code release allows direct testing of whether the approach scales to standard video segmentation datasets once more compute is available.
Load-bearing premise
Run length encoding plus the compression strategies retain enough spatial detail for the autoregressive model to reconstruct accurate masks, even across long video sequences.
What would settle it
Direct IoU comparison on a video dataset containing small or thin objects between the tokenized autoregressive outputs and a standard convolutional segmentation baseline, checking whether fine boundary detail is lost in the encoding step.
read the original abstract
This paper presents a new unified approach to semantic segmentation in both images and videos by using language modeling to output the masks as sequences of discrete tokens. We use run length encoding (RLE) to discretize the segmentation masks, and adapt the Pix2Seq framework to learn autoregressive models to output these tokens. We propose novel tokenization strategies to compress the lengths of the token sequences to make it practicable to extend this approach to videos. We also show how instance information can be incorporated into the tokenization process to perform panoptic segmentation. We evaluate our models on two domain-specific datasets to demonstrate their competitiveness with the state of the art in certain scenarios, in spite of being severely bottlenecked by our limited computational resources. We supplement these analyses by proposing several promising approaches to foster future competitiveness in general-purpose applications, and facilitate this by making our code and models publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a unified approach to semantic segmentation for images and videos by discretizing masks via run-length encoding (RLE) and training autoregressive models (adapted from Pix2Seq) to predict the resulting token sequences. Novel compression strategies are introduced to shorten sequences for video extension, instance information is incorporated for panoptic segmentation, and models are evaluated on two domain-specific datasets where they are reported competitive with SOTA despite severe compute limits; code and models are released publicly.
Significance. If the RLE-plus-compression pipeline can be shown to preserve boundary and small-object fidelity at scale, the work would usefully extend language-modeling paradigms to dense prediction tasks and simplify multimodal integration. The public code release supports reproducibility, but the narrow evaluation scope currently limits broader significance.
major comments (2)
- [Abstract and Evaluation] Abstract and Evaluation section: competitiveness is asserted on two domain-specific datasets without reported detailed baselines, per-class error analysis, or ablation on boundary precision; this leaves the central claim that RLE tokenization plus compression preserves sufficient spatial structure untested for general cases.
- [Tokenization strategies] Tokenization and compression strategies: the paper states that proposed compression makes video sequences practicable, yet provides no quantitative measure of information loss (e.g., change in mask IoU or boundary F-score before/after compression), which is load-bearing for the video-extension claim.
minor comments (1)
- [Abstract] The abstract refers to 'limited computational resources' without specifying model sizes, sequence lengths, or hardware used, which would help readers assess the reported competitiveness.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract and Evaluation] Abstract and Evaluation section: competitiveness is asserted on two domain-specific datasets without reported detailed baselines, per-class error analysis, or ablation on boundary precision; this leaves the central claim that RLE tokenization plus compression preserves sufficient spatial structure untested for general cases.
Authors: The manuscript explicitly frames its evaluation on two domain-specific datasets due to severe compute limits, claiming competitiveness only in those scenarios while proposing directions for general-purpose use. We did not assert broad general-case performance. To strengthen the presentation, we will expand the evaluation section with more detailed baselines and per-class error analysis on the reported datasets; we will also add a boundary-precision ablation computed via the public code release. revision: partial
-
Referee: [Tokenization strategies] Tokenization and compression strategies: the paper states that proposed compression makes video sequences practicable, yet provides no quantitative measure of information loss (e.g., change in mask IoU or boundary F-score before/after compression), which is load-bearing for the video-extension claim.
Authors: We agree that explicit quantification of information loss is important for the video claim. The original manuscript emphasized sequence-length reduction and end-task performance. In revision we will report the delta in mask IoU and boundary F-score before versus after each compression step to directly measure fidelity impact. revision: yes
Circularity Check
No circularity: empirical tokenization method evaluated on external data
full rationale
The paper proposes an empirical pipeline: RLE discretization of masks, compression tokenization strategies, adaptation of the Pix2Seq autoregressive framework, and instance-aware extensions for panoptic segmentation. All central claims are supported by training and evaluation on independent external datasets under stated compute constraints. No equations, predictions, or self-citations reduce any result to its own inputs by construction. The method is falsifiable via pixel-level metrics on held-out data, with no self-definitional loops, fitted-input predictions, or load-bearing self-citations.
Axiom & Free-Parameter Ledger
free parameters (1)
- RLE compression parameters
axioms (1)
- domain assumption Autoregressive language models can capture the spatial structure of RLE-tokenized segmentation masks
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We use run length encoding (RLE) to discretize the segmentation masks, and adapt the Pix2Seq framework to learn autoregressive models to output these tokens. We propose novel tokenization strategies to compress the lengths of the token sequences...
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We also show how instance information can be incorporated into the tokenization process to perform panoptic segmentation.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
L. Castrejón, K. Kundu, R. Urtasun, and S. Fidler, Annotating Object Instances with a Polygon-RNN , CVPR (2017), 4485–4493
work page 2017
-
[3]
T. Chen, Pix2Seq Codebase: Multi-tasks with generative modeling , online: https://github.com/google-research/pix2seq
-
[4]
T. Chen, S. Saxena, L. Li, T.-Y . Lin, D. J. Fleet, and G. Hinton, A Unified Sequence Interface for Vision Tasks (2022)
work page 2022
-
[5]
T. Chen, S. Saxena, L. Li, D. J. Fleet, and G. E. Hinton, Pix2seq: A Language Modeling Framework for Object Detection (2022)
work page 2022
-
[6]
T. Chen, L. Li, S. Saxena, G. E. Hinton, and D. J. Fleet, A Generalist Framework for Panoptic Segmentation of Images and Videos, ICCV (2023)
work page 2023
- [7]
-
[8]
Cordts et al., The cityscapes dataset for semantic urban scene understanding (2016), 3213–3223
M. Cordts et al., The cityscapes dataset for semantic urban scene understanding (2016), 3213–3223
work page 2016
-
[9]
L. R. Dice, Measures of the Amount of Ecologic Association Be- tween Species, Ecology 26 (1945), no. 3, 297–302
work page 1945
-
[10]
J. Lazarow, W. Xu, and Z. Tu, Instance segmentation with mask- supervised polygonal boundary transformers (2022)
work page 2022
- [11]
-
[12]
Lin et al., Microsoft COCO: Common Objects in Context (2014), 740–755
T.-Y . Lin et al., Microsoft COCO: Common Objects in Context (2014), 740–755
work page 2014
-
[13]
Z. Liu et al., Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, ICCV (2021), 9992–10002
work page 2021
-
[14]
A. Robinson and C. Cherry, Results of a prototype television band- width compression scheme , Proceedings of the IEEE 55 (1967), no. 3, 356–364
work page 1967
-
[15]
A. Singh, P2S-Video: Extension of Pix2Seq for Video Detection and Segmentation, online: https://github.com/abhineet123/p2s-video
-
[16]
A. Singh, Video Detection and Segmentation with Language Model- ing, online: https://webdocs.cs.ualberta.ca/~asingh1/p2s/
-
[17]
A. Singh, Object detection and segmentation with deep learning: From fixed to variable-length representations , Ph.D. thesis, Univer- sity of Alberta, Edmonton, Canada, 2025
work page 2025
-
[18]
A. Singh and N. Ray, Improving token-based object detection with video, IEEE Access (2025), 1–1
work page 2025
- [19]
- [20]
-
[21]
Soille, Morphological Image Analysis: Principles and Applica- tions, Springer-V erlag, 2003
P . Soille, Morphological Image Analysis: Principles and Applica- tions, Springer-V erlag, 2003
work page 2003
-
[22]
T. Sørensen, T. Sørensen, T. Biering-Sørensen, T. Sørensen, and J. T. Sorensen, A method of establishing group of equal amplitude in plant sociobiology based on similarity of species content and its application to analyses of the vegetation on Danish commons (1948). How to cite this article: Singh A., Rozeboom J., Ray N.. To- kenizing Semantic Segmentatio...
work page 1948
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.