pith. sign in

arxiv: 2602.21627 · v3 · submitted 2026-02-25 · 💻 cs.CV

Tokenizing Semantic Segmentation with Run Length Encoding

Pith reviewed 2026-05-15 19:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords semantic segmentationrun length encodingautoregressive modelingvideo segmentationpanoptic segmentationtoken sequencesPix2Seq
0
0 comments X

The pith

Run-length encoding converts semantic segmentation masks into token sequences that autoregressive language models can predict for both images and videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that semantic segmentation masks, normally dense pixel maps, can instead be represented as compact sequences of discrete tokens. Run length encoding discretizes each mask, after which an adapted Pix2Seq autoregressive decoder learns to emit the token sequence directly from the input image or video frame. Novel compression steps shorten these sequences enough to handle video lengths, and instance identifiers are folded into the tokens to support panoptic output as well. The resulting models reach competitive accuracy on two domain-specific datasets despite severe compute limits, showing that language-model-style generation can serve as a unified backbone for dense vision tasks.

Core claim

Segmentation masks are discretized into token sequences via run length encoding and then predicted autoregressively; compression strategies reduce sequence length to make video extension practical, while instance labels are incorporated into the same token stream to produce panoptic results.

What carries the argument

Run length encoding of masks into discrete tokens, combined with compression tokenization to shorten sequences for autoregressive generation.

If this is right

  • Image and video segmentation share the same autoregressive decoder architecture once masks are tokenized.
  • Panoptic segmentation is obtained simply by adding instance identifiers to the same token vocabulary.
  • Video inputs become feasible because the compression steps keep token sequences short enough for practical generation.
  • The same framework can be trained on domain-specific data and still match specialized models under limited compute.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The token-sequence formulation could be applied to other dense prediction problems such as depth or surface normal estimation by defining analogous encodings.
  • Larger language-model backbones would likely narrow the remaining accuracy gap on general benchmarks without changing the tokenization pipeline.
  • Public code release allows direct testing of whether the approach scales to standard video segmentation datasets once more compute is available.

Load-bearing premise

Run length encoding plus the compression strategies retain enough spatial detail for the autoregressive model to reconstruct accurate masks, even across long video sequences.

What would settle it

Direct IoU comparison on a video dataset containing small or thin objects between the tokenized autoregressive outputs and a standard convolutional segmentation baseline, checking whether fine boundary detail is lost in the encoding step.

read the original abstract

This paper presents a new unified approach to semantic segmentation in both images and videos by using language modeling to output the masks as sequences of discrete tokens. We use run length encoding (RLE) to discretize the segmentation masks, and adapt the Pix2Seq framework to learn autoregressive models to output these tokens. We propose novel tokenization strategies to compress the lengths of the token sequences to make it practicable to extend this approach to videos. We also show how instance information can be incorporated into the tokenization process to perform panoptic segmentation. We evaluate our models on two domain-specific datasets to demonstrate their competitiveness with the state of the art in certain scenarios, in spite of being severely bottlenecked by our limited computational resources. We supplement these analyses by proposing several promising approaches to foster future competitiveness in general-purpose applications, and facilitate this by making our code and models publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a unified approach to semantic segmentation for images and videos by discretizing masks via run-length encoding (RLE) and training autoregressive models (adapted from Pix2Seq) to predict the resulting token sequences. Novel compression strategies are introduced to shorten sequences for video extension, instance information is incorporated for panoptic segmentation, and models are evaluated on two domain-specific datasets where they are reported competitive with SOTA despite severe compute limits; code and models are released publicly.

Significance. If the RLE-plus-compression pipeline can be shown to preserve boundary and small-object fidelity at scale, the work would usefully extend language-modeling paradigms to dense prediction tasks and simplify multimodal integration. The public code release supports reproducibility, but the narrow evaluation scope currently limits broader significance.

major comments (2)
  1. [Abstract and Evaluation] Abstract and Evaluation section: competitiveness is asserted on two domain-specific datasets without reported detailed baselines, per-class error analysis, or ablation on boundary precision; this leaves the central claim that RLE tokenization plus compression preserves sufficient spatial structure untested for general cases.
  2. [Tokenization strategies] Tokenization and compression strategies: the paper states that proposed compression makes video sequences practicable, yet provides no quantitative measure of information loss (e.g., change in mask IoU or boundary F-score before/after compression), which is load-bearing for the video-extension claim.
minor comments (1)
  1. [Abstract] The abstract refers to 'limited computational resources' without specifying model sizes, sequence lengths, or hardware used, which would help readers assess the reported competitiveness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract and Evaluation] Abstract and Evaluation section: competitiveness is asserted on two domain-specific datasets without reported detailed baselines, per-class error analysis, or ablation on boundary precision; this leaves the central claim that RLE tokenization plus compression preserves sufficient spatial structure untested for general cases.

    Authors: The manuscript explicitly frames its evaluation on two domain-specific datasets due to severe compute limits, claiming competitiveness only in those scenarios while proposing directions for general-purpose use. We did not assert broad general-case performance. To strengthen the presentation, we will expand the evaluation section with more detailed baselines and per-class error analysis on the reported datasets; we will also add a boundary-precision ablation computed via the public code release. revision: partial

  2. Referee: [Tokenization strategies] Tokenization and compression strategies: the paper states that proposed compression makes video sequences practicable, yet provides no quantitative measure of information loss (e.g., change in mask IoU or boundary F-score before/after compression), which is load-bearing for the video-extension claim.

    Authors: We agree that explicit quantification of information loss is important for the video claim. The original manuscript emphasized sequence-length reduction and end-task performance. In revision we will report the delta in mask IoU and boundary F-score before versus after each compression step to directly measure fidelity impact. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical tokenization method evaluated on external data

full rationale

The paper proposes an empirical pipeline: RLE discretization of masks, compression tokenization strategies, adaptation of the Pix2Seq autoregressive framework, and instance-aware extensions for panoptic segmentation. All central claims are supported by training and evaluation on independent external datasets under stated compute constraints. No equations, predictions, or self-citations reduce any result to its own inputs by construction. The method is falsifiable via pixel-level metrics on held-out data, with no self-definitional loops, fitted-input predictions, or load-bearing self-citations.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that RLE sequences retain enough spatial information for autoregressive prediction to succeed, plus free parameters in the novel compression strategies.

free parameters (1)
  • RLE compression parameters
    Parameters controlling tokenization strategies to shorten sequences for video feasibility, chosen to make the approach practicable.
axioms (1)
  • domain assumption Autoregressive language models can capture the spatial structure of RLE-tokenized segmentation masks
    Invoked when adapting the Pix2Seq framework to output masks as token sequences.

pith-pipeline@v0.9.0 · 5445 in / 1158 out tokens · 30848 ms · 2026-05-15T19:56:36.910962+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

  1. [1]

    Acuna, H

    D. Acuna, H. Ling, A. Kar, and S. Fidler, Efficient Interactive An- notation of Segmentation Datasets with Polygon-RNN++ , CVPR (2018)

  2. [2]

    Castrejón, K

    L. Castrejón, K. Kundu, R. Urtasun, and S. Fidler, Annotating Object Instances with a Polygon-RNN , CVPR (2017), 4485–4493

  3. [3]

    Chen, Pix2Seq Codebase: Multi-tasks with generative modeling , online: https://github.com/google-research/pix2seq

    T. Chen, Pix2Seq Codebase: Multi-tasks with generative modeling , online: https://github.com/google-research/pix2seq

  4. [4]

    T. Chen, S. Saxena, L. Li, T.-Y . Lin, D. J. Fleet, and G. Hinton, A Unified Sequence Interface for Vision Tasks (2022)

  5. [5]

    T. Chen, S. Saxena, L. Li, D. J. Fleet, and G. E. Hinton, Pix2seq: A Language Modeling Framework for Object Detection (2022)

  6. [6]

    T. Chen, L. Li, S. Saxena, G. E. Hinton, and D. J. Fleet, A Generalist Framework for Panoptic Segmentation of Images and Videos, ICCV (2023)

  7. [7]

    Chitta, J

    K. Chitta, J. M. Álvarez, and M. Hebert, Quadtree Generating Networks: Efficient Hierarchical Scene Parsing with Sparse Convo- lutions, W ACV (2020), 2009–2018

  8. [8]

    Cordts et al., The cityscapes dataset for semantic urban scene understanding (2016), 3213–3223

    M. Cordts et al., The cityscapes dataset for semantic urban scene understanding (2016), 3213–3223

  9. [9]

    L. R. Dice, Measures of the Amount of Ecologic Association Be- tween Species, Ecology 26 (1945), no. 3, 297–302

  10. [10]

    Lazarow, W

    J. Lazarow, W. Xu, and Z. Tu, Instance segmentation with mask- supervised polygonal boundary transformers (2022)

  11. [11]

    Liang, N

    J. Liang, N. Homayounfar, W.-C. Ma, Y . Xiong, R. Hu, and R. Urtasun, PolyTransform: Deep Polygon Transformer for Instance Segmentation (2020)

  12. [12]

    Lin et al., Microsoft COCO: Common Objects in Context (2014), 740–755

    T.-Y . Lin et al., Microsoft COCO: Common Objects in Context (2014), 740–755

  13. [13]

    Liu et al., Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, ICCV (2021), 9992–10002

    Z. Liu et al., Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, ICCV (2021), 9992–10002

  14. [14]

    Robinson and C

    A. Robinson and C. Cherry, Results of a prototype television band- width compression scheme , Proceedings of the IEEE 55 (1967), no. 3, 356–364

  15. [15]

    Singh, P2S-Video: Extension of Pix2Seq for Video Detection and Segmentation, online: https://github.com/abhineet123/p2s-video

    A. Singh, P2S-Video: Extension of Pix2Seq for Video Detection and Segmentation, online: https://github.com/abhineet123/p2s-video

  16. [16]

    Singh, Video Detection and Segmentation with Language Model- ing, online: https://webdocs.cs.ualberta.ca/~asingh1/p2s/

    A. Singh, Video Detection and Segmentation with Language Model- ing, online: https://webdocs.cs.ualberta.ca/~asingh1/p2s/

  17. [17]

    Singh, Object detection and segmentation with deep learning: From fixed to variable-length representations , Ph.D

    A. Singh, Object detection and segmentation with deep learning: From fixed to variable-length representations , Ph.D. thesis, Univer- sity of Alberta, Edmonton, Canada, 2025

  18. [18]

    Singh and N

    A. Singh and N. Ray, Improving token-based object detection with video, IEEE Access (2025), 1–1

  19. [19]

    Singh, H

    A. Singh, H. Kalke, M. R. Loewen, and N. Ray, River Ice Segmen- tation With Deep Learning , IEEE Transactions on Geoscience and Remote Sensing 58 (2020), 7570–7579

  20. [20]

    Singh, I

    A. Singh, I. Jasra, O. Mouhammed, N. Dadheech, N. Ray, and J. Shapiro, Towards Early Prediction of Human iPSC Reprogram- ming Success, Machine Learning for Biomedical Imaging 2 (2023), 390–407

  21. [21]

    Soille, Morphological Image Analysis: Principles and Applica- tions, Springer-V erlag, 2003

    P . Soille, Morphological Image Analysis: Principles and Applica- tions, Springer-V erlag, 2003

  22. [22]

    Sørensen, T

    T. Sørensen, T. Sørensen, T. Biering-Sørensen, T. Sørensen, and J. T. Sorensen, A method of establishing group of equal amplitude in plant sociobiology based on similarity of species content and its application to analyses of the vegetation on Danish commons (1948). How to cite this article: Singh A., Rozeboom J., Ray N.. To- kenizing Semantic Segmentatio...