pith. machine review for the scientific record. sign in

arxiv: 2604.16070 · v1 · submitted 2026-04-17 · 💻 cs.CV

Recognition: unknown

TableSeq: Unified Generation of Structure, Content, and Layout

Amine Tamasna, Laziz Hamdi, Pascal Boisson, Thierry Paquet

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:53 UTC · model grok-4.3

classification 💻 cs.CV
keywords tableseqcellcompetitivecontentrecognitionstructuretableunified
0
0 comments X

The pith

TableSeq unifies table structure recognition, content extraction, and cell localization by generating an interleaved autoregressive sequence of HTML tags, cell text, and discretized coordinate tokens from an input image.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The system takes a table image and feeds it through a simple encoder made of a high-resolution network and a small transformer. A decoder then predicts tokens one after another: HTML tags that describe rows and columns, the words inside each cell, and rounded numbers that say where each cell sits on the page. Because everything comes out in one sequence, the model learns to keep the structure, text, and positions consistent without extra post-processing steps or separate text readers. The authors test this on standard table datasets and report high scores on structure and content accuracy metrics.

Core claim

TableSeq reaches 95.23 TEDS / 96.83 S-TEDS on PubTabNet, 97.45 TEDS / 98.69 S-TEDS on FinTabNet, and 99.79 / 99.54 / 99.66 precision / recall / F1 on SciTSR under the CAR protocol while using a compact architecture without external OCR or auxiliary decoders.

Load-bearing premise

That a single autoregressive decoder can reliably produce correctly interleaved HTML structure, accurate cell text, and sufficiently precise discretized coordinates without external OCR, auxiliary heads, or complex post-processing.

read the original abstract

We present TableSeq, an image-only, end-to-end framework for joint table structure recognition, content recognition, and cell localization. The model formulates these tasks as a single sequence-generation problem: one decoder produces an interleaved stream of \texttt{HTML} tags, cell text, and discretized coordinate tokens, thereby aligning logical structure, textual content, and cell geometry within a unified autoregressive sequence. This design avoids external OCR, auxiliary decoders, and complex multi-stage post-processing. TableSeq combines a lightweight high-resolution FCN-H16 encoder with a minimal structure-prior head and a single-layer transformer encoder, yielding a compact architecture that remains effective on challenging layouts. Across standard benchmarks, TableSeq achieves competitive or state-of-the-art results while preserving architectural simplicity. It reaches 95.23 TEDS / 96.83 S-TEDS on PubTabNet, 97.45 TEDS / 98.69 S-TEDS on FinTabNet, and 99.79 / 99.54 / 99.66 precision / recall / F1 on SciTSR under the CAR protocol, while remaining competitive on PubTables-1M under GriTS. Beyond TSR/TCR, the same sequence interface generalizes to index-based table querying without task-specific heads, achieving the best IRDR score and competitive ICDR/ICR performance. We also study multi-token prediction for faster blockwise decoding and show that it reduces inference latency with only limited accuracy degradation. Overall, TableSeq provides a practical and reproducible single-stream baseline for unified table recognition, and the source code will be made publicly available at https://github.com/hamdilaziz/TableSeq.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; coordinate discretization and the choice of single-layer transformer are design decisions whose impact is not quantified here.

pith-pipeline@v0.9.0 · 5612 in / 1096 out tokens · 39796 ms · 2026-05-10T08:53:05.297928+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 9 canonical work pages · 1 internal anchor

  1. [1]

    Image-based table recognition: data, model, and evalua- tion.ECCV, 564–580, 2020

    Xu Zhong, Elaheh ShafieiBavani, Antonio Jimeno-Yepes. Image-based table recognition: data, model, and evalua- tion.ECCV, 564–580, 2020

  2. [2]

    DeepDeSRT: deep learning for detec- tion and structure recognition of tables in document im- ages.ICDAR, 2017

    Sebastian Schreiber, Stefan Agne, Ivo Wolf, Andreas Den- gel, Sheraz Ahmed. DeepDeSRT: deep learning for detec- tion and structure recognition of tables in document im- ages.ICDAR, 2017

  3. [3]

    Sachin Raja, Ajoy Mondal, C. V. Jawahar. Table structure recognition using top-down and bottom-up cues.ECCV, 2020

  4. [4]

    SEMv2: table separation line detection based on condi- tional convolution.CoRR, abs/2303.04384, 2023

    Zhenrong Zhang, Pengfei Hu, Jiefeng Ma, Jun Du, Jian- shu Zhang, Huihui Zhu, Baocai Yin, Bing Yin, Cong Liu. SEMv2: table separation line detection based on condi- tional convolution.CoRR, abs/2303.04384, 2023. TableSeq: Unified Generation of Structure, Content, and Layout 13

  5. [5]

    Split, embed and merge: an accurate table structure recog- nizer.Pattern Recognit., 126:108565, 2022

    Zhenrong Zhang, Jianshu Zhang, Jun Du, Fengren Wang. Split, embed and merge: an accurate table structure recog- nizer.Pattern Recognit., 126:108565, 2022

  6. [6]

    Vishwanath, Rohit Rahul, Monika Sharma, Lovekesh Vig

    Shubham Singh Paliwal, D. Vishwanath, Rohit Rahul, Monika Sharma, Lovekesh Vig. TableNet: deep learning model for end-to-end table detection and tabular data ex- traction from scanned document images.ICDAR, 128–133, 2019

  7. [7]

    LGPMA: complicated table structure recognition with lo- cal and global pyramid mask alignment.ICDAR, 99–114, 2021

    Liang Qiao, Zaisheng Li, Zhanzhan Cheng, Peng Zhang, Shiliang Pu, Yi Niu, Wenqi Ren, Wenming Tan, Fei Wu. LGPMA: complicated table structure recognition with lo- cal and global pyramid mask alignment.ICDAR, 99–114, 2021

  8. [8]

    Aligning benchmark datasets for table structure recognition.IC- DAR, 371–386, 2023

    Brandon Smock, Rohith Pesala, Robin Abraham. Aligning benchmark datasets for table structure recognition.IC- DAR, 371–386, 2023

  9. [9]

    Table structure recognition based on cell relationship, a bottom-up approach.RANLP, 1– 8, 2019

    Darshan Adiga, Shabir Ahmad Bhat, Muzaffar Bashir Shah, Viveka Vyeth. Table structure recognition based on cell relationship, a bottom-up approach.RANLP, 1– 8, 2019

  10. [10]

    TSR-DSAW: table structure recognition via deep spa- tial association of words.CoRR, abs/2203.06873, 2022

    Arushi Jain, Shubham Paliwal, Monika Sharma, Lovekesh Vig. TSR-DSAW: table structure recognition via deep spa- tial association of words.CoRR, abs/2203.06873, 2022

  11. [11]

    Morariu, Brian Price, Scott Co- hen, Tony Martinez

    Chris Tensmeyer, Vlad I. Morariu, Brian Price, Scott Co- hen, Tony Martinez. Deep splitting and merging for table structure decomposition.ICDAR, 114–121, 2019

  12. [12]

    Sachin Raja, Ajoy Mondal, C. V. Jawahar. Visual under- standing of complex table structures from document im- ages.W ACV, 2299–2308, 2022

  13. [13]

    Complicated Table Structure Recognition

    Zewen Chi, Heyan Huang, Heng-Da Xu, Houjin Yu, Wanx- uan Yin, Xian-Ling Mao. Complicated table structure recognition.CoRR, abs/1908.04729, 2019

  14. [14]

    CascadeTabNet: an approach for end-to-end table detection and structure recognition from image-based documents.CVPR Workshops, 2020

    Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, Kavita Sultanpure. CascadeTabNet: an approach for end-to-end table detection and structure recognition from image-based documents.CVPR Workshops, 2020

  15. [15]

    TableStrRec: framework for table structure recognition in data sheet im- ages.Int

    Johan Fernandes, Bin Xiao, Murat Simsek, Burak Kantarci, Shahzad Khan, Ala Abu Alkheir. TableStrRec: framework for table structure recognition in data sheet im- ages.Int. J. Document Anal. Recognit., 27(2):127–145, 2024

  16. [16]

    PubTables-1M: towards comprehensive table extrac- tion from unstructured documents.CVPR, 2022

    Brandon Smock, Rohith Pesala, Robin Abraham. PubTables-1M: towards comprehensive table extrac- tion from unstructured documents.CVPR, 2022

  17. [17]

    TRUST: an accurate and end-to-end ta- ble structure recognizer using splitting-based transformers

    Zengyuan Guo, Yuechen Yu, Pengyuan Lv, Chengquan Zhang, Haojie Li, Zhihui Wang, Kun Yao, Jingtuo Liu, Jingdong Wang. TRUST: an accurate and end-to-end ta- ble structure recognizer using splitting-based transformers. CoRR, abs/2208.14687, 2022

  18. [18]

    TGRNet: a table graph reconstruction net- work for table structure recognition.ICCV, 2021

    Wenyuan Xue, Baosheng Yu, Wen Wang, Dacheng Tao, Qingyong Li. TGRNet: a table graph reconstruction net- work for table structure recognition.ICCV, 2021

  19. [19]

    Neural collaborative graph machines for table structure recognition.CVPR, 4533–4542, 2022

    Hao Liu, Xin Li, Bing Liu, Deqiang Jiang, Yinsong Liu, Bo Ren. Neural collaborative graph machines for table structure recognition.CVPR, 4533–4542, 2022

  20. [20]

    End- to-end handwritten paragraph text recognition using a ver- tical attention network.IEEE Trans

    Denis Coquenet, Clément Chatelain, Thierry Paquet. End- to-end handwritten paragraph text recognition using a ver- tical attention network.IEEE Trans. Pattern Anal. Mach. Intell., 45(1):508–524, 2023

  21. [21]

    Optimized table tokenization for table structure recognition.ICDAR, 37–50, 2023

    Maksym Lysak, Ahmed Nassar, Nikolaos Livathinos, Christoph Auer, Peter Staar. Optimized table tokenization for table structure recognition.ICDAR, 37–50, 2023

  22. [22]

    TableVLM: multi-modal pre- training for table structure recognition.ACL, 2437–2449, 2023

    Leiyuan Chen, Chengsong Huang, Xiaoqing Zheng, Jin- shu Lin, Xuan-Jing Huang. TableVLM: multi-modal pre- training for table structure recognition.ACL, 2437–2449, 2023

  23. [23]

    BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

    Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, Luke Zettlemoyer. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension.CoRR, abs/1910.13461, 2019

  24. [24]

    Scene tablestructurerecognitionwithsegmentationcollaboration and alignment.Pattern Recognit

    Hongyi Wang, Yang Xue, Jiaxin Zhang, Lianwen Jin. Scene tablestructurerecognitionwithsegmentationcollaboration and alignment.Pattern Recognit. Lett., 165:146–153, 2023

  25. [25]

    TSRFormer: table structure recognition with transformers.ACM Multimedia, 6473– 6482, 2022

    Weihong Lin, Zheng Sun, Chixiang Ma, Mingze Li, Jiawei Wang, Lei Sun, Qiang Huo. TSRFormer: table structure recognition with transformers.ACM Multimedia, 6473– 6482, 2022

  26. [26]

    Res2TIM: re- construct syntactic structures from table images.ICDAR, 749–755, 2019

    Wenyuan Xue, Qingyong Li, Dacheng Tao. Res2TIM: re- construct syntactic structures from table images.ICDAR, 749–755, 2019

  27. [27]

    Tablet: Table structure recog- nition using encoder-only transformers.arXiv preprint arXiv:2506.07015, 2025

    Qiyu Hou, Jun Wang. TABLET: table structure recognition using encoder-only transformers.CoRR, abs/2506.07015, 2025

  28. [28]

    Pingan-vcgroup’s solution for icdar 2021 competition on scientific literature parsing task b: table recognition to html.arXiv preprint arXiv:2105.01848,

    Jiaquan Ye, Xianbiao Qi, Yelin He, Yihao Chen, Dengyi Gu, Peng Gao, Rong Xiao. PingAn-VCGroup’s solu- tion for ICDAR 2021 competition on scientific literature parsing task B: table recognition to HTML.CoRR, abs/2105.01848, 2021

  29. [29]

    Global Table Extractor (GTE): a framework for joint table identification and cell structure recognition using visual context.W ACV, 697–706, 2021

    Xinyi Zheng, Douglas Burdick, Lucian Popa, Xu Zhong, Nancy Xin Ru Wang. Global Table Extractor (GTE): a framework for joint table identification and cell structure recognition using visual context.W ACV, 697–706, 2021

  30. [30]

    UniTabNet:bridgingvision and language models for enhanced table structure recogni- tion.Findings of ACL: EMNLP, 6131–6143, 2024

    Zhenrong Zhang, Shuhang Liu, Pengfei Hu, Jiefeng Ma, JunDu,JianshuZhang,YuHu. UniTabNet:bridgingvision and language models for enhanced table structure recogni- tion.Findings of ACL: EMNLP, 6131–6143, 2024

  31. [31]

    OMNIPARSER: a unified framework for text spotting, key information extraction and table recognition.CVPR, 15641–15653, 2024

    Jianqiang Wan, Sibo Song, Wenwen Yu, Yuliang Liu, Wen- qing Cheng, Fei Huang, Xiang Bai, Cong Yao, Zhibo Yang. OMNIPARSER: a unified framework for text spotting, key information extraction and table recognition.CVPR, 15641–15653, 2024

  32. [32]

    OCR- free document understanding transformer.ECCV, 498– 517, 2022

    Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park. OCR- free document understanding transformer.ECCV, 498– 517, 2022

  33. [33]

    TableFormer: table structure understanding with transformers.CVPR, 4614–4623, 2022

    Ahmed Nassar, Nikolaos Livathinos, Maksym Lysak, Peter Staar. TableFormer: table structure understanding with transformers.CVPR, 4614–4623, 2022

  34. [34]

    An end-to-end multi-task learning model for image-based table recognition.VISAPP, 626–634, 2023

    Nam Tuan Ly, Atsuhiro Takasu. An end-to-end multi-task learning model for image-based table recognition.VISAPP, 626–634, 2023

  35. [35]

    Im- proving table structure recognition with visual-alignment sequential coordinate modeling.CVPR, 11134–11143, 2023

    Yongshuai Huang, Ning Lu, Dapeng Chen, Yibo Li, Zecheng Xie, Shenggao Zhu, Liangcai Gao, Wei Peng. Im- proving table structure recognition with visual-alignment sequential coordinate modeling.CVPR, 11134–11143, 2023

  36. [36]

    From detection to application: recent advances in understanding scientific tables and figures.ACM Comput

    Jiani Huang, Haihua Chen, Fengchang Yu, Wei Lu. From detection to application: recent advances in understanding scientific tables and figures.ACM Comput. Surv., 56(10):1– 39, 2024

  37. [37]

    Better & faster large language models via multi-token prediction.ICML, 15706– 15734, 2024

    Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve. Better & faster large language models via multi-token prediction.ICML, 15706– 15734, 2024

  38. [38]

    TableBank: table benchmark for image- based table detection and recognition.LREC, 1918–1925, 2020

    Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, Zhoujun Li. TableBank: table benchmark for image- based table detection and recognition.LREC, 1918–1925, 2020. 14 Laziz Hamdi et al

  39. [39]

    Jianlin Su, Murtadha H. M. Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, Yunfeng Liu. RoFormer: enhanced trans- former with rotary position embedding.Neurocomputing, 568:127063, 2024

  40. [40]

    A survey for table recognition based on deep learning.Neurocomputing, 600:128154, 2024

    Chenglong Yu, Weibin Li, Wei Li, Zixuan Zhu, Ruochen Liu, Biao Hou, Licheng Jiao. A survey for table recognition based on deep learning.Neurocomputing, 600:128154, 2024

  41. [41]

    Parsing table structures in the wild.ICCV, 944–952, 2021

    Rujiao Long, Wen Wang, Nan Xue, Feiyu Gao, Zhibo Yang, Yongpan Wang, Gui-Song Xia. Parsing table structures in the wild.ICCV, 944–952, 2021

  42. [42]

    TRACE: table reconstruction aligned to corner and edges.ICDAR, 472–489, 2023

    Youngmin Baek, Daehyun Nam, Jaeheung Surh, Seung Shin, Seonghyeon Kim. TRACE: table reconstruction aligned to corner and edges.ICDAR, 472–489, 2023

  43. [43]

    Pix2Struct: screenshot parsing as pretraining for visual lan- guage understanding.ICML, 18893–18912, 2023

    Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandel- wal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova. Pix2Struct: screenshot parsing as pretraining for visual lan- guage understanding.ICML, 18893–18912, 2023

  44. [44]

    Enhancing table recog- nition with vision LLMs: a benchmark and neighbor-guided toolchain reasoner.IJCAI, 2503–2511, 2025

    Yitong Zhou, Mingyue Cheng, Qingyang Mao, Qi Liu, Feiyang Xu, Xin Li, Enhong Chen. Enhancing table recog- nition with vision LLMs: a benchmark and neighbor-guided toolchain reasoner.IJCAI, 2503–2511, 2025

  45. [45]

    LORE: logical location regression network for table structure recognition

    Hangdi Xing, Feiyu Gao, Rujiao Long, Jiajun Bu, Qi Zheng, Liangcheng Li, Cong Yao, Zhi Yu. LORE: logical location regression network for table structure recognition. AAAI, 37(3):2992–3000, 2023

  46. [46]

    TFLOP: table struc- ture recognition framework with layout pointer mechanism

    Minsoo Khang, Teakgyu Hong. TFLOP: table struc- ture recognition framework with layout pointer mechanism. CoRR, abs/2501.11800, 2025

  47. [47]

    Fleet, Ge- offrey E

    Ting Chen, Saurabh Saxena, Lala Li, David J. Fleet, Ge- offrey E. Hinton. Pix2Seq: a language modeling framework for object detection.ICLR, 2022

  48. [48]

    Fleet, Geoffrey E

    Ting Chen, Saurabh Saxena, Lala Li, Tsung-Yi Lin, David J. Fleet, Geoffrey E. Hinton. A unified sequence interface for vision tasks.NeurIPS, 2022

  49. [49]

    Sachin Raja, Ajoy Mondal, C. V. Jawahar. Tread- ing towards privacy-preserving table structure recognition. W ACV, 2311–2321, 2025

  50. [50]

    Towards unified scene text spotting based on sequence generation.CVPR, 15223–15232, 2023

    Taeho Kil, Seonghyeon Kim, Sukmin Seo, Yoonsik Kim, Daehee Kim. Towards unified scene text spotting based on sequence generation.CVPR, 15223–15232, 2023

  51. [51]

    Hierarchical text spotter for joint text spotting and layout analysis.W ACV, 892–902, 2024

    Shangbang Long, Siyang Qin, Yasuhisa Fujii, Alessandro Bissacco, Michalis Raptis. Hierarchical text spotter for joint text spotting and layout analysis.W ACV, 892–902, 2024

  52. [52]

    Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.Neural Netw., 107:3–11, 2018

    Stefan Elfwing, Eiji Uchibe, Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.Neural Netw., 107:3–11, 2018

  53. [53]

    Ex- ploring plain vision transformer backbones for object de- tection.ECCV, 280–296, 2022

    Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. Ex- ploring plain vision transformer backbones for object de- tection.ECCV, 280–296, 2022

  54. [54]

    General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model.arXiv preprint arXiv:2409.01704, 2024

    Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jian- jian Sun, Yuang Peng, Chunrui Han, Xiangyu Zhang. Gen- eral OCR theory: towards OCR-2.0 via a unified end-to-end model.CoRR, abs/2409.01704, 2024

  55. [55]

    Vary: scaling up the vision vocabulary for large vision-language model.ECCV, 408–424, 2024

    Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, Xi- angyu Zhang. Vary: scaling up the vision vocabulary for large vision-language model.ECCV, 408–424, 2024. A Synthetic Data High-level procedureWe denote the page image byI, the HTML byH(with per-cell coordinate tags<x_i>,<y_j>), and the scale factor fro...