arxiv: 2605.00445 · v3 · submitted 2026-05-01 · 💻 cs.LG

Recognition: no theorem link

The Power of Order: Fooling LLMs with Adversarial Table Permutations

Xinshuai Dong , Haifeng Chen , Xuyuan Liu , Shengyu Chen , Haoyu Wang , Shaoan Xie , Kun Zhang , Zhengzhang Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:22 UTC · model grok-4.3

classification 💻 cs.LG

keywords adversarial attackslarge language modelstabular datatable question answeringmodel robustnessdata permutationsstructured inputs

0 comments

The pith

Semantically identical rearrangements of table rows and columns can cause large language models to output wrong or inconsistent answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that current large language models lack robustness to the order of information in tables, even when every data value and relation stays exactly the same. A reader would care because these models are already used for table question answering and similar structured tasks in real applications where order should be irrelevant. The authors create a gradient-based search called Adversarial Table Permutation that efficiently finds the most damaging row-and-column swaps for any given table and model. Experiments across many model sizes and families show consistent performance drops under these attacks. The result indicates that LLMs rely on positional patterns rather than pure semantic content when processing tabular input.

Core claim

Modern large language models exhibit a significant vulnerability to the layout of tabular data. Semantically-invariant permutations of rows and columns are sometimes sufficient to cause incorrect or inconsistent model outputs. To systematically probe this vulnerability, the authors introduce Adversarial Table Permutation, a gradient-based attack that identifies worst-case permutations designed to maximally disrupt model performance. Extensive experiments demonstrate that this attack significantly degrades performance of a wide range of LLMs across different sizes and architectures.

What carries the argument

Adversarial Table Permutation, a gradient-based optimization procedure that searches over row and column reorderings to maximize model error while preserving all semantic information in the table.

Load-bearing premise

The chosen permutations truly leave the table's meaning unchanged for the purpose of the question being asked.

What would settle it

A controlled test in which every tested model returns identical correct answers on a fixed set of tables under all possible semantically equivalent row-and-column permutations would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.00445 by Haifeng Chen, Haoyu Wang, Kun Zhang, Shaoan Xie, Shengyu Chen, Xinshuai Dong, Xuyuan Liu, Zhengzhang Chen.

**Figure 2.** Figure 2: Overview of ATP. Learnable parameters define soft permutations (doubly stochastic [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Spearman’s rank correlation between metrics. The LLM-as-judge score aligns strongly with human ratings. Based on these results, we observe that the LLM-asjudge metric exhibits strong alignment with human judgments, whereas ROUGE-L fails to capture such alignment. This finding supports the effectiveness of our LLM-as-judge as an evaluation metric for measuring alignment score between LLM-generated respo… view at source ↗

**Figure 4.** Figure 4: Middle-layer answer-token attention aggregated to table cells. Columns are aligned by [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Layer-wise local attention diagnostics for clean and ATP-permuted prompts. [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation on different values of the attack iterations, [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

read the original abstract

Large Language Models have achieved remarkable success and are increasingly deployed in critical applications involving tabular data, such as Table Question Answering. However, their robustness to the structure of this input remains a critical, unaddressed question. This paper demonstrates that modern LLMs exhibit a significant vulnerability to the layout of tabular data. Specifically, we show that semantically-invariant permutations of rows and columns - rearrangements that do not alter the table's underlying information - are sometimes sufficient to cause incorrect or inconsistent model outputs. To systematically probe this vulnerability, we introduce Adversarial Table Permutation, a novel, gradient-based attack that efficiently identifies worst-case permutations designed to maximally disrupt model performance. Our extensive experiments demonstrate that ATP significantly degrades the performance of a wide range of LLMs. This reveals a pervasive vulnerability across different model sizes and architectures, including the most recent and popular models. Our findings expose a fundamental weakness in how current LLMs process structured data, underscoring the urgent need to develop permutation-robust models for reliable, real-world applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds that reordering rows and columns in tables can break LLM table QA even when the facts don't change, using a gradient search for bad orders.

read the letter

The main point from this paper is that LLMs are not robust to the order of rows and columns in tabular inputs. Semantically the same table, just permuted, can lead to wrong answers on questions. They propose Adversarial Table Permutation (ATP), a gradient-based method to find the permutations that cause the most trouble. This is new in applying adversarial search specifically to table permutations for LLMs. The experiments show it works on various models, including recent ones, which is useful to see the scope of the issue. They do a good job demonstrating that this vulnerability is widespread. The soft spots are around confirming that the permutations really are semantically invariant. The attack optimizes for bad outputs, so there should be a check that the correct answer doesn't change with the new order. Without that, or details on how they ensure it, the results could partly reflect changes in how the table is read rather than pure order sensitivity. The abstract mentions extensive experiments, but more on baselines like random shuffles, exact performance numbers, and significance would help. This work is for researchers studying LLM robustness to structured data or those building table-based applications. A reader looking for ideas on testing order invariance in prompts would get value from it. It deserves a serious referee because the idea is concrete and the potential impact on real systems is clear if the claims hold. I would recommend sending it for peer review, asking the authors to add explicit invariance verification and more experimental controls.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLMs are vulnerable to row and column permutations in tabular inputs for question-answering tasks. It introduces Adversarial Table Permutation (ATP), a gradient-based attack that searches for permutations claimed to be semantically invariant yet sufficient to cause incorrect or inconsistent model outputs, and reports that ATP degrades performance across a range of models and sizes.

Significance. If the central claim holds, the work would highlight a practical robustness gap in how LLMs serialize and reason over structured data, with direct relevance to deployed table-QA systems. The empirical, attack-driven approach and breadth of models tested are strengths; the absence of circularity or fitted parameters in the attack definition further supports its potential value as a diagnostic tool.

major comments (2)

[§3] §3 (ATP procedure): the optimization directly maximizes output disruption but contains no post-hoc verification (e.g., re-deriving the ground-truth answer from the linearized permuted table or human equivalence rating) that every selected permutation preserves the original semantics and correct answer. This check is load-bearing for the claim that observed degradation reflects order sensitivity rather than altered table content.
[§4] §4 (experimental results): the reported performance drops are presented without explicit baseline comparisons to random or non-adversarial permutations, without the precise metrics and aggregation method used, and without statistical significance tests across runs or models. These omissions make it difficult to quantify how much of the degradation is attributable to the adversarial search versus generic ordering effects.

minor comments (2)

The abstract and introduction should explicitly state the datasets, table sizes, and question types used so readers can assess the scope of the claimed vulnerability.
Notation for row/column permutations and the linearization step into the prompt should be formalized (e.g., with a small example equation) to avoid ambiguity when describing how tables are serialized.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address each major comment below and outline revisions that will strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (ATP procedure): the optimization directly maximizes output disruption but contains no post-hoc verification (e.g., re-deriving the ground-truth answer from the linearized permuted table or human equivalence rating) that every selected permutation preserves the original semantics and correct answer. This check is load-bearing for the claim that observed degradation reflects order sensitivity rather than altered table content.

Authors: We agree that making the invariance explicit strengthens the claim. Row and column permutations are applied while preserving every cell value and its row-column association exactly; thus the table's information content is identical and the ground-truth answer to any content-based question is unchanged by construction. In the revision we will add to §3 a formal argument establishing this invariance together with a human verification study on a random sample of 100 permutations confirming that annotators judge the ground-truth answers to be identical before and after permutation. This directly addresses the concern. revision: yes
Referee: [§4] §4 (experimental results): the reported performance drops are presented without explicit baseline comparisons to random or non-adversarial permutations, without the precise metrics and aggregation method used, and without statistical significance tests across runs or models. These omissions make it difficult to quantify how much of the degradation is attributable to the adversarial search versus generic ordering effects.

Authors: We accept that these details improve interpretability. In the revised §4 we will include (i) explicit baselines consisting of random permutations and simple non-adversarial heuristics (e.g., lexicographic row sorting), (ii) precise definitions of the metrics (exact-match accuracy and token-level F1) and the aggregation procedure (macro-average over questions), and (iii) statistical significance results (paired t-tests with p-values) computed over five independent runs per model. Updated tables and figures will report these quantities. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack with independent definition and evaluation

full rationale

The paper is an empirical robustness study that defines the ATP attack as a gradient-based search over row/column permutations to maximize output disruption on table QA tasks, then measures the resulting accuracy drops across LLMs. No derivation chain exists that reduces a claimed result to its own inputs by construction; the attack objective is stated independently of the measured performance, and the semantic-invariance claim is an assumption whose validity is tested via the experimental outcomes rather than presupposed in the method. No self-citations, uniqueness theorems, or ansatzes are load-bearing, and no parameters are fitted then relabeled as predictions. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that row and column order should be irrelevant when semantic content is preserved, plus the empirical observation that gradient search can locate damaging permutations.

axioms (1)

domain assumption Semantically equivalent table rearrangements should produce identical model outputs if the model correctly understands the underlying data.
Invoked in the abstract to frame permutations as adversarial rather than legitimate variations.

pith-pipeline@v0.9.0 · 5497 in / 1129 out tokens · 49968 ms · 2026-05-12T02:22:20.045375+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 8 internal anchors

[1]

Ranking via Sinkhorn Propagation

Ryan Prescott Adams and Richard S Zemel. Ranking via sinkhorn propagation.arXiv preprint arXiv:1106.1925, 2011

work page Pith review arXiv 1925
[2]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Exploring the robustness of language models for tabular question answering via attention analysis.Trans

Kushal Raj Bhandari, Sixue Xing, Soham Dan, and Jianxi Gao. Exploring the robustness of language models for tabular question answering via attention analysis.Trans. Mach. Learn. Res., 2025

work page 2025
[4]

Tres observaciones sobre el algebra lineal.Univ

Garrett Birkhoff. Tres observaciones sobre el algebra lineal.Univ. Nac. Tucuman, Ser. A, 5:147–154, 1946

work page 1946
[5]

arXiv preprint arXiv:2210.06280 , year=

Vadim Borisov, Kathrin Seßler, Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci. Language models are realistic tabular data generators.arXiv preprint arXiv:2210.06280, 2022

work page arXiv 2022
[6]

Hytrel: Hypergraph-enhanced tabular data representation learning

Pei Chen, Soumajyoti Sarkar, Leonard Lausen, Balasubramaniam Srinivasan, Sheng Zha, Ruihong Huang, and George Karypis. Hytrel: Hypergraph-enhanced tabular data representation learning. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 1...

work page 2023
[7]

Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models

Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. InProceedings of the 10th ACM workshop on artificial intelligence and security, pages 15–26, 2017

work page 2017
[8]

Smith, and Tao Yu

Zhoujun Cheng, Tianbao Xie, Peng Shi, Chengzu Li, Rahul Nadkarni, Yushi Hu, Caiming Xiong, Dragomir Radev, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Tao Yu. Binding language models in symbolic languages. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023

work page 2023
[9]

Certified adversarial robustness via randomized smoothing

Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. Ininternational conference on machine learning, pages 1310–1320. PMLR, 2019

work page 2019
[10]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Tabby: A language model architecture for tabular and structured data synthesis.arXiv preprint arXiv:2503.02152, 2025

Sonia Cromp, Satya Sai Srinath Namburi GNVV , Mohammed Alkhudhayri, Catherine Cao, Samuel Guo, Nicholas Roberts, and Frederic Sala. Tabby: A language model architecture for tabular and structured data synthesis.arXiv preprint arXiv:2503.02152, 2025

work page arXiv 2025
[12]

Tables as texts or images: Evaluating the table reasoning ability of llms and mllms

Naihao Deng, Zhenjie Sun, Ruiqi He, Aman Sikka, Yulong Chen, Lin Ma, Yue Zhang, and Rada Mihalcea. Tables as texts or images: Evaluating the table reasoning ability of llms and mllms. InFindings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024. Association for Computational Linguistics, 2024

work page 2024
[13]

Towards robustness against natural language word substitutions

Xinshuai Dong, Anh Tuan Luu, Rongrong Ji, and Hong Liu. Towards robustness against natural language word substitutions. InICLR, 2021

work page 2021
[14]

How should pre-trained language models be fine-tuned towards adversarial robustness?Advances in Neural Information Processing Systems, 34:4356–4369, 2021

Xinshuai Dong, Anh Tuan Luu, Min Lin, Shuicheng Yan, and Hanwang Zhang. How should pre-trained language models be fine-tuned towards adversarial robustness?Advances in Neural Information Processing Systems, 34:4356–4369, 2021

work page 2021
[15]

The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

work page 2024
[16]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

What is in your safe data? identifying benign data that breaks safety.arXiv preprint arXiv:2404.01099, 2024

Luxi He, Mengzhou Xia, and Peter Henderson. What is in your safe data? identifying benign data that breaks safety.arXiv preprint arXiv:2404.01099, 2024

work page arXiv 2024
[18]

Why llm safety guardrails collapse after fine-tuning: A similarity analysis between alignment and fine-tuning datasets.arXiv preprint arXiv:2506.05346, 2025

Lei Hsiung, Tianyu Pang, Yung-Chen Tang, Linyue Song, Tsung-Yi Ho, Pin-Yu Chen, and Yaoqing Yang. Why llm safety guardrails collapse after fine-tuning: A similarity analysis between alignment and fine-tuning datasets.arXiv preprint arXiv:2506.05346, 2025

work page arXiv 2025
[19]

Generating syntactically controlled paraphrases without using annotated parallel pairs

Kuan-Hao Huang and Kai-Wei Chang. Generating syntactically controlled paraphrases without using annotated parallel pairs. InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021, Online, April 19 - 23, 2021, pages 1022–1033. Association for Computational Linguistics, 2021

work page 2021
[20]

Categorical Reparameterization with Gumbel-Softmax

Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[21]

Structgpt: A general framework for large language model to reason over structured data

Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Xin Zhao, and Ji-Rong Wen. Structgpt: A general framework for large language model to reason over structured data. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023. Association for Computational Linguistics, 2023

work page 2023
[22]

Adam: A Method for Stochastic Optimization

Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[23]

TabDDPM: Mod- elling tabular data with diffusion models

Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, and Artem Babenko. TabDDPM: Mod- elling tabular data with diffusion models. InProceedings of the 40th International Conference on Machine Learning, 2023

work page 2023
[24]

The hungarian method for the assignment problem.Naval research logistics quarterly, 2(1-2):83–97, 1955

Harold W Kuhn. The hungarian method for the assignment problem.Naval research logistics quarterly, 2(1-2):83–97, 1955

work page 1955
[25]

Variants of the hungarian method for assignment problems.Naval research logistics quarterly, 3(4):253–258, 1956

Harold W Kuhn. Variants of the hungarian method for assignment problems.Naval research logistics quarterly, 3(4):253–258, 1956

work page 1956
[26]

Table as a modality for large language models

Liyao Li, Chao Ye, Wentao Ye, Yifei Sun, Zhe Jiang, Haobo Wang, Jiaming Tian, Yiming Zhang, Ningtao Wang, Xing Fu, Gang Chen, and Junbo Zhao. Table as a modality for large language models. InAdvances in Neural Information Processing Systems, 2025

work page 2025
[27]

Pearl: Towards permutation-resilient llms

CHEN Liang, Li Shen, Yang Deng, Xiaoyan Zhao, Bin Liang, and Kam-Fai Wong. Pearl: Towards permutation-resilient llms. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[28]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

work page 2004
[29]

An inner table retriever for robust table question answering

Weizhe Lin, Rexhina Blloshmi, Bill Byrne, Adria de Gispert, and Gonzalo Iglesias. An inner table retriever for robust table question answering. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

work page 2023
[30]

Rethinking tabular data understanding with large language models

Tianyang Liu, Fei Wang, and Muhao Chen. Rethinking tabular data understanding with large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024, pages 450–482. Association for...

work page 2024
[31]

The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables

Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables.arXiv preprint arXiv:1611.00712, 2016

work page Pith review arXiv 2016
[32]

Towards Deep Learning Models Resistant to Adversarial Attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks.arXiv preprint arXiv:1706.06083, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[33]

Learning latent permuta- tions with gumbel-sinkhorn networks.arXiv preprint arXiv:1802.08665, 2018

Gonzalo Mena, David Belanger, Scott Linderman, and Jasper Snoek. Learning latent permuta- tions with gumbel-sinkhorn networks.arXiv preprint arXiv:1802.08665, 2018. 11

work page arXiv 2018
[34]

Fetaqa: Free-form table question answering.Transactions of the Association for Computational Linguistics, 10:35–49, 2022

Linyong Nan, Chiachun Hsieh, Ziming Mao, Xi Victoria Lin, Neha Verma, Rui Zhang, Wojciech Kry´sci´nski, Hailey Schoelkopf, Riley Kong, Xiangru Tang, et al. Fetaqa: Free-form table question answering.Transactions of the Association for Computational Linguistics, 10:35–49, 2022

work page 2022
[35]

Interpretable llm-based table question answering.Trans

Giang Nguyen, Ivan Brugere, Shubham Sharma, Sanjay Kariyappa, Anh Totti Nguyen, and Freddy Lécué. Interpretable llm-based table question answering.Trans. Mach. Learn. Res., 2025

work page 2025
[36]

Practical black-box attacks against machine learning

Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Anan- thram Swami. Practical black-box attacks against machine learning. InProceedings of the 2017 ACM on Asia conference on computer and communications security, pages 506–519, 2017

work page 2017
[37]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

work page 2002
[38]

Compositional semantic parsing on semi-structured tables

Panupong Pasupat and Percy Liang. Compositional semantic parsing on semi-structured tables. arXiv preprint arXiv:1508.00305, 2015

work page arXiv 2015
[39]

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

work page 2024
[40]

Evaluating the text-to-sql capabilities of large language models

Nitarshan Rajkumar, Raymond Li, and Dzmitry Bahdanau. Evaluating the text-to-sql capabilities of large language models.arXiv preprint arXiv:2204.00498, 2022

work page arXiv 2022
[41]

Code Llama: Open Foundation Models for Code

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Judging the judges: A systematic investigation of position bias in pairwise comparative assessments by LLMs, 2024

Lin Shi, Chiyu Ma, Wenhua Liang, Weicheng Ma, and Soroush V osoughi. Judging the judges: A systematic investigation of position bias in pairwise comparative assessments by LLMs, 2024

work page 2024
[43]

A relationship between arbitrary positive matrices and doubly stochastic matrices.The annals of mathematical statistics, 35(2):876–879, 1964

Richard Sinkhorn. A relationship between arbitrary positive matrices and doubly stochastic matrices.The annals of mathematical statistics, 35(2):876–879, 1964

work page 1964
[44]

Realtabformer: Generating realistic relational and tabular data using transformers.arXiv preprint arXiv:2302.02041, 2023

Aivin V . Solatorio and Olivier Dupriez. REaLTabFormer: Generating realistic relational and tabular data using transformers.arXiv preprint arXiv:2302.02041, 2023

work page arXiv 2023
[45]

Tablegpt2: A large multimodal model with tabular data integration.arXiv preprint arXiv:2411.02059, 2024

Aofeng Su, Aowen Wang, Chao Ye, Chen Zhou, Ga Zhang, Gang Chen, Guangcheng Zhu, Haobo Wang, Haokai Xu, Hao Chen, et al. Tablegpt2: A large multimodal model with tabular data integration.arXiv preprint arXiv:2411.02059, 2024

work page arXiv 2024
[46]

arXiv preprint arXiv:2302.12095 , year=

Jindong Wang, Xixu Hu, Wenxin Hou, Hao Chen, Runkai Zheng, Yidong Wang, Linyi Yang, Haojun Huang, Wei Ye, Xiubo Geng, et al. On the robustness of chatgpt: An adversarial and out-of-distribution perspective.arXiv preprint arXiv:2302.12095, 2023

work page arXiv 2023
[47]

Beyond linearization: Attributed table graphs for table reasoning.arXiv preprint arXiv:2601.08444, 2026

Yuxiang Wang, Junhao Gan, Shengxiang Gao, Shenghao Ye, Zhengyi Yang, and Jianzhong Qi. Beyond linearization: Attributed table graphs for table reasoning.arXiv preprint arXiv:2601.08444, 2026

work page arXiv 2026
[48]

Transtab: Learning transferable tabular transformers across tables

Zifeng Wang and Jimeng Sun. Transtab: Learning transferable tabular transformers across tables. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022

work page 2022
[49]

Chain-of-table: Evolving tables in the reasoning chain for table understanding.arXiv preprint arXiv:2401.04398, 2024

Zilong Wang, Hao Zhang, Chun-Liang Li, Julian Martin Eisenschlos, Vincent Perot, Zifeng Wang, Lesly Miculicich, Yasuhisa Fujii, Jingbo Shang, Chen-Yu Lee, et al. Chain-of-table: Evolving tables in the reasoning chain for table understanding.arXiv preprint arXiv:2401.04398, 2024. 12

work page arXiv 2024
[50]

Kakade, Hao Peng, and Heng Ji

Ziqi Wang, Hanlin Zhang, Xiner Li, Kuan-Hao Huang, Chi Han, Shuiwang Ji, Sham M. Kakade, Hao Peng, and Heng Ji. Eliminating position bias of language models: A mechanistic approach. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[51]

Modeling tabular data using conditional GAN

Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Modeling tabular data using conditional GAN. InAdvances in Neural Information Processing Systems, 2019

work page 2019
[52]

Tableformer: Robust transformer modeling for table-text encoding

Jingfeng Yang, Aditya Gupta, Shyam Upadhyay, Luheng He, Rahul Goel, and Shachi Paul. Tableformer: Robust transformer modeling for table-text encoding. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Irelan...

work page 2022
[53]

Texthoaxer: Budgeted hard-label adversarial attacks on text

Muchao Ye, Chenglin Miao, Ting Wang, and Fenglong Ma. Texthoaxer: Budgeted hard-label adversarial attacks on text. InThirty-Sixth AAAI Conference on Artificial Intelligence. AAAI Press, 2022

work page 2022
[54]

Large language models are versatile decomposers: Decomposing evidence and questions for table-based rea- soning

Yunhu Ye, Binyuan Hui, Min Yang, Binhua Li, Fei Huang, and Yongbin Li. Large language models are versatile decomposers: Decomposing evidence and questions for table-based rea- soning. InProceedings of the 46th international ACM SIGIR conference on research and development in information retrieval, pages 174–184, 2023

work page 2023
[55]

Large language models are versatile decomposers: Decomposing evidence and questions for table-based rea- soning

Yunhu Ye, Binyuan Hui, Min Yang, Binhua Li, Fei Huang, and Yongbin Li. Large language models are versatile decomposers: Decomposing evidence and questions for table-based rea- soning. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023, Taipei, Taiwan, July 23-27, 2023. ACM, 2023

work page 2023
[56]

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts.arXiv preprint arXiv:2309.10253, 2023

work page internal anchor Pith review arXiv 2023
[57]

How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge AI safety by humanizing llms

Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge AI safety by humanizing llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, 2024

work page 2024
[58]

Tablellm: Enabling tabular data manipulation by llms in real office usage scenarios.arXiv preprint arXiv:2403.19318, 2024

Xiaokang Zhang, Sijia Luo, Bohan Zhang, Zeyao Ma, Jing Zhang, Li Yang, Guanlin Li, Zijun Yao, Kangli Xu, Jinchang Zhou, Daniel Zhang-Li, et al. Tablellm: Enabling tabular data manipulation by llms in real office usage scenarios.arXiv preprint arXiv:2403.19318, 2024

work page arXiv 2024
[59]

Certified robustness against natural language attacks by causal intervention

Haiteng Zhao, Chang Ma, Xinshuai Dong, Anh Tuan Luu, Zhi-Hong Deng, and Hanwang Zhang. Certified robustness against natural language attacks by causal intervention. InInternational Conference on Machine Learning, pages 26958–26970. PMLR, 2022

work page 2022
[60]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, ...

work page 2023
[61]

In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 9556–9567

Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua. Tat-qa: A question answering benchmark on a hybrid of tabular and textual content in finance.arXiv preprint arXiv:2105.07624, 2021

work page arXiv 2021
[62]

State of Origin series

Yongshuo Zong, Tingyang Yu, Ruchika Chavhan, Bingchen Zhao, and Timothy Hospedales. Fool your (vision and) language model with embarrassingly simple permutations.arXiv preprint arXiv:2310.01651, 2023. 13 A Appendix A.1 Metric Used for Alignment Scores by LLM-as-judge We follow [60] and use an LLM-as-judge to measure semantic alignment between the ground-t...

work page arXiv 2023