Recognition: no theorem link
The Power of Order: Fooling LLMs with Adversarial Table Permutations
Pith reviewed 2026-05-12 02:22 UTC · model grok-4.3
The pith
Semantically identical rearrangements of table rows and columns can cause large language models to output wrong or inconsistent answers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Modern large language models exhibit a significant vulnerability to the layout of tabular data. Semantically-invariant permutations of rows and columns are sometimes sufficient to cause incorrect or inconsistent model outputs. To systematically probe this vulnerability, the authors introduce Adversarial Table Permutation, a gradient-based attack that identifies worst-case permutations designed to maximally disrupt model performance. Extensive experiments demonstrate that this attack significantly degrades performance of a wide range of LLMs across different sizes and architectures.
What carries the argument
Adversarial Table Permutation, a gradient-based optimization procedure that searches over row and column reorderings to maximize model error while preserving all semantic information in the table.
Load-bearing premise
The chosen permutations truly leave the table's meaning unchanged for the purpose of the question being asked.
What would settle it
A controlled test in which every tested model returns identical correct answers on a fixed set of tables under all possible semantically equivalent row-and-column permutations would falsify the central claim.
Figures
read the original abstract
Large Language Models have achieved remarkable success and are increasingly deployed in critical applications involving tabular data, such as Table Question Answering. However, their robustness to the structure of this input remains a critical, unaddressed question. This paper demonstrates that modern LLMs exhibit a significant vulnerability to the layout of tabular data. Specifically, we show that semantically-invariant permutations of rows and columns - rearrangements that do not alter the table's underlying information - are sometimes sufficient to cause incorrect or inconsistent model outputs. To systematically probe this vulnerability, we introduce Adversarial Table Permutation, a novel, gradient-based attack that efficiently identifies worst-case permutations designed to maximally disrupt model performance. Our extensive experiments demonstrate that ATP significantly degrades the performance of a wide range of LLMs. This reveals a pervasive vulnerability across different model sizes and architectures, including the most recent and popular models. Our findings expose a fundamental weakness in how current LLMs process structured data, underscoring the urgent need to develop permutation-robust models for reliable, real-world applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLMs are vulnerable to row and column permutations in tabular inputs for question-answering tasks. It introduces Adversarial Table Permutation (ATP), a gradient-based attack that searches for permutations claimed to be semantically invariant yet sufficient to cause incorrect or inconsistent model outputs, and reports that ATP degrades performance across a range of models and sizes.
Significance. If the central claim holds, the work would highlight a practical robustness gap in how LLMs serialize and reason over structured data, with direct relevance to deployed table-QA systems. The empirical, attack-driven approach and breadth of models tested are strengths; the absence of circularity or fitted parameters in the attack definition further supports its potential value as a diagnostic tool.
major comments (2)
- [§3] §3 (ATP procedure): the optimization directly maximizes output disruption but contains no post-hoc verification (e.g., re-deriving the ground-truth answer from the linearized permuted table or human equivalence rating) that every selected permutation preserves the original semantics and correct answer. This check is load-bearing for the claim that observed degradation reflects order sensitivity rather than altered table content.
- [§4] §4 (experimental results): the reported performance drops are presented without explicit baseline comparisons to random or non-adversarial permutations, without the precise metrics and aggregation method used, and without statistical significance tests across runs or models. These omissions make it difficult to quantify how much of the degradation is attributable to the adversarial search versus generic ordering effects.
minor comments (2)
- The abstract and introduction should explicitly state the datasets, table sizes, and question types used so readers can assess the scope of the claimed vulnerability.
- Notation for row/column permutations and the linearization step into the prompt should be formalized (e.g., with a small example equation) to avoid ambiguity when describing how tables are serialized.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments. We address each major comment below and outline revisions that will strengthen the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (ATP procedure): the optimization directly maximizes output disruption but contains no post-hoc verification (e.g., re-deriving the ground-truth answer from the linearized permuted table or human equivalence rating) that every selected permutation preserves the original semantics and correct answer. This check is load-bearing for the claim that observed degradation reflects order sensitivity rather than altered table content.
Authors: We agree that making the invariance explicit strengthens the claim. Row and column permutations are applied while preserving every cell value and its row-column association exactly; thus the table's information content is identical and the ground-truth answer to any content-based question is unchanged by construction. In the revision we will add to §3 a formal argument establishing this invariance together with a human verification study on a random sample of 100 permutations confirming that annotators judge the ground-truth answers to be identical before and after permutation. This directly addresses the concern. revision: yes
-
Referee: [§4] §4 (experimental results): the reported performance drops are presented without explicit baseline comparisons to random or non-adversarial permutations, without the precise metrics and aggregation method used, and without statistical significance tests across runs or models. These omissions make it difficult to quantify how much of the degradation is attributable to the adversarial search versus generic ordering effects.
Authors: We accept that these details improve interpretability. In the revised §4 we will include (i) explicit baselines consisting of random permutations and simple non-adversarial heuristics (e.g., lexicographic row sorting), (ii) precise definitions of the metrics (exact-match accuracy and token-level F1) and the aggregation procedure (macro-average over questions), and (iii) statistical significance results (paired t-tests with p-values) computed over five independent runs per model. Updated tables and figures will report these quantities. revision: yes
Circularity Check
No circularity: empirical attack with independent definition and evaluation
full rationale
The paper is an empirical robustness study that defines the ATP attack as a gradient-based search over row/column permutations to maximize output disruption on table QA tasks, then measures the resulting accuracy drops across LLMs. No derivation chain exists that reduces a claimed result to its own inputs by construction; the attack objective is stated independently of the measured performance, and the semantic-invariance claim is an assumption whose validity is tested via the experimental outcomes rather than presupposed in the method. No self-citations, uniqueness theorems, or ansatzes are load-bearing, and no parameters are fitted then relabeled as predictions. The work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Semantically equivalent table rearrangements should produce identical model outputs if the model correctly understands the underlying data.
Reference graph
Works this paper leans on
-
[1]
Ranking via Sinkhorn Propagation
Ryan Prescott Adams and Richard S Zemel. Ranking via sinkhorn propagation.arXiv preprint arXiv:1106.1925, 2011
work page Pith review arXiv 1925
-
[2]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Kushal Raj Bhandari, Sixue Xing, Soham Dan, and Jianxi Gao. Exploring the robustness of language models for tabular question answering via attention analysis.Trans. Mach. Learn. Res., 2025
work page 2025
-
[4]
Tres observaciones sobre el algebra lineal.Univ
Garrett Birkhoff. Tres observaciones sobre el algebra lineal.Univ. Nac. Tucuman, Ser. A, 5:147–154, 1946
work page 1946
-
[5]
arXiv preprint arXiv:2210.06280 , year=
Vadim Borisov, Kathrin Seßler, Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci. Language models are realistic tabular data generators.arXiv preprint arXiv:2210.06280, 2022
-
[6]
Hytrel: Hypergraph-enhanced tabular data representation learning
Pei Chen, Soumajyoti Sarkar, Leonard Lausen, Balasubramaniam Srinivasan, Sheng Zha, Ruihong Huang, and George Karypis. Hytrel: Hypergraph-enhanced tabular data representation learning. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 1...
work page 2023
-
[7]
Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. InProceedings of the 10th ACM workshop on artificial intelligence and security, pages 15–26, 2017
work page 2017
-
[8]
Zhoujun Cheng, Tianbao Xie, Peng Shi, Chengzu Li, Rahul Nadkarni, Yushi Hu, Caiming Xiong, Dragomir Radev, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Tao Yu. Binding language models in symbolic languages. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023
work page 2023
-
[9]
Certified adversarial robustness via randomized smoothing
Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. Ininternational conference on machine learning, pages 1310–1320. PMLR, 2019
work page 2019
-
[10]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Sonia Cromp, Satya Sai Srinath Namburi GNVV , Mohammed Alkhudhayri, Catherine Cao, Samuel Guo, Nicholas Roberts, and Frederic Sala. Tabby: A language model architecture for tabular and structured data synthesis.arXiv preprint arXiv:2503.02152, 2025
-
[12]
Tables as texts or images: Evaluating the table reasoning ability of llms and mllms
Naihao Deng, Zhenjie Sun, Ruiqi He, Aman Sikka, Yulong Chen, Lin Ma, Yue Zhang, and Rada Mihalcea. Tables as texts or images: Evaluating the table reasoning ability of llms and mllms. InFindings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024. Association for Computational Linguistics, 2024
work page 2024
-
[13]
Towards robustness against natural language word substitutions
Xinshuai Dong, Anh Tuan Luu, Rongrong Ji, and Hong Liu. Towards robustness against natural language word substitutions. InICLR, 2021
work page 2021
-
[14]
Xinshuai Dong, Anh Tuan Luu, Min Lin, Shuicheng Yan, and Hanwang Zhang. How should pre-trained language models be fine-tuned towards adversarial robustness?Advances in Neural Information Processing Systems, 34:4356–4369, 2021
work page 2021
-
[15]
The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024
work page 2024
-
[16]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 10
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Luxi He, Mengzhou Xia, and Peter Henderson. What is in your safe data? identifying benign data that breaks safety.arXiv preprint arXiv:2404.01099, 2024
-
[18]
Lei Hsiung, Tianyu Pang, Yung-Chen Tang, Linyue Song, Tsung-Yi Ho, Pin-Yu Chen, and Yaoqing Yang. Why llm safety guardrails collapse after fine-tuning: A similarity analysis between alignment and fine-tuning datasets.arXiv preprint arXiv:2506.05346, 2025
-
[19]
Generating syntactically controlled paraphrases without using annotated parallel pairs
Kuan-Hao Huang and Kai-Wei Chang. Generating syntactically controlled paraphrases without using annotated parallel pairs. InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021, Online, April 19 - 23, 2021, pages 1022–1033. Association for Computational Linguistics, 2021
work page 2021
-
[20]
Categorical Reparameterization with Gumbel-Softmax
Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[21]
Structgpt: A general framework for large language model to reason over structured data
Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Xin Zhao, and Ji-Rong Wen. Structgpt: A general framework for large language model to reason over structured data. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023. Association for Computational Linguistics, 2023
work page 2023
-
[22]
Adam: A Method for Stochastic Optimization
Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[23]
TabDDPM: Mod- elling tabular data with diffusion models
Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, and Artem Babenko. TabDDPM: Mod- elling tabular data with diffusion models. InProceedings of the 40th International Conference on Machine Learning, 2023
work page 2023
-
[24]
Harold W Kuhn. The hungarian method for the assignment problem.Naval research logistics quarterly, 2(1-2):83–97, 1955
work page 1955
-
[25]
Harold W Kuhn. Variants of the hungarian method for assignment problems.Naval research logistics quarterly, 3(4):253–258, 1956
work page 1956
-
[26]
Table as a modality for large language models
Liyao Li, Chao Ye, Wentao Ye, Yifei Sun, Zhe Jiang, Haobo Wang, Jiaming Tian, Yiming Zhang, Ningtao Wang, Xing Fu, Gang Chen, and Junbo Zhao. Table as a modality for large language models. InAdvances in Neural Information Processing Systems, 2025
work page 2025
-
[27]
Pearl: Towards permutation-resilient llms
CHEN Liang, Li Shen, Yang Deng, Xiaoyan Zhao, Bin Liang, and Kam-Fai Wong. Pearl: Towards permutation-resilient llms. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[28]
Rouge: A package for automatic evaluation of summaries
Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004
work page 2004
-
[29]
An inner table retriever for robust table question answering
Weizhe Lin, Rexhina Blloshmi, Bill Byrne, Adria de Gispert, and Gonzalo Iglesias. An inner table retriever for robust table question answering. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023
work page 2023
-
[30]
Rethinking tabular data understanding with large language models
Tianyang Liu, Fei Wang, and Muhao Chen. Rethinking tabular data understanding with large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024, pages 450–482. Association for...
work page 2024
-
[31]
The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables
Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables.arXiv preprint arXiv:1611.00712, 2016
work page Pith review arXiv 2016
-
[32]
Towards Deep Learning Models Resistant to Adversarial Attacks
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks.arXiv preprint arXiv:1706.06083, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[33]
Learning latent permuta- tions with gumbel-sinkhorn networks.arXiv preprint arXiv:1802.08665, 2018
Gonzalo Mena, David Belanger, Scott Linderman, and Jasper Snoek. Learning latent permuta- tions with gumbel-sinkhorn networks.arXiv preprint arXiv:1802.08665, 2018. 11
-
[34]
Linyong Nan, Chiachun Hsieh, Ziming Mao, Xi Victoria Lin, Neha Verma, Rui Zhang, Wojciech Kry´sci´nski, Hailey Schoelkopf, Riley Kong, Xiangru Tang, et al. Fetaqa: Free-form table question answering.Transactions of the Association for Computational Linguistics, 10:35–49, 2022
work page 2022
-
[35]
Interpretable llm-based table question answering.Trans
Giang Nguyen, Ivan Brugere, Shubham Sharma, Sanjay Kariyappa, Anh Totti Nguyen, and Freddy Lécué. Interpretable llm-based table question answering.Trans. Mach. Learn. Res., 2025
work page 2025
-
[36]
Practical black-box attacks against machine learning
Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Anan- thram Swami. Practical black-box attacks against machine learning. InProceedings of the 2017 ACM on Asia conference on computer and communications security, pages 506–519, 2017
work page 2017
-
[37]
Bleu: a method for automatic evaluation of machine translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002
work page 2002
-
[38]
Compositional semantic parsing on semi-structured tables
Panupong Pasupat and Percy Liang. Compositional semantic parsing on semi-structured tables. arXiv preprint arXiv:1508.00305, 2015
-
[39]
Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024
work page 2024
-
[40]
Evaluating the text-to-sql capabilities of large language models
Nitarshan Rajkumar, Raymond Li, and Dzmitry Bahdanau. Evaluating the text-to-sql capabilities of large language models.arXiv preprint arXiv:2204.00498, 2022
-
[41]
Code Llama: Open Foundation Models for Code
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
Lin Shi, Chiyu Ma, Wenhua Liang, Weicheng Ma, and Soroush V osoughi. Judging the judges: A systematic investigation of position bias in pairwise comparative assessments by LLMs, 2024
work page 2024
-
[43]
Richard Sinkhorn. A relationship between arbitrary positive matrices and doubly stochastic matrices.The annals of mathematical statistics, 35(2):876–879, 1964
work page 1964
-
[44]
Aivin V . Solatorio and Olivier Dupriez. REaLTabFormer: Generating realistic relational and tabular data using transformers.arXiv preprint arXiv:2302.02041, 2023
-
[45]
Aofeng Su, Aowen Wang, Chao Ye, Chen Zhou, Ga Zhang, Gang Chen, Guangcheng Zhu, Haobo Wang, Haokai Xu, Hao Chen, et al. Tablegpt2: A large multimodal model with tabular data integration.arXiv preprint arXiv:2411.02059, 2024
-
[46]
arXiv preprint arXiv:2302.12095 , year=
Jindong Wang, Xixu Hu, Wenxin Hou, Hao Chen, Runkai Zheng, Yidong Wang, Linyi Yang, Haojun Huang, Wei Ye, Xiubo Geng, et al. On the robustness of chatgpt: An adversarial and out-of-distribution perspective.arXiv preprint arXiv:2302.12095, 2023
-
[47]
Yuxiang Wang, Junhao Gan, Shengxiang Gao, Shenghao Ye, Zhengyi Yang, and Jianzhong Qi. Beyond linearization: Attributed table graphs for table reasoning.arXiv preprint arXiv:2601.08444, 2026
-
[48]
Transtab: Learning transferable tabular transformers across tables
Zifeng Wang and Jimeng Sun. Transtab: Learning transferable tabular transformers across tables. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022
work page 2022
-
[49]
Zilong Wang, Hao Zhang, Chun-Liang Li, Julian Martin Eisenschlos, Vincent Perot, Zifeng Wang, Lesly Miculicich, Yasuhisa Fujii, Jingbo Shang, Chen-Yu Lee, et al. Chain-of-table: Evolving tables in the reasoning chain for table understanding.arXiv preprint arXiv:2401.04398, 2024. 12
-
[50]
Ziqi Wang, Hanlin Zhang, Xiner Li, Kuan-Hao Huang, Chi Han, Shuiwang Ji, Sham M. Kakade, Hao Peng, and Heng Ji. Eliminating position bias of language models: A mechanistic approach. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[51]
Modeling tabular data using conditional GAN
Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Modeling tabular data using conditional GAN. InAdvances in Neural Information Processing Systems, 2019
work page 2019
-
[52]
Tableformer: Robust transformer modeling for table-text encoding
Jingfeng Yang, Aditya Gupta, Shyam Upadhyay, Luheng He, Rahul Goel, and Shachi Paul. Tableformer: Robust transformer modeling for table-text encoding. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Irelan...
work page 2022
-
[53]
Texthoaxer: Budgeted hard-label adversarial attacks on text
Muchao Ye, Chenglin Miao, Ting Wang, and Fenglong Ma. Texthoaxer: Budgeted hard-label adversarial attacks on text. InThirty-Sixth AAAI Conference on Artificial Intelligence. AAAI Press, 2022
work page 2022
-
[54]
Yunhu Ye, Binyuan Hui, Min Yang, Binhua Li, Fei Huang, and Yongbin Li. Large language models are versatile decomposers: Decomposing evidence and questions for table-based rea- soning. InProceedings of the 46th international ACM SIGIR conference on research and development in information retrieval, pages 174–184, 2023
work page 2023
-
[55]
Yunhu Ye, Binyuan Hui, Min Yang, Binhua Li, Fei Huang, and Yongbin Li. Large language models are versatile decomposers: Decomposing evidence and questions for table-based rea- soning. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023, Taipei, Taiwan, July 23-27, 2023. ACM, 2023
work page 2023
-
[56]
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts.arXiv preprint arXiv:2309.10253, 2023
work page internal anchor Pith review arXiv 2023
-
[57]
Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge AI safety by humanizing llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, 2024
work page 2024
-
[58]
Xiaokang Zhang, Sijia Luo, Bohan Zhang, Zeyao Ma, Jing Zhang, Li Yang, Guanlin Li, Zijun Yao, Kangli Xu, Jinchang Zhou, Daniel Zhang-Li, et al. Tablellm: Enabling tabular data manipulation by llms in real office usage scenarios.arXiv preprint arXiv:2403.19318, 2024
-
[59]
Certified robustness against natural language attacks by causal intervention
Haiteng Zhao, Chang Ma, Xinshuai Dong, Anh Tuan Luu, Zhi-Hong Deng, and Hanwang Zhang. Certified robustness against natural language attacks by causal intervention. InInternational Conference on Machine Learning, pages 26958–26970. PMLR, 2022
work page 2022
-
[60]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, ...
work page 2023
-
[61]
Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua. Tat-qa: A question answering benchmark on a hybrid of tabular and textual content in finance.arXiv preprint arXiv:2105.07624, 2021
-
[62]
Yongshuo Zong, Tingyang Yu, Ruchika Chavhan, Bingchen Zhao, and Timothy Hospedales. Fool your (vision and) language model with embarrassingly simple permutations.arXiv preprint arXiv:2310.01651, 2023. 13 A Appendix A.1 Metric Used for Alignment Scores by LLM-as-judge We follow [60] and use an LLM-as-judge to measure semantic alignment between the ground-t...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.