LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling
Pith reviewed 2026-05-19 16:28 UTC · model grok-4.3
The pith
Logic-preserving difficulty scaling finds problem variations that cause language models to fail up to five times more often than random tests.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Logic-preserving difficulty scaling quantifies the difficulty of allowable problem variations while keeping the core logic fixed and searches the space of such variations to maximize difficulty for a given model. This process shows that performance declines and errors in reasoning chains become more pronounced as difficulty increases. The method finds variations that induce performance drops up to 5 times larger than random sampling, and fine-tuning on the difficult variations produces more consistent robustness gains than fine-tuning on easier ones.
What carries the argument
Logic-preserving difficulty scaling (LPDS), a framework that assigns difficulty scores to logic-preserving problem variations and performs a targeted search to maximize those scores for a specific model.
If this is right
- Model performance declines steadily as the quantified difficulty of logic-preserving variations increases.
- Errors in the models' reasoning chains become more pronounced at higher difficulty levels.
- LPDS identifies variations that produce performance drops up to 5 times larger than those from random sampling.
- Fine-tuning on more difficult variations produces more consistent robustness gains than fine-tuning on easier variations.
Where Pith is reading between the lines
- Similar difficulty scaling could be applied to other robustness properties, such as consistency across different prompt phrasings or output formats.
- A training schedule that gradually introduces harder logic-preserving variations might build more stable reasoning than fixed-difficulty datasets.
- Before deployment in settings where small input changes should not alter outcomes, developers could run LPDS-style searches to surface hidden inconsistencies.
Load-bearing premise
That difficulty scores for logic-preserving variations can be assigned in a way that reliably predicts actual model failures rather than just marking differences in the inputs.
What would settle it
Running the same set of problems on multiple models and comparing accuracy drops on LPDS-selected variations versus randomly selected variations to check whether the fivefold difference holds or disappears.
read the original abstract
As large language models (LLMs) are increasingly deployed to perform tasks with minimal human oversight, it is crucial that these models operate robustly. In particular, a model that can solve a given problem should not fail simply because certain entities$\unicode{x2013}$such as names, numbers, or other contextual details$\unicode{x2013}$have changed while the underlying problem logic remains the same. Prior work suggests that current LLMs still struggle with this form of robustness: they often succeed on some variations of a problem but fail on others. However, existing evaluations often lack a systematic way to identify which logic-preserving variations are most likely to induce failure. Instead, they typically test a random subset of allowable variations, which can overstate robustness. To address this gap, we introduce logic-preserving difficulty scaling (LPDS), a framework that (i) quantifies the difficulty of a problem variation and (ii) systematically searches the space of allowable variations to find those that maximize difficulty and expose failures. We show that as difficulty increases, performance declines and errors in the models' reasoning chains become more pronounced. We further demonstrate that LPDS efficiently finds difficult problem variations for a model, resulting in performance drops up to 5 times larger compared to random sampling. Finally, we show that fine-tuning on more difficult variations leads to more consistent robustness gains than training on easier ones.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Logic-Preserving Difficulty Scaling (LPDS), a framework that quantifies the difficulty of logic-preserving variations of problems (e.g., changes to names, numbers, or context while preserving underlying logic) and employs a search procedure to identify high-difficulty instances. It claims that model performance declines consistently as difficulty increases, that LPDS uncovers variations producing performance drops up to 5 times larger than those found by random sampling, and that fine-tuning on difficult variations yields more consistent robustness improvements than training on easier ones.
Significance. If the difficulty quantification proves independent of target-model outputs and the search reliably surfaces genuinely harder instances rather than model-specific weaknesses, LPDS could offer a more systematic alternative to random variation testing for LLM robustness evaluation. The reported 5x performance-drop differential and the fine-tuning results would then provide concrete evidence that targeted exposure to scaled difficulty improves consistency, addressing a recognized gap in current evaluation practices.
major comments (2)
- [§3] §3 (LPDS Framework): The difficulty scoring function must be shown to be computed without reference to the target LLM's outputs, reasoning traces, or error patterns on the candidate variations. If any component of the score incorporates model-specific information, the central comparison to random sampling becomes circular, because the search is then guided toward already-known failure modes rather than independently harder logic-preserving instances.
- [§5] §5 (Experiments): The reported performance drops and 5x improvement over random sampling should be accompanied by an ablation that recomputes difficulty scores using only problem-intrinsic features (e.g., syntactic complexity or entity count) with no access to model responses. Without this control, it remains unclear whether the larger drops reflect true difficulty scaling or simply more effective exploitation of the evaluated model's weaknesses.
minor comments (2)
- [Abstract] The abstract states 'up to 5 times larger' without specifying the exact models, datasets, or number of runs; the main text should provide these details together with confidence intervals.
- [§3] Notation for the difficulty function and the search objective should be introduced once and used consistently; currently the transition from the quantification step to the search algorithm is abrupt.
Simulated Author's Rebuttal
We are grateful to the referee for their constructive feedback on our manuscript. We address the major comments point-by-point below, providing clarifications on the model-independence of our difficulty scoring and agreeing to include additional ablations in the revised version.
read point-by-point responses
-
Referee: [§3] §3 (LPDS Framework): The difficulty scoring function must be shown to be computed without reference to the target LLM's outputs, reasoning traces, or error patterns on the candidate variations. If any component of the score incorporates model-specific information, the central comparison to random sampling becomes circular, because the search is then guided toward already-known failure modes rather than independently harder logic-preserving instances.
Authors: In the LPDS framework presented in §3, the difficulty score is computed using only logic-preserving properties of the problem variations, including metrics such as the degree of entity modification, contextual complexity, and logical equivalence checks, all of which are determined without any access to or reference to the target LLM's outputs, reasoning traces, or error patterns. This ensures that the search for high-difficulty instances is not circular but identifies variations that are inherently more challenging due to their structural properties. We will update the manuscript to include a dedicated subsection or appendix explicitly demonstrating and stating this independence. revision: yes
-
Referee: [§5] §5 (Experiments): The reported performance drops and 5x improvement over random sampling should be accompanied by an ablation that recomputes difficulty scores using only problem-intrinsic features (e.g., syntactic complexity or entity count) with no access to model responses. Without this control, it remains unclear whether the larger drops reflect true difficulty scaling or simply more effective exploitation of the evaluated model's weaknesses.
Authors: We acknowledge the value of the suggested ablation. While our difficulty scoring function is already based on problem-intrinsic features without model responses, as clarified above, we will perform and report an additional ablation study in §5. This will involve recomputing difficulty scores using only basic intrinsic features like syntactic complexity and entity count, and comparing the resulting performance drops to those from the full LPDS scoring. We expect this to confirm that the 5x larger drops are attributable to the more comprehensive difficulty scaling rather than exploitation of model weaknesses. The revised manuscript will include these results. revision: yes
Circularity Check
No significant circularity in LPDS derivation or claims
full rationale
The paper introduces LPDS as an independent quantification of difficulty for logic-preserving variations, followed by a search procedure whose outputs are evaluated against a random-sampling baseline. This comparison supplies an external benchmark that is not derived from the difficulty metric itself. No equations, definitions, or self-citations reduce the reported performance drops or robustness gains to fitted parameters or prior author results by construction. The framework remains self-contained because difficulty scaling and failure exposure are measured empirically rather than tautologically.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Logic-preserving variations exist and can be systematically enumerated or searched for a given problem.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We quantify the difficulty of a problem variation pi ∈ T as the distance between the model’s response Yi and the template’s reasoning graph... MDH = (H(l)i − μH)T Σ−1H (H(l)i − μH)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we introduce logic-preserving difficulty scaling (LPDS), a framework that (i) quantifies the difficulty of a problem variation and (ii) systematically searches the space of allowable variations
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Chang, Yupeng and Wang, Xu and Wang, Jindong and Wu, Yuan and Yang, Linyi and Zhu, Kaijie and Chen, Hao and Yi, Xiaoyuan and Wang, Cunxiang and Wang, Yidong and Ye, Wei and Zhang, Yue and Chang, Yi and Yu, Philip S. and Yang, Qiang and Xie, Xing , title =. ACM Trans. Intell. Syst. Technol. , month = mar, articleno =. 2024 , issue_date =. doi:10.1145/36412...
-
[2]
Jiang, Juyong and Wang, Fan and Shen, Jiasi and Kim, Sungju and Kim, Sunghun , title =. ACM Trans. Softw. Eng. Methodol. , month = jul, keywords =. 2025 , publisher =. doi:10.1145/3747588 , abstract =
-
[3]
Evaluating Open-Domain Question Answering in the Era of Large Language Models
Kamalloo, Ehsan and Dziri, Nouha and Clarke, Charles and Rafiei, Davood. Evaluating Open-Domain Question Answering in the Era of Large Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.307
-
[4]
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
Towards reasoning era: A survey of long chain-of-thought for reasoning large language models , author=. arXiv preprint arXiv:2503.09567 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...
-
[6]
Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
The rise and potential of large language model based agents: a survey.Sci
Xi, Zhiheng and Chen, Wenxiang and Guo, Xin and He, Wei and Ding, Yiwen and Hong, Boyang and Zhang, Ming and Wang, Junzhe and Jin, Senjie and Zhou, Enyu and Zheng, Rui and Fan, Xiaoran and Wang, Xiao and Xiong, Limao and Zhou, Yuhao and Wang, Weiran and Jiang, Changhao and Zou, Yicheng and Liu, Xiangyang and Yin, Zhangyue and Dou, Shihan and Weng, Rongxia...
-
[8]
Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems , author=. arXiv preprint arXiv:2504.01990 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sparks of artificial general intelligence: Early experiments with gpt-4 , author=. arXiv preprint arXiv:2303.12712 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
First Conference on Language Modeling , year=
Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models - A Survey , author=. First Conference on Language Modeling , year=
-
[13]
Agent-SafetyBench: Evaluating the Safety of LLM Agents
Agent-safetybench: Evaluating the safety of llm agents , author=. arXiv preprint arXiv:2412.14470 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
A survey on large language model benchmarks, 2025a
A survey on large language model benchmarks , author=. arXiv preprint arXiv:2508.15361 , year=
-
[15]
Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and Li, Tianle and Ku, Max and Wang, Kai and Zhuang, Alex and Fan, Rongqi and Yue, Xiang and Chen, Wenhu , booktitle =. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understandi...
-
[16]
David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman , booktitle=. 2024 , url=
work page 2024
-
[17]
SciCode: A Research Coding Benchmark Curated by Scientists , url =
Tian, Minyang and Gao, Luyu and Zhang, Shizhuo Dylan and Chen, Xinan and Fan, Cunwei and Guo, Xuefei and Haas, Roland and Ji, Pan and Krongchon, Kittithat and Li, Yao and Liu, Shengyan and Luo, Di and Ma, Yutao and Tong, Hao and Trinh, Kha and Tian, Chenyu and Wang, Zihan and Wu, Bohao and Xiong, Yanyu and Yin, Shengzhu and Zhu, Minhui and Lieret, Kilian ...
-
[18]
Seyed Iman Mirzadeh and Keivan Alizadeh and Hooman Shahrokhi and Oncel Tuzel and Samy Bengio and Mehrdad Farajtabar , booktitle=. 2025 , url=
work page 2025
-
[19]
Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap, 2024
Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap , author=. arXiv preprint arXiv:2402.19450 , year=
-
[20]
Wu, Zhaofeng and Qiu, Linlu and Ross, Alexis and Aky. Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v...
-
[21]
and Wang, Xuezhi and Zhou, Denny , title =
Chen, Xinyun and Chi, Ryan A. and Wang, Xuezhi and Zhou, Denny , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =
work page 2024
-
[22]
Large language models can be easily distracted by irrelevant context , year =
Shi, Freda and Chen, Xinyun and Misra, Kanishka and Scales, Nathan and Dohan, David and Chi, Ed and Sch\". Large language models can be easily distracted by irrelevant context , year =. Proceedings of the 40th International Conference on Machine Learning , articleno =
-
[23]
Training Verifiers to Solve Math Word Problems
Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Measuring Mathematical Problem Solving With the
Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt , booktitle=. Measuring Mathematical Problem Solving With the. 2021 , url=
work page 2021
-
[25]
FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning
FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning , author=. arXiv preprint arXiv:2506.02515 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
arXiv preprint arXiv:2511.01650 , year=
EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning , author=. arXiv preprint arXiv:2511.01650 , year=
-
[27]
Automatic Engineering of Long Prompts
Hsieh, Cho-Jui and Si, Si and Yu, Felix and Dhillon, Inderjit. Automatic Engineering of Long Prompts. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.634
-
[28]
The Twelfth International Conference on Learning Representations , year=
Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers , author=. The Twelfth International Conference on Learning Representations , year=
-
[29]
Efficient Prompt Optimization Through the Lens of Best Arm Identification , volume =
Shi, Chengshuai and Yang, Kun and Chen, Zihan and Li, Jundong and Yang, Jing and Shen, Cong , booktitle =. Efficient Prompt Optimization Through the Lens of Best Arm Identification , volume =. doi:10.52202/079017-3161 , editor =
-
[30]
Toward Human Readable Prompt Tuning: Kubrick ' s The Shining is a good movie, and a good prompt too?
Shi, Weijia and Han, Xiaochuang and Gonen, Hila and Holtzman, Ari and Tsvetkov, Yulia and Zettlemoyer, Luke. Toward Human Readable Prompt Tuning: Kubrick ' s The Shining is a good movie, and a good prompt too?. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.733
-
[31]
Gradient-based Adversarial Attacks against Text Transformers
Guo, Chuan and Sablayrolles, Alexandre and J \'e gou, Herv \'e and Kiela, Douwe. Gradient-based Adversarial Attacks against Text Transformers. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.464
-
[32]
and Wallace, Eric and Singh, Sameer
Shin, Taylor and Razeghi, Yasaman and Logan IV, Robert L. and Wallace, Eric and Singh, Sameer. A uto P rompt: E liciting K nowledge from L anguage M odels with A utomatically G enerated P rompts. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.346
-
[33]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Accelerating Greedy Coordinate Gradient and General Prompt Optimization via Probe Sampling , url =
Zhao, Yiran and Zheng, Wenyue and Cai, Tianle and Long, Xuan and Kawaguchi, Kenji and Goyal, Anirudh and Shieh, Michael Qizhe , booktitle =. Accelerating Greedy Coordinate Gradient and General Prompt Optimization via Probe Sampling , url =. doi:10.52202/079017-1701 , editor =
-
[35]
Journal of Artificial Intelligence Research , volume=
Visualisation and'diagnostic classifiers' reveal how recurrent and recursive neural networks process hierarchical structure , author=. Journal of Artificial Intelligence Research , volume=
-
[36]
Weber, Lucas and Bruni, Elia and Hupkes, Dieuwke. Mind the instructions: a holistic evaluation of consistency and interactions in prompt-based learning. Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL). 2023. doi:10.18653/v1/2023.conll-1.20
-
[37]
The Eleventh International Conference on Learning Representations , year=
Out-of-Distribution Detection and Selective Generation for Conditional Language Models , author=. The Eleventh International Conference on Learning Representations , year=
-
[38]
The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
Scalable Best-of-N Selection for Large Language Models via Self-Certainty , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
-
[39]
Thomas McCoy, Ellie Pavlick, and Tal Linzen
McCoy, R. Thomas and Pavlick, Ellie and Linzen, Tal. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1334
-
[40]
Toolformer: language models can teach themselves to use tools , year =
Schick, Timo and Dwivedi-Yu, Jane and Dess\'. Toolformer: language models can teach themselves to use tools , year =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =
-
[41]
Garg, Sahaj and Perot, Vincent and Limtiaco, Nicole and Taly, Ankur and Chi, Ed H. and Beutel, Alex , title =. Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society , pages =. 2019 , isbn =. doi:10.1145/3306618.3317950 , abstract =
-
[42]
T., Wu, T., Guestrin, C., and Singh, S
Ribeiro, Marco Tulio and Wu, Tongshuang and Guestrin, Carlos and Singh, Sameer. Beyond Accuracy: Behavioral Testing of NLP Models with C heck L ist. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.442
-
[43]
H ate C heck: Functional Tests for Hate Speech Detection Models
R. H ate C heck: Functional Tests for Hate Speech Detection Models. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl-long.4
-
[44]
NeurIPS 2025 Workshop on Efficient Reasoning , year=
Deep Think with Confidence , author=. NeurIPS 2025 Workshop on Efficient Reasoning , year=
work page 2025
-
[45]
Representation Engineering: A Top-Down Approach to AI Transparency
Representation engineering: A top-down approach to ai transparency , author=. arXiv preprint arXiv:2310.01405 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[46]
Forty-second International Conference on Machine Learning , year=
Layer by Layer: Uncovering Hidden Representations in Language Models , author=. Forty-second International Conference on Machine Learning , year=
- [47]
-
[48]
Phi-4 technical report , author=. arXiv preprint arXiv:2412.08905 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[49]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , url =
Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and ichter, brian and Xia, Fei and Chi, Ed and Le, Quoc V and Zhou, Denny , booktitle =. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , url =
-
[50]
The Twelfth International Conference on Learning Representations , year=
Large Language Models Are Not Robust Multiple Choice Selectors , author=. The Twelfth International Conference on Learning Representations , year=
-
[51]
Pezeshkpour, Pouya and Hruschka, Estevam. Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions. Findings of the Association for Computational Linguistics: NAACL 2024. 2024. doi:10.18653/v1/2024.findings-naacl.130
-
[52]
The Twelfth International Conference on Learning Representations , year=
Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting , author=. The Twelfth International Conference on Learning Representations , year=
-
[53]
arXiv preprint arXiv:2510.05152 , year=
A single character can make or break your LLM evals , author=. arXiv preprint arXiv:2510.05152 , year=
-
[54]
Zhou, Zhanke and Tao, Rong and Zhu, Jianing and Luo, Yiwen and Wang, Zengmao and Han, Bo , booktitle =. Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales? , url =. doi:10.52202/079017-3936 , editor =
-
[55]
Reasoning Robustness of LLMs to Adversarial Typographical Errors
Gan, Esther and Zhao, Yiran and Cheng, Liying and Yancan, Mao and Goyal, Anirudh and Kawaguchi, Kenji and Kan, Min-Yen and Shieh, Michael. Reasoning Robustness of LLM s to Adversarial Typographical Errors. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.584
-
[56]
Resilience of Large Language Models for Noisy Instructions
Wang, Bin and Wei, Chengwei and Liu, Zhengyuan and Lin, Geyu and Chen, Nancy F. Resilience of Large Language Models for Noisy Instructions. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.697
-
[57]
Aryan Gulati and Brando Miranda and Eric Chen and Emily Xia and Kai Fronsdal and Bruno de Moraes Dumont and Sanmi Koyejo , booktitle=. Putnam-. 2024 , url=
work page 2024
-
[58]
State of What Art? A Call for Multi-Prompt LLM Evaluation
Mizrahi, Moran and Kaplan, Guy and Malkin, Dan and Dror, Rotem and Shahaf, Dafna and Stanovsky, Gabriel , title =. Transactions of the Association for Computational Linguistics , volume =. 2024 , month =. doi:10.1162/tacl_a_00681 , url =
-
[59]
The Twelfth International Conference on Learning Representations , year=
Evaluating the Zero-shot Robustness of Instruction-tuned Language Models , author=. The Twelfth International Conference on Learning Representations , year=
-
[60]
Transactions on Machine Learning Research , issn=
Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics , author=. Transactions on Machine Learning Research , issn=. 2025 , url=
work page 2025
-
[61]
arXiv preprint arXiv:2407.08989 , year=
Robustness of llms to perturbations in text , author=. arXiv preprint arXiv:2407.08989 , year=
-
[62]
Proceedings of the 40th International Conference on Machine Learning , pages =
Large Language Models Can Be Easily Distracted by Irrelevant Context , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =
work page 2023
-
[63]
Does prompt formatting have any impact on llm performance? , author=. arXiv preprint arXiv:2411.10541 , year=
-
[64]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
On the Worst Prompt Performance of Large Language Models , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
-
[65]
arXiv preprint arXiv:2506.11111 , year=
Evaluating and Improving Robustness in Large Language Models: A Survey and Future Directions , author=. arXiv preprint arXiv:2506.11111 , year=
-
[66]
arXiv preprint arXiv:2502.16923 , year=
A systematic survey of automatic prompt optimization techniques , author=. arXiv preprint arXiv:2502.16923 , year=
-
[67]
arXiv preprint arXiv:2502.11560 , year=
A survey of automatic prompt engineering: An optimization perspective , author=. arXiv preprint arXiv:2502.11560 , year=
-
[68]
ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models , year=
A Survey on Prompt Tuning , author=. ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models , year=
-
[69]
URL https://aclanthology.org/2021
Lester, Brian and Al-Rfou, Rami and Constant, Noah. The Power of Scale for Parameter-Efficient Prompt Tuning. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.243
-
[70]
Adversarial Soft Prompt Tuning for Cross-Domain Sentiment Analysis
Wu, Hui and Shi, Xiaodong. Adversarial Soft Prompt Tuning for Cross-Domain Sentiment Analysis. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.174
- [71]
-
[72]
Demystifying Prompts in Language Models via Perplexity Estimation
Gonen, Hila and Iyer, Srini and Blevins, Terra and Smith, Noah and Zettlemoyer, Luke. Demystifying Prompts in Language Models via Perplexity Estimation. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.679
-
[73]
Uncertainty Estimation in Large Language Models to Support Biodiversity Conservation
Mora-Cross, Maria and Calderon-Ramirez, Saul. Uncertainty Estimation in Large Language Models to Support Biodiversity Conservation. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track). 2024. doi:10.18653/v1/2024.naacl-industry.31
-
[74]
arXiv preprint arXiv:2410.15326 , year=
A survey of uncertainty estimation in llms: Theory meets practice , author=. arXiv preprint arXiv:2410.15326 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.