pith. sign in

arxiv: 2605.15393 · v1 · pith:ENWG2XKPnew · submitted 2026-05-14 · 💻 cs.LG

LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling

Pith reviewed 2026-05-19 16:28 UTC · model grok-4.3

classification 💻 cs.LG
keywords LLM robustnesslogic-preserving variationsdifficulty scalingevaluation frameworkreasoning errorsfine-tuningrobustness testing
0
0 comments X

The pith

Logic-preserving difficulty scaling finds problem variations that cause language models to fail up to five times more often than random tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces logic-preserving difficulty scaling to systematically identify harder versions of problems where the underlying logic stays the same but details such as names or numbers change. It demonstrates that as difficulty rises according to the measure, models show declining performance and more errors in their reasoning steps. The targeted search for difficult variations produces performance drops up to five times larger than those from random sampling of allowable changes. Fine-tuning on the harder variations yields more consistent robustness improvements compared to training on easier versions.

Core claim

Logic-preserving difficulty scaling quantifies the difficulty of allowable problem variations while keeping the core logic fixed and searches the space of such variations to maximize difficulty for a given model. This process shows that performance declines and errors in reasoning chains become more pronounced as difficulty increases. The method finds variations that induce performance drops up to 5 times larger than random sampling, and fine-tuning on the difficult variations produces more consistent robustness gains than fine-tuning on easier ones.

What carries the argument

Logic-preserving difficulty scaling (LPDS), a framework that assigns difficulty scores to logic-preserving problem variations and performs a targeted search to maximize those scores for a specific model.

If this is right

  • Model performance declines steadily as the quantified difficulty of logic-preserving variations increases.
  • Errors in the models' reasoning chains become more pronounced at higher difficulty levels.
  • LPDS identifies variations that produce performance drops up to 5 times larger than those from random sampling.
  • Fine-tuning on more difficult variations produces more consistent robustness gains than fine-tuning on easier variations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar difficulty scaling could be applied to other robustness properties, such as consistency across different prompt phrasings or output formats.
  • A training schedule that gradually introduces harder logic-preserving variations might build more stable reasoning than fixed-difficulty datasets.
  • Before deployment in settings where small input changes should not alter outcomes, developers could run LPDS-style searches to surface hidden inconsistencies.

Load-bearing premise

That difficulty scores for logic-preserving variations can be assigned in a way that reliably predicts actual model failures rather than just marking differences in the inputs.

What would settle it

Running the same set of problems on multiple models and comparing accuracy drops on LPDS-selected variations versus randomly selected variations to check whether the fivefold difference holds or disappears.

read the original abstract

As large language models (LLMs) are increasingly deployed to perform tasks with minimal human oversight, it is crucial that these models operate robustly. In particular, a model that can solve a given problem should not fail simply because certain entities$\unicode{x2013}$such as names, numbers, or other contextual details$\unicode{x2013}$have changed while the underlying problem logic remains the same. Prior work suggests that current LLMs still struggle with this form of robustness: they often succeed on some variations of a problem but fail on others. However, existing evaluations often lack a systematic way to identify which logic-preserving variations are most likely to induce failure. Instead, they typically test a random subset of allowable variations, which can overstate robustness. To address this gap, we introduce logic-preserving difficulty scaling (LPDS), a framework that (i) quantifies the difficulty of a problem variation and (ii) systematically searches the space of allowable variations to find those that maximize difficulty and expose failures. We show that as difficulty increases, performance declines and errors in the models' reasoning chains become more pronounced. We further demonstrate that LPDS efficiently finds difficult problem variations for a model, resulting in performance drops up to 5 times larger compared to random sampling. Finally, we show that fine-tuning on more difficult variations leads to more consistent robustness gains than training on easier ones.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Logic-Preserving Difficulty Scaling (LPDS), a framework that quantifies the difficulty of logic-preserving variations of problems (e.g., changes to names, numbers, or context while preserving underlying logic) and employs a search procedure to identify high-difficulty instances. It claims that model performance declines consistently as difficulty increases, that LPDS uncovers variations producing performance drops up to 5 times larger than those found by random sampling, and that fine-tuning on difficult variations yields more consistent robustness improvements than training on easier ones.

Significance. If the difficulty quantification proves independent of target-model outputs and the search reliably surfaces genuinely harder instances rather than model-specific weaknesses, LPDS could offer a more systematic alternative to random variation testing for LLM robustness evaluation. The reported 5x performance-drop differential and the fine-tuning results would then provide concrete evidence that targeted exposure to scaled difficulty improves consistency, addressing a recognized gap in current evaluation practices.

major comments (2)
  1. [§3] §3 (LPDS Framework): The difficulty scoring function must be shown to be computed without reference to the target LLM's outputs, reasoning traces, or error patterns on the candidate variations. If any component of the score incorporates model-specific information, the central comparison to random sampling becomes circular, because the search is then guided toward already-known failure modes rather than independently harder logic-preserving instances.
  2. [§5] §5 (Experiments): The reported performance drops and 5x improvement over random sampling should be accompanied by an ablation that recomputes difficulty scores using only problem-intrinsic features (e.g., syntactic complexity or entity count) with no access to model responses. Without this control, it remains unclear whether the larger drops reflect true difficulty scaling or simply more effective exploitation of the evaluated model's weaknesses.
minor comments (2)
  1. [Abstract] The abstract states 'up to 5 times larger' without specifying the exact models, datasets, or number of runs; the main text should provide these details together with confidence intervals.
  2. [§3] Notation for the difficulty function and the search objective should be introduced once and used consistently; currently the transition from the quantification step to the search algorithm is abrupt.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their constructive feedback on our manuscript. We address the major comments point-by-point below, providing clarifications on the model-independence of our difficulty scoring and agreeing to include additional ablations in the revised version.

read point-by-point responses
  1. Referee: [§3] §3 (LPDS Framework): The difficulty scoring function must be shown to be computed without reference to the target LLM's outputs, reasoning traces, or error patterns on the candidate variations. If any component of the score incorporates model-specific information, the central comparison to random sampling becomes circular, because the search is then guided toward already-known failure modes rather than independently harder logic-preserving instances.

    Authors: In the LPDS framework presented in §3, the difficulty score is computed using only logic-preserving properties of the problem variations, including metrics such as the degree of entity modification, contextual complexity, and logical equivalence checks, all of which are determined without any access to or reference to the target LLM's outputs, reasoning traces, or error patterns. This ensures that the search for high-difficulty instances is not circular but identifies variations that are inherently more challenging due to their structural properties. We will update the manuscript to include a dedicated subsection or appendix explicitly demonstrating and stating this independence. revision: yes

  2. Referee: [§5] §5 (Experiments): The reported performance drops and 5x improvement over random sampling should be accompanied by an ablation that recomputes difficulty scores using only problem-intrinsic features (e.g., syntactic complexity or entity count) with no access to model responses. Without this control, it remains unclear whether the larger drops reflect true difficulty scaling or simply more effective exploitation of the evaluated model's weaknesses.

    Authors: We acknowledge the value of the suggested ablation. While our difficulty scoring function is already based on problem-intrinsic features without model responses, as clarified above, we will perform and report an additional ablation study in §5. This will involve recomputing difficulty scores using only basic intrinsic features like syntactic complexity and entity count, and comparing the resulting performance drops to those from the full LPDS scoring. We expect this to confirm that the 5x larger drops are attributable to the more comprehensive difficulty scaling rather than exploitation of model weaknesses. The revised manuscript will include these results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in LPDS derivation or claims

full rationale

The paper introduces LPDS as an independent quantification of difficulty for logic-preserving variations, followed by a search procedure whose outputs are evaluated against a random-sampling baseline. This comparison supplies an external benchmark that is not derived from the difficulty metric itself. No equations, definitions, or self-citations reduce the reported performance drops or robustness gains to fitted parameters or prior author results by construction. The framework remains self-contained because difficulty scaling and failure exposure are measured empirically rather than tautologically.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that logic preservation can be defined and verified for the chosen tasks, plus whatever internal parameters are used to score difficulty and run the search; no explicit free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Logic-preserving variations exist and can be systematically enumerated or searched for a given problem.
    Invoked when the framework is defined in the abstract.

pith-pipeline@v0.9.0 · 5785 in / 1231 out tokens · 39649 ms · 2026-05-19T16:28:43.234195+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 12 internal anchors

  1. [1]

    Yu, Qiang Yang, and Xing Xie

    Chang, Yupeng and Wang, Xu and Wang, Jindong and Wu, Yuan and Yang, Linyi and Zhu, Kaijie and Chen, Hao and Yi, Xiaoyuan and Wang, Cunxiang and Wang, Yidong and Ye, Wei and Zhang, Yue and Chang, Yi and Yu, Philip S. and Yang, Qiang and Xie, Xing , title =. ACM Trans. Intell. Syst. Technol. , month = mar, articleno =. 2024 , issue_date =. doi:10.1145/36412...

  2. [2]

    A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology, 35 (2):1–72, 2026

    Jiang, Juyong and Wang, Fan and Shen, Jiasi and Kim, Sungju and Kim, Sunghun , title =. ACM Trans. Softw. Eng. Methodol. , month = jul, keywords =. 2025 , publisher =. doi:10.1145/3747588 , abstract =

  3. [3]

    Evaluating Open-Domain Question Answering in the Era of Large Language Models

    Kamalloo, Ehsan and Dziri, Nouha and Clarke, Charles and Rafiei, Davood. Evaluating Open-Domain Question Answering in the Era of Large Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.307

  4. [4]

    Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    Towards reasoning era: A survey of long chain-of-thought for reasoning large language models , author=. arXiv preprint arXiv:2503.09567 , year=

  5. [5]

    Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...

  6. [6]

    OpenAI o1 System Card

    Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

  7. [7]

    The rise and potential of large language model based agents: a survey.Sci

    Xi, Zhiheng and Chen, Wenxiang and Guo, Xin and He, Wei and Ding, Yiwen and Hong, Boyang and Zhang, Ming and Wang, Junzhe and Jin, Senjie and Zhou, Enyu and Zheng, Rui and Fan, Xiaoran and Wang, Xiao and Xiong, Limao and Zhou, Yuhao and Wang, Weiran and Jiang, Changhao and Zou, Yicheng and Liu, Xiangyang and Yin, Zhangyue and Dou, Shihan and Weng, Rongxia...

  8. [8]

    Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

    Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems , author=. arXiv preprint arXiv:2504.01990 , year=

  9. [9]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

  10. [10]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  11. [11]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Sparks of artificial general intelligence: Early experiments with gpt-4 , author=. arXiv preprint arXiv:2303.12712 , year=

  12. [12]

    First Conference on Language Modeling , year=

    Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models - A Survey , author=. First Conference on Language Modeling , year=

  13. [13]

    Agent-SafetyBench: Evaluating the Safety of LLM Agents

    Agent-safetybench: Evaluating the safety of llm agents , author=. arXiv preprint arXiv:2412.14470 , year=

  14. [14]

    A survey on large language model benchmarks, 2025a

    A survey on large language model benchmarks , author=. arXiv preprint arXiv:2508.15361 , year=

  15. [15]

    Advances in

    Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and Li, Tianle and Ku, Max and Wang, Kai and Zhuang, Alex and Fan, Rongqi and Yue, Xiang and Chen, Wenhu , booktitle =. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understandi...

  16. [16]

    Bowman , booktitle=

    David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman , booktitle=. 2024 , url=

  17. [17]

    SciCode: A Research Coding Benchmark Curated by Scientists , url =

    Tian, Minyang and Gao, Luyu and Zhang, Shizhuo Dylan and Chen, Xinan and Fan, Cunwei and Guo, Xuefei and Haas, Roland and Ji, Pan and Krongchon, Kittithat and Li, Yao and Liu, Shengyan and Luo, Di and Ma, Yutao and Tong, Hao and Trinh, Kha and Tian, Chenyu and Wang, Zihan and Wu, Bohao and Xiong, Yanyu and Yin, Shengzhu and Zhu, Minhui and Lieret, Kilian ...

  18. [18]

    2025 , url=

    Seyed Iman Mirzadeh and Keivan Alizadeh and Hooman Shahrokhi and Oncel Tuzel and Samy Bengio and Mehrdad Farajtabar , booktitle=. 2025 , url=

  19. [19]

    Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap, 2024

    Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap , author=. arXiv preprint arXiv:2402.19450 , year=

  20. [20]

    Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks , booktitle =

    Wu, Zhaofeng and Qiu, Linlu and Ross, Alexis and Aky. Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v...

  21. [21]

    and Wang, Xuezhi and Zhou, Denny , title =

    Chen, Xinyun and Chi, Ryan A. and Wang, Xuezhi and Zhou, Denny , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

  22. [22]

    Large language models can be easily distracted by irrelevant context , year =

    Shi, Freda and Chen, Xinyun and Misra, Kanishka and Scales, Nathan and Dohan, David and Chi, Ed and Sch\". Large language models can be easily distracted by irrelevant context , year =. Proceedings of the 40th International Conference on Machine Learning , articleno =

  23. [23]

    Training Verifiers to Solve Math Word Problems

    Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

  24. [24]

    Measuring Mathematical Problem Solving With the

    Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt , booktitle=. Measuring Mathematical Problem Solving With the. 2021 , url=

  25. [25]

    FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning

    FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning , author=. arXiv preprint arXiv:2506.02515 , year=

  26. [26]

    arXiv preprint arXiv:2511.01650 , year=

    EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning , author=. arXiv preprint arXiv:2511.01650 , year=

  27. [27]

    Automatic Engineering of Long Prompts

    Hsieh, Cho-Jui and Si, Si and Yu, Felix and Dhillon, Inderjit. Automatic Engineering of Long Prompts. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.634

  28. [28]

    The Twelfth International Conference on Learning Representations , year=

    Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers , author=. The Twelfth International Conference on Learning Representations , year=

  29. [29]

    Efficient Prompt Optimization Through the Lens of Best Arm Identification , volume =

    Shi, Chengshuai and Yang, Kun and Chen, Zihan and Li, Jundong and Yang, Jing and Shen, Cong , booktitle =. Efficient Prompt Optimization Through the Lens of Best Arm Identification , volume =. doi:10.52202/079017-3161 , editor =

  30. [30]

    Toward Human Readable Prompt Tuning: Kubrick ' s The Shining is a good movie, and a good prompt too?

    Shi, Weijia and Han, Xiaochuang and Gonen, Hila and Holtzman, Ari and Tsvetkov, Yulia and Zettlemoyer, Luke. Toward Human Readable Prompt Tuning: Kubrick ' s The Shining is a good movie, and a good prompt too?. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.733

  31. [31]

    Gradient-based Adversarial Attacks against Text Transformers

    Guo, Chuan and Sablayrolles, Alexandre and J \'e gou, Herv \'e and Kiela, Douwe. Gradient-based Adversarial Attacks against Text Transformers. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.464

  32. [32]

    and Wallace, Eric and Singh, Sameer

    Shin, Taylor and Razeghi, Yasaman and Logan IV, Robert L. and Wallace, Eric and Singh, Sameer. A uto P rompt: E liciting K nowledge from L anguage M odels with A utomatically G enerated P rompts. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.346

  33. [33]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

  34. [34]

    Accelerating Greedy Coordinate Gradient and General Prompt Optimization via Probe Sampling , url =

    Zhao, Yiran and Zheng, Wenyue and Cai, Tianle and Long, Xuan and Kawaguchi, Kenji and Goyal, Anirudh and Shieh, Michael Qizhe , booktitle =. Accelerating Greedy Coordinate Gradient and General Prompt Optimization via Probe Sampling , url =. doi:10.52202/079017-1701 , editor =

  35. [35]

    Journal of Artificial Intelligence Research , volume=

    Visualisation and'diagnostic classifiers' reveal how recurrent and recursive neural networks process hierarchical structure , author=. Journal of Artificial Intelligence Research , volume=

  36. [36]

    Mind the instructions: a holistic evaluation of consistency and interactions in prompt-based learning

    Weber, Lucas and Bruni, Elia and Hupkes, Dieuwke. Mind the instructions: a holistic evaluation of consistency and interactions in prompt-based learning. Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL). 2023. doi:10.18653/v1/2023.conll-1.20

  37. [37]

    The Eleventh International Conference on Learning Representations , year=

    Out-of-Distribution Detection and Selective Generation for Conditional Language Models , author=. The Eleventh International Conference on Learning Representations , year=

  38. [38]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    Scalable Best-of-N Selection for Large Language Models via Self-Certainty , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  39. [39]

    Thomas McCoy, Ellie Pavlick, and Tal Linzen

    McCoy, R. Thomas and Pavlick, Ellie and Linzen, Tal. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1334

  40. [40]

    Toolformer: language models can teach themselves to use tools , year =

    Schick, Timo and Dwivedi-Yu, Jane and Dess\'. Toolformer: language models can teach themselves to use tools , year =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =

  41. [41]

    and Beutel, Alex , title =

    Garg, Sahaj and Perot, Vincent and Limtiaco, Nicole and Taly, Ankur and Chi, Ed H. and Beutel, Alex , title =. Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society , pages =. 2019 , isbn =. doi:10.1145/3306618.3317950 , abstract =

  42. [42]

    T., Wu, T., Guestrin, C., and Singh, S

    Ribeiro, Marco Tulio and Wu, Tongshuang and Guestrin, Carlos and Singh, Sameer. Beyond Accuracy: Behavioral Testing of NLP Models with C heck L ist. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.442

  43. [43]

    H ate C heck: Functional Tests for Hate Speech Detection Models

    R. H ate C heck: Functional Tests for Hate Speech Detection Models. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl-long.4

  44. [44]

    NeurIPS 2025 Workshop on Efficient Reasoning , year=

    Deep Think with Confidence , author=. NeurIPS 2025 Workshop on Efficient Reasoning , year=

  45. [45]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Representation engineering: A top-down approach to ai transparency , author=. arXiv preprint arXiv:2310.01405 , year=

  46. [46]

    Forty-second International Conference on Machine Learning , year=

    Layer by Layer: Uncovering Hidden Representations in Language Models , author=. Forty-second International Conference on Machine Learning , year=

  47. [47]

    2025 , journal=

    Qwen2.5 Technical Report , author=. 2025 , journal=

  48. [48]

    Phi-4 Technical Report

    Phi-4 technical report , author=. arXiv preprint arXiv:2412.08905 , year=

  49. [49]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , url =

    Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and ichter, brian and Xia, Fei and Chi, Ed and Le, Quoc V and Zhou, Denny , booktitle =. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , url =

  50. [50]

    The Twelfth International Conference on Learning Representations , year=

    Large Language Models Are Not Robust Multiple Choice Selectors , author=. The Twelfth International Conference on Learning Representations , year=

  51. [51]

    and Hruschka, E

    Pezeshkpour, Pouya and Hruschka, Estevam. Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions. Findings of the Association for Computational Linguistics: NAACL 2024. 2024. doi:10.18653/v1/2024.findings-naacl.130

  52. [52]

    The Twelfth International Conference on Learning Representations , year=

    Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting , author=. The Twelfth International Conference on Learning Representations , year=

  53. [53]

    arXiv preprint arXiv:2510.05152 , year=

    A single character can make or break your LLM evals , author=. arXiv preprint arXiv:2510.05152 , year=

  54. [54]

    Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales? , url =

    Zhou, Zhanke and Tao, Rong and Zhu, Jianing and Luo, Yiwen and Wang, Zengmao and Han, Bo , booktitle =. Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales? , url =. doi:10.52202/079017-3936 , editor =

  55. [55]

    Reasoning Robustness of LLMs to Adversarial Typographical Errors

    Gan, Esther and Zhao, Yiran and Cheng, Liying and Yancan, Mao and Goyal, Anirudh and Kawaguchi, Kenji and Kan, Min-Yen and Shieh, Michael. Reasoning Robustness of LLM s to Adversarial Typographical Errors. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.584

  56. [56]

    Resilience of Large Language Models for Noisy Instructions

    Wang, Bin and Wei, Chengwei and Liu, Zhengyuan and Lin, Geyu and Chen, Nancy F. Resilience of Large Language Models for Noisy Instructions. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.697

  57. [57]

    Aryan Gulati and Brando Miranda and Eric Chen and Emily Xia and Kai Fronsdal and Bruno de Moraes Dumont and Sanmi Koyejo , booktitle=. Putnam-. 2024 , url=

  58. [58]

    State of What Art? A Call for Multi-Prompt LLM Evaluation

    Mizrahi, Moran and Kaplan, Guy and Malkin, Dan and Dror, Rotem and Shahaf, Dafna and Stanovsky, Gabriel , title =. Transactions of the Association for Computational Linguistics , volume =. 2024 , month =. doi:10.1162/tacl_a_00681 , url =

  59. [59]

    The Twelfth International Conference on Learning Representations , year=

    Evaluating the Zero-shot Robustness of Instruction-tuned Language Models , author=. The Twelfth International Conference on Learning Representations , year=

  60. [60]

    Transactions on Machine Learning Research , issn=

    Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics , author=. Transactions on Machine Learning Research , issn=. 2025 , url=

  61. [61]

    arXiv preprint arXiv:2407.08989 , year=

    Robustness of llms to perturbations in text , author=. arXiv preprint arXiv:2407.08989 , year=

  62. [62]

    Proceedings of the 40th International Conference on Machine Learning , pages =

    Large Language Models Can Be Easily Distracted by Irrelevant Context , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

  63. [63]

    Wang, and Sadid Hasan

    Does prompt formatting have any impact on llm performance? , author=. arXiv preprint arXiv:2411.10541 , year=

  64. [64]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    On the Worst Prompt Performance of Large Language Models , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  65. [65]

    arXiv preprint arXiv:2506.11111 , year=

    Evaluating and Improving Robustness in Large Language Models: A Survey and Future Directions , author=. arXiv preprint arXiv:2506.11111 , year=

  66. [66]

    arXiv preprint arXiv:2502.16923 , year=

    A systematic survey of automatic prompt optimization techniques , author=. arXiv preprint arXiv:2502.16923 , year=

  67. [67]

    arXiv preprint arXiv:2502.11560 , year=

    A survey of automatic prompt engineering: An optimization perspective , author=. arXiv preprint arXiv:2502.11560 , year=

  68. [68]

    ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models , year=

    A Survey on Prompt Tuning , author=. ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models , year=

  69. [69]

    URL https://aclanthology.org/2021

    Lester, Brian and Al-Rfou, Rami and Constant, Noah. The Power of Scale for Parameter-Efficient Prompt Tuning. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.243

  70. [70]

    Adversarial Soft Prompt Tuning for Cross-Domain Sentiment Analysis

    Wu, Hui and Shi, Xiaodong. Adversarial Soft Prompt Tuning for Cross-Domain Sentiment Analysis. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.174

  71. [71]

    The Gradient , year =

    Huyen, Chip , title =. The Gradient , year =

  72. [72]

    Demystifying Prompts in Language Models via Perplexity Estimation

    Gonen, Hila and Iyer, Srini and Blevins, Terra and Smith, Noah and Zettlemoyer, Luke. Demystifying Prompts in Language Models via Perplexity Estimation. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.679

  73. [73]

    Uncertainty Estimation in Large Language Models to Support Biodiversity Conservation

    Mora-Cross, Maria and Calderon-Ramirez, Saul. Uncertainty Estimation in Large Language Models to Support Biodiversity Conservation. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track). 2024. doi:10.18653/v1/2024.naacl-industry.31

  74. [74]

    arXiv preprint arXiv:2410.15326 , year=

    A survey of uncertainty estimation in llms: Theory meets practice , author=. arXiv preprint arXiv:2410.15326 , year=