LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling

Dieuwke Hupkes; Jesse Dodge; Philipp Mondorf; Samuel J. Bell

arxiv: 2605.15393 · v1 · pith:ENWG2XKPnew · submitted 2026-05-14 · 💻 cs.LG

LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling

Philipp Mondorf , Samuel J. Bell , Jesse Dodge , Dieuwke Hupkes This is my paper

Pith reviewed 2026-05-19 16:28 UTC · model grok-4.3

classification 💻 cs.LG

keywords LLM robustnesslogic-preserving variationsdifficulty scalingevaluation frameworkreasoning errorsfine-tuningrobustness testing

0 comments

The pith

Logic-preserving difficulty scaling finds problem variations that cause language models to fail up to five times more often than random tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces logic-preserving difficulty scaling to systematically identify harder versions of problems where the underlying logic stays the same but details such as names or numbers change. It demonstrates that as difficulty rises according to the measure, models show declining performance and more errors in their reasoning steps. The targeted search for difficult variations produces performance drops up to five times larger than those from random sampling of allowable changes. Fine-tuning on the harder variations yields more consistent robustness improvements compared to training on easier versions.

Core claim

Logic-preserving difficulty scaling quantifies the difficulty of allowable problem variations while keeping the core logic fixed and searches the space of such variations to maximize difficulty for a given model. This process shows that performance declines and errors in reasoning chains become more pronounced as difficulty increases. The method finds variations that induce performance drops up to 5 times larger than random sampling, and fine-tuning on the difficult variations produces more consistent robustness gains than fine-tuning on easier ones.

What carries the argument

Logic-preserving difficulty scaling (LPDS), a framework that assigns difficulty scores to logic-preserving problem variations and performs a targeted search to maximize those scores for a specific model.

If this is right

Model performance declines steadily as the quantified difficulty of logic-preserving variations increases.
Errors in the models' reasoning chains become more pronounced at higher difficulty levels.
LPDS identifies variations that produce performance drops up to 5 times larger than those from random sampling.
Fine-tuning on more difficult variations produces more consistent robustness gains than fine-tuning on easier variations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar difficulty scaling could be applied to other robustness properties, such as consistency across different prompt phrasings or output formats.
A training schedule that gradually introduces harder logic-preserving variations might build more stable reasoning than fixed-difficulty datasets.
Before deployment in settings where small input changes should not alter outcomes, developers could run LPDS-style searches to surface hidden inconsistencies.

Load-bearing premise

That difficulty scores for logic-preserving variations can be assigned in a way that reliably predicts actual model failures rather than just marking differences in the inputs.

What would settle it

Running the same set of problems on multiple models and comparing accuracy drops on LPDS-selected variations versus randomly selected variations to check whether the fivefold difference holds or disappears.

read the original abstract

As large language models (LLMs) are increasingly deployed to perform tasks with minimal human oversight, it is crucial that these models operate robustly. In particular, a model that can solve a given problem should not fail simply because certain entities$\unicode{x2013}$such as names, numbers, or other contextual details$\unicode{x2013}$have changed while the underlying problem logic remains the same. Prior work suggests that current LLMs still struggle with this form of robustness: they often succeed on some variations of a problem but fail on others. However, existing evaluations often lack a systematic way to identify which logic-preserving variations are most likely to induce failure. Instead, they typically test a random subset of allowable variations, which can overstate robustness. To address this gap, we introduce logic-preserving difficulty scaling (LPDS), a framework that (i) quantifies the difficulty of a problem variation and (ii) systematically searches the space of allowable variations to find those that maximize difficulty and expose failures. We show that as difficulty increases, performance declines and errors in the models' reasoning chains become more pronounced. We further demonstrate that LPDS efficiently finds difficult problem variations for a model, resulting in performance drops up to 5 times larger compared to random sampling. Finally, we show that fine-tuning on more difficult variations leads to more consistent robustness gains than training on easier ones.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LPDS gives a directed search for logic-preserving variations that produce larger robustness drops than random sampling.

read the letter

LPDS is a practical way to generate harder logic-preserving test cases for LLMs that expose bigger robustness gaps than random sampling does. The new part is the two-part setup: first scoring difficulty of variations, then searching for the high-scoring ones that keep the logic intact. This is more directed than prior random-variation tests. The results show performance dropping steadily with higher difficulty scores, and the targeted search producing drops five times larger than random picks. They also find that fine-tuning on those hard cases improves consistency more than training on easy ones. That last bit is a nice practical takeaway for making models more reliable. The potential issue is whether the difficulty score really measures something model-agnostic. If it relies on features that already track where this particular LLM fails, the whole thing becomes a smarter way to find weaknesses rather than a general hardness scale. The abstract does not spell out the exact scoring function, so it is worth checking the methods to see if it stays independent of the evaluated model. If it does, the comparison to random sampling holds up well. This paper is aimed at researchers building evaluation suites for LLMs, especially those focused on robustness under minimal oversight. Anyone running robustness benchmarks will find the method worth trying out. I would send this to peer review. The core idea is sound and the experiments make a clear case for the approach, though the independence of the difficulty measure needs explicit confirmation in the writeup.

Referee Report

2 major / 2 minor

Summary. The paper introduces Logic-Preserving Difficulty Scaling (LPDS), a framework that quantifies the difficulty of logic-preserving variations of problems (e.g., changes to names, numbers, or context while preserving underlying logic) and employs a search procedure to identify high-difficulty instances. It claims that model performance declines consistently as difficulty increases, that LPDS uncovers variations producing performance drops up to 5 times larger than those found by random sampling, and that fine-tuning on difficult variations yields more consistent robustness improvements than training on easier ones.

Significance. If the difficulty quantification proves independent of target-model outputs and the search reliably surfaces genuinely harder instances rather than model-specific weaknesses, LPDS could offer a more systematic alternative to random variation testing for LLM robustness evaluation. The reported 5x performance-drop differential and the fine-tuning results would then provide concrete evidence that targeted exposure to scaled difficulty improves consistency, addressing a recognized gap in current evaluation practices.

major comments (2)

[§3] §3 (LPDS Framework): The difficulty scoring function must be shown to be computed without reference to the target LLM's outputs, reasoning traces, or error patterns on the candidate variations. If any component of the score incorporates model-specific information, the central comparison to random sampling becomes circular, because the search is then guided toward already-known failure modes rather than independently harder logic-preserving instances.
[§5] §5 (Experiments): The reported performance drops and 5x improvement over random sampling should be accompanied by an ablation that recomputes difficulty scores using only problem-intrinsic features (e.g., syntactic complexity or entity count) with no access to model responses. Without this control, it remains unclear whether the larger drops reflect true difficulty scaling or simply more effective exploitation of the evaluated model's weaknesses.

minor comments (2)

[Abstract] The abstract states 'up to 5 times larger' without specifying the exact models, datasets, or number of runs; the main text should provide these details together with confidence intervals.
[§3] Notation for the difficulty function and the search objective should be introduced once and used consistently; currently the transition from the quantification step to the search algorithm is abrupt.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their constructive feedback on our manuscript. We address the major comments point-by-point below, providing clarifications on the model-independence of our difficulty scoring and agreeing to include additional ablations in the revised version.

read point-by-point responses

Referee: [§3] §3 (LPDS Framework): The difficulty scoring function must be shown to be computed without reference to the target LLM's outputs, reasoning traces, or error patterns on the candidate variations. If any component of the score incorporates model-specific information, the central comparison to random sampling becomes circular, because the search is then guided toward already-known failure modes rather than independently harder logic-preserving instances.

Authors: In the LPDS framework presented in §3, the difficulty score is computed using only logic-preserving properties of the problem variations, including metrics such as the degree of entity modification, contextual complexity, and logical equivalence checks, all of which are determined without any access to or reference to the target LLM's outputs, reasoning traces, or error patterns. This ensures that the search for high-difficulty instances is not circular but identifies variations that are inherently more challenging due to their structural properties. We will update the manuscript to include a dedicated subsection or appendix explicitly demonstrating and stating this independence. revision: yes
Referee: [§5] §5 (Experiments): The reported performance drops and 5x improvement over random sampling should be accompanied by an ablation that recomputes difficulty scores using only problem-intrinsic features (e.g., syntactic complexity or entity count) with no access to model responses. Without this control, it remains unclear whether the larger drops reflect true difficulty scaling or simply more effective exploitation of the evaluated model's weaknesses.

Authors: We acknowledge the value of the suggested ablation. While our difficulty scoring function is already based on problem-intrinsic features without model responses, as clarified above, we will perform and report an additional ablation study in §5. This will involve recomputing difficulty scores using only basic intrinsic features like syntactic complexity and entity count, and comparing the resulting performance drops to those from the full LPDS scoring. We expect this to confirm that the 5x larger drops are attributable to the more comprehensive difficulty scaling rather than exploitation of model weaknesses. The revised manuscript will include these results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in LPDS derivation or claims

full rationale

The paper introduces LPDS as an independent quantification of difficulty for logic-preserving variations, followed by a search procedure whose outputs are evaluated against a random-sampling baseline. This comparison supplies an external benchmark that is not derived from the difficulty metric itself. No equations, definitions, or self-citations reduce the reported performance drops or robustness gains to fitted parameters or prior author results by construction. The framework remains self-contained because difficulty scaling and failure exposure are measured empirically rather than tautologically.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that logic preservation can be defined and verified for the chosen tasks, plus whatever internal parameters are used to score difficulty and run the search; no explicit free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Logic-preserving variations exist and can be systematically enumerated or searched for a given problem.
Invoked when the framework is defined in the abstract.

pith-pipeline@v0.9.0 · 5785 in / 1231 out tokens · 39649 ms · 2026-05-19T16:28:43.234195+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We quantify the difficulty of a problem variation pi ∈ T as the distance between the model’s response Yi and the template’s reasoning graph... MDH = (H(l)i − μH)T Σ−1H (H(l)i − μH)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we introduce logic-preserving difficulty scaling (LPDS), a framework that (i) quantifies the difficulty of a problem variation and (ii) systematically searches the space of allowable variations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 12 internal anchors

[1]

Yu, Qiang Yang, and Xing Xie

Chang, Yupeng and Wang, Xu and Wang, Jindong and Wu, Yuan and Yang, Linyi and Zhu, Kaijie and Chen, Hao and Yi, Xiaoyuan and Wang, Cunxiang and Wang, Yidong and Ye, Wei and Zhang, Yue and Chang, Yi and Yu, Philip S. and Yang, Qiang and Xie, Xing , title =. ACM Trans. Intell. Syst. Technol. , month = mar, articleno =. 2024 , issue_date =. doi:10.1145/36412...

work page doi:10.1145/3641289 2024
[2]

A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology, 35 (2):1–72, 2026

Jiang, Juyong and Wang, Fan and Shen, Jiasi and Kim, Sungju and Kim, Sunghun , title =. ACM Trans. Softw. Eng. Methodol. , month = jul, keywords =. 2025 , publisher =. doi:10.1145/3747588 , abstract =

work page doi:10.1145/3747588 2025
[3]

Evaluating Open-Domain Question Answering in the Era of Large Language Models

Kamalloo, Ehsan and Dziri, Nouha and Clarke, Charles and Rafiei, Davood. Evaluating Open-Domain Question Answering in the Era of Large Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.307

work page doi:10.18653/v1/2023.acl-long.307 2023
[4]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Towards reasoning era: A survey of long chain-of-thought for reasoning large language models , author=. arXiv preprint arXiv:2503.09567 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...

work page doi:10.1038/s41586-025-09422-z
[6]

OpenAI o1 System Card

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

The rise and potential of large language model based agents: a survey.Sci

Xi, Zhiheng and Chen, Wenxiang and Guo, Xin and He, Wei and Ding, Yiwen and Hong, Boyang and Zhang, Ming and Wang, Junzhe and Jin, Senjie and Zhou, Enyu and Zheng, Rui and Fan, Xiaoran and Wang, Xiao and Xiong, Limao and Zhou, Yuhao and Wang, Weiran and Jiang, Changhao and Zou, Yicheng and Liu, Xiangyang and Yin, Zhangyue and Dou, Shihan and Weng, Rongxia...

work page doi:10.1007/s11432-024-4222-0
[8]

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems , author=. arXiv preprint arXiv:2504.01990 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Sparks of artificial general intelligence: Early experiments with gpt-4 , author=. arXiv preprint arXiv:2303.12712 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

First Conference on Language Modeling , year=

Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models - A Survey , author=. First Conference on Language Modeling , year=

work page
[13]

Agent-SafetyBench: Evaluating the Safety of LLM Agents

Agent-safetybench: Evaluating the safety of llm agents , author=. arXiv preprint arXiv:2412.14470 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

A survey on large language model benchmarks, 2025a

A survey on large language model benchmarks , author=. arXiv preprint arXiv:2508.15361 , year=

work page arXiv
[15]

Advances in

Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and Li, Tianle and Ku, Max and Wang, Kai and Zhuang, Alex and Fan, Rongqi and Yue, Xiang and Chen, Wenhu , booktitle =. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understandi...

work page doi:10.52202/079017-3018
[16]

Bowman , booktitle=

David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman , booktitle=. 2024 , url=

work page 2024
[17]

SciCode: A Research Coding Benchmark Curated by Scientists , url =

Tian, Minyang and Gao, Luyu and Zhang, Shizhuo Dylan and Chen, Xinan and Fan, Cunwei and Guo, Xuefei and Haas, Roland and Ji, Pan and Krongchon, Kittithat and Li, Yao and Liu, Shengyan and Luo, Di and Ma, Yutao and Tong, Hao and Trinh, Kha and Tian, Chenyu and Wang, Zihan and Wu, Bohao and Xiong, Yanyu and Yin, Shengzhu and Zhu, Minhui and Lieret, Kilian ...

work page doi:10.52202/079017-0963
[18]

2025 , url=

Seyed Iman Mirzadeh and Keivan Alizadeh and Hooman Shahrokhi and Oncel Tuzel and Samy Bengio and Mehrdad Farajtabar , booktitle=. 2025 , url=

work page 2025
[19]

Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap, 2024

Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap , author=. arXiv preprint arXiv:2402.19450 , year=

work page arXiv
[20]

Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks , booktitle =

Wu, Zhaofeng and Qiu, Linlu and Ross, Alexis and Aky. Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v...

work page doi:10.18653/v1/2024.naacl-long.102 2024
[21]

and Wang, Xuezhi and Zhou, Denny , title =

Chen, Xinyun and Chi, Ryan A. and Wang, Xuezhi and Zhou, Denny , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

work page 2024
[22]

Large language models can be easily distracted by irrelevant context , year =

Shi, Freda and Chen, Xinyun and Misra, Kanishka and Scales, Nathan and Dohan, David and Chi, Ed and Sch\". Large language models can be easily distracted by irrelevant context , year =. Proceedings of the 40th International Conference on Machine Learning , articleno =

work page
[23]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Measuring Mathematical Problem Solving With the

Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt , booktitle=. Measuring Mathematical Problem Solving With the. 2021 , url=

work page 2021
[25]

FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning

FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning , author=. arXiv preprint arXiv:2506.02515 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

arXiv preprint arXiv:2511.01650 , year=

EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning , author=. arXiv preprint arXiv:2511.01650 , year=

work page arXiv
[27]

Automatic Engineering of Long Prompts

Hsieh, Cho-Jui and Si, Si and Yu, Felix and Dhillon, Inderjit. Automatic Engineering of Long Prompts. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.634

work page doi:10.18653/v1/2024.findings-acl.634 2024
[28]

The Twelfth International Conference on Learning Representations , year=

Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers , author=. The Twelfth International Conference on Learning Representations , year=

work page
[29]

Efficient Prompt Optimization Through the Lens of Best Arm Identification , volume =

Shi, Chengshuai and Yang, Kun and Chen, Zihan and Li, Jundong and Yang, Jing and Shen, Cong , booktitle =. Efficient Prompt Optimization Through the Lens of Best Arm Identification , volume =. doi:10.52202/079017-3161 , editor =

work page doi:10.52202/079017-3161
[30]

Toward Human Readable Prompt Tuning: Kubrick ' s The Shining is a good movie, and a good prompt too?

Shi, Weijia and Han, Xiaochuang and Gonen, Hila and Holtzman, Ari and Tsvetkov, Yulia and Zettlemoyer, Luke. Toward Human Readable Prompt Tuning: Kubrick ' s The Shining is a good movie, and a good prompt too?. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.733

work page doi:10.18653/v1/2023.findings-emnlp.733 2023
[31]

Gradient-based Adversarial Attacks against Text Transformers

Guo, Chuan and Sablayrolles, Alexandre and J \'e gou, Herv \'e and Kiela, Douwe. Gradient-based Adversarial Attacks against Text Transformers. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.464

work page doi:10.18653/v1/2021.emnlp-main.464 2021
[32]

and Wallace, Eric and Singh, Sameer

Shin, Taylor and Razeghi, Yasaman and Logan IV, Robert L. and Wallace, Eric and Singh, Sameer. A uto P rompt: E liciting K nowledge from L anguage M odels with A utomatically G enerated P rompts. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.346

work page doi:10.18653/v1/2020.emnlp-main.346 2020
[33]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Accelerating Greedy Coordinate Gradient and General Prompt Optimization via Probe Sampling , url =

Zhao, Yiran and Zheng, Wenyue and Cai, Tianle and Long, Xuan and Kawaguchi, Kenji and Goyal, Anirudh and Shieh, Michael Qizhe , booktitle =. Accelerating Greedy Coordinate Gradient and General Prompt Optimization via Probe Sampling , url =. doi:10.52202/079017-1701 , editor =

work page doi:10.52202/079017-1701
[35]

Journal of Artificial Intelligence Research , volume=

Visualisation and'diagnostic classifiers' reveal how recurrent and recursive neural networks process hierarchical structure , author=. Journal of Artificial Intelligence Research , volume=

work page
[36]

Mind the instructions: a holistic evaluation of consistency and interactions in prompt-based learning

Weber, Lucas and Bruni, Elia and Hupkes, Dieuwke. Mind the instructions: a holistic evaluation of consistency and interactions in prompt-based learning. Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL). 2023. doi:10.18653/v1/2023.conll-1.20

work page doi:10.18653/v1/2023.conll-1.20 2023
[37]

The Eleventh International Conference on Learning Representations , year=

Out-of-Distribution Detection and Selective Generation for Conditional Language Models , author=. The Eleventh International Conference on Learning Representations , year=

work page
[38]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Scalable Best-of-N Selection for Large Language Models via Self-Certainty , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page
[39]

Thomas McCoy, Ellie Pavlick, and Tal Linzen

McCoy, R. Thomas and Pavlick, Ellie and Linzen, Tal. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1334

work page doi:10.18653/v1/p19-1334 2019
[40]

Toolformer: language models can teach themselves to use tools , year =

Schick, Timo and Dwivedi-Yu, Jane and Dess\'. Toolformer: language models can teach themselves to use tools , year =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =

work page
[41]

and Beutel, Alex , title =

Garg, Sahaj and Perot, Vincent and Limtiaco, Nicole and Taly, Ankur and Chi, Ed H. and Beutel, Alex , title =. Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society , pages =. 2019 , isbn =. doi:10.1145/3306618.3317950 , abstract =

work page doi:10.1145/3306618.3317950 2019
[42]

T., Wu, T., Guestrin, C., and Singh, S

Ribeiro, Marco Tulio and Wu, Tongshuang and Guestrin, Carlos and Singh, Sameer. Beyond Accuracy: Behavioral Testing of NLP Models with C heck L ist. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.442

work page doi:10.18653/v1/2020.acl-main.442 2020
[43]

H ate C heck: Functional Tests for Hate Speech Detection Models

R. H ate C heck: Functional Tests for Hate Speech Detection Models. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl-long.4

work page doi:10.18653/v1/2021.acl-long.4 2021
[44]

NeurIPS 2025 Workshop on Efficient Reasoning , year=

Deep Think with Confidence , author=. NeurIPS 2025 Workshop on Efficient Reasoning , year=

work page 2025
[45]

Representation Engineering: A Top-Down Approach to AI Transparency

Representation engineering: A top-down approach to ai transparency , author=. arXiv preprint arXiv:2310.01405 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Forty-second International Conference on Machine Learning , year=

Layer by Layer: Uncovering Hidden Representations in Language Models , author=. Forty-second International Conference on Machine Learning , year=

work page
[47]

2025 , journal=

Qwen2.5 Technical Report , author=. 2025 , journal=

work page 2025
[48]

Phi-4 Technical Report

Phi-4 technical report , author=. arXiv preprint arXiv:2412.08905 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[49]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , url =

Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and ichter, brian and Xia, Fei and Chi, Ed and Le, Quoc V and Zhou, Denny , booktitle =. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , url =

work page
[50]

The Twelfth International Conference on Learning Representations , year=

Large Language Models Are Not Robust Multiple Choice Selectors , author=. The Twelfth International Conference on Learning Representations , year=

work page
[51]

and Hruschka, E

Pezeshkpour, Pouya and Hruschka, Estevam. Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions. Findings of the Association for Computational Linguistics: NAACL 2024. 2024. doi:10.18653/v1/2024.findings-naacl.130

work page doi:10.18653/v1/2024.findings-naacl.130 2024
[52]

The Twelfth International Conference on Learning Representations , year=

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting , author=. The Twelfth International Conference on Learning Representations , year=

work page
[53]

arXiv preprint arXiv:2510.05152 , year=

A single character can make or break your LLM evals , author=. arXiv preprint arXiv:2510.05152 , year=

work page arXiv
[54]

Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales? , url =

Zhou, Zhanke and Tao, Rong and Zhu, Jianing and Luo, Yiwen and Wang, Zengmao and Han, Bo , booktitle =. Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales? , url =. doi:10.52202/079017-3936 , editor =

work page doi:10.52202/079017-3936
[55]

Reasoning Robustness of LLMs to Adversarial Typographical Errors

Gan, Esther and Zhao, Yiran and Cheng, Liying and Yancan, Mao and Goyal, Anirudh and Kawaguchi, Kenji and Kan, Min-Yen and Shieh, Michael. Reasoning Robustness of LLM s to Adversarial Typographical Errors. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.584

work page doi:10.18653/v1/2024.emnlp-main.584 2024
[56]

Resilience of Large Language Models for Noisy Instructions

Wang, Bin and Wei, Chengwei and Liu, Zhengyuan and Lin, Geyu and Chen, Nancy F. Resilience of Large Language Models for Noisy Instructions. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.697

work page doi:10.18653/v1/2024.findings-emnlp.697 2024
[57]

Aryan Gulati and Brando Miranda and Eric Chen and Emily Xia and Kai Fronsdal and Bruno de Moraes Dumont and Sanmi Koyejo , booktitle=. Putnam-. 2024 , url=

work page 2024
[58]

State of What Art? A Call for Multi-Prompt LLM Evaluation

Mizrahi, Moran and Kaplan, Guy and Malkin, Dan and Dror, Rotem and Shahaf, Dafna and Stanovsky, Gabriel , title =. Transactions of the Association for Computational Linguistics , volume =. 2024 , month =. doi:10.1162/tacl_a_00681 , url =

work page doi:10.1162/tacl_a_00681 2024
[59]

The Twelfth International Conference on Learning Representations , year=

Evaluating the Zero-shot Robustness of Instruction-tuned Language Models , author=. The Twelfth International Conference on Learning Representations , year=

work page
[60]

Transactions on Machine Learning Research , issn=

Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics , author=. Transactions on Machine Learning Research , issn=. 2025 , url=

work page 2025
[61]

arXiv preprint arXiv:2407.08989 , year=

Robustness of llms to perturbations in text , author=. arXiv preprint arXiv:2407.08989 , year=

work page arXiv
[62]

Proceedings of the 40th International Conference on Machine Learning , pages =

Large Language Models Can Be Easily Distracted by Irrelevant Context , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

work page 2023
[63]

Wang, and Sadid Hasan

Does prompt formatting have any impact on llm performance? , author=. arXiv preprint arXiv:2411.10541 , year=

work page arXiv
[64]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

On the Worst Prompt Performance of Large Language Models , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page
[65]

arXiv preprint arXiv:2506.11111 , year=

Evaluating and Improving Robustness in Large Language Models: A Survey and Future Directions , author=. arXiv preprint arXiv:2506.11111 , year=

work page arXiv
[66]

arXiv preprint arXiv:2502.16923 , year=

A systematic survey of automatic prompt optimization techniques , author=. arXiv preprint arXiv:2502.16923 , year=

work page arXiv
[67]

arXiv preprint arXiv:2502.11560 , year=

A survey of automatic prompt engineering: An optimization perspective , author=. arXiv preprint arXiv:2502.11560 , year=

work page arXiv
[68]

ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models , year=

A Survey on Prompt Tuning , author=. ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models , year=

work page
[69]

URL https://aclanthology.org/2021

Lester, Brian and Al-Rfou, Rami and Constant, Noah. The Power of Scale for Parameter-Efficient Prompt Tuning. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.243

work page doi:10.18653/v1/2021.emnlp-main.243 2021
[70]

Adversarial Soft Prompt Tuning for Cross-Domain Sentiment Analysis

Wu, Hui and Shi, Xiaodong. Adversarial Soft Prompt Tuning for Cross-Domain Sentiment Analysis. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.174

work page doi:10.18653/v1/2022.acl-long.174 2022
[71]

The Gradient , year =

Huyen, Chip , title =. The Gradient , year =

work page
[72]

Demystifying Prompts in Language Models via Perplexity Estimation

Gonen, Hila and Iyer, Srini and Blevins, Terra and Smith, Noah and Zettlemoyer, Luke. Demystifying Prompts in Language Models via Perplexity Estimation. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.679

work page doi:10.18653/v1/2023.findings-emnlp.679 2023
[73]

Uncertainty Estimation in Large Language Models to Support Biodiversity Conservation

Mora-Cross, Maria and Calderon-Ramirez, Saul. Uncertainty Estimation in Large Language Models to Support Biodiversity Conservation. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track). 2024. doi:10.18653/v1/2024.naacl-industry.31

work page doi:10.18653/v1/2024.naacl-industry.31 2024
[74]

arXiv preprint arXiv:2410.15326 , year=

A survey of uncertainty estimation in llms: Theory meets practice , author=. arXiv preprint arXiv:2410.15326 , year=

work page arXiv

[1] [1]

Yu, Qiang Yang, and Xing Xie

Chang, Yupeng and Wang, Xu and Wang, Jindong and Wu, Yuan and Yang, Linyi and Zhu, Kaijie and Chen, Hao and Yi, Xiaoyuan and Wang, Cunxiang and Wang, Yidong and Ye, Wei and Zhang, Yue and Chang, Yi and Yu, Philip S. and Yang, Qiang and Xie, Xing , title =. ACM Trans. Intell. Syst. Technol. , month = mar, articleno =. 2024 , issue_date =. doi:10.1145/36412...

work page doi:10.1145/3641289 2024

[2] [2]

A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology, 35 (2):1–72, 2026

Jiang, Juyong and Wang, Fan and Shen, Jiasi and Kim, Sungju and Kim, Sunghun , title =. ACM Trans. Softw. Eng. Methodol. , month = jul, keywords =. 2025 , publisher =. doi:10.1145/3747588 , abstract =

work page doi:10.1145/3747588 2025

[3] [3]

Evaluating Open-Domain Question Answering in the Era of Large Language Models

Kamalloo, Ehsan and Dziri, Nouha and Clarke, Charles and Rafiei, Davood. Evaluating Open-Domain Question Answering in the Era of Large Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.307

work page doi:10.18653/v1/2023.acl-long.307 2023

[4] [4]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Towards reasoning era: A survey of long chain-of-thought for reasoning large language models , author=. arXiv preprint arXiv:2503.09567 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...

work page doi:10.1038/s41586-025-09422-z

[6] [6]

OpenAI o1 System Card

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

The rise and potential of large language model based agents: a survey.Sci

Xi, Zhiheng and Chen, Wenxiang and Guo, Xin and He, Wei and Ding, Yiwen and Hong, Boyang and Zhang, Ming and Wang, Junzhe and Jin, Senjie and Zhou, Enyu and Zheng, Rui and Fan, Xiaoran and Wang, Xiao and Xiong, Limao and Zhou, Yuhao and Wang, Weiran and Jiang, Changhao and Zou, Yicheng and Liu, Xiangyang and Yin, Zhangyue and Dou, Shihan and Weng, Rongxia...

work page doi:10.1007/s11432-024-4222-0

[8] [8]

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems , author=. arXiv preprint arXiv:2504.01990 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Sparks of artificial general intelligence: Early experiments with gpt-4 , author=. arXiv preprint arXiv:2303.12712 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

First Conference on Language Modeling , year=

Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models - A Survey , author=. First Conference on Language Modeling , year=

work page

[13] [13]

Agent-SafetyBench: Evaluating the Safety of LLM Agents

Agent-safetybench: Evaluating the safety of llm agents , author=. arXiv preprint arXiv:2412.14470 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

A survey on large language model benchmarks, 2025a

A survey on large language model benchmarks , author=. arXiv preprint arXiv:2508.15361 , year=

work page arXiv

[15] [15]

Advances in

Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and Li, Tianle and Ku, Max and Wang, Kai and Zhuang, Alex and Fan, Rongqi and Yue, Xiang and Chen, Wenhu , booktitle =. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understandi...

work page doi:10.52202/079017-3018

[16] [16]

Bowman , booktitle=

David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman , booktitle=. 2024 , url=

work page 2024

[17] [17]

SciCode: A Research Coding Benchmark Curated by Scientists , url =

Tian, Minyang and Gao, Luyu and Zhang, Shizhuo Dylan and Chen, Xinan and Fan, Cunwei and Guo, Xuefei and Haas, Roland and Ji, Pan and Krongchon, Kittithat and Li, Yao and Liu, Shengyan and Luo, Di and Ma, Yutao and Tong, Hao and Trinh, Kha and Tian, Chenyu and Wang, Zihan and Wu, Bohao and Xiong, Yanyu and Yin, Shengzhu and Zhu, Minhui and Lieret, Kilian ...

work page doi:10.52202/079017-0963

[18] [18]

2025 , url=

Seyed Iman Mirzadeh and Keivan Alizadeh and Hooman Shahrokhi and Oncel Tuzel and Samy Bengio and Mehrdad Farajtabar , booktitle=. 2025 , url=

work page 2025

[19] [19]

Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap, 2024

Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap , author=. arXiv preprint arXiv:2402.19450 , year=

work page arXiv

[20] [20]

Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks , booktitle =

Wu, Zhaofeng and Qiu, Linlu and Ross, Alexis and Aky. Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v...

work page doi:10.18653/v1/2024.naacl-long.102 2024

[21] [21]

and Wang, Xuezhi and Zhou, Denny , title =

Chen, Xinyun and Chi, Ryan A. and Wang, Xuezhi and Zhou, Denny , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

work page 2024

[22] [22]

Large language models can be easily distracted by irrelevant context , year =

Shi, Freda and Chen, Xinyun and Misra, Kanishka and Scales, Nathan and Dohan, David and Chi, Ed and Sch\". Large language models can be easily distracted by irrelevant context , year =. Proceedings of the 40th International Conference on Machine Learning , articleno =

work page

[23] [23]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Measuring Mathematical Problem Solving With the

Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt , booktitle=. Measuring Mathematical Problem Solving With the. 2021 , url=

work page 2021

[25] [25]

FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning

FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning , author=. arXiv preprint arXiv:2506.02515 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

arXiv preprint arXiv:2511.01650 , year=

EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning , author=. arXiv preprint arXiv:2511.01650 , year=

work page arXiv

[27] [27]

Automatic Engineering of Long Prompts

Hsieh, Cho-Jui and Si, Si and Yu, Felix and Dhillon, Inderjit. Automatic Engineering of Long Prompts. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.634

work page doi:10.18653/v1/2024.findings-acl.634 2024

[28] [28]

The Twelfth International Conference on Learning Representations , year=

Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers , author=. The Twelfth International Conference on Learning Representations , year=

work page

[29] [29]

Efficient Prompt Optimization Through the Lens of Best Arm Identification , volume =

Shi, Chengshuai and Yang, Kun and Chen, Zihan and Li, Jundong and Yang, Jing and Shen, Cong , booktitle =. Efficient Prompt Optimization Through the Lens of Best Arm Identification , volume =. doi:10.52202/079017-3161 , editor =

work page doi:10.52202/079017-3161

[30] [30]

Toward Human Readable Prompt Tuning: Kubrick ' s The Shining is a good movie, and a good prompt too?

Shi, Weijia and Han, Xiaochuang and Gonen, Hila and Holtzman, Ari and Tsvetkov, Yulia and Zettlemoyer, Luke. Toward Human Readable Prompt Tuning: Kubrick ' s The Shining is a good movie, and a good prompt too?. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.733

work page doi:10.18653/v1/2023.findings-emnlp.733 2023

[31] [31]

Gradient-based Adversarial Attacks against Text Transformers

Guo, Chuan and Sablayrolles, Alexandre and J \'e gou, Herv \'e and Kiela, Douwe. Gradient-based Adversarial Attacks against Text Transformers. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.464

work page doi:10.18653/v1/2021.emnlp-main.464 2021

[32] [32]

and Wallace, Eric and Singh, Sameer

Shin, Taylor and Razeghi, Yasaman and Logan IV, Robert L. and Wallace, Eric and Singh, Sameer. A uto P rompt: E liciting K nowledge from L anguage M odels with A utomatically G enerated P rompts. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.346

work page doi:10.18653/v1/2020.emnlp-main.346 2020

[33] [33]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Accelerating Greedy Coordinate Gradient and General Prompt Optimization via Probe Sampling , url =

Zhao, Yiran and Zheng, Wenyue and Cai, Tianle and Long, Xuan and Kawaguchi, Kenji and Goyal, Anirudh and Shieh, Michael Qizhe , booktitle =. Accelerating Greedy Coordinate Gradient and General Prompt Optimization via Probe Sampling , url =. doi:10.52202/079017-1701 , editor =

work page doi:10.52202/079017-1701

[35] [35]

Journal of Artificial Intelligence Research , volume=

Visualisation and'diagnostic classifiers' reveal how recurrent and recursive neural networks process hierarchical structure , author=. Journal of Artificial Intelligence Research , volume=

work page

[36] [36]

Mind the instructions: a holistic evaluation of consistency and interactions in prompt-based learning

Weber, Lucas and Bruni, Elia and Hupkes, Dieuwke. Mind the instructions: a holistic evaluation of consistency and interactions in prompt-based learning. Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL). 2023. doi:10.18653/v1/2023.conll-1.20

work page doi:10.18653/v1/2023.conll-1.20 2023

[37] [37]

The Eleventh International Conference on Learning Representations , year=

Out-of-Distribution Detection and Selective Generation for Conditional Language Models , author=. The Eleventh International Conference on Learning Representations , year=

work page

[38] [38]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Scalable Best-of-N Selection for Large Language Models via Self-Certainty , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page

[39] [39]

Thomas McCoy, Ellie Pavlick, and Tal Linzen

McCoy, R. Thomas and Pavlick, Ellie and Linzen, Tal. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1334

work page doi:10.18653/v1/p19-1334 2019

[40] [40]

Toolformer: language models can teach themselves to use tools , year =

Schick, Timo and Dwivedi-Yu, Jane and Dess\'. Toolformer: language models can teach themselves to use tools , year =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =

work page

[41] [41]

and Beutel, Alex , title =

Garg, Sahaj and Perot, Vincent and Limtiaco, Nicole and Taly, Ankur and Chi, Ed H. and Beutel, Alex , title =. Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society , pages =. 2019 , isbn =. doi:10.1145/3306618.3317950 , abstract =

work page doi:10.1145/3306618.3317950 2019

[42] [42]

T., Wu, T., Guestrin, C., and Singh, S

Ribeiro, Marco Tulio and Wu, Tongshuang and Guestrin, Carlos and Singh, Sameer. Beyond Accuracy: Behavioral Testing of NLP Models with C heck L ist. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.442

work page doi:10.18653/v1/2020.acl-main.442 2020

[43] [43]

H ate C heck: Functional Tests for Hate Speech Detection Models

R. H ate C heck: Functional Tests for Hate Speech Detection Models. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl-long.4

work page doi:10.18653/v1/2021.acl-long.4 2021

[44] [44]

NeurIPS 2025 Workshop on Efficient Reasoning , year=

Deep Think with Confidence , author=. NeurIPS 2025 Workshop on Efficient Reasoning , year=

work page 2025

[45] [45]

Representation Engineering: A Top-Down Approach to AI Transparency

Representation engineering: A top-down approach to ai transparency , author=. arXiv preprint arXiv:2310.01405 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[46] [46]

Forty-second International Conference on Machine Learning , year=

Layer by Layer: Uncovering Hidden Representations in Language Models , author=. Forty-second International Conference on Machine Learning , year=

work page

[47] [47]

2025 , journal=

Qwen2.5 Technical Report , author=. 2025 , journal=

work page 2025

[48] [48]

Phi-4 Technical Report

Phi-4 technical report , author=. arXiv preprint arXiv:2412.08905 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[49] [49]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , url =

Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and ichter, brian and Xia, Fei and Chi, Ed and Le, Quoc V and Zhou, Denny , booktitle =. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , url =

work page

[50] [50]

The Twelfth International Conference on Learning Representations , year=

Large Language Models Are Not Robust Multiple Choice Selectors , author=. The Twelfth International Conference on Learning Representations , year=

work page

[51] [51]

and Hruschka, E

Pezeshkpour, Pouya and Hruschka, Estevam. Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions. Findings of the Association for Computational Linguistics: NAACL 2024. 2024. doi:10.18653/v1/2024.findings-naacl.130

work page doi:10.18653/v1/2024.findings-naacl.130 2024

[52] [52]

The Twelfth International Conference on Learning Representations , year=

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting , author=. The Twelfth International Conference on Learning Representations , year=

work page

[53] [53]

arXiv preprint arXiv:2510.05152 , year=

A single character can make or break your LLM evals , author=. arXiv preprint arXiv:2510.05152 , year=

work page arXiv

[54] [54]

Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales? , url =

Zhou, Zhanke and Tao, Rong and Zhu, Jianing and Luo, Yiwen and Wang, Zengmao and Han, Bo , booktitle =. Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales? , url =. doi:10.52202/079017-3936 , editor =

work page doi:10.52202/079017-3936

[55] [55]

Reasoning Robustness of LLMs to Adversarial Typographical Errors

Gan, Esther and Zhao, Yiran and Cheng, Liying and Yancan, Mao and Goyal, Anirudh and Kawaguchi, Kenji and Kan, Min-Yen and Shieh, Michael. Reasoning Robustness of LLM s to Adversarial Typographical Errors. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.584

work page doi:10.18653/v1/2024.emnlp-main.584 2024

[56] [56]

Resilience of Large Language Models for Noisy Instructions

Wang, Bin and Wei, Chengwei and Liu, Zhengyuan and Lin, Geyu and Chen, Nancy F. Resilience of Large Language Models for Noisy Instructions. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.697

work page doi:10.18653/v1/2024.findings-emnlp.697 2024

[57] [57]

Aryan Gulati and Brando Miranda and Eric Chen and Emily Xia and Kai Fronsdal and Bruno de Moraes Dumont and Sanmi Koyejo , booktitle=. Putnam-. 2024 , url=

work page 2024

[58] [58]

State of What Art? A Call for Multi-Prompt LLM Evaluation

Mizrahi, Moran and Kaplan, Guy and Malkin, Dan and Dror, Rotem and Shahaf, Dafna and Stanovsky, Gabriel , title =. Transactions of the Association for Computational Linguistics , volume =. 2024 , month =. doi:10.1162/tacl_a_00681 , url =

work page doi:10.1162/tacl_a_00681 2024

[59] [59]

The Twelfth International Conference on Learning Representations , year=

Evaluating the Zero-shot Robustness of Instruction-tuned Language Models , author=. The Twelfth International Conference on Learning Representations , year=

work page

[60] [60]

Transactions on Machine Learning Research , issn=

Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics , author=. Transactions on Machine Learning Research , issn=. 2025 , url=

work page 2025

[61] [61]

arXiv preprint arXiv:2407.08989 , year=

Robustness of llms to perturbations in text , author=. arXiv preprint arXiv:2407.08989 , year=

work page arXiv

[62] [62]

Proceedings of the 40th International Conference on Machine Learning , pages =

Large Language Models Can Be Easily Distracted by Irrelevant Context , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

work page 2023

[63] [63]

Wang, and Sadid Hasan

Does prompt formatting have any impact on llm performance? , author=. arXiv preprint arXiv:2411.10541 , year=

work page arXiv

[64] [64]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

On the Worst Prompt Performance of Large Language Models , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page

[65] [65]

arXiv preprint arXiv:2506.11111 , year=

Evaluating and Improving Robustness in Large Language Models: A Survey and Future Directions , author=. arXiv preprint arXiv:2506.11111 , year=

work page arXiv

[66] [66]

arXiv preprint arXiv:2502.16923 , year=

A systematic survey of automatic prompt optimization techniques , author=. arXiv preprint arXiv:2502.16923 , year=

work page arXiv

[67] [67]

arXiv preprint arXiv:2502.11560 , year=

A survey of automatic prompt engineering: An optimization perspective , author=. arXiv preprint arXiv:2502.11560 , year=

work page arXiv

[68] [68]

ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models , year=

A Survey on Prompt Tuning , author=. ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models , year=

work page

[69] [69]

URL https://aclanthology.org/2021

Lester, Brian and Al-Rfou, Rami and Constant, Noah. The Power of Scale for Parameter-Efficient Prompt Tuning. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.243

work page doi:10.18653/v1/2021.emnlp-main.243 2021

[70] [70]

Adversarial Soft Prompt Tuning for Cross-Domain Sentiment Analysis

Wu, Hui and Shi, Xiaodong. Adversarial Soft Prompt Tuning for Cross-Domain Sentiment Analysis. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.174

work page doi:10.18653/v1/2022.acl-long.174 2022

[71] [71]

The Gradient , year =

Huyen, Chip , title =. The Gradient , year =

work page

[72] [72]

Demystifying Prompts in Language Models via Perplexity Estimation

Gonen, Hila and Iyer, Srini and Blevins, Terra and Smith, Noah and Zettlemoyer, Luke. Demystifying Prompts in Language Models via Perplexity Estimation. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.679

work page doi:10.18653/v1/2023.findings-emnlp.679 2023

[73] [73]

Uncertainty Estimation in Large Language Models to Support Biodiversity Conservation

Mora-Cross, Maria and Calderon-Ramirez, Saul. Uncertainty Estimation in Large Language Models to Support Biodiversity Conservation. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track). 2024. doi:10.18653/v1/2024.naacl-industry.31

work page doi:10.18653/v1/2024.naacl-industry.31 2024

[74] [74]

arXiv preprint arXiv:2410.15326 , year=

A survey of uncertainty estimation in llms: Theory meets practice , author=. arXiv preprint arXiv:2410.15326 , year=

work page arXiv