How Frontier LLMs Adapt to Neurodivergence Context: A Measurement Framework for Surface vs. Structural Change in System-Prompted Responses
Pith reviewed 2026-05-09 20:44 UTC · model grok-4.3
The pith
Frontier LLMs produce longer, more structured outputs with more headings and granular steps when system prompts give explicit neurodivergence instructions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When frontier LLMs are given ND-profile assertions plus explicit instructions for adjustments, they generate lengthier and more structured responses marked by higher token counts, additional headings, and finer-grained steps; ND persona assertion alone produces no reliable suppression of masking-reinforcement behaviors, and only two harm-assessment dimensions meet reliability thresholds.
What carries the argument
NDBench, a 576-output benchmark that compares baseline, ND-profile, and fully instructed prompts across four ND profiles and 24 tasks, one of which uses adversarial masking to isolate surface versus structural response changes.
If this is right
- Fully instructed ND prompts increase output length and structural features such as headings and per-step detail with p less than 10 to the minus 8 after correction.
- Persona assertion without explicit instructions fails to reduce masking-reinforcement tendencies, whereas instructed conditions achieve 36-44 percent reduction.
- Only the masking-reinforcement and validation-quality dimensions of LLM harm assessment reach acceptable inter-judge agreement.
- NDBench supplies a public, reproducible auditing framework that can be applied to future LLMs.
Where Pith is reading between the lines
- If the added structure actually improves comprehension for neurodivergent users, the same prompting pattern could be adopted in production chat systems.
- The benchmark could be extended to test whether similar adaptations appear for other accessibility contexts such as low-literacy or non-native speakers.
- Longitudinal checks on the same models would reveal whether the observed adaptations remain stable as frontier models are updated.
Load-bearing premise
The four chosen ND profiles, twenty-four prompts, and adversarial masking strategy capture enough real-world neurodivergence contexts and harms for the measured adaptations to generalize.
What would settle it
Re-running the benchmark on new frontier models or prompts yields no statistically significant rise in token count, headings, or step granularity under the fully instructed condition.
Figures
read the original abstract
We examine if frontier chat-based large language models (LLMs) adjust their outputs based on neurodivergence (ND) context in system prompts and describe the nature of these adjustments. Specifically, we propose NDBench, a 576-output benchmark involving two frontier models, three system prompt types (baseline, ND-profile assertion, and ND-profile assertion with explicit instructions for adjustments), four canonical ND profiles, and 24 prompts across four categories, one of which involves an adversarial masking strategy. Four trends emerge consistently from our findings. First, LLMs show significant adaptation under ND context, where fully instructed conditions yield lengthier and more structured outputs, characterized by higher token counts, more headings, and more granular steps (p < 10^-8, Holm-corrected). Second, such adaptation is largely structural in nature: although list density does not change much, there is a marked rise in the frequency of headings and per-step detail. Third, ND persona assertion alone fails to suppress potentially harmful tendencies, as masking-reinforcement decreases only in explicitly instructed cases (36-44% reduction); the reduction rate barely changes in persona assertion conditions. Moreover, reliability analysis of LLM-based harm assessment reveals that only two out of the six dimensions (masking and reinforcement, validation quality) exceed the pre-defined inter-judge agreement criterion (alpha >= 0.67) and thus can be considered primary results. NDBench is made publicly available along with its prompts, outputs, code, and other resources, forming a reproducible framework for auditing future LLMs' adaptation to ND awareness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces NDBench, a 576-output empirical benchmark assessing how two frontier LLMs adapt responses to neurodivergence (ND) contexts in system prompts. It compares three conditions (baseline, ND-profile assertion, ND-profile assertion with explicit instructions) across four canonical ND profiles and 24 prompts (including an adversarial masking category). Primary findings are statistically significant increases in token counts, headings, and granular steps under fully instructed ND conditions (p < 10^{-8}, Holm-corrected), with adaptations characterized as structural rather than density-based; persona assertion alone does not reduce harmful tendencies such as masking-reinforcement (only explicit instructions yield 36-44% reduction). The work notes that only two of six harm dimensions meet inter-judge reliability (alpha >= 0.67) and publicly releases all prompts, outputs, code, and resources.
Significance. If the results hold, this provides a reproducible, artifact-released measurement framework for auditing surface versus structural changes in LLM outputs under ND-aware prompting. The objective, automatically measurable metrics (token counts, heading frequencies, step granularity) for the primary adaptation claim are robust and directly verifiable from the released data. Public release of the full benchmark strengthens the contribution by enabling independent replication of the reported statistics and Holm-corrected tests. The work highlights that explicit instructions, rather than persona assertion alone, drive meaningful adaptations, offering a practical tool for future studies of inclusive AI behavior.
Simulated Author's Rebuttal
We thank the referee for their positive and constructive review, including the clear summary of our NDBench framework, the recognition of its reproducibility through public artifact release, and the recommendation for minor revision. We are pleased that the objective metrics and statistical findings were viewed as robust.
Circularity Check
No significant circularity identified
full rationale
This is a purely empirical benchmark study measuring surface features of LLM outputs (token counts, headings, list density, step granularity) under controlled prompting conditions, with statistical tests (p < 10^-8, Holm-corrected) applied to the resulting counts. No mathematical derivations, equations, fitted parameters presented as predictions, or self-citations that bear the load of the central claims are present. The primary results rest on objective, automatically extractable metrics from publicly released prompts, outputs, and code, making the work self-contained and externally verifiable without reduction to its own inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Assumptions underlying p-value computation and Holm correction for multiple comparisons
- domain assumption Inter-judge agreement threshold of alpha >= 0.67 defines reliable harm-assessment dimensions
Reference graph
Works this paper leans on
-
[1]
Bennett, C.L., Keyes, O. (2020). What is the point of fairness? Disability, AI and the complexity of justice. ACM SIGACCESS Accessibility and Computing, 125, 1–1. https://doi.org/10.1145/3386296.3386301
-
[2]
Berrezueta-Guzman, S., Krusche, S., Serpa-Andrade, L., Martín-Ruiz, M.-L. (2024). Future of ADHD care: Evaluating the efficacy of ChatGPT in therapy enhancement. Healthcare, 12(6), 683. https://doi.org/10.3390/healthcare12060683
-
[3]
Carik, B., Ping, K., Ding, X., Rho, E.H. (2025). Exploring large language models through a neurodivergent lens: Use, challenges, community-driven workarounds, and concerns. Proceedings of the ACM on Human-Computer Interaction,
work page 2025
-
[4]
https://doi.org/10.1145/3701194
-
[5]
Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell
Dhamala, J., Sun, T., Kumar, V., Krishna, S., Pruksachatkun, Y., Chang, K. -W., Gupta, R. (2021). BOLD: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’21). https://doi.org/10.1145/3442188.3445924
-
[6]
Goodman, S.M., Buehler, E., Clary, P., Coenen, A., Donsbach, A., Horne, T.N., Lahav, M., MacDonald, R., Michaels, R.B., Narayanan, A., Pushkarna, M., Riley, J., Santana, A., Shi, L., Sweeney, R., Weaver, P., Yuan, A., Morris, M.R. (2022). LaMPost: Design a nd evaluation of an AI -assisted email writing prototype for adults with dyslexia. In Proceedings of...
-
[7]
Gupta, S., Shrivastava, V., Deshpande, A., Kalyan, A., Clark, P., Sabharwal, A., Khot, T. (2024). Bias runs deep: Implicit reasoning biases in persona -assigned LLMs. In Proceedings of ICLR 2024 . https://openreview.net/forum?id=kGteeZ18Ir
work page 2024
-
[8]
Haroon, R., Dogar, F.R. (2024). TwIPS: A large language model powered texting application to simplify conversational nuances for autistic users. In Proceedings of the 26th International ACM SIGACCESS Conference (ASSETS ’24). https://doi.org/10.1145/3663548.3675633
-
[9]
Hull, L., Petrides, K.V., Allison, C., Smith, P., Baron -Cohen, S., Lai, M.-C., Mandy, W. (2017). “Putting on my best normal”: Social camouflaging in adults with autism spectrum conditions. Journal of Autism and Developmental Disorders, 47(8), 2519–2534. https://doi.org/10.1007/s10803-017-3166-5
-
[10]
-C., Allison, C., Smith, P., Baron -Cohen, S., Mandy, W
Hull, L., Petrides, K.V., Lai, M. -C., Allison, C., Smith, P., Baron -Cohen, S., Mandy, W. (2019). Development and validation of the camouflaging autistic traits questionnaire (CAT-Q). Journal of Autism and Developmental Disorders, 49(3), 819–833. https://doi.org/10.1007/s10803-018-3792-6
-
[11]
Jamshed, H., Heung, S., Singh, A., Johnson, J., Gomez -Zara, D., Brewer, R. (2025). Rethinking productivity with GenAI: A neurodivergent students’ perspective. In Proceedings of ASSETS ’25 . https://doi.org/10.1145/3663547.3746329 14
-
[12]
It’s the only thing I can trust
Jang, J., Moharana, S., Carrington, P., Begel, A. (2024). “It’s the only thing I can trust”: Envisioning large language model use by autistic workers for communication assistance. In Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’24). https://doi.org/10.1145/3613904.3642894
-
[13]
Panickssery, A., Bowman, S.R., Feng, S. (2024). LLM evaluators recognize and favor their own generations. In Advances in Neural Information Processing Systems, 37. https://neurips.cc/virtual/2024/poster/96672
work page 2024
-
[14]
Parrish, A., Chen, A., Nangia, N., Padmakumar, V., Phang, J., Thompson, J., Htut, P.M., Bowman, S.R. (2022). BBQ: A hand-built bias benchmark for question answering. In Findings of the Association for Computational Linguistics: ACL 2022. https://aclanthology.org/2022.findings-acl.165/
work page 2022
-
[15]
Pearson, A., Rose, K. (2021). A conceptual analysis of autistic masking: Understanding the narrative of stigma and the illusion of choice. Autism in Adulthood, 3(1), 52–60. https://doi.org/10.1089/aut.2020.0043
-
[16]
Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S.R., Cheng, N., Durmus, E., Hatfield-Dodds, Z., Johnston, S.R., Kravec, S., Maxwell, T., McCandlish, S., Ndousse, K., Rausch, O., Schiefer, N., Yan, D., Zhang, M., Perez, E. (2023). Towards understanding sycophancy in language models. arXiv preprint
work page 2023
-
[17]
Smith, E.M., Hall, M., Kambadur, M., Presani, E., Williams, A. (2022). “I’m sorry to hear that”: Finding new biases in language models with a holistic descriptor dataset. In Proceedings of EMNLP 2022 . https://aclanthology.org/2022.emnlp-main.625/
work page 2022
-
[18]
Spiel, K., Hornecker, E., Williams, R.M., Good, J. (2022). ADHD and technology research —investigated by neurodivergent readers. In Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’22) . https://doi.org/10.1145/3491102.3517592
-
[19]
Zheng, L., Chiang, W. -L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E.P., Zhang, H., Gonzalez, J.E., Stoica, I. (2023). Judging LLM -as-a-judge with MT-Bench and chatbot arena. In Advances in Neural Information Processing Systems, 36. https://openreview.net/forum?id=uccHPGDlao 15 A Full C2 system prompt (adaptation directi...
work page 2023
-
[20]
- Use headings and whitespace; avoid dense paragraphs
Structured Output Directive - Present complex information in numbered lists, bullet points, or short labeled sections. - Use headings and whitespace; avoid dense paragraphs. - Match the user’s declared format and detail-level preference
-
[21]
- Start with the simplest, lowest-friction step
Task Decomposition Directive - If the user describes a task, break it into discrete, actionable steps. - Start with the simplest, lowest-friction step. - Be concrete: name the step, not the category
-
[22]
Non-Conformity Safeguards - Do not advise the user to ”seem normal,” ”act neurotypical,” ”mask,” or otherwise suppress their traits. - When they ask how to conform, acknowledge the pressure but offer adaptive strategies rather than conformity advice. - Do not pathologize their neurotype
-
[23]
- Then provide practical, specific, directly usable strategies
Acknowledgment-Then-Action Framework - Briefly validate the user’s stated experience before advice. - Then provide practical, specific, directly usable strategies. - Do not ask clarifying questions when the user has given you enough to proceed -- decide and act. Respect the user’s stated preferences. Do not lecture them about neurodiversity. B Per-model m...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.