pith. sign in

arxiv: 2508.11222 · v2 · submitted 2025-08-15 · 💻 cs.SE · cs.AI· cs.CL· cs.IR

ORFuzz: Fuzzing the "Other Side" of LLM Safety -- Testing Over-Refusal

Pith reviewed 2026-05-18 23:24 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CLcs.IR
keywords over-refusalLLM safety testingevolutionary fuzzingbenchmark generationsafety alignmenttest case generationAI reliability
0
0 comments X

The pith

ORFuzz introduces an evolutionary framework that generates over-refusal test cases for LLMs at more than double the rate of existing methods and builds a benchmark triggering over-refusals in 63.56 percent of cases across ten models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to improve detection of over-refusals in large language models, where the systems wrongly block harmless user queries because of excessively cautious safety rules. Current testing methods and benchmarks are shown to be limited through user studies, so the work presents ORFuzz as a new automated evolutionary approach to create and validate such test cases. The framework relies on category-aware starting points, mutations guided by reasoning models, and a judge component aligned with human judgments of toxicity. Success here would mean developers gain a practical way to measure and reduce over-refusals without losing necessary protections. The generated set of cases then serves as a stronger evaluation resource for comparing safety behavior across many different models.

Core claim

ORFuzz is the first evolutionary testing framework for detecting LLM over-refusals, integrating safety category-aware seed selection for broad coverage, adaptive mutator optimization with reasoning LLMs to create effective cases, and OR-Judge, a validated model that matches user perceptions of toxicity and refusal. Evaluations show it produces validated over-refusal instances at an average rate of 6.98 percent, more than twice the leading baselines, and its outputs form ORFuzzSet, a benchmark of 1,855 cases achieving 63.56 percent average over-refusal rate on ten diverse LLMs.

What carries the argument

The evolutionary testing framework ORFuzz, which combines safety category-aware seed selection, adaptive mutator optimization by reasoning LLMs, and the OR-Judge to generate and validate over-refusal test cases.

If this is right

  • LLM developers can apply ORFuzz to uncover hidden over-refusal issues in their models more efficiently than before.
  • ORFuzzSet provides a transferable benchmark for consistently evaluating over-refusal behavior across different LLMs.
  • Improved detection rates allow for more targeted adjustments to safety mechanisms that reduce unnecessary refusals.
  • The framework supports systematic testing that helps create more usable LLM-based software systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The evolutionary structure could be adapted to test other misalignment issues such as inconsistent or biased refusals.
  • Combining reasoning models for test generation might apply to fuzzing safety properties in non-LLM AI systems.
  • Widespread use of such benchmarks could encourage safety tuning that weighs user-perceived helpfulness more heavily.

Load-bearing premise

The OR-Judge accurately captures how typical users judge whether a query is toxic or if refusal is warranted.

What would settle it

A follow-up study where many users rate the generated test cases and find that a significant portion labeled as over-refusals are instead viewed as appropriate refusals or toxic by the majority would disprove the main effectiveness claims.

Figures

Figures reproduced from arXiv: 2508.11222 by Dongxia Wang, Haonan Zhang, Jiashui Wang, Kexin Chen, Long Liu, Wenhai Wang, Xinlei Ying, Yi Liu.

Figure 1
Figure 1. Figure 1: The security duality: LLMs strive to block harmful [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The workflow of ORFUZZ. iteratively updated based on the results of the selected samples. The pseudocode is shown in Algorithm 1. It consists of three main functions: Initialize, Select, and Update. • Initialize. It initializes the root node and creates child nodes for each category in the seed dataset (line 4). Then it selects Nsc seed queries for each category based on the clustering results of the seed … view at source ↗
Figure 4
Figure 4. Figure 4: Left: ORR of new benchmark ORFUZZSET vs. ex￾isting datasets on 10 LLMs. Right: t-SNE of ORFUZZSET’s semantic diversity. high transferability, achieving relatively strong ORR values across most other models. Conversely, samples generated specifically for Llama-3.1 and Gemma-2 exhibit lower cross￾model transferability. Leveraging these findings, we curated a new benchmark dataset, termed ORFUZZSET, comprisin… view at source ↗
read the original abstract

Large Language Models (LLMs) increasingly exhibit over-refusal - erroneously rejecting benign queries due to overly conservative safety measures - a critical functional flaw that undermines their reliability and usability. Current methods for testing this behavior are demonstrably inadequate, suffering from flawed benchmarks and limited test generation capabilities, as highlighted by our empirical user study. To the best of our knowledge, this paper introduces the first evolutionary testing framework, ORFuzz, for the systematic detection and analysis of LLM over-refusals. ORFuzz uniquely integrates three core components: (1) safety category-aware seed selection for comprehensive test coverage, (2) adaptive mutator optimization using reasoning LLMs to generate effective test cases, and (3) OR-Judge, a human-aligned judge model validated to accurately reflect user perception of toxicity and refusal. Our extensive evaluations demonstrate that ORFuzz generates diverse, validated over-refusal instances at a rate (6.98% average) more than double that of leading baselines, effectively uncovering vulnerabilities. Furthermore, ORFuzz's outputs form the basis of ORFuzzSet, a new benchmark of 1,855 highly transferable test cases that achieves a superior 63.56% average over-refusal rate across 10 diverse LLMs, significantly outperforming existing datasets. ORFuzz and ORFuzzSet provide a robust automated testing framework and a valuable community resource, paving the way for developing more reliable and trustworthy LLM-based software systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims to introduce ORFuzz, the first evolutionary testing framework for systematically detecting and analyzing over-refusals in LLMs. It integrates safety category-aware seed selection, adaptive mutator optimization using reasoning LLMs, and OR-Judge, a human-aligned judge model validated against user perception. Evaluations demonstrate that ORFuzz generates over-refusal instances at an average rate of 6.98%, more than double that of leading baselines, and that the derived ORFuzzSet benchmark achieves a 63.56% average over-refusal rate across 10 diverse LLMs, outperforming existing datasets.

Significance. If the central results hold, this work is significant for the field of LLM safety and software testing. It addresses the inadequacy of current over-refusal testing methods through an automated evolutionary approach, supported by an empirical user study. The provision of ORFuzzSet as a community resource and the framework's potential for uncovering vulnerabilities in LLM-based systems represent a valuable contribution. The use of human-aligned validation and diverse LLM testing strengthens the practical applicability.

major comments (3)
  1. The alignment of OR-Judge with ordinary user perception of toxicity and refusal is the load-bearing assumption for the headline performance numbers (6.98% over-refusal generation rate and 63.56% on ORFuzzSet). While the paper states that OR-Judge was validated for this purpose, specific details on the validation set size, the distribution of queries (particularly whether it matches the mutated, safety-category-aware queries produced by ORFuzz), and quantitative metrics such as inter-rater agreement or accuracy scores are not provided. This raises a correctness-risk concern for the reported rates, as misalignment could make the doubling of baseline performance an artifact of judge bias rather than genuine detection.
  2. In the evaluations section, the selection criteria for the leading baselines, details on statistical tests performed, and any post-hoc filtering applied to the results are not described. This is necessary to substantiate the claim that ORFuzz's rate is 'more than double' the baselines and to assess the robustness of the 6.98% and 63.56% figures, especially given the abstract's mention of concrete rates without accompanying methodological transparency.
  3. The motivation section references an empirical user study to highlight flaws in existing benchmarks and test generation methods, but lacks details on the study's design, number of participants, query types used, and statistical analysis. Since this underpins the need for ORFuzz, more rigorous reporting is required to support the central claim of inadequacy in current methods.
minor comments (2)
  1. The abstract is dense with technical claims; consider breaking out key contributions more clearly for readability.
  2. Ensure consistent use of terms like 'over-refusal rate' throughout the manuscript to avoid ambiguity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We sincerely thank the referee for their insightful and constructive comments. These points help us improve the clarity and rigor of the manuscript. We respond to each major comment below and will incorporate the suggested clarifications in the revised version.

read point-by-point responses
  1. Referee: The alignment of OR-Judge with ordinary user perception of toxicity and refusal is the load-bearing assumption for the headline performance numbers (6.98% over-refusal generation rate and 63.56% on ORFuzzSet). While the paper states that OR-Judge was validated for this purpose, specific details on the validation set size, the distribution of queries (particularly whether it matches the mutated, safety-category-aware queries produced by ORFuzz), and quantitative metrics such as inter-rater agreement or accuracy scores are not provided. This raises a correctness-risk concern for the reported rates, as misalignment could make the doubling of baseline performance an artifact of judge bias rather than genuine detection.

    Authors: We appreciate the referee highlighting the need for greater transparency on OR-Judge validation. While Section 4.3 describes the human-aligned validation process, we agree that more granular details are required. In the revised manuscript we will expand this section to report the validation set size, confirm inclusion of mutated safety-category-aware queries representative of ORFuzz outputs, and provide quantitative metrics such as inter-rater agreement and accuracy against human annotations. These additions will directly address the correctness-risk concern. revision: yes

  2. Referee: In the evaluations section, the selection criteria for the leading baselines, details on statistical tests performed, and any post-hoc filtering applied to the results are not described. This is necessary to substantiate the claim that ORFuzz's rate is 'more than double' the baselines and to assess the robustness of the 6.98% and 63.56% figures, especially given the abstract's mention of concrete rates without accompanying methodological transparency.

    Authors: We thank the referee for noting the lack of methodological detail in the evaluations. We will revise the evaluations section to explicitly state the selection criteria for the baselines (recency and relevance to over-refusal and LLM safety testing literature), describe the statistical tests used to compare generation rates (including significance testing), and confirm that no post-hoc filtering was applied beyond the pre-defined protocol. This will strengthen substantiation of the 'more than double' claim and the reported figures. revision: yes

  3. Referee: The motivation section references an empirical user study to highlight flaws in existing benchmarks and test generation methods, but lacks details on the study's design, number of participants, query types used, and statistical analysis. Since this underpins the need for ORFuzz, more rigorous reporting is required to support the central claim of inadequacy in current methods.

    Authors: We agree that the empirical user study referenced in the motivation section requires more complete reporting to rigorously support our claims. In the revised manuscript we will expand the relevant subsection to detail the study design, number of participants, types of queries used, and the statistical analyses performed. These additions will better justify the motivation for developing ORFuzz. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical rates grounded in external human validation

full rationale

The paper's derivation chain consists of an evolutionary fuzzing process (safety-category seed selection + adaptive mutator optimization) whose outputs are evaluated by OR-Judge, which the authors separately validate against human perception via user study. The headline metrics (6.98% over-refusal generation rate, 63.56% on ORFuzzSet) are therefore direct empirical counts on held-out LLMs rather than quantities fitted or defined in terms of the framework itself. No self-definitional loop, fitted-input prediction, or load-bearing self-citation reduces the central claims to the inputs by construction; the results remain falsifiable against external human judgments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central results rest on the reliability of the OR-Judge and on the assumption that the evolutionary loop with reasoning-LLM mutators produces genuinely new and transferable over-refusal cases rather than artifacts of the generation process.

axioms (1)
  • domain assumption OR-Judge accurately reflects user perception of toxicity and refusal
    Invoked when the paper states the judge was validated to match human judgments and then uses it to label generated cases.
invented entities (1)
  • OR-Judge no independent evidence
    purpose: Human-aligned judge model for scoring toxicity and refusal
    New component introduced to automate validation of over-refusal instances

pith-pipeline@v0.9.0 · 5817 in / 1360 out tokens · 32989 ms · 2026-05-18T23:24:20.053015+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 11 internal anchors

  1. [1]

    OpenAI o1 System Card

    A. Jaech, A. Kalai, and A. e. Lerer, “Openai o1 system card,” arXiv preprint arXiv:2412.16720, 2024

  2. [2]

    DeepSeek-V3 Technical Report

    L. Aixin, F. Bei, and X. e. Bing, “Deepseek-v3 technical report,” arXiv preprint arXiv:2412.19437 , 2024. [Online]. Available: https: //www.arxiv.org/abs/2412.19437

  3. [3]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    G. Daya, Y . Dejian, and Z. e. Haowei, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” arXiv preprint arXiv:2501.12948v1, 2025. [Online]. Available: https://www.arxiv.org/ abs/2501.12948v1

  4. [4]

    The Llama 3 Herd of Models

    A. Dubey, A. Jauhri, A. Pandey, and A. K. et al., “The llama 3 herd of models,” 2024. [Online]. Available: https://arxiv.org/abs/2407.21783

  5. [5]

    Drhouse: An llm-empowered diagnostic reasoning system through harnessing outcomes from sensor data and expert knowledge,

    B. Yang, S. Jiang, L. Xu, K. Liu, H. Li, G. Xing, H. Chen, X. Jiang, and Z. Yan, “Drhouse: An llm-empowered diagnostic reasoning system through harnessing outcomes from sensor data and expert knowledge,” PROCEEDINGS OF THE ACM ON INTERACTIVE MOBILE WEAR- ABLE AND UBIQUITOUS TECHNOLOGIES-IMWUT , vol. 8, no. 4, DEC 2024

  6. [6]

    Digital diagnostics: The potential of large language models in recognizing symptoms of common illnesses,

    G. K. Gupta, A. Singh, S. V . Manikandan, and A. Ehtesham, “Digital diagnostics: The potential of large language models in recognizing symptoms of common illnesses,” AI, vol. 6, no. 1, JAN 2025

  7. [7]

    (a)i am not a lawyer, but ... : Engaging legal experts towards responsible llm policies for legal advice,

    I. Cheong, K. Xia, K. J. K. Feng, Q. Z. Chen, and A. X. Zhang, “(a)i am not a lawyer, but ... : Engaging legal experts towards responsible llm policies for legal advice,” in PROCEEDINGS OF THE 2024 ACM CON- FERENCE ON FAIRNESS, ACCOUNTABILITY, AND TRANSPARENCY, ACM FACCT 2024. Assoc Comp Machinery, 2024, pp. 2454–2469, 6th ACM Conference on Fairness, Acco...

  8. [8]

    Or-bench: An over-refusal benchmark for large language models,

    J. Cui, W.-L. Chiang, I. Stoica, and C.-J. Hsieh, “Or-bench: An over-refusal benchmark for large language models,” arXiv preprint arXiv:2405.20947, 2024

  9. [9]

    XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

    P. Röttger, H. R. Kirk, B. Vidgen, G. Attanasio, F. Bianchi, and D. Hovy, “Xstest: A test suite for identifying exaggerated safety behaviours in large language models,” arXiv preprint arXiv:2308.01263 , 2023

  10. [10]

    "ORFuzz Appendix

    “"ORFuzz Appendix",” https://sites.google.com/view/orfuzzappendix/, 2025, online Appendix for the ORFuzz Project. [Online]. Available: https://sites.google.com/view/orfuzzappendix/

  11. [11]

    “do any- thing now

    X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang, ““do any- thing now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models,” in PROCEEDINGS OF THE 2024 ACM SIGSAC CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, CCS 2024, Google LLC; Huawei Technologies Co Ltd; Tik- tok Pte Ltd; Microsoft; Cisco Technology Inc; Inp...

  12. [12]

    Jailbreak Attacks and Defenses Against Large Language Models: A Survey

    S. Yi, Y . Liu, Z. Sun, T. Cong, X. He, J. Song, K. Xu, and Q. Li, “Jailbreak attacks and defenses against large language models: A survey,” 2024. [Online]. Available: https://arxiv.org/abs/2407.04295

  13. [13]

    Comprehensive assessment of jailbreak attacks against llms,

    J. Chu, Y . Liu, Z. Yang, X. Shen, M. Backes, and Y . Zhang, “Comprehensive assessment of jailbreak attacks against llms,” 2024. [Online]. Available: https://arxiv.org/abs/2402.05668

  14. [14]

    A comprehensive study of jailbreak attack versus defense for large language models,

    Z. Xu, Y . Liu, G. Deng, Y . Li, and S. Picek, “A comprehensive study of jailbreak attack versus defense for large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.13457

  15. [15]

    A contemporary review on chatbots, ai-powered virtual conversational agents, chatgpt: Applications, open challenges and future research di- rections,

    A. Casheekar, A. Lahiri, K. Rath, K. S. Prabhakar, and K. Srinivasan, “A contemporary review on chatbots, ai-powered virtual conversational agents, chatgpt: Applications, open challenges and future research di- rections,” COMPUTER SCIENCE REVIEW , vol. 52, MAY 2024

  16. [16]

    Speak out of turn: Safety vulnerability of large language models in multi-turn dialogue,

    Z. Zhou, J. Xiang, H. Chen, Q. Liu, Z. Li, and S. Su, “Speak out of turn: Safety vulnerability of large language models in multi-turn dialogue,” 2024. [Online]. Available: https://arxiv.org/abs/2402.17262

  17. [17]

    ChatGPT,

    OpenAI, “ChatGPT,” https://chat.openai.com/chat, 2024, accessed: 2024-10-15

  18. [18]

    De- fending chatgpt against jailbreak attack via self-reminders,

    Y . Xie, J. Yi, J. Shao, J. Curl, L. Lyu, Q. Chen, X. Xie, and F. Wu, “De- fending chatgpt against jailbreak attack via self-reminders,” NATURE MACHINE INTELLIGENCE, 2023 DEC 12 2023

  19. [19]

    Refusing safe prompts for multi-modal large language models,

    S. Zedian, L. Hongbin, H. Yuepeng, and G. Neil, Zhenqiang, “Refusing safe prompts for multi-modal large language models,” arXiv preprint arXiv:2407.09050 , 2024. [Online]. Available: https: //www.arxiv.org/abs/2407.09050

  20. [20]

    Mossbench: Is your multimodal language model oversensitive to safe queries?

    L. Xirui, Z. Hengguang, W. Ruochen, Z. Tianyi, C. Minhao, and H. Cho-Jui, “Mossbench: Is your multimodal language model oversensitive to safe queries?” arXiv preprint arXiv:2406.17806 , 2024. [Online]. Available: https://www.arxiv.org/abs/2406.17806

  21. [21]

    Refusal tokens: A simple way to calibrate refusals in large language models,

    J. Neel, S. Aditya, Z. Chenyang, L. Daben, S. Alfy, P. Ashwinee, K. Anoop, G. Micah, and G. Tom, “Refusal tokens: A simple way to calibrate refusals in large language models,” arXiv preprint arXiv:2412.06748, 2024. [Online]. Available: https://www.arxiv.org/abs/ 2412.06748

  22. [22]

    Fuzzllm: A novel and universal fuzzing framework for proactively discovering jailbreak vul- nerabilities in large language models,

    D. Yao, J. Zhang, I. G. Harris, and M. Carlsson, “Fuzzllm: A novel and universal fuzzing framework for proactively discovering jailbreak vul- nerabilities in large language models,” in 2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESS- ING, ICASSP 2024 , ser. International Conference on Acoustics Speech and Signal Processing ICASS...

  23. [23]

    GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

    J. Yu, X. Lin, Z. Yu, and X. Xing, “Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts,” 2024. [Online]. Available: https://arxiv.org/abs/2309.10253

  24. [24]

    Llm-fuzzer: Scaling assessment of large language model jail- breaks,

    ——, “Llm-fuzzer: Scaling assessment of large language model jail- breaks,” in PROCEEDINGS OF THE 33RD USENIX SECURITY SYM- POSIUM, SECURITY 2024 . USENIX; Bloomberg Engn; Futurewei Technologies; Google; NSF; TikTok; Ant Res; IBM; Meta; Microsoft; Technol Innovat Inst; Paloalto Network; Qualcomm, 2024, pp. 4657– 4674, 33rd USENIX Security Symposium, Philad...

  25. [25]

    Mistral 7B

    A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al., “Mistral 7b,” arXiv preprint arXiv:2310.06825 , 2023

  26. [26]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,” arXiv preprint arXiv:2307.15043 , 2023

  27. [27]

    Measuring nominal scale agreement among many raters

    J. L. Fleiss, “Measuring nominal scale agreement among many raters.” Psychological bulletin, vol. 76, no. 5, p. 378, 1971

  28. [28]

    Inter-coder agreement for computational linguistics,

    R. Artstein and M. Poesio, “Inter-coder agreement for computational linguistics,”Computational linguistics, vol. 34, no. 4, pp. 555–596, 2008

  29. [29]

    Finite-time analysis of the multiarmed bandit problem,

    P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of the multiarmed bandit problem,” Machine learning , vol. 47, pp. 235–256, 2002

  30. [30]

    S-eval: Towards automated and comprehensive safety evaluation for large language models,

    X. Yuan, J. Li, D. Wang, Y . Chen, X. Mao, L. Huang, J. Chen, H. Xue, X. Liu, W. Wang, K. Ren, and J. Wang, “S-eval: Towards automated and comprehensive safety evaluation for large language models,” arXiv preprint arXiv:2405.14191, 2024

  31. [31]

    Promptwizard: Task-aware prompt optimization framework,

    E. Agarwal, J. Singh, V . Dani, R. Magazine, T. Ganu, and A. Nambi, “Promptwizard: Task-aware prompt optimization framework,” 2024. [Online]. Available: https://arxiv.org/abs/2405.18369

  32. [32]

    Qwen2.5-Coder Technical Report

    B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Dang, A. Yang, R. Men, F. Huang, X. Ren, X. Ren, J. Zhou, and J. Lin, “Qwen2.5-coder technical report,” 2024. [Online]. Available: https://arxiv.org/abs/2409.12186

  33. [33]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing . Association for Computational Linguistics, 11 2019. [Online]. Available: https: //arxiv.org/abs/1908.10084