pith. sign in

arxiv: 2605.15635 · v1 · pith:OIH7OZHOnew · submitted 2026-05-15 · 💻 cs.CL

Evaluating Chinese Ambiguity Understanding in Large Language Models

Pith reviewed 2026-05-20 19:42 UTC · model grok-4.3

classification 💻 cs.CL
keywords Chinese linguistic ambiguitylarge language modelsambiguity detectionPotential Ambiguity TheoryCHA-Gen datasetchain-of-thought promptingsemantic entropyinstruction tuning
0
0 comments X

The pith

Large language models often fail to detect linguistic ambiguity in Chinese.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper builds a new dataset called CHA-Gen using Potential Ambiguity Theory to test how LLMs handle Chinese sentences that can mean more than one thing. It evaluates models like the Qwen and Gemma series through direct questions and chain-of-thought reasoning, then measures uncertainty with semantic entropy. The results show models miss ambiguity, make specific reasoning errors, and become overconfident after instruction tuning while base models keep more options open. Models also lean toward the most common reading. The work gives a scalable method to create more Chinese ambiguity data and points to clear gaps in current LLM understanding of the language.

Core claim

LLMs struggle with ambiguity detection in Chinese. Analysis of Qwen3-32B CoT rationales reveals three common failure modes (ambiguity blindness, misattribution, and premature resolution). Instruction tuning induces overconfidence while base models better capture semantic diversity. Models exhibit a bias toward dominant interpretations. Uncertainty quantification with semantic entropy shows higher uncertainty for ambiguous sentences.

What carries the argument

Semi-automatic pipeline guided by Potential Ambiguity (PA) Theory to generate and label 5,712 Chinese sentences across 18 ambiguous structures as either ambiguous or unambiguous.

If this is right

  • Chain-of-thought prompting raises accuracy on Chinese ambiguity detection compared with direct answers.
  • Semantic entropy is measurably higher on ambiguous sentences than on unambiguous ones.
  • Instruction tuning produces overconfidence and reduces the models' ability to represent multiple meanings.
  • Base models preserve greater semantic diversity than their instruction-tuned versions.
  • Models systematically favor the dominant interpretation when facing ambiguity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training procedures that reward explicit listing of alternative meanings could reduce the observed overconfidence in future Chinese-capable models.
  • The same pipeline approach could generate comparable ambiguity test sets for other languages that share structural features with Chinese.
  • Real-world tasks such as Chinese machine translation or legal text analysis may improve if models are explicitly trained to flag and resolve ambiguity.
  • Comparing failure rates across model sizes and families on the same dataset would clarify which architectures handle Chinese ambiguity best.

Load-bearing premise

The semi-automatic pipeline guided by Potential Ambiguity Theory produces a dataset whose ambiguous and unambiguous labels accurately reflect genuine linguistic ambiguity in Chinese and are not artifacts of the generation process.

What would settle it

Independent human linguists labeling a random sample of CHA-Gen sentences and agreeing with the pipeline labels on fewer than 70 percent of cases would show the dataset does not capture real ambiguity.

Figures

Figures reproduced from arXiv: 2605.15635 by Hideki Nakayama, Junwen Mo, Ke Xu, Yifang Xue, Yuanzhi Lu.

Figure 1
Figure 1. Figure 1: An example of a potential ambiguous structure in PA theory. when specific conditions are met. For instance (see [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The Pipeline for CHA-Gen Corpus Construction machine translation is employed to probe LLMs’uncertainty behavior in this work rather than to detect ambiguous sentences. We also believe that, compared to direct querying, translation tasks facilitate a deeper understanding of LLMs’ implicit interpretations of ambiguous inputs. 3. CHA-Gen Corpus To systematically investigate the ability of LLMs in Chinese ambi… view at source ↗
Figure 3
Figure 3. Figure 3: Overall performance comparison under various prompts and thinking modes. the ambiguity identification task, most models achieve higher Macro-F1 under CoT prompt, albeit with a concurrent decrease in overall Accuracy. This observation suggests that CoT prompt encourages models to produce more balanced predictions. For the ambiguity comparison task, where ground-truth labels (1 and 2) are balanced, Accuracy … view at source ↗
Figure 4
Figure 4. Figure 4: The answer distributions of CHAmbi and CHA-Gen in ambiguity identification and comparison tasks, with the dashed line indicating the optimal balance point. Answer Distributions. To delve into the observed performance trends, we analyze the answer distribution of each model, which is visualized in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Fraction of CHAmbi sentence pairs where ambiguous sentences exhibit higher semantic entropy in their translation sets than their unambiguous counterparts, broken down by ambiguity types. missing information, which induces vagueness rather than supporting multiple well-defined interpretations. The second-lowest fraction is observed for coreference ambiguities. Such ambiguities can often be translated direct… view at source ↗
Figure 6
Figure 6. Figure 6: Fraction of CHA-Gen sentence pairs for which ambiguous sentences exhibit higher semantic entropy in their translation sets than their unambiguous counterparts, broken down by ambiguity structures. Dominant Interpretation Bias. The last three example sentences in [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
read the original abstract

Linguistic ambiguity is critical to the robustness of Large Language Models (LLMs), yet existing research focuses mostly on English, with limited attention devoted to Chinese. Existing Chinese ambiguity datasets (e.g., CHAmbi) suffer from poor scalability. Guided by Potential Ambiguity (PA) Theory, we design a semi-automatic pipeline to construct CHA-Gen. It is the first PA Theory-grounded Chinese ambiguity dataset, which comprises 5,712 sentences (2,414 ambiguous, 3,298 unambiguous) across 18 potential ambiguous structures. Evaluating LLMs (e.g. Gemma 3, Qwen 2.5/3 series) via direct querying and machine translation, we find that LLMs struggle with ambiguity detection (improved by CoT prompting). Analysis of Qwen3-32B's CoT rationales reveals three common failure modes: ambiguity blindness, misattribution, and premature resolution. Uncertainty quantification with semantic entropy metric shows higher uncertainty for ambiguous sentences. Moreover, instruction tuning induces overconfidence, whereas Base models better capture semantic diversity. We further observe that models exhibit a bias toward dominant interpretations. Our work provides a scalable approach for Chinese ambiguity corpus and insights into LLMs' ambiguity handling, laying a foundation for enhancing Chinese ambiguity research in LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces CHA-Gen, the first PA Theory-grounded Chinese ambiguity dataset containing 5,712 sentences (2,414 ambiguous, 3,298 unambiguous) across 18 structures, built via a semi-automatic pipeline. It evaluates LLMs (Gemma 3, Qwen 2.5/3 series) on ambiguity detection using direct prompting and machine translation, analyzes three failure modes in Qwen3-32B CoT rationales (ambiguity blindness, misattribution, premature resolution), reports higher semantic entropy on ambiguous items, and finds that instruction tuning produces overconfidence while base models better preserve semantic diversity, with an additional bias toward dominant interpretations.

Significance. If the dataset labels are shown to reflect genuine linguistic ambiguity rather than pipeline artifacts, the work addresses a clear gap in Chinese-focused LLM evaluation and supplies concrete failure-mode diagnostics plus a scalable construction method that could support larger corpora. The empirical comparison of base versus instruction-tuned models and the semantic-entropy uncertainty analysis are useful contributions that could guide robustness improvements in Chinese NLP.

major comments (1)
  1. [§3 and §4] §3 (Dataset Construction) and §4 (Evaluation): The central claims—that LLMs struggle with Chinese ambiguity detection, exhibit the three listed failure modes, show elevated semantic entropy on ambiguous items, and that instruction tuning induces overconfidence—rest on the accuracy of the CHA-Gen ambiguous/unambiguous labels. The manuscript describes a PA-Theory-guided semi-automatic pipeline but reports no independent human validation, inter-annotator agreement, or error analysis on the generated labels. Without such validation it remains possible that the 2,414/3,298 split reflects generation heuristics rather than Chinese linguistic reality, which would confound all reported performance numbers, rationale analyses, and base-vs-tuned comparisons.
minor comments (2)
  1. [Abstract] Abstract: The abstract states directional findings and failure modes but omits quantitative effect sizes, statistical tests, or explicit baseline comparisons, making it difficult to gauge the practical magnitude of the reported difficulties.
  2. [Results] Results section: Provide more detail on how the machine-translation evaluation protocol was implemented and on the exact computation of semantic entropy, including any hyperparameters or sampling settings.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We are grateful to the referee for their thorough review and valuable suggestions. The feedback has helped us identify areas where the manuscript can be improved. Below, we provide a point-by-point response to the major comment.

read point-by-point responses
  1. Referee: [§3 and §4] §3 (Dataset Construction) and §4 (Evaluation): The central claims—that LLMs struggle with Chinese ambiguity detection, exhibit the three listed failure modes, show elevated semantic entropy on ambiguous items, and that instruction tuning induces overconfidence—rest on the accuracy of the CHA-Gen ambiguous/unambiguous labels. The manuscript describes a PA-Theory-guided semi-automatic pipeline but reports no independent human validation, inter-annotator agreement, or error analysis on the generated labels. Without such validation it remains possible that the 2,414/3,298 split reflects generation heuristics rather than Chinese linguistic reality, which would confound all reported performance numbers, rationale analyses, and base-vs-tuned comparisons.

    Authors: We acknowledge the referee's concern regarding the lack of human validation for the CHA-Gen labels. The dataset construction follows a PA-Theory-guided semi-automatic pipeline, where ambiguous structures are derived from established linguistic theories on potential ambiguity in Chinese. However, to address this point directly, we will incorporate an independent human validation study in the revised manuscript. This will include recruiting native Chinese speakers to annotate a representative sample of the sentences, computing inter-annotator agreement (e.g., Cohen's kappa or Fleiss' kappa), and performing error analysis to identify any discrepancies between the pipeline labels and human judgments. We believe this addition will substantiate that the labels reflect genuine linguistic ambiguity and strengthen the validity of our empirical findings on LLM performance, failure modes, and the effects of instruction tuning. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical evaluation against externally constructed dataset

full rationale

The paper is an empirical evaluation study that constructs the CHA-Gen dataset via a PA-Theory-guided semi-automatic pipeline and measures LLM performance, CoT failure modes, semantic entropy, and base-vs-tuned differences directly against those labels. No equations, derivations, or first-principles results are present that reduce any reported observation or claim to quantities defined by the authors' own fitted parameters or self-referential definitions. The central findings are observational measurements on an independently generated test set rather than predictions forced by construction from the same inputs; the dataset serves as an external benchmark for the evaluation, making the work self-contained against its own measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the validity of PA Theory for Chinese, the correctness of the semi-automatic labeling pipeline, and the assumption that direct querying plus machine-translation evaluation faithfully measures ambiguity understanding.

axioms (1)
  • domain assumption Potential Ambiguity (PA) Theory supplies a reliable and complete set of 18 structures for identifying potential ambiguity in Chinese sentences.
    The pipeline is explicitly guided by PA Theory to generate the 5,712 sentences across those structures.

pith-pipeline@v0.9.0 · 5764 in / 1413 out tokens · 89830 ms · 2026-05-20T19:42:00.121021+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 6 internal anchors

  1. [9]

    and Rao, Jun and Li, Bei and Ding, Liang and Chao, Lidia S

    Ma, Xinyu and Liu, Xuebo and Wong, Derek F. and Rao, Jun and Li, Bei and Ding, Liang and Chao, Lidia S. and Tao, Dacheng and Zhang, Min. 3 AM : An Ambiguity-Aware Multi-Modal Machine Translation Dataset. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024

  2. [15]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  3. [16]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  4. [17]

    7th International Conference on Social Science and Higher Education (ICSSHE 2021) , pages=

    Study on Chinese Semantic Content Based on Syntactic Differences Between Chinese and English , author=. 7th International Conference on Social Science and Higher Education (ICSSHE 2021) , pages=. 2021 , organization=

  5. [18]

    Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models

    Ashwin K. Vijayakumar and Michael Cogswell and Ramprasaath R. Selvaraju and Qing Sun and Stefan Lee and David J. Crandall and Dhruv Batra , title =. CoRR , volume =. 2016 , url =. 1610.02424 , timestamp =

  6. [19]

    Journal of Chinese Information Processing , number =

    Zhiwei Feng , title =. Journal of Chinese Information Processing , number =. 1989 , issn =

  7. [20]

    Zhiwei Feng , title =. 1995

  8. [21]

    Journal of Chinese Information Processing , number =

    Maosong Sun, Changning Huang , title =. Journal of Chinese Information Processing , number =. 1989 , issn =

  9. [22]

    Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert. Language Models are Few-Shot Learners , booktitle =. 2020 , url =

  10. [23]

    2023 , url=

    Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine , author=. 2023 , url=

  11. [24]

    CoRR , volume =

    Amr Hendy and Mohamed Abdelrehim and Amr Sharaf and Vikas Raunak and Mohamed Gabr and Hitokazu Matsushita and Young Jin Kim and Mohamed Afify and Hany Hassan Awadalla , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2302.09210 , eprinttype =. 2302.09210 , timestamp =

  12. [25]

    Applied Psycholinguistics , volume=

    Translation ambiguity in and out of context , author=. Applied Psycholinguistics , volume=. 2011 , publisher=

  13. [26]

    Qwen2.5 Technical Report

    An Yang and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chengyuan Li and Dayiheng Liu and Fei Huang and Haoran Wei and Huan Lin and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Yang and Jiaxi Yang and Jingren Zhou and Junyang Lin and Kai Dang and Keming Lu and Keqin Bao and Kexin Yang and Le Yu and Mei Li and Mi...

  14. [27]

    The Twelfth International Conference on Learning Representations,

    Haoran Xu and Young Jin Kim and Amr Sharaf and Hany Hassan Awadalla , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

  15. [28]

    Treviso and Nuno Miguel Guerreiro and Chrysoula Zerva and Ana C

    Ricardo Rei and Marcos V. Treviso and Nuno Miguel Guerreiro and Chrysoula Zerva and Ana C. Farinha and Christine Maroti and Jos. CometKiwi: IST-Unbabel 2022 Submission for the Quality Estimation Shared Task , booktitle =. 2022 , url =

  16. [29]

    Mixture Models for Diverse Machine Translation: Tricks of the Trade , booktitle =

    Tianxiao Shen and Myle Ott and Michael Auli and Marc'Aurelio Ranzato , editor =. Mixture Models for Diverse Machine Translation: Tricks of the Trade , booktitle =. 2019 , url =

  17. [30]

    Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation

    Zhang, Biao and Williams, Philip and Titov, Ivan and Sennrich, Rico. Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.148

  18. [31]

    Foundations and Trends

    Determinantal point processes for machine learning , author=. Foundations and Trends. 2012 , publisher=

  19. [32]

    Fast Greedy

    Laming Chen and Guoxin Zhang and Eric Zhou , editor =. Fast Greedy. Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montr. 2018 , url =

  20. [33]

    From Contract Drafting to Software Specification: Linguistic Sources of Ambiguity , volume =

    Kamsties, Erik and Science, Ph and Krieger, Michael and Mathematics, Ph and Berry, M , year =. From Contract Drafting to Software Specification: Linguistic Sources of Ambiguity , volume =

  21. [35]

    and Fellbaum, Christiane and Gross, Derek and Miller, Katherine , year =

    Miller, George and Beckwith, R. and Fellbaum, Christiane and Gross, Derek and Miller, Katherine , year =. Introduction to WordNet: An On-line Lexical Database* , journal =

  22. [36]

    BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network , journal =

    Roberto Navigli and Simone Paolo Ponzetto , keywords =. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network , journal =. 2012 , issn =. doi:https://doi.org/10.1016/j.artint.2012.07.001 , url =

  23. [37]

    2025 , url =

    Mohammadmostafa Rostamkhani and Baktash Ansari and Hoorieh Sabzevari and Farzan Rahmani and Sauleh Eetemadi , title =. 2025 , url =

  24. [39]

    The Eleventh International Conference on Learning Representations,

    Lorenz Kuhn and Yarin Gal and Sebastian Farquhar , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

  25. [41]

    The Twelfth International Conference on Learning Representations,

    Chujie Zheng and Hao Zhou and Fandong Meng and Jie Zhou and Minlie Huang , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

  26. [42]

    Cognition , volume=

    The communicative function of ambiguity in language , author=. Cognition , volume=. 2012 , publisher=

  27. [43]

    Aleksandar Bajceta and Miguel Leon and Wasif Afzal and Pernilla Lindberg and Markus Bohlin , editor =. Using. Joint Proceedings of. 2022 , url =

  28. [47]

    o rr, J. , editor Ruiz, M. , editor Stegh \

    author Bajceta, A. , author Leon, M. , author Afzal, W. , author Lindberg, P. , author Bohlin, M. , year 2022 . title Using NLP tools to detect ambiguities in system requirements - A comparison study , in: editor Fischbach, J. , editor Condori - Fern \' a ndez, N. , editor D \" o rr, J. , editor Ruiz, M. , editor Stegh \" o fer, J. , editor Pasquale, L. ,...

  29. [48]

    , author Tomar, T

    author Bhaskar, A. , author Tomar, T. , author Sathe, A. , author Sarawagi, S. , year 2023 . title Benchmarking and improving text-to- SQL generation under ambiguity , in: editor Bouamor, H. , editor Pino, J. , editor Bali, K. (Eds.), booktitle Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , publisher Association f...

  30. [49]

    , author Mann, B

    author Brown, T.B. , author Mann, B. , author Ryder, N. , author Subbiah, M. , author Kaplan, J. , author Dhariwal, P. , author Neelakantan, A. , author Shyam, P. , author Sastry, G. , author Askell, A. , author Agarwal, S. , author Herbert - Voss, A. , author Krueger, G. , author Henighan, T. , author Child, R. , author Ramesh, A. , author Ziegler, D.M. ...

  31. [50]

    , author Wang, C

    author Chen, X. , author Wang, C. , author Xue, Y. , author Zhang, N. , author Yang, X. , author Li, Q. , author Shen, Y. , author Liang, L. , author Gu, J. , author Chen, H. , year 2024 . title Unified hallucination detection for multimodal large language models , in: editor Ku, L.W. , editor Martins, A. , editor Srikumar, V. (Eds.), booktitle Proceeding...

  32. [51]

    , year 1989

    author Feng, Z. , year 1989 . title Structural description of chinese scientific terms and potential ambiguity . journal Journal of Chinese Information Processing , pages 1--16

  33. [52]

    , year 1995

    author Feng, Z. , year 1995 . title On potential nature of ambiguous construction . journal Journal of Chinese Information Processing , pages 14--24

  34. [53]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    author Guo, D. , author Yang, D. , author Zhang, H. , author Song, J. , author Zhang, R. , author Xu, R. , author Zhu, Q. , author Ma, S. , author Wang, P. , author Bi, X. , et al., year 2025 . title Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning . journal arXiv preprint arXiv:2501.12948

  35. [54]

    , author Wang, T

    author He, J. , author Wang, T. , author Xiong, D. , author Liu, Q. , year 2020 . title The box is in the pen: Evaluating commonsense reasoning in neural machine translation , in: editor Cohn, T. , editor He, Y. , editor Liu, Y. (Eds.), booktitle Findings of the Association for Computational Linguistics: EMNLP 2020 , publisher Association for Computationa...

  36. [55]

    , author Stanovsky, G

    author Itzhak, I. , author Stanovsky, G. , author Rosenfeld, N. , author Belinkov, Y. , year 2024 . title Instructed to bias: Instruction-tuned language models exhibit emergent cognitive bias . journal Transactions of the Association for Computational Linguistics volume 12 , pages 771--785 . https://aclanthology.org/2024.tacl-1.43/, :10.1162/tacl_a_00673

  37. [56]

    , author Kim, Y

    author Kim, H.J. , author Kim, Y. , author Park, C. , author Kim, J. , author Park, C. , author Yoo, K.M. , author Lee, S.g. , author Kim, T. , year 2024 . title Aligning language models to explicitly handle ambiguity , in: editor Al-Onaizan, Y. , editor Bansal, M. , editor Chen, Y.N. (Eds.), booktitle Proceedings of the 2024 Conference on Empirical Metho...

  38. [57]

    , author Gal, Y

    author Kuhn, L. , author Gal, Y. , author Farquhar, S. , year 2023 . title Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation , in: booktitle The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 , publisher OpenReview.net . https://openreview.net/for...

  39. [58]

    , year 2021

    author Li, F. , year 2021 . title Study on chinese semantic content based on syntactic differences between chinese and english , in: booktitle 7th International Conference on Social Science and Higher Education (ICSSHE 2021) , organization Atlantis Press . pp. pages 542--545

  40. [59]

    , author Chen, Y.N

    author Lin, Y.T. , author Chen, Y.N. , year 2023 . title LLM -eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models , in: editor Chen, Y.N. , editor Rastogi, A. (Eds.), booktitle Proceedings of the 5th Workshop on NLP for Conversational AI (NLP4ConvAI 2023) , publisher Association for Computational L...

  41. [60]

    DeepSeek-V3 Technical Report

    author Liu, A. , author Feng, B. , author Xue, B. , author Wang, B. , author Wu, B. , author Lu, C. , author Zhao, C. , author Deng, C. , author Zhang, C. , author Ruan, C. , et al., year 2024 . title Deepseek-v3 technical report . journal arXiv preprint arXiv:2412.19437

  42. [61]

    , author Wu, Z

    author Liu, A. , author Wu, Z. , author Michael, J. , author Suhr, A. , author West, P. , author Koller, A. , author Swayamdipta, S. , author Smith, N. , author Choi, Y. , year 2023 . title We ' re afraid language models aren ' t modeling ambiguity , in: editor Bouamor, H. , editor Pino, J. , editor Bali, K. (Eds.), booktitle Proceedings of the 2023 Confe...

  43. [62]

    , author Liu, X

    author Ma, X. , author Liu, X. , author Wong, D.F. , author Rao, J. , author Li, B. , author Ding, L. , author Chao, L.S. , author Tao, D. , author Zhang, M. , year 2024 . title 3 AM : An ambiguity-aware multi-modal machine translation dataset , in: editor Calzolari, N. , editor Kan, M.Y. , editor Hoste, V. , editor Lenci, A. , editor Sakti, S. , editor X...

  44. [63]

    , author Goyal, P

    author Mehrabi, N. , author Goyal, P. , author Verma, A. , author Dhamala, J. , author Kumar, V. , author Hu, Q. , author Chang, K.W. , author Zemel, R. , author Galstyan, A. , author Gupta, R. , year 2023 . title Resolving ambiguities in text-to-image generative models , in: editor Rogers, A. , editor Boyd-Graber, J. , editor Okazaki, N. (Eds.), booktitl...

  45. [64]

    , author Pezzelle, S

    author Mehrparvar, B. , author Pezzelle, S. , year 2024 . title Detecting and translating language ambiguity with multilingual LLM s , in: editor S \"a lev \"a , J. , editor Owodunni, A. (Eds.), booktitle Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024) , publisher Association for Computational Linguistics , address Mi...

  46. [65]

    , author Michael, J

    author Min, S. , author Michael, J. , author Hajishirzi, H. , author Zettlemoyer, L. , year 2020 . title A mbig QA : Answering ambiguous open-domain questions , in: editor Webber, B. , editor Cohn, T. , editor He, Y. , editor Liu, Y. (Eds.), booktitle Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , publishe...

  47. [66]

    , author Garc \' a-Sierra, \'O

    author Ortega-Mart \' n, M. , author Garc \' a-Sierra, \'O . , author Ardoiz, A. , author \'A lvarez, J. , author Armenteros, J.C. , author Alonso, A. , year 2023 . title Linguistic ambiguity analysis in chatgpt . journal arXiv preprint arXiv:2302.06426

  48. [67]

    , author Tily, H

    author Piantadosi, S.T. , author Tily, H. , author Gibson, E. , year 2012 . title The communicative function of ambiguity in language . journal Cognition volume 122 , pages 280--291

  49. [68]

    , author Ansari, B

    author Rostamkhani, M. , author Ansari, B. , author Sabzevari, H. , author Rahmani, F. , author Eetemadi, S. , year 2025 . title Illusory VQA: benchmarking and enhancing multimodal models on visual illusions , in: booktitle IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2025, Nashville, TN, USA, June 11-15, 2025 ,...

  50. [69]

    Gemini: A Family of Highly Capable Multimodal Models

    author Team, G. , author Anil, R. , author Borgeaud, S. , author Alayrac, J.B. , author Yu, J. , author Soricut, R. , author Schalkwyk, J. , author Dai, A.M. , author Hauth, A. , author Millican, K. , et al., year 2023 . title Gemini: a family of highly capable multimodal models . journal arXiv preprint arXiv:2312.11805

  51. [70]

    , author Gao, Y

    author Wang, B. , author Gao, Y. , author Li, Z. , author Lou, J.G. , year 2023 . title Know what I don ' t know: Handling ambiguous and unknown questions for text-to- SQL , in: editor Rogers, A. , editor Boyd-Graber, J. , editor Okazaki, N. (Eds.), booktitle Findings of the Association for Computational Linguistics: ACL 2023 , publisher Association for C...

  52. [71]

    , author Kang, Z

    author Wang, X. , author Kang, Z. , author Zhai, W. , author Lou, X. , author Lai, Y. , author Wang, Z. , author Wang, Y. , author Huang, K. , author Wang, Y. , author Li, P. , author Liu, Y. , year 2025 . title MUCAR : Benchmarking multilingual cross-modal ambiguity resolution for multimodal large language models , in: editor Christodoulopoulos, C. , edi...

  53. [72]

    , author Hanna, M

    author Wildenburg, F. , author Hanna, M. , author Pezzelle, S. , year 2024 . title Do pre-trained language models detect and understand semantic underspecification? ask the DUST ! , in: editor Ku, L.W. , editor Martins, A. , editor Srikumar, V. (Eds.), booktitle Findings of the Association for Computational Linguistics: ACL 2024 , publisher Association fo...

  54. [73]

    Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity

    author Wu, X. , author Li, H. , author Liu, H. , author Ji, X. , author Li, R. , author Chen, Y. , author Zhang, Y. , year 2025 . title Uncovering the fragility of trustworthy llms through chinese textual ambiguity . journal CoRR volume abs/2507.23121 . https://doi.org/10.48550/arXiv.2507.23121, :10.48550/ARXIV.2507.23121, arXiv:2507.23121 http://arxiv.or...

  55. [74]

    , author Cai, S

    author Zhang, Q. , author Cai, S. , author Zhao, J. , author Pechenizkiy, M. , author Fang, M. , year 2024 . title CHA mbi: A new benchmark on C hinese ambiguity challenges for large language models , in: editor Al-Onaizan, Y. , editor Bansal, M. , editor Chen, Y.N. (Eds.), booktitle Findings of the Association for Computational Linguistics: EMNLP 2024 , ...

  56. [75]

    , author Zhou, H

    author Zheng, C. , author Zhou, H. , author Meng, F. , author Zhou, J. , author Huang, M. , year 2024 . title Large language models are not robust multiple choice selectors , in: booktitle The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 , publisher OpenReview.net . https://openreview.net/forum?i...