pith. sign in

arxiv: 2510.23536 · v2 · submitted 2025-10-27 · 💻 cs.CL

IPQA: A Benchmark for Core Intent Identification in Personalized Question Answering

Pith reviewed 2026-05-18 04:09 UTC · model grok-4.3

classification 💻 cs.CL
keywords intent identificationpersonalized question answeringbenchmarkcore intentssatisficing theorylanguage model evaluationuser history modelinganswer selection behavior
0
0 comments X

The pith

Current language models fail to identify core user intents from answer selection histories in personalized question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces IPQA, a benchmark that measures how well systems can identify the core intents users prioritize when selecting answers to satisfy their information needs. Existing evaluations focus only on response quality or retrieval, skipping this foundational step of intent identification. Core intents are extracted from observable answer selection patterns, following the principle that users accept answers meeting a personal threshold rather than seeking optimal ones. Experiments across multiple state-of-the-art models show they perform poorly at recovering these intents from user histories, and accuracy falls further as question complexity rises. Without this capability, personalized systems cannot produce responses aligned with individual priorities.

Core claim

The paper establishes IPQA as a benchmark for core intent identification in personalized question answering, where core intents are the prioritized intents derived from observable answer selection behavior patterns according to satisficing theory. The dataset is built across domains via systematic filtering, LLM-based annotation, and quality control that combines automated checks with human validation. Evaluations reveal that state-of-the-art language models struggle to identify core intents from user histories, with performance degrading as question complexity increases.

What carries the argument

The IPQA benchmark, which derives core intents from observable answer selection behaviors using satisficing theory to create reliable labels for evaluating personalized intent identification.

If this is right

  • Personalized QA systems must first solve core intent identification before they can reliably generate responses that match individual information needs.
  • Model performance on intent identification drops with increasing question complexity, implying that current history modeling techniques are insufficient for complex cases.
  • The benchmark supplies a concrete metric to track progress on intent identification separate from retrieval or generation quality.
  • Public release of the dataset enables direct comparisons and targeted improvements across different model architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If selection patterns reliably reveal prioritized intents, then systems could learn user models directly from implicit choice data rather than requiring explicit preference statements.
  • The degradation with complexity suggests that similar intent identification challenges may appear in other multi-turn or context-rich settings such as dialogue systems.
  • The approach could be extended by testing whether models trained on the benchmark generalize to real-time user interactions where answer selections are observed live.
  • Neighboring tasks like personalized recommendation might benefit from analogous benchmarks that derive core preferences from choice behavior instead of stated ratings.

Load-bearing premise

That core intents can be accurately derived from observable answer selection behavior patterns using satisficing theory and that the combination of LLM annotation with automated and human quality control produces reliable labels without significant bias or noise.

What would settle it

A study in which actual users review the benchmark questions and state their own prioritized intents, showing low agreement with the derived core intent labels from answer selections.

Figures

Figures reproduced from arXiv: 2510.23536 by Dongha Lee, Jieyong Kim, Maryam Amirizaniani, Soojin Yoon.

Figure 1
Figure 1. Figure 1: In information seeking scenarios, users ask ques [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the IPQA dataset construction pipeline: data collection from CQA dataset, LLM-based intent annotation, and quality control through LLM verification and human validation. personalization, the system receives user profile 𝑃𝑢 = {(𝑞𝑖 , 𝑠𝑖)}|𝑃𝑢 | 𝑖=1 containing historical questions 𝑞𝑖 paired with source information 𝑠𝑖 , following previous personalization studies [11, 12, 14, 26, 27]. Since users do … view at source ↗
Figure 3
Figure 3. Figure 3: Core intent identification performance of User Profile (Intents) with varying profile history sizes ( [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Core intent identification performance of User Pro [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Core intent identification performance of User Profile (Intent) with varying profile history sizes [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
read the original abstract

Intent identification serves as the foundation for generating appropriate responses in personalized question answering (PQA). However, existing benchmarks evaluate only response quality or retrieval performance without directly measuring intent identification capabilities. This gap is critical because without understanding which intents users prioritize, systems cannot generate responses satisfying individual information needs. To address this, we introduce the concept of core intents: intents users prioritize when selecting answers to satisfy their information needs. To evaluate these core intents, we propose IPQA, a benchmark for core Intent identification in Personalized Question Answering. Since users do not explicitly state their prioritized intents, we derive core intents from observable behavior patterns in answer selection, grounded in satisficing theory where users choose answers meeting their acceptance thresholds. We construct a dataset with various domains through systematic filtering, LLM-based annotation, and rigorous quality control combining automated verification with human validation. Experimental evaluations across state-of-the-art language models reveal that current systems struggle with core intent identification in personalized contexts. Models fail to identify core intents from user histories, with performance degrading as question complexity increases. The code and dataset will be made publicly available to facilitate future research in this direction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces IPQA, a benchmark for core intent identification in personalized question answering. Core intents are defined as those users prioritize when selecting answers to satisfy information needs and are derived from observable answer-selection behavior using satisficing theory. The dataset is constructed across domains via systematic filtering, LLM-based annotation, and quality control that combines automated checks with human validation. Experiments on state-of-the-art language models show that current systems struggle to identify core intents from user histories, with performance degrading as question complexity increases. The code and dataset will be released publicly.

Significance. If the core-intent labels prove to be faithful proxies for user priorities, the work fills a clear gap by isolating intent identification from downstream response quality or retrieval metrics in personalized QA. The public release of the benchmark and code is a concrete strength that supports reproducibility and follow-on research. The reported degradation with question complexity, if robust, would point to a specific modeling challenge in handling user history.

major comments (1)
  1. [Dataset construction] Dataset construction (as described in the abstract and methods): the central claim that models fail at core intent identification rests on the assumption that answer-selection behavior, interpreted through satisficing theory, reliably encodes the user's prioritized intent. No validation is presented showing that selected answers reflect an acceptance threshold rather than confounds such as answer length, position bias, source familiarity, or lexical overlap. If this behavioral-to-intent mapping is noisy or systematically biased, the observed performance degradation with complexity may reflect label artifacts instead of genuine model limitations.
minor comments (1)
  1. [Abstract and Methods] The abstract states that quality control combines automated verification with human validation but does not report inter-annotator agreement, exact filtering criteria, or the proportion of LLM-annotated items that required human correction; these quantitative details should be added to the methods section for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript introducing the IPQA benchmark. We address the major comment on dataset construction below, clarifying our validation approach while acknowledging the need for additional discussion of potential confounds.

read point-by-point responses
  1. Referee: [Dataset construction] Dataset construction (as described in the abstract and methods): the central claim that models fail at core intent identification rests on the assumption that answer-selection behavior, interpreted through satisficing theory, reliably encodes the user's prioritized intent. No validation is presented showing that selected answers reflect an acceptance threshold rather than confounds such as answer length, position bias, source familiarity, or lexical overlap. If this behavioral-to-intent mapping is noisy or systematically biased, the observed performance degradation with complexity may reflect label artifacts instead of genuine model limitations.

    Authors: We appreciate the referee's emphasis on validating the mapping from observable answer-selection behavior to core intents under satisficing theory. Our dataset construction incorporates systematic filtering to reduce obvious confounds (e.g., extreme length disparities and high lexical overlap) followed by LLM-based annotation and a human validation stage in which annotators assess whether the chosen answer satisfies the user's information need at an acceptance threshold. These steps are detailed in the methods section. That said, we agree that an explicit analysis or ablation addressing position bias and source familiarity would further strengthen the claims. We will revise the manuscript to expand the description of the human validation protocol, report inter-annotator agreement on intent alignment, and add a dedicated limitations subsection discussing residual confounds and their potential impact on the observed complexity degradation. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction from external theory and annotation

full rationale

The paper defines core intents via satisficing theory applied to observable answer-selection behavior, constructs labels through systematic filtering plus LLM annotation and human QC, then reports empirical model performance on the resulting dataset. No equations, fitted parameters, or self-citations appear in the provided text that would make any central claim reduce to its own inputs by construction. The reported degradation in model performance is an observation on an independently constructed benchmark rather than a tautological renaming or self-referential derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the new concept of core intents and the domain assumption from satisficing theory that users select answers meeting acceptance thresholds; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption Users choose answers based on satisficing theory, selecting those that meet acceptance thresholds rather than optimizing for all possible criteria.
    Invoked to derive core intents from observable behavior patterns in answer selection.
invented entities (1)
  • core intents no independent evidence
    purpose: To represent the prioritized intents users have when selecting answers to satisfy their information needs in PQA.
    Introduced as the key evaluation target; derived from behavior but lacks independent external validation in the abstract.

pith-pipeline@v0.9.0 · 5735 in / 1408 out tokens · 38495 ms · 2026-05-18T04:09:04.572860+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Training LLMs with Reinforcement Learning for Intent-Aware Personalized Question Answering

    cs.CL 2026-05 unverdicted novelty 5.0

    IAP uses RL to train LLMs to explicitly infer and apply implicit user intent in single-turn personalized QA, achieving ~7.5% average macro-score gains over baselines on LaMP-QA.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    Denise E Agosto. 2002. Bounded rationality and satisficing in young people’s Web-based decision making.Journal of the American society for Information Science and Technology53, 1 (2002), 16–27

  2. [2]

    Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare Voss (Eds.). Association for Computational ...

  3. [3]

    Iñigo Casanueva, Tadas Temčinas, Daniela Gerz, Matthew Henderson, and Ivan Vulić. 2020. Efficient Intent Detection with Dual Sentence Encoders. InProceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, Tsung- Hsien Wen, Asli Celikyilmaz, Zhou Yu, Alexandros Papangelis, Mihail Eric, Anuj Kumar, Iñigo Casanueva, and Rushin Shah...

  4. [4]

    Long Chen, Dell Zhang, and Levene Mark. 2012. Understanding user intent in community question answering. InProceedings of the 21st international conference on world wide web. 823–828

  5. [5]

    Alice Coucke, Alaa Saade, Adrien Ball, Théodore Bluche, Alexandre Caulier, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Caltagirone, Thibaut Lavril, et al. 2018. Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces.arXiv preprint arXiv:1805.10190(2018)

  6. [6]

    Yiming Du, Hongru Wang, Zhengyi Zhao, Bin Liang, Baojun Wang, Wanjun Zhong, Zezhong Wang, and Kam-Fai Wong. 2024. PerLTQA: A Personal Long- Term Memory Dataset for Memory Classification, Retrieval, and Fusion in Ques- tion Answering. InProceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10), Kam-Fai Wong, Min Zhang, Ruifeng Xu, ...

  7. [7]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv e-prints(2024), arXiv–2407

  8. [8]

    Ryang Heo, Yongsik Seo, Junseong Lee, and Dongha Lee. 2025. Can Large Language Models be Effective Online Opinion Miners? arXiv:2505.15695 [cs.CL] https://arxiv.org/abs/2505.15695

  9. [9]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. 2024. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186(2024)

  10. [10]

    Pranav Kasela, Marco Braga, Gabriella Pasi, and Raffaele Perego. 2024. SE-PQA: Personalized Community Question Answering. InCompanion Proceedings of the ACM Web Conference 2024(Singapore, Singapore)(WWW ’24). Association for Computing Machinery, New York, NY, USA, 1095–1098. doi:10.1145/3589335. 3651445

  11. [11]

    Jieyong Kim, Hyunseo Kim, Hyunjin Cho, SeongKu Kang, Buru Chang, Jinyoung Yeo, and Dongha Lee. 2025. Review-driven Personalized Preference Reasoning with Large Language Models for Recommendation. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval(Padua, Italy)(SIGIR ’25). Association for Compu...

  12. [12]

    Jieyong Kim, Tongyoung Kim, Soojin Yoon, Jaehyung Kim, and Dongha Lee. 2025. RPM: Reasoning-Level Personalization for Black-Box Large Language Models. arXiv:2505.21082 [cs.CL] https://arxiv.org/abs/2505.21082

  13. [13]

    Jan-Christoph Klie, Richard Eckart de Castilho, and Iryna Gurevych. 2024. An- alyzing Dataset Annotation Quality Management in the Wild.Computational Linguistics50, 3 (Sept. 2024), 817–866. doi:10.1162/coli_a_00516

  14. [14]

    Rossi, Franck Dernoncourt, Hanieh Deilamsalehy, Xiang Chen, Ruiyi Zhang, Shubham Agarwal, Nedim Lipka, Chien Van Nguyen, Thien Huu Nguyen, and Hamed Zamani

    Ishita Kumar, Snigdha Viswanathan, Sushrita Yerra, Alireza Salemi, Ryan A. Rossi, Franck Dernoncourt, Hanieh Deilamsalehy, Xiang Chen, Ruiyi Zhang, Shubham Agarwal, Nedim Lipka, Chien Van Nguyen, Thien Huu Nguyen, and Hamed Zamani. 2024. LongLaMP: A Benchmark for Personalized Long-form Text Generation. arXiv:2407.11016 [cs.CL] https://arxiv.org/abs/2407.11016

  15. [15]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles. 611–626

  16. [16]

    Peper, Christopher Clarke, Andrew Lee, Parker Hill, Jonathan K

    Stefan Larson, Anish Mahendran, Joseph J. Peper, Christopher Clarke, Andrew Lee, Parker Hill, Jonathan K. Kummerfeld, Kevin Leach, Michael A. Laurenzano, Lingjia Tang, and Jason Mars. 2019. An Evaluation Dataset for Intent Clas- sification and Out-of-Scope Prediction. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing...

  17. [17]

    Chin-Yew Lin and Eduard Hovy. 2003. Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics. InProceedings of the 2003 Human Lan- guage Technology Conference of the North American Chapter of the Association for Computational Linguistics. 150–157. https://aclanthology.org/N03-1020/

  18. [18]

    Langming Liu, Shilei Liu, Yujin Yuan, Yizhen Zhang, Bencheng Yan, Zhiyuan Zeng, Zihao Wang, Jiaqi Liu, Di Wang, Wenbo Su, Pengjie Wang, Jian Xu, and Bo Zheng. 2025. UQABench: Evaluating User Embedding for Prompting LLMs in Personalized Question Answering. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2(Toronto ON...

  19. [19]

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 2511–2522. doi:...

  20. [20]

    Matt Post. 2018. A Call for Clarity in Reporting BLEU Scores. InProceedings of the Third Conference on Machine Translation: Research Papers, Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post...

  21. [21]

    Chandra Prabha, Lynn Silipigni Connaway, Lawrence Olszewski, and Lillie R Jenkins. 2007. What is enough? Satisficing information needs.Journal of docu- mentation63, 1 (2007), 74–89

  22. [22]

    Libo Qin, Xiao Xu, Wanxiang Che, and Ting Liu. 2020. AGIF: An Adaptive Graph- Interactive Framework for Joint Multiple Intent Detection and Slot Filling. In Findings of the Association for Computational Linguistics: EMNLP 2020, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 1807–1816. doi:10.18653/v1/2020.fi...

  23. [23]

    Silvia Quarteroni. 2010. Personalized Question Answering. InTraitement Au- tomatique des Langues, Volume 51, Numéro 1 : Varia [Varia], Béatrice Daille, Éric Villemonte de la Clergerie, Yves Lepage, and François Yvon (Eds.). ATALA (Association pour le Traitement Automatique des Langues), France, 97–123. https://aclanthology.org/2010.tal-1.4/

  24. [24]

    Dongyu Ru, Lin Qiu, Xiangkun Hu, Tianhang Zhang, Peng Shi, Shuaichen Chang, Cheng Jiayang, Cunxiang Wang, Shichao Sun, Huanyu Li, Zizhao Zhang, Binjie Wang, Jiarong Jiang, Tong He, Zhiguo Wang, Pengfei Liu, Yue Zhang, and Zheng Zhang. 2025. RAGCHECKER: a fine-grained framework for diagnosing retrieval- augmented generation. InProceedings of the 38th Inter...

  25. [25]

    Alireza Salemi, Julian Killingback, and Hamed Zamani. 2025. ExPerT: Effec- tive and Explainable Evaluation of Personalized Long-Form Text Generation. InFindings of the Association for Computational Linguistics: ACL 2025, Wanxi- ang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Computational Linguistics, Vienna,...

  26. [26]

    Alireza Salemi, Sheshera Mysore, Michael Bendersky, and Hamed Zamani. 2023. Lamp: When large language models meet personalization.arXiv preprint arXiv:2304.11406(2023)

  27. [27]

    Alireza Salemi and Hamed Zamani. 2025. LaMP-QA: A Benchmark for Per- sonalized Long-form Question Answering. arXiv:2506.00137 [cs.CL] https: //arxiv.org/abs/2506.00137

  28. [28]

    Kwangwook Seo, Donguk Kwon, and Dongha Lee. 2025. MT-RAIG: Novel Benchmark and Evaluation Framework for Retrieval-Augmented Insight Gen- eration over Multiple Tables. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanx- iang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehv...

  29. [29]

    Herbert A Simon. 1955. A behavioral model of rational choice.The quarterly journal of economics(1955), 99–118

  30. [30]

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. 2025. Gemma 3 technical report.arXiv preprint arXiv:2503.19786 (2025)

  31. [31]

    Yejin Yoon, Jungyeon Lee, Kangsan Kim, Chanhee Park, and Taeuk Kim. 2024. BlendX: Complex Multi-Intent Detection with Blended Patterns. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sa...

  32. [32]

    Hanlei Zhang, Xiaoteng Li, Hua Xu, Panpan Zhang, Kang Zhao, and Kai Gao

  33. [33]

    TEXTOIR: An Integrated and Visualized Platform for Text Open Intent Recognition. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, Heng Ji, Jong C. Park, and Rui Xia (Eds.). Association for Computational Linguistics, O...

  34. [34]

    Hanlei Zhang, Hua Xu, and Ting-En Lin. 2021. Deep open intent classifica- tion with adaptive decision boundary. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 14374–14382

  35. [35]

    Hanlei Zhang, Hua Xu, Shaojie Zhao, and Qianrui Zhou. 2023. Learning Dis- criminative Representations and Decision Boundaries for Open Intent Detec- tion.IEEE/ACM Trans. Audio, Speech and Lang. Proc.31 (April 2023), 1611–1623. doi:10.1109/TASLP.2023.3265203

  36. [36]

    BERTScore: Evaluating Text Generation with BERT

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating Text Generation with BERT. arXiv:1904.09675 [cs.CL] https://arxiv.org/abs/1904.09675

  37. [37]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. 2025. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models.arXiv preprint arXiv:2506.05176(2025). A Benchmark Details Table 7 provides detailed statistics of instance distribution across fine-grai...