MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models
Pith reviewed 2026-05-20 05:06 UTC · model grok-4.3
The pith
LLMs fail to attend to implicit cues in reasoning tasks despite explicit instructions
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Large language models fail to attend to subtle yet important contextual cues under explicit task instructions. This is shown by the MixRea benchmark, where the best model among twenty-one tested reaches only 42.8 percent consistency, indicating widespread inattentional blindness rooted in training corpora. Potential Relation Completion Prompting improves performance by recovering overlooked causal relations, yet the limitation continues across diverse multi-source reasoning tasks.
What carries the argument
The MixRea benchmark of 2,246 multiple-choice questions across nine reasoning types that vary the distribution of explicit and implicit information to measure reasoning consistency
Load-bearing premise
The MixRea questions accurately capture real-world cases where implicit information is both present and decision-critical, and low consistency reflects a general attentional bias rather than task-specific artifacts
What would settle it
Showing that models reach high consistency on MixRea questions while retaining strong performance on unrelated benchmarks would indicate the low scores do not reflect a general limitation
Figures
read the original abstract
Large language models (LLMs) are increasingly integrated into high-stakes decision-making. Inspired by the theory of \emph{inattentional blindness} in human cognition, we investigate whether LLMs, trained on human-preferred corpora that embed attentional biases, exhibit a similar limitation: \emph{failing to attend to subtle yet important contextual cues under explicit task instructions}. To evaluate this, we introduce the task of \textbf{explicit-implicit reasoning} and present \textbf{MixRea}, a benchmark of 2,246 multiple-choice questions across 9 reasoning types with varying distributions of explicit and implicit information. Evaluation of 21 advanced LLMs shows that even the best-performing reasoning model (Gemini 2.5 Pro) achieves only 42.8\% consistency, revealing widespread inattentional blindness. To mitigate this, we propose \textbf{Potential Relation Completion Prompting (PRCP)}, a prompting method that improves reasoning by recovering overlooked causal relations. Further analysis shows that this limitation persists across diverse multi-source reasoning tasks, highlighting the need for more cognitively aligned models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MixRea, a benchmark of 2,246 multiple-choice questions across 9 reasoning types with varying explicit and implicit information distributions, to test whether LLMs exhibit inattentional blindness by failing to attend to subtle but decision-critical implicit cues under explicit instructions. Evaluation of 21 LLMs shows the best model (Gemini 2.5 Pro) reaches only 42.8% consistency; the authors propose Potential Relation Completion Prompting (PRCP) to recover overlooked relations and report that the limitation persists across multi-source tasks.
Significance. If the benchmark construction and evaluation controls can be shown to isolate attentional failure rather than general integration load, the result would usefully document a systematic limitation in current LLMs with direct relevance to high-stakes applications. The PRCP prompting method supplies a concrete, immediately testable mitigation; the benchmark itself could become a reusable diagnostic if human baselines and ablations are added.
major comments (2)
- [Methods / Benchmark Construction] Benchmark construction (Methods section): the claim that low consistency specifically reflects inattentional blindness rather than task-specific integration difficulty rests on the unverified assumption that implicit facts were inserted without confounding increases in overall complexity or lexical overlap. No explicit-vs-implicit ablations, controls for reasoning depth, or human performance baselines are reported, so the 42.8% figure for Gemini 2.5 Pro cannot yet be attributed to attentional bias.
- [Evaluation / Results] Evaluation protocol: the abstract and results state the 42.8% consistency without accompanying inter-annotator agreement, question validation statistics, prompt-sensitivity controls, or significance tests. These omissions make it impossible to assess whether the reported gap is robust or an artifact of the particular question set and prompting regime.
minor comments (1)
- [Abstract] The abstract refers to 'varying distributions of explicit and implicit information' across the 9 reasoning types but does not define how these distributions are measured or balanced; a short table or paragraph quantifying the explicit/implicit token ratios per type would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which has helped us strengthen the methodological rigor of the manuscript. We address each major comment below and indicate the revisions made to the manuscript.
read point-by-point responses
-
Referee: [Methods / Benchmark Construction] Benchmark construction (Methods section): the claim that low consistency specifically reflects inattentional blindness rather than task-specific integration difficulty rests on the unverified assumption that implicit facts were inserted without confounding increases in overall complexity or lexical overlap. No explicit-vs-implicit ablations, controls for reasoning depth, or human performance baselines are reported, so the 42.8% figure for Gemini 2.5 Pro cannot yet be attributed to attentional bias.
Authors: We appreciate this observation and agree that stronger isolation of attentional effects from general integration load would improve the attribution. Our construction process (Section 3.1) deliberately kept surface features (sentence length, lexical diversity, and syntactic complexity) matched between explicit-only and mixed conditions by inserting implicit cues via minimal paraphrasing rather than added clauses. Nevertheless, we acknowledge the absence of explicit ablations in the original submission. In the revised manuscript we have added (i) a matched-pair ablation comparing the same questions in explicit-only versus mixed form, (ii) a reasoning-depth control that bins items by number of required inference steps, and (iii) a small-scale human baseline (n=48 participants) showing 84% consistency. These results are reported in a new subsection 4.3 and support that the observed drop is driven by the implicit component rather than overall difficulty. revision: yes
-
Referee: [Evaluation / Results] Evaluation protocol: the abstract and results state the 42.8% consistency without accompanying inter-annotator agreement, question validation statistics, prompt-sensitivity controls, or significance tests. These omissions make it impossible to assess whether the reported gap is robust or an artifact of the particular question set and prompting regime.
Authors: We agree that these statistics are necessary for assessing robustness. The original dataset construction included three-way annotation by domain experts; we have now computed and reported inter-annotator agreement (Fleiss’ κ = 0.89) together with question-validation pass rates in Section 3.3. To address prompt sensitivity we added an appendix (Appendix C) that evaluates five prompt templates and shows the consistency gap remains stable. Finally, we include paired statistical tests (Wilcoxon signed-rank) comparing model consistency scores against chance and against each other, with p-values and effect sizes now appearing in Table 2 and the results section. These additions directly address the concern about potential artifacts. revision: yes
Circularity Check
No circularity: empirical benchmark evaluation is self-contained
full rationale
The paper introduces the MixRea benchmark of 2,246 multiple-choice questions across 9 reasoning types to test explicit-implicit reasoning in LLMs, drawing inspiration from human inattentional blindness theory. It reports empirical results on 21 external LLMs (e.g., Gemini 2.5 Pro at 42.8% consistency) and proposes Potential Relation Completion Prompting (PRCP) as a mitigation. No equations, derivations, fitted parameters, or self-referential definitions appear in the abstract or described content. The central claims rest on direct evaluation of independent models against the newly constructed benchmark without any reduction to inputs by construction, self-citation load-bearing premises, or renaming of known results. This is a standard empirical contribution whose validity can be assessed against external benchmarks and human baselines.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs trained on human-preferred corpora embed attentional biases analogous to inattentional blindness
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce the task of explicit-implicit reasoning and present MixRea, a benchmark of 2,246 multiple-choice questions across 9 reasoning types with varying distributions of explicit and implicit information.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PRCP prompting method that improves reasoning by recovering overlooked causal relations.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Clancey, William J. Communication, Simulation, and Intelligent Agents: Implications of Personal Intelligent Machines for Medical Education. Proceedings of the Eighth International Joint Conference on Artificial Intelligence (IJCAI-83)
-
[2]
Classification Problem Solving
Clancey, William J. Classification Problem Solving. Proceedings of the Fourth National Conference on Artificial Intelligence
- [3]
-
[4]
New Ways to Make Microcircuits Smaller---Duplicate Entry
Robinson, Arthur L. New Ways to Make Microcircuits Smaller---Duplicate Entry. Science
-
[5]
Clancey and Glenn Rennels , abstract =
Diane Warner Hasling and William J. Clancey and Glenn Rennels , abstract =. Strategic explanations for a diagnostic consultation system , journal =. 1984 , issn =. doi:https://doi.org/10.1016/S0020-7373(84)80003-6 , url =
-
[6]
Hasling, Diane Warner and Clancey, William J. and Rennels, Glenn R. and Test, Thomas. Strategic Explanations in Consultation---Duplicate. The International Journal of Man-Machine Studies
-
[7]
Poligon: A System for Parallel Problem Solving
Rice, James. Poligon: A System for Parallel Problem Solving
-
[8]
Transfer of Rule-Based Expertise through a Tutorial Dialogue
Clancey, William J. Transfer of Rule-Based Expertise through a Tutorial Dialogue
-
[9]
The Engineering of Qualitative Models
Clancey, William J. The Engineering of Qualitative Models
- [10]
- [11]
-
[12]
International Conference on Learning Representations (ICLR) , year=
React: Synergizing reasoning and acting in language models , author=. International Conference on Learning Representations (ICLR) , year=
-
[13]
ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[14]
Finemedlm-o1: Enhancing the medical reasoning ability of llm from supervised fine-tuning to test-time training , author=. arXiv preprint arXiv:2501.09213 , year=
-
[15]
Machine learning for healthcare conference , pages=
Are large language models ready for healthcare? a comparative study on clinical language understanding , author=. Machine learning for healthcare conference , pages=. 2023 , organization=
work page 2023
-
[16]
ChronosLex: Time-aware Incremental Training for Temporal Generalization of Legal Classification Tasks , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[17]
Autonomous chemical research with large language models , author=. Nature , volume=. 2023 , publisher=
work page 2023
-
[18]
DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments
Deepresearcher: Scaling deep research via reinforcement learning in real-world environments , author=. arXiv preprint arXiv:2504.03160 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Gorillas in Our Midst: Sustained Inattentional Blindness for Dynamic Events , author=. Perception , year=
-
[20]
The Eleventh International Conference on Learning Representations , year=
CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis , author=. The Eleventh International Conference on Learning Representations , year=
-
[21]
D oc LLM : A Layout-Aware Generative Language Model for Multimodal Document Understanding
Wang, Dongsheng and Raman, Natraj and Sibue, Mathieu and Ma, Zhiqiang and Babkin, Petr and Kaur, Simerjot and Pei, Yulong and Nourbakhsh, Armineh and Liu, Xiaomo. D oc LLM : A Layout-Aware Generative Language Model for Multimodal Document Understanding. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long...
-
[22]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =
Liao, Wenhui and Wang, Jiapeng and Li, Hongliang and Wang, Chengyu and Huang, Jun and Jin, Lianwen , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2025 , pages =
work page 2025
-
[23]
Forty-second International Conference on Machine Learning , year=
Compositional Condition Question Answering in Tabular Understanding , author=. Forty-second International Conference on Machine Learning , year=
-
[24]
Interpretable Table Question Answering via Plans of Atomic Table Transformations , author=. 2024 , url=
work page 2024
-
[25]
Samuel Holt and Max Ruiz Luyten and Mihaela van der Schaar , booktitle=. L2. 2024 , url=
work page 2024
-
[26]
Advances in neural information processing systems , volume=
Attention is all you need , author=. Advances in neural information processing systems , volume=
-
[27]
L oo GLE : Can Long-Context Language Models Understand Long Contexts?
Li, Jiaqi and Wang, Mengmeng and Zheng, Zilong and Zhang, Muhan. L oo GLE : Can Long-Context Language Models Understand Long Contexts?. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.859
-
[28]
doi: 10.18653/v1/2024.acl-long.172
Bai, Yushi and Lv, Xin and Zhang, Jiajie and Lyu, Hongchang and Tang, Jiankai and Huang, Zhidian and Du, Zhengxiao and Liu, Xiao and Zeng, Aohan and Hou, Lei and Dong, Yuxiao and Tang, Jie and Li, Juanzi. L ong B ench: A Bilingual, Multitask Benchmark for Long Context Understanding. Proceedings of the 62nd Annual Meeting of the Association for Computation...
-
[29]
L ong A lign: A Recipe for Long Context Alignment of Large Language Models
Bai, Yushi and Lv, Xin and Zhang, Jiajie and He, Yuze and Qi, Ji and Hou, Lei and Tang, Jie and Dong, Yuxiao and Li, Juanzi. L ong A lign: A Recipe for Long Context Alignment of Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.74
-
[30]
Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=
work page 2021
-
[31]
Advances in neural information processing systems , volume=
Redpajama: an open dataset for training large language models , author=. Advances in neural information processing systems , volume=
-
[32]
arXiv preprint arXiv:2402.00159 , year=
Dolma: An open corpus of three trillion tokens for language model pretraining research , author=. arXiv preprint arXiv:2402.00159 , year=
- [33]
-
[34]
and Forbes, Maxwell and Choi, Yejin
Emelin, Denis and Le Bras, Ronan and Hwang, Jena D. and Forbes, Maxwell and Choi, Yejin. Moral Stories: Situated Reasoning about Norms, Intents, Actions, and their Consequences. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.54
-
[35]
STAR: A Benchmark for Situated Reasoning in Real-World Videos , url =
Wu, Bo and Yu, Shoubin and Chen, Zhenfang and Tenenbaum, Josh and Gan, Chuang , booktitle =. STAR: A Benchmark for Situated Reasoning in Real-World Videos , url =
-
[36]
Li, Qintong and Cui, Leyang and Zhao, Xueliang and Kong, Lingpeng and Bi, Wei. GSM -Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLM s as Mathematical Problem Solvers. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.163
-
[37]
Fan, Lizhou and Hua, Wenyue and Li, Lingyao and Ling, Haoyang and Zhang, Yongfeng. NPH ard E val: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.225
-
[38]
Sun, Jiaxing and Huang, Weiquan and Wu, Jiang and Gu, Chenya and Li, Wei and Zhang, Songyang and Yan, Hang and He, Conghui. Benchmarking C hinese Commonsense Reasoning of LLM s: From C hinese-Specifics to Reasoning-Memorization Correlations. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 20...
-
[39]
S port QA : A Benchmark for Sports Understanding in Large Language Models
Xia, Haotian and Yang, Zhengbang and Wang, Yuqing and Tracy, Rhys and Zhao, Yun and Huang, Dongdong and Chen, Zezhi and Zhu, Yan and Wang, Yuan-fang and Shen, Weining. S port QA : A Benchmark for Sports Understanding in Large Language Models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics:...
-
[40]
A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories
Mostafazadeh, Nasrin and Chambers, Nathanael and He, Xiaodong and Parikh, Devi and Batra, Dhruv and Vanderwende, Lucy and Kohli, Pushmeet and Allen, James. A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories. Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human ...
-
[41]
First Conference on Language Modeling , year=
Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models - A Survey , author=. First Conference on Language Modeling , year=
-
[42]
The Odyssey of Commonsense Causality: From Foundational Benchmarks to Cutting-Edge Reasoning
Cui, Shaobo and Jin, Zhijing and Sch. The Odyssey of Commonsense Causality: From Foundational Benchmarks to Cutting-Edge Reasoning. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.932
-
[43]
Chung, Jiwan and Lee, Sungjae and Kim, Minseo and Han, Seungju and Yousefpour, Ashkan and Hessel, Jack and Yu, Youngjae. Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument Understanding. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.143
-
[44]
Liar, Liar, Logical Mire: A Benchmark for Suppositional Reasoning in Large Language Models
Mondorf, Philipp and Plank, Barbara. Liar, Liar, Logical Mire: A Benchmark for Suppositional Reasoning in Large Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.404
-
[45]
Tables as Texts or Images: Evaluating the Table Reasoning Ability of LLM s and MLLM s
Deng, Naihao and Sun, Zhenjie and He, Ruiqi and Sikka, Aman and Chen, Yulong and Ma, Lin and Zhang, Yue and Mihalcea, Rada. Tables as Texts or Images: Evaluating the Table Reasoning Ability of LLM s and MLLM s. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.23
-
[46]
Pezeshkpour, Pouya and Hruschka, Estevam. Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions. Findings of the Association for Computational Linguistics: NAACL 2024. 2024. doi:10.18653/v1/2024.findings-naacl.130
-
[47]
Possible Stories: Evaluating Situated Commonsense Reasoning under Multiple Possible Scenarios
Ashida, Mana and Sugawara, Saku. Possible Stories: Evaluating Situated Commonsense Reasoning under Multiple Possible Scenarios. Proceedings of the 29th International Conference on Computational Linguistics. 2022
work page 2022
-
[48]
L ogic B ench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models
Parmar, Mihir and Patel, Nisarg and Varshney, Neeraj and Nakamura, Mutsumi and Luo, Man and Mashetty, Santosh and Mitra, Arindam and Baral, Chitta. L ogic B ench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...
-
[49]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , url =
Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and ichter, brian and Xia, Fei and Chi, Ed and Le, Quoc V and Zhou, Denny , booktitle =. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , url =
-
[50]
Large Language Models are Zero-Shot Reasoners , url =
Kojima, Takeshi and Gu, Shixiang (Shane) and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , booktitle =. Large Language Models are Zero-Shot Reasoners , url =
- [51]
- [52]
-
[53]
LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=
work page 2023
-
[54]
Baichuan 2: Open Large-scale Language Models , author=. 2023 , eprint=
work page 2023
- [55]
- [56]
-
[57]
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism , author=. 2024 , eprint=
work page 2024
-
[58]
Language Models are Few-Shot Learners , url =
Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...
- [59]
- [60]
-
[61]
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement , author=. 2024 , eprint=
work page 2024
-
[62]
FR o G : Evaluating Fuzzy Reasoning of Generalized Quantifiers in LLM s
Li, Yiyuan and Sun, Shichao and Liu, Pengfei. FR o G : Evaluating Fuzzy Reasoning of Generalized Quantifiers in LLM s. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.411
-
[63]
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding , url =
Zhang, Tao and Li, Xiangtai and Fei, Hao and Yuan, Haobo and Wu, Shengqiong and Ji, Shunping and Loy, Chen Change and Yan, Shuicheng , booktitle =. OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding , url =
-
[64]
Exploring the Effectiveness of Prompt Engineering for Legal Reasoning Tasks
Yu, Fangyi and Quartey, Lee and Schilder, Frank. Exploring the Effectiveness of Prompt Engineering for Legal Reasoning Tasks. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.858
-
[65]
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , author=. 2023 , eprint=
work page 2023
-
[66]
Large Language Models for Mathematical Reasoning: Progresses and Challenges , author=. 2024 , eprint=
work page 2024
-
[67]
A Survey of Reasoning with Foundation Models , author=. 2024 , eprint=
work page 2024
-
[68]
2024 , month = jun, institution =
Anthropic , title =. 2024 , month = jun, institution =
work page 2024
- [69]
-
[70]
Gemma 2: Improving Open Language Models at a Practical Size , author=. 2024 , eprint=
work page 2024
-
[71]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=
work page 2025
-
[72]
Marathon: A Race Through the Realm of Long Context with Large Language Models
Zhang, Lei and Li, Yunshui and Liu, Ziqiang and Yang, Jiaxi and Liu, Junhao and Chen, Longze and Luo, Run and Yang, Min. Marathon: A Race Through the Realm of Long Context with Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.284
-
[73]
BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack , url =
Kuratov, Yuri and Bulatov, Aydar and Anokhin, Petr and Rodkin, Ivan and Sorokin, Dmitry and Sorokin, Artyom and Burtsev, Mikhail , booktitle =. BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack , url =
-
[74]
F ollow B ench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models
Jiang, Yuxin and Wang, Yufei and Zeng, Xingshan and Zhong, Wanjun and Li, Liangyou and Mi, Fei and Shang, Lifeng and Jiang, Xin and Liu, Qun and Wang, Wei. F ollow B ench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: ...
-
[75]
Benchmarking Complex Instruction-Following with Multiple Constraints Composition , url =
Wen, Bosi and Ke, Pei and Gu, Xiaotao and Wu, Lindong and Huang, Hao and Zhou, Jinfeng and Li, Wenchuang and Hu, Binxin and Gao, Wendy and Xu, Jiaxin and Liu, Yiming and Tang, Jie and Wang, Hongning and Huang, Minlie , booktitle =. Benchmarking Complex Instruction-Following with Multiple Constraints Composition , url =
-
[76]
Chen, Xinyi and Liao, Baohao and Qi, Jirui and Eustratiadis, Panagiotis and Monz, Christof and Bisazza, Arianna and de Rijke, Maarten. The SIF o Benchmark: Investigating the Sequential Instruction Following Ability of Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.92
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.