Recognition: 2 theorem links
· Lean TheoremFrom 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLMs
Pith reviewed 2026-05-11 02:30 UTC · model grok-4.3
The pith
Converting multiple-choice questions into 2-order logical judgments exposes 31-56% accuracy drops in frontier LLMs from combinatorial reasoning gaps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By deterministically transforming 0-order selection into 2-order judgment through combinatorial hardening and ranking items via 9-dimensional analysis of model traces with Item Response Theory for adaptive control, the work establishes that frontier LLMs exhibit large, consistent performance degradation on the hardened items. This degeneration arises specifically from a combinatorial reasoning gap and completeness-verification deficit rather than missing knowledge, as confirmed by the absence of comparable failures in humans and by zero-shot validity-preserving transfer to other benchmarks.
What carries the argument
The LogiHard framework, which performs deterministic combinatorial transformation of 0-order selection into 2-order logical judgment while integrating Item Response Theory for precise difficulty control.
If this is right
- LLMs exhibit multi-select failure and early exit bias that human test-takers avoid on the same items.
- Zero-shot transfer produces 47 percent accuracy degradation on MMLU while preserving validity.
- The aggregate degeneration remains consistent and domain-agnostic across tested benchmarks.
- Performance collapse traces to a combinatorial reasoning gap and completeness-verification deficit induced by training rather than knowledge shortfalls.
Where Pith is reading between the lines
- The framework could be applied to additional high-stakes exam domains to map the scope of the reasoning gap without new data collection.
- Model training that emphasizes explicit verification steps might reduce the observed early exit and multi-select patterns.
- Static benchmarks risk underestimating limitations if they remain at 0-order selection without such hardening.
Load-bearing premise
The transformation from 0-order selection to 2-order judgment preserves logical validity without introducing artifacts or new knowledge demands, and the accuracy drops reflect a specific reasoning gap rather than surface changes or evaluation biases.
What would settle it
A direct comparison in which the same models maintain original accuracy levels on the combinatorially hardened 2-order versions or in which degradation correlates with domain-specific knowledge gaps instead of reasoning structure.
read the original abstract
Multiple-choice reasoning benchmarks face dual challenges: rapid saturation from advancing models and data contamination that undermines static evaluations. Ad-hoc hardening methods (paraphrasing, perturbation) attempt to increase difficulty but sacrifice logical validity for surface complexity, falling short to challenge advanced reasoning models. We present LogiHard, a formal framework that deterministically transforms 0-order selection into 2-order logical judgment, which significantly increases the thinking overhead and reasoning steps. The framework integrates Item Response Theory (IRT) for computerized adaptive testing (CAT), enabling precise difficulty control with fewer questions than static benchmarks. We instantiate LogiHard-2k, a logical reasoning dataset constructed by cognitively ranking high-stakes examination questions via 9-dimensional analysis of model thinking traces, followed by combinatorial transformation of high-difficulty items. Evaluation across twelve state-of-the-art models reveals an accuracy degradation ranging from 31% to 56% on combinatorially hardened questions. LLMs suffer from the multi-select failure and early exit bias, which are not shared by human testees. Zero-shot transfer to MMLU demonstrates 47% accuracy degradation (89.84% to 42.86%), confirming applicability across domains with provable validity preservation. The consistent aggregate degeneration is domain-agnostic and stems not from knowledge deficits but from a combinatorial reasoning gap, reflecting a training-induced completeness-verification deficit.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces LogiHard, a deterministic framework that transforms 0-order multiple-choice selection questions into 2-order logical judgment tasks to increase reasoning overhead and steps. It combines this with Item Response Theory (IRT) and computerized adaptive testing (CAT) for difficulty control, constructs the LogiHard-2k dataset by ranking high-stakes exam items via 9-dimensional analysis of model thinking traces, and evaluates twelve frontier LLMs. The paper reports accuracy degradations of 31-56% on the hardened items, identifies LLM-specific failures (multi-select and early-exit bias) absent in humans, and shows a 47% drop (89.84% to 42.86%) under zero-shot MMLU transfer, attributing the consistent degeneration to a combinatorial reasoning gap and training-induced completeness-verification deficit rather than knowledge shortfalls.
Significance. If the transformations preserve logical validity and isolate increased reasoning demands without introducing format or length artifacts, the results would provide evidence of a fundamental compositional limitation in current LLMs that is distinct from knowledge deficits and not mitigated by scale. The integration of IRT/CAT for efficient, controlled evaluation and the 9-dimensional trace analysis for item selection represent methodological strengths over purely ad-hoc hardening approaches. The cross-domain MMLU transfer adds weight to the domain-agnostic claim.
major comments (3)
- [Abstract] Abstract: The central claim of 'provable validity preservation' for the 0-to-2-order combinatorial transformation is unsupported by any explicit transformation rules, equivalence verification procedure, or controls that hold prompt length, token count, answer format (selection vs. judgment), or logical nesting constant. This is load-bearing for attributing the 31-56% degradation specifically to a 'combinatorial reasoning gap' rather than surface artifacts or evaluation biases.
- [Abstract] Abstract (MMLU transfer paragraph): The reported drop from 89.84% to 42.86% is presented without error bars, per-item sample details, or confirmation that the same hardening rules and IRT controls were applied to MMLU items. Without these, it is unclear whether the degradation isolates the intended reasoning factor or reflects new format-induced biases.
- [Abstract] Abstract (evaluation paragraph): The accuracy degradations (31% to 56%) and identification of 'multi-select failure and early exit bias' lack any mention of statistical controls, human baseline performance on the identical hardened items, or ablation isolating the combinatorial element from length/format changes. This weakens the causal link to a training-induced completeness-verification deficit.
minor comments (2)
- [Abstract] The terms '0-order selection' and '2-order logical judgment' are used without a concise formal definition or example in the abstract; a short illustrative example would improve accessibility.
- [Abstract] No reference is made to prior IRT applications in LLM evaluation or to existing work on logical nesting in reasoning benchmarks; adding 2-3 targeted citations would strengthen the positioning.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback, which highlights important areas for clarification and strengthening of our claims. We address each major comment point-by-point below, outlining specific revisions to the manuscript that will incorporate the suggested improvements while preserving the core contributions of LogiHard.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of 'provable validity preservation' for the 0-to-2-order combinatorial transformation is unsupported by any explicit transformation rules, equivalence verification procedure, or controls that hold prompt length, token count, answer format (selection vs. judgment), or logical nesting constant. This is load-bearing for attributing the 31-56% degradation specifically to a 'combinatorial reasoning gap' rather than surface artifacts or evaluation biases.
Authors: We agree that the abstract's brevity leaves the validity preservation claim underspecified. Section 3 of the full manuscript defines the deterministic transformation rules (converting 0-order selection to 2-order judgment via logical equivalence, where the model must judge whether a candidate satisfies the original condition), but we will revise the abstract to include a concise description of these rules and add an appendix with formal equivalence proofs, example transformations, and verification procedures. To address controls, we will incorporate new ablations in the revision that match prompt length, token count, and format across conditions, demonstrating that the observed degradations persist under these controls and are not attributable to surface artifacts. revision: yes
-
Referee: [Abstract] Abstract (MMLU transfer paragraph): The reported drop from 89.84% to 42.86% is presented without error bars, per-item sample details, or confirmation that the same hardening rules and IRT controls were applied to MMLU items. Without these, it is unclear whether the degradation isolates the intended reasoning factor or reflects new format-induced biases.
Authors: The MMLU transfer used the identical LogiHard transformation rules and IRT-based item selection on a subset of 100 MMLU items. We will revise the abstract and methods section to explicitly confirm this, report per-item sample details, and include error bars (e.g., standard deviation across items and bootstrap confidence intervals). These additions will clarify that the 47% drop isolates the combinatorial reasoning factor rather than format biases. revision: yes
-
Referee: [Abstract] Abstract (evaluation paragraph): The accuracy degradations (31% to 56%) and identification of 'multi-select failure and early exit bias' lack any mention of statistical controls, human baseline performance on the identical hardened items, or ablation isolating the combinatorial element from length/format changes. This weakens the causal link to a training-induced completeness-verification deficit.
Authors: We will add statistical controls (paired significance tests across models and items) to the evaluation section. The manuscript already notes that these biases are absent in human testees based on pilot observations; we will expand this with quantitative human performance metrics on the hardened items where available. For isolating the combinatorial element, we will include ablations comparing against length- and format-matched controls in the revision. These changes will strengthen the attribution to the completeness-verification deficit. revision: partial
Circularity Check
No significant circularity; core claims rest on direct empirical measurements
full rationale
The paper's central results consist of measured accuracy degradations (31-56% on LogiHard-2k items and 47% on MMLU zero-shot transfer) obtained by applying the described combinatorial transformation to selected questions and evaluating frontier models. IRT/CAT and 9-dimensional trace analysis serve only for item selection and difficulty ranking; they do not define or derive the reported failure modes (multi-select failure, early-exit bias, completeness-verification deficit) by construction. No equations reduce the performance gap to a fitted parameter, no self-citation supplies a load-bearing uniqueness theorem, and the validity-preservation assertion is presented as an independent property of the deterministic LogiHard mapping rather than a redefinition of the observed drops. The derivation therefore remains self-contained against external model evaluations and human baselines.
Axiom & Free-Parameter Ledger
free parameters (1)
- 9-dimensional analysis criteria
axioms (1)
- domain assumption Combinatorial transformation preserves logical validity of original questions
invented entities (1)
-
2-order logical judgment
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/LogicAsFunctionalEquation.leanSatisfiesLawsOfLogic unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LogiHard, a formal framework that deterministically transforms 0-order selection into 2-order logical judgment... validity-by-construction... propositional logic tasks via exactness (EXACTi ≡ pi ∧ ⋀j≠i ¬pj), Disjunction (pi ∨ pj), and Negation
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
IRT 3PL model... Gold Score via weighted linear combination of 9 cognitive metrics... CAT engine... Fisher information
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649:1139–1146, 2026
Center for AI Safety, Scale AI, and HLE Contributors Consortium. A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649:1139–1146, 2026
work page 2026
-
[3]
Towards reasoning era: A survey of long chain-of-thought for reasoning large language models, 2025
Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models, 2025
work page 2025
-
[4]
Simin Chen, Yiming Chen, Zexin Li, Yifan Jiang, Zhongwei Wan, Yixin He, Dezhi Ran, Tianle Gu, Haizhou Li, Tao Xie, and Baishakhi Ray. Benchmarking large language models under data contamination: A survey from static to dynamic evaluation. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Con...
work page 2025
-
[5]
Simin Chen, Pranav Pusarla, and Baishakhi Ray. Dynamic benchmarking of reasoning capa- bilities in code large language models under data contamination. InProceedings of the 42nd International Conference on Machine Learning (ICML). PMLR, 2025
work page 2025
-
[6]
Gemini 3.1 pro model card, February 2026
Google DeepMind. Gemini 3.1 pro model card, February 2026
work page 2026
-
[7]
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025
work page 2025
-
[8]
Deepseek-v4: Towards highly efficient million-token context intelligence, 2026
DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026
work page 2026
-
[9]
Glm-5: from vibe coding to agentic engineering, 2026
GLM-5-Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, Chenzheng Zhu, Congfeng Yin, Cunx- iang Wang, Gengzheng Pan, Hao Zeng, Haoke Zhang, Haoran Wang, Huilong Chen, Jiajie 10 Zhang, Jian Jiao, Jiaqi Guo, Jingsen Wang, Jingzhao Du, Jinzhu Wu, Kedong Wang, Lei Li, Lin Fan, Luce...
work page 2026
-
[10]
Changing answer order can decrease mmlu accuracy.arXiv preprint arXiv:2406.19470, 2024
Vipul Gupta, David Pantoja, Candace Ross, Adina Williams, and Megan Ung. Changing answer order can decrease mmlu accuracy.arXiv preprint arXiv:2406.19470, 2024
-
[11]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021
work page 2021
-
[12]
Valentin Hofmann, David Heineman, Ian Magnusson, Kyle Lo, Jesse Dodge, Maarten Sap, Pang Wei Koh, Chun Wang, Hannaneh Hajishirzi, and Noah A. Smith. Fluid language model benchmarking. InSecond Conference on Language Modeling, 2025
work page 2025
-
[13]
Mehran Kazemi, Bahare Fatemi, Hritik Bansal, John Palowitch, Chrysovalantis Anastasiou, Sanket Vaibhav Mehta, Lalit K Jain, Virginia Aglietti, Disha Jindal, Yuanzhu Peter Chen, et al. Big-bench extra hard. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26473–26501, 2025
work page 2025
-
[14]
Bogdan Kosti´c, Conor Fallon, Julian Risch, and Alexander Löser. Same meaning, different scores: Lexical and syntactic sensitivity in llm evaluation.arXiv preprint arXiv:2602.17316, 2026
-
[15]
Ehsan Latif, Yifan Zhou, Shuchen Guo, Yizhu Gao, Lehong Shi, Matthew Nyaaba, Arne Bewerdorff, Xiantong Yang, and Xiaoming Zhai. Comparative evaluation of openai o1 and human performance in higher order cognition.Scientific Reports, 2025
work page 2025
-
[16]
Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline.arXiv preprint arXiv:2406.11939, 2024
-
[17]
Yanhong Li, Tianyang Xu, Kenan Tang, Karen Livescu, David McAllester, and Jiawei Zhou. Okbench: Democratizing llm evaluation with fully automated, on-demand, open knowledge benchmarking, 2025
work page 2025
-
[18]
Logiqa 2.0—an improved dataset for logical reasoning in natural language understanding
Hanmeng Liu, Jian Liu, Leyang Cui, Zhiyang Teng, Nan Duan, Ming Zhou, and Yue Zhang. Logiqa 2.0—an improved dataset for logical reasoning in natural language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2947–2962, 2023. 11
work page 2023
-
[19]
Logicot: Logical chain-of-thought instruction tuning
Hanmeng Liu, Zhiyang Teng, Leyang Cui, Chaoli Zhang, Qiji Zhou, and Yue Zhang. Logicot: Logical chain-of-thought instruction tuning. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 2908–2921, 2023
work page 2023
-
[20]
Lord.Applications of Item Response Theory to Practical Testing Problems
F.M. Lord.Applications of Item Response Theory to Practical Testing Problems. L. Erlbaum Associates, 1980
work page 1980
-
[21]
Do llms know when to not answer? investigating abstention abilities of large language models
Nishanth Madhusudhan, Sathwik Tejaswi Madhusudhan, Vikas Yadav, and Masoud Hashemi. Do llms know when to not answer? investigating abstention abilities of large language models. InProceedings of the 31st International Conference on Computational Linguistics, pages 9329–9345, 2025
work page 2025
-
[22]
Frontier llms still struggle with simple reasoning tasks, 2025
Alan Malek, Jiawei Ge, Nevena Lazic, Chi Jin, András György, and Csaba Szepesvári. Frontier llms still struggle with simple reasoning tasks, 2025
work page 2025
-
[23]
Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models, 2024
Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models, 2024
work page 2024
-
[24]
Luke Moffett and Bhuwan Dhingra. Close or cloze? assessing the robustness of large language models to adversarial perturbations via word recovery. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors,Proceedings of the 31st International Conference on Computational Linguistics, pages 6999–701...
work page 2025
-
[25]
s1: Simple test-time scaling, 2025
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025
work page 2025
-
[26]
NVIDIA, :, Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adithya Renduchintala, Aditya Malte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, Aleksander Ficek, Alex Kondratenko, Alex Shaposhnikov, Alexander Bukharin, Ali Taghibakhshi, Amelia Barton, Ameya Sunil Mahabaleshwarkar, Amy Shen, Andrew Tao, Ann Guan, Anna Shors, Anubhav M...
work page 2025
-
[27]
Jaden Park, Mu Cai, Feng Yao, Jingbo Shang, Soochahn Lee, and Yong Jae Lee. Contamination detection for vlms using multi-modal semantic perturbation.International Conference on Learning Representations, 2026
work page 2026
-
[28]
Large language models sensitivity to the order of options in multiple-choice questions
Pouya Pezeshkpour and Estevam Hruschka. Large language models sensitivity to the order of options in multiple-choice questions. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 2006–2017, 2024
work page 2024
-
[29]
Qwq-32b: Embracing the power of reinforcement learning, March 2025
Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025
work page 2025
-
[30]
Eva Sánchez Salido, Julio Gonzalo, and Guillermo Marco. None of the others: a general tech- nique to distinguish reasoning from memorization in multiple-choice llm evaluation benchmarks. arXiv preprint arXiv:2502.12896, 2025
-
[31]
Parshin Shojaee*, Iman Mirzadeh*, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity. InNeurIPS, 2025
work page 2025
-
[32]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer, Ale...
work page 2026
-
[33]
Yifan Sun, Han Wang, Dongbai Li, Gang Wang, and Huan Zhang. The emperor’s new clothes in benchmarking? a rigorous examination of mitigation strategies for llm benchmark data contamination.arXiv preprint arXiv:2503.16402, 2025
-
[34]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, , and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261, 2022
work page internal anchor Pith review arXiv 2022
-
[35]
Zhi Rui Tam, Cheng-Kuang Wu, Chieh-Yen Lin, and Yun-Nung Chen. None of the above, less of the right parallel patterns in human and llm performance on multi-choice questions answering. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20112–20134, 2025. 14
work page 2025
-
[36]
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[37]
Universal adversarial triggers for attacking and analyzing nlp, 2021
Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing nlp, 2021
work page 2021
-
[38]
Livebench: A challenging, contamination-free LLM benchmark
Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh-Agrawal, Sandeep Singh Sandha, Siddartha Venkat Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. Livebench: A challenging, contamination-free LLM benchmark. InThe...
work page 2025
-
[39]
Xiaobao Wu, Liangming Pan, Yuxi Xie, Ruiwen Zhou, Shuai Zhao, Yubo Ma, Mingzhe Du, Rui Mao, Anh Tuan Luu, and William Yang Wang. AntiLeakBench: Preventing data contamination by automatically constructing benchmarks with updated real-world knowledge. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the ...
work page 2025
-
[40]
On memorization of large language models in logical reasoning
Chulin Xie, Yangsibo Huang, Chiyuan Zhang, Da Yu, Xinyun Chen, Bill Yuchen Lin, Bo Li, Badih Ghazi, and Ravi Kumar. On memorization of large language models in logical reasoning. In Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, and Dhirendra Pratap Singh, editors,Proceedin...
work page 2025
-
[41]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Gulsum Yigit and Mehmet Fatih Amasyali. Adversarial distractor generation for mcqa: Lever- aging in-context learning and rule-based approaches.Natural Language Processing Journal, 13:100186, 2025
work page 2025
-
[43]
Qihao Zhao, Yangyu Huang, Tengchao Lv, Lei Cui, Qinzheng Sun, Shaoguang Mao, Xin Zhang, Ying Xin, Qiufeng Yin, Scarlett Li, and Furu Wei. Mmlu-cf: A contamination-free multi-task language understanding benchmark, 2024. 15 A Implementation Details All models were accessed through their respective official public APIs: GPT-5.4 and o3 (OpenAI), Claude-Opus-4...
work page 2024
-
[44]
Analyze step by step, explaining the reasoning basis for each step
-
[45]
Evaluate each option (correct or reason for elimination)
-
[46]
Finally provide a definitive answer Please begin: Strict mode: Please solve the following logical reasoning problem. Before giving the final answer, show your complete thinking process: {base} Please begin your detailed reasoning: Minimal mode: {base} Please reason in detail before giving your answer: Human evaluators received the following instruction: "...
-
[47]
Old Zhang wins OR Old Yan wins (P V Q)
-
[48]
If Old Zhang wins→overseas project damaged (P→R)
-
[49]
If Old Yan wins→domestic project paused (Q→S) **Evaluating each statement:** **Statement I:** "The company’s overseas project might not be damaged, and domestic product development project won’t be paused." From premises: P V Q. If P, then R. If Q, then S. So from P V Q, we get R V S (overseas damaged OR domestic paused). This means it’s NOT possible that...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.