IsoSci: A Benchmark of Isomorphic Cross-Domain Science Problems for Evaluating Reasoning versus Knowledge Retrieval in LLMs

Erchin Serpedin; Hasan Kurban; Samir Abdaljalil

arxiv: 2607.01431 · v1 · pith:WHGR3J2Gnew · submitted 2026-07-01 · 💻 cs.CL · cs.AI

IsoSci: A Benchmark of Isomorphic Cross-Domain Science Problems for Evaluating Reasoning versus Knowledge Retrieval in LLMs

Samir Abdaljalil , Erchin Serpedin , Hasan Kurban This is my paper

Pith reviewed 2026-07-03 21:11 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM evaluationchain-of-thoughtscience benchmarksknowledge retrievalisomorphic problemsreasoning vs knowledgecross-domain evaluation

0 comments

The pith

Isomorphic science problem pairs show that 91 percent of chain-of-thought gains depend on domain knowledge rather than shared logical structure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ISOSCI, a benchmark built from problem pairs that match in logical structure but draw on different scientific domains. This setup isolates whether performance lifts from reasoning modes come from better use of structure or from retrieving the right facts. Results across multiple model families show that nearly all such lifts vanish when knowledge requirements change, even though the underlying logic stays fixed. The finding questions whether chain-of-thought prompting supplies a general reasoning advantage on short procedural science tasks. Model rankings also flip between this benchmark and others, showing that conclusions about reasoning depend on the test chosen.

Core claim

Across five model pairs from four families, 91.3 percent of reasoning-mode accuracy gains (63 of 69) prove knowledge-dependent rather than structure-invariant, with a Wilson 95 percent confidence interval of 82.3 to 96.0 percent. Capable models gain less than five percentage points from reasoning toggles in every domain, and a model specialized for reasoning that leads on GPQA Diamond trails by 24.7 points on ISOSCI.

What carries the argument

The ISOSCI benchmark of isomorphic cross-domain science problem pairs that share identical logical structure while requiring distinct domain-specific knowledge.

If this is right

Chain-of-thought prompting supplies little general improvement on short-horizon procedural science tasks once knowledge is controlled.
Conclusions about reasoning utility in LLMs can reverse when the benchmark changes.
Reasoning-specialized models can underperform standard models when the test isolates knowledge retrieval.
Controlled separation of structure and knowledge is required to attribute performance gains accurately.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Evaluations that do not control for domain knowledge will continue to overstate the benefits of reasoning techniques.
Training focused only on reasoning patterns may show limited transfer to new science domains without parallel knowledge gains.
Expanding the benchmark to additional domains could test whether the knowledge dependence holds beyond the current set.

Load-bearing premise

The benchmark problem pairs truly share identical logical structure and differ only in the domain knowledge they require.

What would settle it

Finding a collection of problem pairs where a majority of reasoning-mode gains remain stable across domains would falsify the claim that gains are mostly knowledge-dependent.

read the original abstract

We introduce ISOSCI, a benchmark of isomorphic cross-domain science problem pairs that separates reasoning ability from domain knowledge retrieval in LLM evaluation. Each pair shares identical logical structure but requires different domain-specific knowledge, enabling controlled attribution of reasoning-mode gains. Across five model pairs spanning four model families, we find that 91.3% of reasoning-mode gains are knowledge-dependent rather than structure-invariant (63/69 gains; Wilson 95% CI [82.3%, 96.0%]), directly challenging the assumption that chain-of-thought reasoning improves short-horizon procedural scientific problem-solving. Reasoning toggles on highly capable models provide less than 5 percentage points accuracy gain across all domains, and a reasoning-specialized model (o3-mini) that outperforms its standard counterpart on GPQA Diamond (+19.2 percentage points) underperforms on ISOSCI (-24.7 percentage points), showing that benchmark choice determines conclusions about reasoning utility. We release ISOSCI at https://huggingface.co/datasets/isosci/isosci

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's new benchmark shows most CoT gains on science problems track domain knowledge rather than fixed structure, but the result stands or falls on whether the problem pairs are truly isomorphic.

read the letter

The core finding is that across several models, turning on reasoning modes produced gains in 69 cases, but 63 of those gains went away when the domain changed even though the logical steps stayed the same. That 91.3% figure with the Wilson interval is the number anyone will remember.

They do two things cleanly. First, they release the full set of pairs on Hugging Face so others can inspect the construction. Second, they run the same toggle on multiple model families and show that a model strong on GPQA Diamond drops on their set, which makes the point that benchmark choice can flip the story about reasoning utility.

The soft spot is exactly where the stress-test note flags it. The attribution of gains to knowledge rather than structure requires that each pair really shares the same step count, variable mapping, and inference chain. The abstract gives no numbers on how they verified that equivalence—no inter-annotator scores, no step-by-step mapping table, no counter-example search. If even a few pairs have small structural drift, the 63/69 count moves. That is the load-bearing step, and it is not yet visible in the summary.

The work is aimed at people who design or critique LLM reasoning benchmarks for short scientific tasks. It gives them a concrete template for holding structure fixed while varying knowledge. A serious editor should send it to review because the empirical contrast is sharp and the dataset is public; the isomorphism check is the obvious place for referees to press, but the paper is coherent enough on its own terms to deserve that check rather than a desk reject.

Referee Report

1 major / 1 minor

Summary. The paper introduces ISOSCI, a benchmark of cross-domain science problem pairs designed to be isomorphic in logical structure (identical step counts, variable mappings, and inference relations) while differing only in domain-specific facts. It evaluates five model pairs across four families and reports that 91.3% (63/69) of accuracy gains from reasoning modes are knowledge-dependent rather than structure-invariant (Wilson 95% CI [82.3%, 96.0%]), with additional findings that reasoning toggles yield <5pp gains on capable models and that o3-mini underperforms its base on ISOSCI despite gains on GPQA Diamond. The dataset is released publicly.

Significance. If the isomorphism claim holds, the work supplies a controlled empirical instrument for separating reasoning from knowledge retrieval in LLM evaluation, directly testing assumptions underlying CoT prompting in procedural science tasks. The concrete percentages, confidence intervals, multi-model coverage, and public release constitute reproducible assets that could influence benchmark design and claims about reasoning utility.

major comments (1)

[Benchmark construction] Benchmark construction section: the central attribution that 91.3% of gains are knowledge-dependent rests on the unverified premise that problem pairs share identical logical structure. No formal equivalence check (step-by-step mapping, inter-annotator agreement on structure, or counter-example search) is reported, leaving open the possibility that undetected structural divergences inflate the knowledge-dependent count.

minor comments (1)

[Abstract] Abstract and results section: the phrase 'five model pairs spanning four model families' is used without an accompanying table listing the exact pairs and families, which would aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on benchmark construction. The concern about verifying logical isomorphism is central to the paper's claims, and we address it directly below with a commitment to revision.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction section: the central attribution that 91.3% of gains are knowledge-dependent rests on the unverified premise that problem pairs share identical logical structure. No formal equivalence check (step-by-step mapping, inter-annotator agreement on structure, or counter-example search) is reported, leaving open the possibility that undetected structural divergences inflate the knowledge-dependent count.

Authors: We agree that the manuscript does not report a formal equivalence verification process. Problem pairs were aligned during construction via manual step-by-step mapping to enforce identical step counts, variable mappings, and inference relations, with any detected divergences corrected prior to inclusion. However, no inter-annotator agreement metrics or systematic counter-example search were performed or documented. To address this, we will revise the benchmark construction section to include: (1) a detailed protocol description, (2) concrete examples of the logical mappings for representative pairs, and (3) an explicit statement of the verification limitations. This change will allow readers to evaluate the isomorphism claim more rigorously while preserving the reported results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct measurements

full rationale

The paper introduces a benchmark dataset of problem pairs and reports empirical accuracy counts (63/69 gains) computed from LLM evaluations on the released data. No equations, fitted parameters, predictions, or self-citations appear in the provided text that reduce any result to its own inputs by construction. The isomorphism of pairs is a design premise whose verification is external to the reported statistics; the central claim does not derive from any internal reduction or renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that problem pairs are isomorphic in logical structure. No free parameters or invented entities are introduced.

axioms (1)

domain assumption Problem pairs share identical logical structure but require different domain-specific knowledge
This premise enables the controlled attribution of reasoning-mode gains and is invoked in the benchmark construction described in the abstract.

pith-pipeline@v0.9.1-grok · 5723 in / 1111 out tokens · 24834 ms · 2026-07-03T21:11:19.867896+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 12 canonical work pages · 4 internal anchors

[1]

Theorem-of-thought: A multi- agent framework for abductive, deductive, and inductive reasoning in language models

Samir Abdaljalil, Hasan Kurban, Khalid Qaraqe, and Erchin Serpedin. Theorem-of-thought: A multi- agent framework for abductive, deductive, and inductive reasoning in language models. In Yuji Zhang, Canyu Chen, Sha Li, Mor Geva, Chi Han, Xiaozhi Wang, Shangbin Feng, Silin Gao, Isabelle Augenstein, Mohit Bansal, Manling Li, and Heng Ji, editors,Proceedings ...

work page doi:10.18653/v1/2025.knowllm-1.10 2025
[2]

Rethinking fine-tuning when scaling test-time compute: Limiting confidence improves mathematical reasoning

Feng Chen, Allan Raventos, Nan Cheng, Surya Ganguli, and Shaul Druckmann. Rethinking fine-tuning when scaling test-time compute: Limiting confidence improves mathematical reasoning. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview. net/forum?id=jvVQeSMeGM

2025
[3]

Gemini 2.0 flash model card, 2025

Deepmind. Gemini 2.0 flash model card, 2025. URL http://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-2-0-Flash-Model-Card.pdf

2025
[4]

Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi

Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi. Faith and fate: Limits of transformers on compositionality. In Thirty-seventh Conference on Neural Information Proce...

2023
[5]

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, Olli Järviniemi, Matthew Barnett, Robert Sandler, Matej Vrzala, Jaime Sevilla, Qiuyu Ren, Elizabeth Pratt, Lionel Levine, Grant Barkley, Natalie Stewart, Bogdan Grechuk, Tetiana Grechuk, S...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, and Xiao et al. Bi. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645(8081):633–638, sept 2025. ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URL http://dx.doi.org/10.1038/s41586-025-09422-z

work page doi:10.1038/s41586-025-09422-z 2025
[7]

Have large language models learned to reason? a characterization via 3-SAT

Risha Hazra, Gabriele Venturato, Pedro Zuidberg Dos Martires, and Luc De Raedt. Have large language models learned to reason? a characterization via 3-SAT. InSecond Conference on Language Modeling,
[8]

URLhttps://openreview.net/forum?id=MPTlWIVSMU
[9]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=d7KBjmI3GmQ

2021
[10]

The reasoning-memorization interplay in language models is mediated by a single direction

Yihuai Hong, Meng Cao, Dian Zhou, Lei Yu, and Zhijing Jin. The reasoning-memorization interplay in language models is mediated by a single direction. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 21565–21585, Vienna, Austria, July 2025. Ass...

work page doi:10.18653/v1/2025.findings-acl.1111 2025
[11]

Disentangling memory and reasoning ability in large language models

Mingyu Jin, Weidi Luo, Sitao Cheng, Xinyi Wang, Wenyue Hua, Ruixiang Tang, William Yang Wang, and Yongfeng Zhang. Disentangling memory and reasoning ability in large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistic...

work page doi:10.18653/v1/2025.acl-long.84 2025
[12]

Reasoning Gets Harder for LLMs Inside A Dialogue

Ivan Kartac, Mateusz Lango, and Ondrej Dušek. Reasoning gets harder for llms inside a dialogue, 2026. URLhttps://arxiv.org/abs/2603.20133

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

Solving quantitative reasoning problems with language models

Aitor Lewkowycz, Anders Johan Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Venkatesh Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun ...

2022
[14]

SocialGPT: Prompting LLMs for social relation reasoning via greedy segment optimization

Wanhua Li, Zibin Meng, Jiawei Zhou, Donglai Wei, Chuang Gan, and Hanspeter Pfister. SocialGPT: Prompting LLMs for social relation reasoning via greedy segment optimization. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/ forum?id=xcF2VbyZts

2024
[15]

Beyond Output Correctness: Benchmarking and Evaluating Large Language Model Reasoning in Coding Tasks

Yuangang Li, Justin Tian Jin Chen, Ethan Yu, David Hong, and Iftekhar Ahmed. Beyond output correctness: Benchmarking and evaluating large language model reasoning in coding tasks, 2026. URL https: //arxiv.org/abs/2604.12379

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Quantumbench: A benchmark for quantum problem solving, 2025

Shunya Minami, Tatsuya Ishigaki, Ikko Hamamura, Taku Mikuriya, Youmi Ma, Naoaki Okazaki, Hiroya Takamura, Yohichi Suzuki, and Tadashi Kadowaki. Quantumbench: A benchmark for quantum problem solving, 2025. URLhttps://arxiv.org/abs/2511.00092

work page arXiv 2025
[17]

Openai o3 and o4-mini system card, 2025

OpenAI. Openai o3 and o4-mini system card, 2025. URL https://openai.com/index/ o3-o4-mini-system-card/

2025
[18]

Impact of pretraining term frequencies on few-shot numerical reasoning

Yasaman Razeghi, Robert L Logan IV , Matt Gardner, and Sameer Singh. Impact of pretraining term frequencies on few-shot numerical reasoning. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Findings of the Association for Computational Linguistics: EMNLP 2022, pages 840–854, Abu Dhabi, United Arab Emirates, December 2022. Association for Comput...

2022
[19]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024. URLhttps://openreview.net/forum?id=Ti67584b98

2024
[20]

Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=4FWAwZtd2n

2025
[21]

Reasoning or knowledge: Stratified evaluation of biomedical LLMs

Rahul Thapa, Qingyang Wu, Kevin Wu, Harrison G Zhang, Angela Zhang, Eric Wu, Haotian Ye, and James Zou. Reasoning or knowledge: Stratified evaluation of biomedical LLMs. In Vera Demberg, Kentaro Inui, and Lluís Marquez, editors,Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Paper...

work page doi:10.18653/v1/2026.eacl-long.111 2026
[22]

A tutorial on llm reasoning: Relevant methods behind chatgpt o1, 2025

Jun Wang. A tutorial on llm reasoning: Relevant methods behind chatgpt o1, 2025. URL https: //arxiv.org/abs/2502.10867

work page arXiv 2025
[23]

Physunibench: A multi-modal physics reasoning benchmark at undergraduate level, 2026

Lintao Wang, Encheng Su, Jiaqi Liu, Pengze Li, Jiabei Xiao, Wenlong Zhang, Xinnan Dai, Xi Chen, Yuan Meng, Lei Bai, Wanli Ouyang, Shixiang Tang, Aoran Wang, and Xinzhu Ma. Physunibench: A multi-modal physics reasoning benchmark at undergraduate level, 2026. URL https://arxiv.org/ abs/2506.17667

work page arXiv 2026
[24]

Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang

Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. SciBench: Evaluating College-Level Scientific Problem- Solving Abilities of Large Language Models. InProceedings of the Forty-First International Conference on Machine Learning, 2024. 11

2024
[25]

Guiding language model reasoning with planning tokens

Xinyi Wang, Lucas Caccia, Oleksiy Ostapenko, Xingdi Yuan, William Yang Wang, and Alessandro Sordoni. Guiding language model reasoning with planning tokens. InFirst Conference on Language Modeling,
[26]

URLhttps://openreview.net/forum?id=wi9IffRhVM
[27]

Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview. net/forum?id=1PL1NIMMrw

2023
[28]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Ad- vances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Asso...

2022
[29]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

GPO: Learning from critical steps to improve LLM reasoning

Jiahao Yu, Zelei Cheng, Xian Wu, and Xinyu Xing. GPO: Learning from critical steps to improve LLM reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=c6RDAutyNE

2025
[31]

R” = reasoning mode; “S

Zhanke Zhou, Xiao Feng, Zhaocheng Zhu, Jiangchao Yao, Sanmi Koyejo, and Bo Han. From passive to active reasoning: Can large language models ask the right questions under incomplete information? In Forty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/ forum?id=LCaTpVuvpj. A Full Accuracy Results with Confidence Interva...

2025

[1] [1]

Theorem-of-thought: A multi- agent framework for abductive, deductive, and inductive reasoning in language models

Samir Abdaljalil, Hasan Kurban, Khalid Qaraqe, and Erchin Serpedin. Theorem-of-thought: A multi- agent framework for abductive, deductive, and inductive reasoning in language models. In Yuji Zhang, Canyu Chen, Sha Li, Mor Geva, Chi Han, Xiaozhi Wang, Shangbin Feng, Silin Gao, Isabelle Augenstein, Mohit Bansal, Manling Li, and Heng Ji, editors,Proceedings ...

work page doi:10.18653/v1/2025.knowllm-1.10 2025

[2] [2]

Rethinking fine-tuning when scaling test-time compute: Limiting confidence improves mathematical reasoning

Feng Chen, Allan Raventos, Nan Cheng, Surya Ganguli, and Shaul Druckmann. Rethinking fine-tuning when scaling test-time compute: Limiting confidence improves mathematical reasoning. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview. net/forum?id=jvVQeSMeGM

2025

[3] [3]

Gemini 2.0 flash model card, 2025

Deepmind. Gemini 2.0 flash model card, 2025. URL http://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-2-0-Flash-Model-Card.pdf

2025

[4] [4]

Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi

Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi. Faith and fate: Limits of transformers on compositionality. In Thirty-seventh Conference on Neural Information Proce...

2023

[5] [5]

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, Olli Järviniemi, Matthew Barnett, Robert Sandler, Matej Vrzala, Jaime Sevilla, Qiuyu Ren, Elizabeth Pratt, Lionel Levine, Grant Barkley, Natalie Stewart, Bogdan Grechuk, Tetiana Grechuk, S...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, and Xiao et al. Bi. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645(8081):633–638, sept 2025. ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URL http://dx.doi.org/10.1038/s41586-025-09422-z

work page doi:10.1038/s41586-025-09422-z 2025

[7] [7]

Have large language models learned to reason? a characterization via 3-SAT

Risha Hazra, Gabriele Venturato, Pedro Zuidberg Dos Martires, and Luc De Raedt. Have large language models learned to reason? a characterization via 3-SAT. InSecond Conference on Language Modeling,

[8] [8]

URLhttps://openreview.net/forum?id=MPTlWIVSMU

[9] [9]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=d7KBjmI3GmQ

2021

[10] [10]

The reasoning-memorization interplay in language models is mediated by a single direction

Yihuai Hong, Meng Cao, Dian Zhou, Lei Yu, and Zhijing Jin. The reasoning-memorization interplay in language models is mediated by a single direction. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 21565–21585, Vienna, Austria, July 2025. Ass...

work page doi:10.18653/v1/2025.findings-acl.1111 2025

[11] [11]

Disentangling memory and reasoning ability in large language models

Mingyu Jin, Weidi Luo, Sitao Cheng, Xinyi Wang, Wenyue Hua, Ruixiang Tang, William Yang Wang, and Yongfeng Zhang. Disentangling memory and reasoning ability in large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistic...

work page doi:10.18653/v1/2025.acl-long.84 2025

[12] [12]

Reasoning Gets Harder for LLMs Inside A Dialogue

Ivan Kartac, Mateusz Lango, and Ondrej Dušek. Reasoning gets harder for llms inside a dialogue, 2026. URLhttps://arxiv.org/abs/2603.20133

work page internal anchor Pith review Pith/arXiv arXiv 2026

[13] [13]

Solving quantitative reasoning problems with language models

Aitor Lewkowycz, Anders Johan Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Venkatesh Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun ...

2022

[14] [14]

SocialGPT: Prompting LLMs for social relation reasoning via greedy segment optimization

Wanhua Li, Zibin Meng, Jiawei Zhou, Donglai Wei, Chuang Gan, and Hanspeter Pfister. SocialGPT: Prompting LLMs for social relation reasoning via greedy segment optimization. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/ forum?id=xcF2VbyZts

2024

[15] [15]

Beyond Output Correctness: Benchmarking and Evaluating Large Language Model Reasoning in Coding Tasks

Yuangang Li, Justin Tian Jin Chen, Ethan Yu, David Hong, and Iftekhar Ahmed. Beyond output correctness: Benchmarking and evaluating large language model reasoning in coding tasks, 2026. URL https: //arxiv.org/abs/2604.12379

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [16]

Quantumbench: A benchmark for quantum problem solving, 2025

Shunya Minami, Tatsuya Ishigaki, Ikko Hamamura, Taku Mikuriya, Youmi Ma, Naoaki Okazaki, Hiroya Takamura, Yohichi Suzuki, and Tadashi Kadowaki. Quantumbench: A benchmark for quantum problem solving, 2025. URLhttps://arxiv.org/abs/2511.00092

work page arXiv 2025

[17] [17]

Openai o3 and o4-mini system card, 2025

OpenAI. Openai o3 and o4-mini system card, 2025. URL https://openai.com/index/ o3-o4-mini-system-card/

2025

[18] [18]

Impact of pretraining term frequencies on few-shot numerical reasoning

Yasaman Razeghi, Robert L Logan IV , Matt Gardner, and Sameer Singh. Impact of pretraining term frequencies on few-shot numerical reasoning. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Findings of the Association for Computational Linguistics: EMNLP 2022, pages 840–854, Abu Dhabi, United Arab Emirates, December 2022. Association for Comput...

2022

[19] [19]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024. URLhttps://openreview.net/forum?id=Ti67584b98

2024

[20] [20]

Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=4FWAwZtd2n

2025

[21] [21]

Reasoning or knowledge: Stratified evaluation of biomedical LLMs

Rahul Thapa, Qingyang Wu, Kevin Wu, Harrison G Zhang, Angela Zhang, Eric Wu, Haotian Ye, and James Zou. Reasoning or knowledge: Stratified evaluation of biomedical LLMs. In Vera Demberg, Kentaro Inui, and Lluís Marquez, editors,Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Paper...

work page doi:10.18653/v1/2026.eacl-long.111 2026

[22] [22]

A tutorial on llm reasoning: Relevant methods behind chatgpt o1, 2025

Jun Wang. A tutorial on llm reasoning: Relevant methods behind chatgpt o1, 2025. URL https: //arxiv.org/abs/2502.10867

work page arXiv 2025

[23] [23]

Physunibench: A multi-modal physics reasoning benchmark at undergraduate level, 2026

Lintao Wang, Encheng Su, Jiaqi Liu, Pengze Li, Jiabei Xiao, Wenlong Zhang, Xinnan Dai, Xi Chen, Yuan Meng, Lei Bai, Wanli Ouyang, Shixiang Tang, Aoran Wang, and Xinzhu Ma. Physunibench: A multi-modal physics reasoning benchmark at undergraduate level, 2026. URL https://arxiv.org/ abs/2506.17667

work page arXiv 2026

[24] [24]

Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang

Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. SciBench: Evaluating College-Level Scientific Problem- Solving Abilities of Large Language Models. InProceedings of the Forty-First International Conference on Machine Learning, 2024. 11

2024

[25] [25]

Guiding language model reasoning with planning tokens

Xinyi Wang, Lucas Caccia, Oleksiy Ostapenko, Xingdi Yuan, William Yang Wang, and Alessandro Sordoni. Guiding language model reasoning with planning tokens. InFirst Conference on Language Modeling,

[26] [26]

URLhttps://openreview.net/forum?id=wi9IffRhVM

[27] [27]

Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview. net/forum?id=1PL1NIMMrw

2023

[28] [28]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Ad- vances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Asso...

2022

[29] [29]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

GPO: Learning from critical steps to improve LLM reasoning

Jiahao Yu, Zelei Cheng, Xian Wu, and Xinyu Xing. GPO: Learning from critical steps to improve LLM reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=c6RDAutyNE

2025

[31] [31]

R” = reasoning mode; “S

Zhanke Zhou, Xiao Feng, Zhaocheng Zhu, Jiangchao Yao, Sanmi Koyejo, and Bo Han. From passive to active reasoning: Can large language models ask the right questions under incomplete information? In Forty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/ forum?id=LCaTpVuvpj. A Full Accuracy Results with Confidence Interva...

2025