AInstein: Can LLMs Solve Research Problems From Parametric Memory Alone?

Christopher Pal; Gaurav Sahu; Jose Dolz; Laurent Charlin; Marco Pedersoli; Shambhavi Mishra

arxiv: 2510.05432 · v2 · submitted 2025-10-06 · 💻 cs.AI

AInstein: Can LLMs Solve Research Problems From Parametric Memory Alone?

Shambhavi Mishra , Gaurav Sahu , Marco Pedersoli , Laurent Charlin , Jose Dolz , Christopher Pal This is my paper

Pith reviewed 2026-05-18 09:26 UTC · model grok-4.3

classification 💻 cs.AI

keywords large language modelsparametric knowledgeresearch problem solvingiterative refinementAI researchknowledge boundariessolution generation

0 comments

The pith

Large language models solve over seventy percent of AI research problems using only their parametric knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether large language models can generate and refine solutions to real AI research problems using nothing beyond the knowledge already stored in their parameters. The authors introduce the AInstein framework, which runs the model through repeated cycles of proposing a solution and critiquing its own output. They first validate an automated scoring system against twenty human experts on held-out problems, then apply it to more than twelve hundred recent papers. The key result is that models address the stated problem more than seventy percent of the time yet match the exact published approach less than nineteen percent of the time, which the authors read as evidence of genuine problem-solving inside familiar areas. They also locate a clear limit: performance collapses when the needed idea requires drawing an analogy from a distant field.

Core claim

Using only parametric memory, large language models generate solutions to AI research problems that address the stated goals in over seventy percent of cases across more than one thousand papers. At the same time, these solutions match the approach taken in the actual published paper less than nineteen percent of the time. This gap indicates that the models are not merely retrieving memorized answers but are engaging in problem-solving processes. The framework reveals a parametric knowledge boundary beyond which models cannot transfer ideas across domains without additional support.

What carries the argument

AInstein, a framework that runs an LLM through iterative critique loops to generate, evaluate, and refine research solutions from parametric knowledge alone.

Load-bearing premise

The automated LLM judge, after validation against twenty human experts on a small set of held-out problems, continues to measure success and rediscovery accurately when scaled to the full set of recent papers.

What would settle it

A new blind study in which human experts rate a random sample of the generated solutions and find that their success and rediscovery judgments differ substantially from the automated scores.

Figures

Figures reproduced from arXiv: 2510.05432 by Christopher Pal, Gaurav Sahu, Jose Dolz, Laurent Charlin, Marco Pedersoli, Shambhavi Mishra.

**Figure 1.** Figure 1: The AINSTEIN framework. An input scientific abstract (A) is first derived into a generalized problem (P) by the Generalizer agent (G). The Solver agent (S) then attempts to derive a technical solution (Z), using the problem statement P. Both phases employ an iterative refinement loop with internal (Mi) and external (Me) critique. Note: The transition A → P follows the same iterative refinement mechanism, w… view at source ↗

**Figure 2.** Figure 2: Correlation matrix for Generalizer qual [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Performance comparison of internal models across ICLR paper tiers (Oral, Spotlight, Poster), averaging [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Visual analysis of the 11 identified research clusters. (a) Top keywords provide a thematic summary [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Can large language models solve AI research problems using only their parametric knowledge, without fine-tuning, retrieval, or other external aids? We introduce AInstein, a framework for testing whether LLM agents can generate and refine solutions to research problems through iterative critique loops. A blind study with 20 domain experts on held-out ICLR 2026 problems validates our automated metrics, which we then scale to 1,214 ICLR 2025 papers using an LLM-as-a-judge paradigm. Two metrics capture complementary aspects of performance: Success Rate (does the solution address the problem?) and Rediscovery (does it match the published approach?). LLMs succeed on over 70% of problems, yet strictly rediscover the published solution less than 19% of the time, suggesting genuine problem-solving rather than associative recall. However, this ability has clear limits: models handle familiar methodological territory well but fail when solutions require cross-domain analogical transfer, a pattern we call the parametric knowledge boundary. On the ResearchPlanGen benchmark (2,645 problems), our training-free iterative refinement strategy matches RL finetuning, and a criteria-coverage analysis pins down the ceiling of what test-time refinement alone can achieve. Together, these findings map both the capabilities and the limits of LLMs as autonomous scientific problem-solvers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs solve over 70% of these research problems from parametric memory with refinement but rediscover the published solution less than 19% of the time, though the LLM judge's reliability at scale is the main open question.

read the letter

This paper's key finding is that LLMs solve over 70% of AI research problems using only parametric knowledge and iterative refinement, but they strictly rediscover the published solution in under 19% of cases. That gap points to actual problem-solving rather than simple recall, though with clear limits on cross-domain transfers. What stands out as new is the AInstein framework itself, which combines critique loops with dual metrics for success and rediscovery. They validate those metrics through a blind expert study on held-out ICLR 2026 problems before scaling the LLM judge to 1,214 ICLR 2025 papers. The comparison on ResearchPlanGen, where their training-free method matches RL finetuning, adds a useful benchmark angle. The identification of a parametric knowledge boundary is a clean way to frame where these models hit walls. The work does well on ambition and scope. Running this at the scale of over a thousand papers gives a broader picture than smaller studies, and separating success from rediscovery helps clarify what the models are actually doing. The main soft spot is in the judge reliability when moving from 20 expert validations to the full set. The stress test raises a fair point about potential misclassification on partial methodological overlaps or how strict matching gets defined in prompts. If the paper includes inter-rater stats or detailed examples of the judge's decisions, that would shore it up; otherwise it leaves some room for doubt on the exact percentages. This is for researchers focused on AI for scientific discovery and LLM agent design. Anyone evaluating how far current models can go without external tools will find the results relevant. It deserves a serious referee because the empirical scale and the novel metric distinction make it worth detailed scrutiny, even if some methods details need tightening. I would send this to peer review. The central claims are grounded enough in the large evaluation to merit expert feedback.

Referee Report

3 major / 2 minor

Summary. The paper introduces the AInstein framework to test whether LLMs can solve AI research problems using only parametric knowledge, without fine-tuning or retrieval, via iterative critique and refinement loops. A blind validation study with 20 domain experts on held-out ICLR 2026 problems is used to validate automated Success Rate and Rediscovery metrics, which are then scaled via an LLM-as-a-judge to 1,214 ICLR 2025 papers. Key results include >70% success rate but <19% strict rediscovery of published solutions (interpreted as genuine problem-solving rather than recall), identification of a 'parametric knowledge boundary' for cross-domain transfer, and a finding that training-free refinement on the ResearchPlanGen benchmark (2,645 problems) matches RL finetuning performance.

Significance. If the metrics prove reliable at scale, the work would offer a substantial empirical characterization of LLM limits and capabilities as autonomous scientific agents, with the large-scale evaluation across 1,200+ papers and the direct comparison of test-time refinement to RL finetuning providing concrete benchmarks for future agent research.

major comments (3)

[§4 (Blind Validation Study)] §4 (Blind Validation Study): The validation relies on 20 domain experts for the automated metrics, but the manuscript provides no inter-rater agreement statistics (e.g., Cohen's or Fleiss' kappa), confusion matrices, or explicit operationalization of 'strict' rediscovery matching in the LLM judge prompts. This directly undermines the load-bearing interpretation that the <19% rediscovery rate demonstrates genuine parametric problem-solving rather than associative recall or metric error.
[§5 (Scaling to ICLR 2025)] §5 (Scaling to ICLR 2025): When the same LLM-as-a-judge is applied to the full 1,214-paper corpus, there is no reported analysis of partial methodological overlap cases or sensitivity to prompt variations for rediscovery detection. Systematic misclassification here would inflate the gap between the 70% Success Rate and low Rediscovery figures, weakening the central claim.
[§6 (Parametric Knowledge Boundary)] §6 (Parametric Knowledge Boundary): The claim that failures occur specifically on cross-domain analogical transfer (as opposed to other factors such as iterative loop design or prompt sensitivity) lacks supporting ablations or quantitative breakdowns; this distinction is central to the paper's mapping of LLM limits.

minor comments (2)

[Abstract] The abstract references a 'criteria-coverage analysis' that pins down the ceiling of test-time refinement; this should be given a dedicated subsection with explicit metrics and results for clarity.
[§3 (Metrics)] Notation for the two core metrics (Success Rate and Rediscovery) could be formalized earlier, e.g., with explicit definitions or pseudocode, to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point by point below, proposing targeted revisions to improve the rigor and transparency of our validation and analyses.

read point-by-point responses

Referee: [§4 (Blind Validation Study)] The validation relies on 20 domain experts for the automated metrics, but the manuscript provides no inter-rater agreement statistics (e.g., Cohen's or Fleiss' kappa), confusion matrices, or explicit operationalization of 'strict' rediscovery matching in the LLM judge prompts. This directly undermines the load-bearing interpretation that the <19% rediscovery rate demonstrates genuine parametric problem-solving rather than associative recall or metric error.

Authors: We appreciate this observation and agree that inter-rater reliability metrics are essential for validating our automated measures. In the revised manuscript, we will compute and report Fleiss' kappa for the agreement among the 20 domain experts on both the Success Rate and Rediscovery judgments. We will also include confusion matrices that compare the expert annotations with the LLM-as-a-judge outputs. Furthermore, we will provide the full LLM judge prompt in the appendix and explicitly operationalize 'strict' rediscovery as requiring that the generated solution matches the published work in its primary methodological contribution and key innovations, rather than superficial similarities. These changes will bolster confidence in the <19% rediscovery rate as evidence of genuine problem-solving. revision: yes
Referee: [§5 (Scaling to ICLR 2025)] When the same LLM-as-a-judge is applied to the full 1,214-paper corpus, there is no reported analysis of partial methodological overlap cases or sensitivity to prompt variations for rediscovery detection. Systematic misclassification here would inflate the gap between the 70% Success Rate and low Rediscovery figures, weakening the central claim.

Authors: We concur that examining partial overlaps and prompt sensitivity is crucial to rule out systematic biases in the scaling. Accordingly, we will revise Section 5 to include an analysis of partial methodological overlap cases, providing examples and quantifying their frequency and impact on the Rediscovery metric. Additionally, we will perform a sensitivity analysis by testing multiple variations of the rediscovery detection prompt and report the resulting range of Rediscovery rates, demonstrating that the low rediscovery figure remains stable. This will address concerns about potential misclassification inflating the observed gap. revision: yes
Referee: [§6 (Parametric Knowledge Boundary)] The claim that failures occur specifically on cross-domain analogical transfer (as opposed to other factors such as iterative loop design or prompt sensitivity) lacks supporting ablations or quantitative breakdowns; this distinction is central to the paper's mapping of LLM limits.

Authors: To provide stronger evidence for this distinction, we will add new ablations and quantitative breakdowns in the revised Section 6. Specifically, we categorize a sample of failure cases according to whether they primarily stem from cross-domain analogical transfer requirements, issues with the iterative critique loop design, or sensitivity to prompt phrasing. We report the proportions for each category based on expert review of a subset of cases, showing that cross-domain transfer accounts for the majority of failures. This supports our characterization of the parametric knowledge boundary. revision: yes

Circularity Check

0 steps flagged

No significant circularity; metrics grounded by external expert validation

full rationale

The paper validates its automated Success Rate and Rediscovery metrics via a blind study with 20 domain experts on held-out ICLR 2026 problems before scaling the LLM-as-a-judge to the 1,214 ICLR 2025 papers. This provides independent external grounding against human benchmarks on a separate set, so the central claim (70%+ success with <19% strict rediscovery indicating genuine parametric problem-solving) does not reduce to self-definition, fitted inputs, or self-citation chains by construction. The derivation chain remains self-contained with no load-bearing steps that equate outputs to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the domain assumption that ICLR problems are representative of AI research tasks solvable from parametric knowledge and that expert-validated automated metrics generalize to the larger paper set.

axioms (1)

domain assumption ICLR 2025 and 2026 papers represent typical AI research problems whose solutions can be evaluated for success and rediscovery from parametric knowledge alone.
The evaluation framework assumes these conference problems are suitable proxies for testing LLM autonomous problem-solving ability.

invented entities (1)

parametric knowledge boundary no independent evidence
purpose: To describe the observed performance drop when solutions require cross-domain analogical transfer.
This term is introduced to characterize a specific failure pattern in the experimental results.

pith-pipeline@v0.9.0 · 5778 in / 1575 out tokens · 57604 ms · 2026-05-18T09:26:36.077952+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce AInstein, a framework for testing whether LLM agents can generate and refine solutions to research problems through iterative critique loops... Success Rate... Rediscovery... Novelty

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FAME: Forecasting Academic Impact via Continuous-Time Manifold Evolution
cs.LG 2026-05 unverdicted novelty 6.0

FAME models scientific topic trajectories in continuous time to forecast paper impact more accurately than LLMs by aligning manuscripts with field momentum in a dynamic latent space.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 1 Pith paper · 6 internal anchors

[1]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page 1901
[2]

On the Measure of Intelligence

URLhttps://arxiv.org/abs/1911.01547. Hao Cui, Zahra Shamsi, Gowoon Cheon, Xuejian Ma, Shutong Li, Maria Tikhanovskaya, Peter Norgaard, Nayan- tara Mudur, Martyna Plomecka, Paul Raccuglia, Yasaman Bahri, Victor V . Albert, Pranesh Srinivasan, Haining Pan, Philippe Faist, Brian Rohr, Ekin Dogus Cubuk, Muratahan Aykol, Amil Merchant, Michael J. Statt, Dan Mo...

work page internal anchor Pith review Pith/arXiv arXiv 1911
[3]

Curie: Eval- uating llms on multitask scientific long context understanding and reasoning.arXiv preprint arXiv:2503.13517, 2025

URLhttps://arxiv.org/abs/2503.13517. Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits.Transformer Circuits Thread, 1(1):12,

work page arXiv
[4]

org/abs/2406.18321

URLhttps://arxiv. org/abs/2406.18321. Rita Gonz´alez-M´arquez and Dmitry Kobak. Learning representations of learning representations. InData-centric Machine Learning Research (DMLR) workshop at ICLR 2024,

work page arXiv 2024
[5]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment.arXiv preprint arXiv:2303.16634,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Entity-based knowledge conflicts in question answering

Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. Entity-based knowledge conflicts in question answering. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 7052–7063,

work page 2021
[7]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al

URLhttps://arxiv.org/abs/2407.15240. Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems, 36:46534–46594,

work page arXiv
[8]

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.),Proceedings of the 2022 Conference on Empirical Methods in Natural Lan- guage Processing, pp. 11048–11064, Abu Dhabi,...

work page 2022
[9]

In-context Learning and Induction Heads

Association for Com- putational Linguistics. doi: 10.18653/v1/2022.emnlp-main.759. URLhttps://aclanthology.org/ 2022.emnlp-main.759/. 10 Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads.arXiv preprint arXiv:2209.11895,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2022.emnlp-main.759 2022
[10]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Prompt programming for large language models: Beyond the few-shot paradigm

Laria Reynolds and Kyle McDonell. Prompt programming for large language models: Beyond the few-shot paradigm. InExtended abstracts of the 2021 CHI conference on human factors in computing systems, pp. 1–7,

work page 2021
[12]

Evaluation metrics in the era of gpt-4: Reliably evaluating large language models on sequence to sequence tasks.arXiv preprint arXiv:2310.13800,

Andrea Sottana, Bin Liang, Kai Zou, and Zheng Yuan. Evaluation metrics in the era of gpt-4: Reliably evaluating large language models on sequence to sequence tasks.arXiv preprint arXiv:2310.13800,

work page arXiv
[13]

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Qingyun Wang, Doug Downey, Heng Ji, and Tom Hope. Scimon: Scientific inspiration machines optimized for novelty. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 279–299, 2024a. Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhan...

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

URLhttps://arxiv.org/abs/2501.09686. Yanbo Zhang, Sumeer A. Khan, Adnan Mahmud, Huck Yang, Alexander Lavin, Michael Levin, Jeremy Frey, Jared Dunnmon, James Evans, Alan Bundy, Saso Dzeroski, Jesper Tegner, and Hector Zenil. Advancing the scientific method with large language models: From hypothesis to discovery,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Advancing the scientific method with large language models: From hypothesis to discovery

URLhttps://arxiv. org/abs/2505.16477. Alex Zhavoronkov, Yan A Ivanenkov, Alex Aliper, Mark S Veselov, Vladimir A Aladinskiy, Anastasiya V Aladin- skaya, Victor A Terentiev, Daniil A Polykovskiy, Maksim D Kuznetsov, Arip Asadulaev, et al. Deep learning enables rapid identification of potent ddr1 kinase inhibitors.Nature biotechnology, 37(9):1038–1040,

work page arXiv
[16]

APPENDIX A LLM USAGE We have used LLMs to polish the text of the paper and also to conduct literature review

URL https://arxiv.org/abs/2505.03418. APPENDIX A LLM USAGE We have used LLMs to polish the text of the paper and also to conduct literature review. Specifically, we used chatGPT’s deep research feature to retrieve some relevant papers. 11 B STATISTICALSIGNIFICANCETESTING To rigorously compare the performance between our primary models,GPT-OSS-120BandMistr...

work page arXiv

[1] [1]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page 1901

[2] [2]

On the Measure of Intelligence

URLhttps://arxiv.org/abs/1911.01547. Hao Cui, Zahra Shamsi, Gowoon Cheon, Xuejian Ma, Shutong Li, Maria Tikhanovskaya, Peter Norgaard, Nayan- tara Mudur, Martyna Plomecka, Paul Raccuglia, Yasaman Bahri, Victor V . Albert, Pranesh Srinivasan, Haining Pan, Philippe Faist, Brian Rohr, Ekin Dogus Cubuk, Muratahan Aykol, Amil Merchant, Michael J. Statt, Dan Mo...

work page internal anchor Pith review Pith/arXiv arXiv 1911

[3] [3]

Curie: Eval- uating llms on multitask scientific long context understanding and reasoning.arXiv preprint arXiv:2503.13517, 2025

URLhttps://arxiv.org/abs/2503.13517. Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits.Transformer Circuits Thread, 1(1):12,

work page arXiv

[4] [4]

org/abs/2406.18321

URLhttps://arxiv. org/abs/2406.18321. Rita Gonz´alez-M´arquez and Dmitry Kobak. Learning representations of learning representations. InData-centric Machine Learning Research (DMLR) workshop at ICLR 2024,

work page arXiv 2024

[5] [5]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment.arXiv preprint arXiv:2303.16634,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Entity-based knowledge conflicts in question answering

Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. Entity-based knowledge conflicts in question answering. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 7052–7063,

work page 2021

[7] [7]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al

URLhttps://arxiv.org/abs/2407.15240. Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems, 36:46534–46594,

work page arXiv

[8] [8]

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.),Proceedings of the 2022 Conference on Empirical Methods in Natural Lan- guage Processing, pp. 11048–11064, Abu Dhabi,...

work page 2022

[9] [9]

In-context Learning and Induction Heads

Association for Com- putational Linguistics. doi: 10.18653/v1/2022.emnlp-main.759. URLhttps://aclanthology.org/ 2022.emnlp-main.759/. 10 Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads.arXiv preprint arXiv:2209.11895,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2022.emnlp-main.759 2022

[10] [10]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Prompt programming for large language models: Beyond the few-shot paradigm

Laria Reynolds and Kyle McDonell. Prompt programming for large language models: Beyond the few-shot paradigm. InExtended abstracts of the 2021 CHI conference on human factors in computing systems, pp. 1–7,

work page 2021

[12] [12]

Evaluation metrics in the era of gpt-4: Reliably evaluating large language models on sequence to sequence tasks.arXiv preprint arXiv:2310.13800,

Andrea Sottana, Bin Liang, Kai Zou, and Zheng Yuan. Evaluation metrics in the era of gpt-4: Reliably evaluating large language models on sequence to sequence tasks.arXiv preprint arXiv:2310.13800,

work page arXiv

[13] [13]

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Qingyun Wang, Doug Downey, Heng Ji, and Tom Hope. Scimon: Scientific inspiration machines optimized for novelty. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 279–299, 2024a. Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhan...

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

URLhttps://arxiv.org/abs/2501.09686. Yanbo Zhang, Sumeer A. Khan, Adnan Mahmud, Huck Yang, Alexander Lavin, Michael Levin, Jeremy Frey, Jared Dunnmon, James Evans, Alan Bundy, Saso Dzeroski, Jesper Tegner, and Hector Zenil. Advancing the scientific method with large language models: From hypothesis to discovery,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Advancing the scientific method with large language models: From hypothesis to discovery

URLhttps://arxiv. org/abs/2505.16477. Alex Zhavoronkov, Yan A Ivanenkov, Alex Aliper, Mark S Veselov, Vladimir A Aladinskiy, Anastasiya V Aladin- skaya, Victor A Terentiev, Daniil A Polykovskiy, Maksim D Kuznetsov, Arip Asadulaev, et al. Deep learning enables rapid identification of potent ddr1 kinase inhibitors.Nature biotechnology, 37(9):1038–1040,

work page arXiv

[16] [16]

APPENDIX A LLM USAGE We have used LLMs to polish the text of the paper and also to conduct literature review

URL https://arxiv.org/abs/2505.03418. APPENDIX A LLM USAGE We have used LLMs to polish the text of the paper and also to conduct literature review. Specifically, we used chatGPT’s deep research feature to retrieve some relevant papers. 11 B STATISTICALSIGNIFICANCETESTING To rigorously compare the performance between our primary models,GPT-OSS-120BandMistr...

work page arXiv