pith. sign in

arxiv: 2605.21491 · v1 · pith:DC67C3DAnew · submitted 2026-04-06 · 💻 cs.LG · cs.AI· cs.CL

Teaching Language Models to Forecast Research Success Through Comparative Idea Evaluation

Pith reviewed 2026-05-22 01:40 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords language modelsresearch ideassuccess forecastingcomparative evaluationsupervised fine-tuningreinforcement learningbenchmark performancescientific discovery
0
0 comments X

The pith

Fine-tuned language models predict which research idea will succeed on benchmarks with 77 percent accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper asks if language models can forecast the success of research ideas by comparing pairs of them against known benchmark results. The authors build a dataset of 11,488 such pairs from PapersWithCode and find that base models guess correctly only about 30 percent of the time. After supervised fine-tuning, an 8B model reaches 77.1 percent accuracy, which is higher than GPT-5 at 61.1 percent. They also train models with reinforcement learning to produce step-by-step reasoning for their choices. The results hold up on tests that check for reliance on superficial patterns and on data from different domains or later times.

Core claim

The central claim is that language models can be trained to forecast empirical success of research ideas through comparative evaluation of idea pairs drawn from objective benchmark outcomes. Supervised fine-tuning on 11,488 pairs allows 8B models to achieve 77.1% accuracy in selecting the better idea, outperforming GPT-5. Reinforcement Learning with Verifiable Rewards enables the models to find latent reasoning paths and generate interpretable justifications at 71.35% accuracy. Ablations confirm the models resist surface heuristics and the performance transfers to cross-domain and time-split test sets.

What carries the argument

The comparative idea evaluation task on pairs grounded in PapersWithCode benchmark results, trained first with supervised fine-tuning and then with Reinforcement Learning with Verifiable Rewards to elicit reasoning.

If this is right

  • Small language models become effective objective verifiers for screening research ideas without running experiments.
  • The training approach generalizes across domains and time periods in held-out test sets.
  • Models can supply interpretable justifications for their forecasts of success.
  • Compute-efficient models offer a scalable way to support autonomous filtering in scientific discovery pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the approach works broadly, AI systems could use these verifiers to prioritize which generated ideas to test first in a discovery loop.
  • Similar comparative training could be applied in other fields where historical experiment outcomes are available as labels.
  • Future work might combine this evaluation with idea generation models to create closed-loop research automation.

Load-bearing premise

The 11,488 idea pairs from PapersWithCode supply unbiased labels that reflect true empirical success and extend to new ideas outside the dataset.

What would settle it

Running experiments on a fresh set of research ideas and finding that the model's predicted winners do not actually achieve higher benchmark scores than the predicted losers.

Figures

Figures reproduced from arXiv: 2605.21491 by Aniketh Garikaparthi, Manasi Patwardhan, Srujan P Mule.

Figure 1
Figure 1. Figure 1: We explore various methods to fine-tune 8B Parameter Language Models using our constructed dataset. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Dataset Construction Pipeline We use raw entries from 1,918 NLP leaderboards to construct a statis￾tically grounded idea pairs with a benchmark specific research goal, while difficulty stratification ensures robust evaluation across diverse research goals. rors are discarded, resulting in 5,695 RR and 832 Original markdown papers. Research Goal and Idea Extraction. For each one of the 1918 leaderboards, we… view at source ↗
Figure 3
Figure 3. Figure 3: Differential Analysis (∆ based) with Bootstrap statistical significance tests across different Difficulty Subsets (σ) and Overall Performance. ∗∗: p < 0.01; ∗ : p < 0.05. Reason-SFT-DAPO and Synthetic-Reason-SFT￾DAPO produce consistent and coherent reason￾ing traces prior to the final answer, while being resilient to the form of reward hacking seen in Reason-DAPO, showcasing that it is possible to induce i… view at source ↗
Figure 4
Figure 4. Figure 4: Consistency (%) across different stages and train￾ing paradigm of Qwen3 Model. 7.4 Robustness analysis We test the robustness of the trained models for some features they might be exploiting: (i) Length: categorize the idea pairs based on cases where longer idea is better and otherwise, (ii) Re￾cency: categorize the idea pairs based on cases where the newer idea (published later) is better or worse, (iii) … view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of the ideas/methods across the [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of the ideas/methods across the [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The average rewards through the training [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Overall Accuracy (%) Vs Mean Number of to￾kens generated during reasoning. els across different stress tests and difficulty levels of idea pair comparison. B.7.1 Difficulty vs. Length Sensitivity • Reason-DAPO: There is a strong inverse correlation between task difficulty and length sensitivity. As the difficulty increases (mov￾ing from 3-σ to 1-σ), the performance gap be￾tween “Longer” and “Shorter” inpu… view at source ↗
Figure 11
Figure 11. Figure 11: Distribution of consistency rate (%) across dif￾ferent research goals/leaderboards for the cross-domain test set [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Distribution of RMSE across different research goals/leaderboards for the cross-domain test set 2 different ranks predicted. This might still not be enough since we see high Top-1 accu￾racy for base models when they actually don’t do very well. • Reason-SFT-DAPO and Reason-SFT￾DrGRPO achieve better RMSE than GPT-5 (across all reasoning efforts) on CD test, hence showing the potential of such models in fil… view at source ↗
Figure 13
Figure 13. Figure 13: Distribution of consistency rate (%) across dif￾ferent research goals/leaderboards for the in-domain test set [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Distribution of RMSE across different research goals/leaderboards for the in-domain test set Family Model DB ↓ DE ↓ DM ↓ Qwen Base 0.2507 0.0937 0.2832 Base (Reason) 0.2635 0.1392 0.3276 Direct-SFT 0.1676 0.1390 0.1915 Reason-DAPO 0.1755 0.1584 0.5475 Reason-SFT-DAPO 0.1941 0.1688 0.3112 Reason-SFT-DrGRPO 0.1970 0.1734 0.3641 Synthetic-Reason-SFT-DAPO 0.2186 0.2009 0.4320 Llama Base 0.2685 0.1552 0.3275 B… view at source ↗
read the original abstract

As language models accelerate scientific research by automating hypothesis generation and implementation, a new bottleneck emerges: evaluating and filtering hundreds of AI-generated ideas without exhaustive experimentation. We ask whether LMs can learn to forecast the empirical success of research ideas before any experiments are run. We study comparative empirical forecasting: given a benchmark-specific research goal and two candidate ideas, predict which will achieve better benchmark performance. We construct a dataset of 11,488 idea pairs grounded in objective outcomes from PapersWithCode. While off-the-shelf 8B-parameter models struggle (30% acc.), SFT dramatically boosts performance to 77.1%, outperforming GPT-5 (61.1%). By framing evaluation as a reasoning task via Reinforcement Learning with Verifiable Rewards (RLVR), we train models to discover latent reasoning paths, achieving 71.35% acc. with interpretable justifications. Through additional ablations and out-of-distribution tests, we show robustness to surface-level heuristics and transfer to both a cross-domain time-split test set and an independently constructed test set. Our results demonstrate that compute-efficient small language models can serve as effective, objective verifiers, offering a scalable path for autonomous scientific discovery.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a method for training language models to predict which of two research ideas will perform better on a specific benchmark by constructing a dataset of 11,488 comparative idea pairs from PapersWithCode. It demonstrates that supervised fine-tuning improves an 8B model's accuracy from 30% to 77.1%, surpassing GPT-5's 61.1%, and that RLVR further enables interpretable reasoning at 71.35% accuracy. The paper includes ablations and out-of-distribution evaluations to support the robustness of the approach.

Significance. If the central claims hold, this work could provide a valuable tool for filtering promising research ideas generated by AI systems, reducing the need for exhaustive experimentation and supporting more autonomous scientific discovery pipelines. The grounding in objective benchmark outcomes rather than subjective assessments is a notable strength, as is the exploration of both SFT and RLVR approaches with reported OOD generalization.

major comments (2)
  1. [Dataset construction] Dataset construction (implied in methods and abstract): The 11,488 idea pairs are derived exclusively from reported results of published papers on PapersWithCode. This introduces systematic selection bias, as only completed and accepted work appears in the source, under-sampling failed ideas. High accuracy may therefore reflect learning of publishability correlates (e.g., complexity signals or temporal trends) rather than intrinsic forecasting of empirical success. The time-split and cross-domain OOD tests remain inside the same publication-filtered distribution and do not rule out this bias.
  2. [Results and OOD evaluation] Results and OOD evaluation sections: The claim of robustness to surface-level heuristics and transfer to new domains relies on the assumption that the benchmark outcomes provide unbiased ground truth. If publication bias is present, the reported 77.1% SFT accuracy and 71.35% RLVR accuracy may not generalize to truly novel or unpublished ideas outside the PapersWithCode distribution.
minor comments (1)
  1. [Abstract] Abstract: 'GPT-5' is referenced without clarification; confirm whether this refers to a specific model version or if it is a placeholder.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below, providing clarifications and indicating where we have revised the paper to better acknowledge limitations while defending the core contributions on substantive grounds.

read point-by-point responses
  1. Referee: [Dataset construction] Dataset construction (implied in methods and abstract): The 11,488 idea pairs are derived exclusively from reported results of published papers on PapersWithCode. This introduces systematic selection bias, as only completed and accepted work appears in the source, under-sampling failed ideas. High accuracy may therefore reflect learning of publishability correlates (e.g., complexity signals or temporal trends) rather than intrinsic forecasting of empirical success. The time-split and cross-domain OOD tests remain inside the same publication-filtered distribution and do not rule out this bias.

    Authors: We appreciate the referee's concern about selection bias. However, every pair in our dataset consists of two ideas that both resulted in published papers with reported benchmark results; the label is determined solely by which idea achieved strictly superior performance on the target benchmark. The supervision signal is therefore relative empirical success between two viable, published ideas rather than a binary publishable/non-publishable distinction. This design reduces the risk that the model is merely learning generic publishability correlates. Our ablations further show that accuracy degrades when surface features such as idea length, lexical complexity, or publication year are explicitly controlled or masked, indicating the model exploits more substantive content. We have added a dedicated paragraph in the revised Limitations section discussing the scope of the published-literature distribution and the fact that time-split and cross-domain OOD tests remain within it. We do not claim the model would perform identically on entirely unpublished or failed ideas. revision: partial

  2. Referee: [Results and OOD evaluation] Results and OOD evaluation sections: The claim of robustness to surface-level heuristics and transfer to new domains relies on the assumption that the benchmark outcomes provide unbiased ground truth. If publication bias is present, the reported 77.1% SFT accuracy and 71.35% RLVR accuracy may not generalize to truly novel or unpublished ideas outside the PapersWithCode distribution.

    Authors: We agree that all reported accuracies and robustness claims are conditioned on ground-truth labels derived from published benchmark outcomes. The 77.1% SFT and 71.35% RLVR figures therefore reflect performance in forecasting relative success among ideas that reached the stage of public benchmark reporting. Our ablation studies were designed precisely to test whether the model relies on surface heuristics (e.g., temporal trends, verbosity) rather than idea content; performance remains substantially above chance even after these controls. In the revised manuscript we have tempered language in the Results and OOD sections to state that transfer is demonstrated to new domains and later time periods within the published literature, and we explicitly caution that extrapolation to ideas never submitted to benchmarks remains untested. These clarifications appear in both the main text and the new Limitations subsection. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation or evaluation chain

full rationale

The paper constructs an external dataset of 11,488 idea pairs whose labels derive from reported benchmark outcomes on PapersWithCode. It then applies standard supervised fine-tuning and RLVR (with verifiable rewards tied to those same external labels) and reports accuracy on held-out time-split, cross-domain, and independently constructed test sets. No equations, self-definitions, or self-citations are invoked to force the reported performance numbers; the metrics are measured directly against the independent ground-truth labels rather than against quantities defined by the model's own parameters. The methodology is therefore self-contained and externally benchmarked.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that historical benchmark scores constitute unbiased labels for future idea quality; no free parameters or invented entities are declared in the abstract.

axioms (1)
  • domain assumption Benchmark performance on PapersWithCode is an objective and generalizable proxy for research idea success
    Used to create the 11,488 labeled pairs that serve as training and evaluation targets.

pith-pipeline@v0.9.0 · 5747 in / 1294 out tokens · 48715 ms · 2026-05-22T01:40:03.518560+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 1 internal anchor

  1. [1]

    Agentic ai for scientific discovery: A survey of progress, challenges, and future directions.arXiv preprint arXiv:2503.08979, 2025

    Agentic ai for scientific discovery: A sur- vey of progress, challenges, and future directions. Preprint, arXiv:2503.08979. Xuemei Gu and Mario Krenn. 2025. Forecasting high-impact research topics via machine learning on evolving knowledge graphs.Machine Learning: Science and Technology, 6(2):025041. Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Y...

  2. [2]

    Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

    Rubrics as rewards: Reinforcement learning beyond verifiable domains.Preprint, arXiv:2507.17746. Alexander Gurung and Mirella Lapata. 2025. Learning to reason for long-form story generation.Preprint, arXiv:2503.22828. Danny Halawi, Fred Zhang, Chen Yueh-Han, and Jacob Steinhardt. 2024. Approaching human- level forecasting with language models.Preprint, ar...

  3. [3]

    Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, and Philip E

    Writing-zero: Bridge the gap between non- verifiable tasks and verifiable rewards.Preprint, arXiv:2506.00103. Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, and Philip E. Tetlock. 2025. Forecastbench: A dynamic bench- mark of ai forecasting capabilities.Preprint, arXiv:2409.19839. Esther Landhuis. 2016. Scientific li...

  4. [4]

    ScholarEval: Research Idea Evaluation Grounded in Literature , journal =

    Scholareval: Research idea evaluation grounded in literature.Preprint, arXiv:2510.16234. Charles O’Neill, Tirthankar Ghosal, Roberta R˘aileanu, Mike Walmsley, Thang Bui, Kevin Schawinski, and Ioana Ciuc ˘a. 2025. Sparks of science: Hypothe- sis generation using structured paper data.Preprint, arXiv:2504.12976. OpenAI. 2025. Gpt-5 system card. Accessed: 20...

  5. [5]

    Chenglei Si, Diyi Yang, and Tatsunori Hashimoto

    The ideation-execution gap: Execution outcomes of llm-generated versus human research ideas.Preprint, arXiv:2506.20803. Chenglei Si, Diyi Yang, and Tatsunori Hashimoto

  6. [6]

    Preprint, arXiv:2409.04109

    Can llms generate novel research ideas? a large-scale human study with 100+ nlp researchers. Preprint, arXiv:2409.04109. Michael D. Skarlinski, Sam Cox, Jon M. Laurent, James D. Braza, Michaela Hinks, Michael J. Ham- merling, Manvitha Ponnapati, Samuel G. Rodriques, and Andrew D. White. 2024. Language agents achieve superhuman synthesis of scientific know...

  7. [7]

    Excluded

    The pace of artificial intelligence innovations: Speed, talent, and trial-and-error.Journal of Infor- metrics, 14(4):101094. Keisuke Ueda, Wataru Hirota, Takuto Asakura, Takahiro Omi, Kosuke Takahashi, Kosuke Arima, and Tatsuya Ishigaki. 2025. Exploring design of multi-agent llm dialogues for research ideation. Preprint, arXiv:2507.08350. Qingyun Wang, Do...

  8. [8]

    Both idea_A and idea_B MUST be actual model names from the list: { models_list}

  9. [9]

    reasoning

    All model names in the "reasoning" text MUST be replaced with "idea_A" or " idea_B"

  10. [10]

    Reasoning must be grounded in actual paper content, not inferred

  11. [11]

    Only include pairs where the paper explains WHY one is better, not just describes differences

  12. [12]

    The reasoning must connect differences to advantages or why one is better than the other

  13. [13]

    If no such reasoning exists for any pair, return empty comparative_reasoning array

  14. [14]

    Extract reasoning for each pair where the paper explains superiority

    Capture ALL reasoning the paper provides - don’t limit to specific types Important Notes: - Multiple models: You may receive 3, 4, or more models to analyze. Extract reasoning for each pair where the paper explains superiority. - Not all pairs need reasoning: If the paper doesn’t explain why one is better (just mentions differences), don’t extract a reaso...

  15. [15]

    Write as a SINGLE comprehensive paragraph (not multiple sections)

  16. [16]

    Focus on the core RESEARCH OBJECTIVE that this benchmark addresses

  17. [17]

    Include what type of input data is used, what output is expected, and how performance is measured

  18. [18]

    Be specific about the research challenge and why it is important

  19. [19]

    Use scientific language but keep it readable and focused

  20. [20]

    Mention the specific benchmark/dataset name

  21. [21]

    Example 2:

    Keep the research goal between 3-5 sentences INSTRUCTIONS: - Write a cohesive paragraph that flows naturally - Start with the research objective or problem being addressed - Include input/output specifications naturally within the paragraph - Mention evaluation approach without making it a separate section - Focus on the RESEARCH GOAL, not just describing...

  22. [22]

    +", "&",

    Analyze each model name as a complete unit: Take the EXACT model or method name as given and analyze it as one single model/method, even if it contains symbols like "+", "&", "with", etc

  23. [23]

    we propose

    Check if originally introduced: You may look for phrases like "we propose", "we introduce", "we present [exact_model_name]", "our [exact_model_name]", detailed descriptions indicating novelty or any other relavant context

  24. [24]

    using [exact_model_name] from [citation]

    If NOT originally introduced: Look for citations when the complete model or method name is mentioned: - Find phrases like "using [exact_model_name] from [citation]", "based on [ exact_model_name] [citation]", "[exact_model_name] (Author et al.)" etc., but be mindful of cases where the exact model name is just a variant of the original (Like MethodX(unidir...

  25. [25]

    ModelA + ModelB

    For combination-style model or method names (e.g., "ModelA + ModelB", " Enhanced ModelX", "ModelY with additional components (like trained on certain dataset etc.)"): - Treat the ENTIRE name as ONE MODEL - do not analyze components separately - If the complete combination is a novel approach, mark as introduced_in_this_paper = true - If the complete combi...

  26. [26]

    models": [ {{

    Use citations to find original papers: When a model is cited, go to the references section and find the complete bibliographic information for that citation. Return JSON format with EXACTLY ONE entry per model name provided: {{ "models": [ {{ "model_name": "EXACT_MODEL_NAME_AS_PROVIDED", "introduced_in_this_paper": true/false, "original_paper_title": "Tit...