Teaching Language Models to Forecast Research Success Through Comparative Idea Evaluation
Pith reviewed 2026-05-22 01:40 UTC · model grok-4.3
The pith
Fine-tuned language models predict which research idea will succeed on benchmarks with 77 percent accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that language models can be trained to forecast empirical success of research ideas through comparative evaluation of idea pairs drawn from objective benchmark outcomes. Supervised fine-tuning on 11,488 pairs allows 8B models to achieve 77.1% accuracy in selecting the better idea, outperforming GPT-5. Reinforcement Learning with Verifiable Rewards enables the models to find latent reasoning paths and generate interpretable justifications at 71.35% accuracy. Ablations confirm the models resist surface heuristics and the performance transfers to cross-domain and time-split test sets.
What carries the argument
The comparative idea evaluation task on pairs grounded in PapersWithCode benchmark results, trained first with supervised fine-tuning and then with Reinforcement Learning with Verifiable Rewards to elicit reasoning.
If this is right
- Small language models become effective objective verifiers for screening research ideas without running experiments.
- The training approach generalizes across domains and time periods in held-out test sets.
- Models can supply interpretable justifications for their forecasts of success.
- Compute-efficient models offer a scalable way to support autonomous filtering in scientific discovery pipelines.
Where Pith is reading between the lines
- If the approach works broadly, AI systems could use these verifiers to prioritize which generated ideas to test first in a discovery loop.
- Similar comparative training could be applied in other fields where historical experiment outcomes are available as labels.
- Future work might combine this evaluation with idea generation models to create closed-loop research automation.
Load-bearing premise
The 11,488 idea pairs from PapersWithCode supply unbiased labels that reflect true empirical success and extend to new ideas outside the dataset.
What would settle it
Running experiments on a fresh set of research ideas and finding that the model's predicted winners do not actually achieve higher benchmark scores than the predicted losers.
Figures
read the original abstract
As language models accelerate scientific research by automating hypothesis generation and implementation, a new bottleneck emerges: evaluating and filtering hundreds of AI-generated ideas without exhaustive experimentation. We ask whether LMs can learn to forecast the empirical success of research ideas before any experiments are run. We study comparative empirical forecasting: given a benchmark-specific research goal and two candidate ideas, predict which will achieve better benchmark performance. We construct a dataset of 11,488 idea pairs grounded in objective outcomes from PapersWithCode. While off-the-shelf 8B-parameter models struggle (30% acc.), SFT dramatically boosts performance to 77.1%, outperforming GPT-5 (61.1%). By framing evaluation as a reasoning task via Reinforcement Learning with Verifiable Rewards (RLVR), we train models to discover latent reasoning paths, achieving 71.35% acc. with interpretable justifications. Through additional ablations and out-of-distribution tests, we show robustness to surface-level heuristics and transfer to both a cross-domain time-split test set and an independently constructed test set. Our results demonstrate that compute-efficient small language models can serve as effective, objective verifiers, offering a scalable path for autonomous scientific discovery.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a method for training language models to predict which of two research ideas will perform better on a specific benchmark by constructing a dataset of 11,488 comparative idea pairs from PapersWithCode. It demonstrates that supervised fine-tuning improves an 8B model's accuracy from 30% to 77.1%, surpassing GPT-5's 61.1%, and that RLVR further enables interpretable reasoning at 71.35% accuracy. The paper includes ablations and out-of-distribution evaluations to support the robustness of the approach.
Significance. If the central claims hold, this work could provide a valuable tool for filtering promising research ideas generated by AI systems, reducing the need for exhaustive experimentation and supporting more autonomous scientific discovery pipelines. The grounding in objective benchmark outcomes rather than subjective assessments is a notable strength, as is the exploration of both SFT and RLVR approaches with reported OOD generalization.
major comments (2)
- [Dataset construction] Dataset construction (implied in methods and abstract): The 11,488 idea pairs are derived exclusively from reported results of published papers on PapersWithCode. This introduces systematic selection bias, as only completed and accepted work appears in the source, under-sampling failed ideas. High accuracy may therefore reflect learning of publishability correlates (e.g., complexity signals or temporal trends) rather than intrinsic forecasting of empirical success. The time-split and cross-domain OOD tests remain inside the same publication-filtered distribution and do not rule out this bias.
- [Results and OOD evaluation] Results and OOD evaluation sections: The claim of robustness to surface-level heuristics and transfer to new domains relies on the assumption that the benchmark outcomes provide unbiased ground truth. If publication bias is present, the reported 77.1% SFT accuracy and 71.35% RLVR accuracy may not generalize to truly novel or unpublished ideas outside the PapersWithCode distribution.
minor comments (1)
- [Abstract] Abstract: 'GPT-5' is referenced without clarification; confirm whether this refers to a specific model version or if it is a placeholder.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below, providing clarifications and indicating where we have revised the paper to better acknowledge limitations while defending the core contributions on substantive grounds.
read point-by-point responses
-
Referee: [Dataset construction] Dataset construction (implied in methods and abstract): The 11,488 idea pairs are derived exclusively from reported results of published papers on PapersWithCode. This introduces systematic selection bias, as only completed and accepted work appears in the source, under-sampling failed ideas. High accuracy may therefore reflect learning of publishability correlates (e.g., complexity signals or temporal trends) rather than intrinsic forecasting of empirical success. The time-split and cross-domain OOD tests remain inside the same publication-filtered distribution and do not rule out this bias.
Authors: We appreciate the referee's concern about selection bias. However, every pair in our dataset consists of two ideas that both resulted in published papers with reported benchmark results; the label is determined solely by which idea achieved strictly superior performance on the target benchmark. The supervision signal is therefore relative empirical success between two viable, published ideas rather than a binary publishable/non-publishable distinction. This design reduces the risk that the model is merely learning generic publishability correlates. Our ablations further show that accuracy degrades when surface features such as idea length, lexical complexity, or publication year are explicitly controlled or masked, indicating the model exploits more substantive content. We have added a dedicated paragraph in the revised Limitations section discussing the scope of the published-literature distribution and the fact that time-split and cross-domain OOD tests remain within it. We do not claim the model would perform identically on entirely unpublished or failed ideas. revision: partial
-
Referee: [Results and OOD evaluation] Results and OOD evaluation sections: The claim of robustness to surface-level heuristics and transfer to new domains relies on the assumption that the benchmark outcomes provide unbiased ground truth. If publication bias is present, the reported 77.1% SFT accuracy and 71.35% RLVR accuracy may not generalize to truly novel or unpublished ideas outside the PapersWithCode distribution.
Authors: We agree that all reported accuracies and robustness claims are conditioned on ground-truth labels derived from published benchmark outcomes. The 77.1% SFT and 71.35% RLVR figures therefore reflect performance in forecasting relative success among ideas that reached the stage of public benchmark reporting. Our ablation studies were designed precisely to test whether the model relies on surface heuristics (e.g., temporal trends, verbosity) rather than idea content; performance remains substantially above chance even after these controls. In the revised manuscript we have tempered language in the Results and OOD sections to state that transfer is demonstrated to new domains and later time periods within the published literature, and we explicitly caution that extrapolation to ideas never submitted to benchmarks remains untested. These clarifications appear in both the main text and the new Limitations subsection. revision: partial
Circularity Check
No circularity in derivation or evaluation chain
full rationale
The paper constructs an external dataset of 11,488 idea pairs whose labels derive from reported benchmark outcomes on PapersWithCode. It then applies standard supervised fine-tuning and RLVR (with verifiable rewards tied to those same external labels) and reports accuracy on held-out time-split, cross-domain, and independently constructed test sets. No equations, self-definitions, or self-citations are invoked to force the reported performance numbers; the metrics are measured directly against the independent ground-truth labels rather than against quantities defined by the model's own parameters. The methodology is therefore self-contained and externally benchmarked.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Benchmark performance on PapersWithCode is an objective and generalizable proxy for research idea success
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We construct a dataset of 11,488 idea pairs grounded in objective outcomes from PapersWithCode... SFT dramatically boosts performance to 77.1%... Reinforcement Learning with Verifiable Rewards (RLVR)
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Direct-SFT yields dramatic improvements. Qwen3 reaches 77.10% accuracy... robustness to stress tests on paraphrasing and recency, length and position bias
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Agentic ai for scientific discovery: A sur- vey of progress, challenges, and future directions. Preprint, arXiv:2503.08979. Xuemei Gu and Mario Krenn. 2025. Forecasting high-impact research topics via machine learning on evolving knowledge graphs.Machine Learning: Science and Technology, 6(2):025041. Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Y...
-
[2]
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
Rubrics as rewards: Reinforcement learning beyond verifiable domains.Preprint, arXiv:2507.17746. Alexander Gurung and Mirella Lapata. 2025. Learning to reason for long-form story generation.Preprint, arXiv:2503.22828. Danny Halawi, Fred Zhang, Chen Yueh-Han, and Jacob Steinhardt. 2024. Approaching human- level forecasting with language models.Preprint, ar...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, and Philip E
Writing-zero: Bridge the gap between non- verifiable tasks and verifiable rewards.Preprint, arXiv:2506.00103. Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, and Philip E. Tetlock. 2025. Forecastbench: A dynamic bench- mark of ai forecasting capabilities.Preprint, arXiv:2409.19839. Esther Landhuis. 2016. Scientific li...
-
[4]
ScholarEval: Research Idea Evaluation Grounded in Literature , journal =
Scholareval: Research idea evaluation grounded in literature.Preprint, arXiv:2510.16234. Charles O’Neill, Tirthankar Ghosal, Roberta R˘aileanu, Mike Walmsley, Thang Bui, Kevin Schawinski, and Ioana Ciuc ˘a. 2025. Sparks of science: Hypothe- sis generation using structured paper data.Preprint, arXiv:2504.12976. OpenAI. 2025. Gpt-5 system card. Accessed: 20...
-
[5]
Chenglei Si, Diyi Yang, and Tatsunori Hashimoto
The ideation-execution gap: Execution outcomes of llm-generated versus human research ideas.Preprint, arXiv:2506.20803. Chenglei Si, Diyi Yang, and Tatsunori Hashimoto
-
[6]
Can llms generate novel research ideas? a large-scale human study with 100+ nlp researchers. Preprint, arXiv:2409.04109. Michael D. Skarlinski, Sam Cox, Jon M. Laurent, James D. Braza, Michaela Hinks, Michael J. Ham- merling, Manvitha Ponnapati, Samuel G. Rodriques, and Andrew D. White. 2024. Language agents achieve superhuman synthesis of scientific know...
-
[7]
The pace of artificial intelligence innovations: Speed, talent, and trial-and-error.Journal of Infor- metrics, 14(4):101094. Keisuke Ueda, Wataru Hirota, Takuto Asakura, Takahiro Omi, Kosuke Takahashi, Kosuke Arima, and Tatsuya Ishigaki. 2025. Exploring design of multi-agent llm dialogues for research ideation. Preprint, arXiv:2507.08350. Qingyun Wang, Do...
-
[8]
Both idea_A and idea_B MUST be actual model names from the list: { models_list}
- [9]
-
[10]
Reasoning must be grounded in actual paper content, not inferred
-
[11]
Only include pairs where the paper explains WHY one is better, not just describes differences
-
[12]
The reasoning must connect differences to advantages or why one is better than the other
-
[13]
If no such reasoning exists for any pair, return empty comparative_reasoning array
-
[14]
Extract reasoning for each pair where the paper explains superiority
Capture ALL reasoning the paper provides - don’t limit to specific types Important Notes: - Multiple models: You may receive 3, 4, or more models to analyze. Extract reasoning for each pair where the paper explains superiority. - Not all pairs need reasoning: If the paper doesn’t explain why one is better (just mentions differences), don’t extract a reaso...
-
[15]
Write as a SINGLE comprehensive paragraph (not multiple sections)
-
[16]
Focus on the core RESEARCH OBJECTIVE that this benchmark addresses
-
[17]
Include what type of input data is used, what output is expected, and how performance is measured
-
[18]
Be specific about the research challenge and why it is important
-
[19]
Use scientific language but keep it readable and focused
-
[20]
Mention the specific benchmark/dataset name
-
[21]
Keep the research goal between 3-5 sentences INSTRUCTIONS: - Write a cohesive paragraph that flows naturally - Start with the research objective or problem being addressed - Include input/output specifications naturally within the paragraph - Mention evaluation approach without making it a separate section - Focus on the RESEARCH GOAL, not just describing...
- [22]
-
[23]
Check if originally introduced: You may look for phrases like "we propose", "we introduce", "we present [exact_model_name]", "our [exact_model_name]", detailed descriptions indicating novelty or any other relavant context
-
[24]
using [exact_model_name] from [citation]
If NOT originally introduced: Look for citations when the complete model or method name is mentioned: - Find phrases like "using [exact_model_name] from [citation]", "based on [ exact_model_name] [citation]", "[exact_model_name] (Author et al.)" etc., but be mindful of cases where the exact model name is just a variant of the original (Like MethodX(unidir...
-
[25]
For combination-style model or method names (e.g., "ModelA + ModelB", " Enhanced ModelX", "ModelY with additional components (like trained on certain dataset etc.)"): - Treat the ENTIRE name as ONE MODEL - do not analyze components separately - If the complete combination is a novel approach, mark as introduced_in_this_paper = true - If the complete combi...
-
[26]
Use citations to find original papers: When a model is cited, go to the references section and find the complete bibliographic information for that citation. Return JSON format with EXACTLY ONE entry per model name provided: {{ "models": [ {{ "model_name": "EXACT_MODEL_NAME_AS_PROVIDED", "introduced_in_this_paper": true/false, "original_paper_title": "Tit...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.