Teaching Language Models to Forecast Research Success Through Comparative Idea Evaluation

Aniketh Garikaparthi; Manasi Patwardhan; Srujan P Mule

arxiv: 2605.21491 · v1 · pith:DC67C3DAnew · submitted 2026-04-06 · 💻 cs.LG · cs.AI· cs.CL

Teaching Language Models to Forecast Research Success Through Comparative Idea Evaluation

Srujan P Mule , Aniketh Garikaparthi , Manasi Patwardhan This is my paper

Pith reviewed 2026-05-22 01:40 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords language modelsresearch ideassuccess forecastingcomparative evaluationsupervised fine-tuningreinforcement learningbenchmark performancescientific discovery

0 comments

The pith

Fine-tuned language models predict which research idea will succeed on benchmarks with 77 percent accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper asks if language models can forecast the success of research ideas by comparing pairs of them against known benchmark results. The authors build a dataset of 11,488 such pairs from PapersWithCode and find that base models guess correctly only about 30 percent of the time. After supervised fine-tuning, an 8B model reaches 77.1 percent accuracy, which is higher than GPT-5 at 61.1 percent. They also train models with reinforcement learning to produce step-by-step reasoning for their choices. The results hold up on tests that check for reliance on superficial patterns and on data from different domains or later times.

Core claim

The central claim is that language models can be trained to forecast empirical success of research ideas through comparative evaluation of idea pairs drawn from objective benchmark outcomes. Supervised fine-tuning on 11,488 pairs allows 8B models to achieve 77.1% accuracy in selecting the better idea, outperforming GPT-5. Reinforcement Learning with Verifiable Rewards enables the models to find latent reasoning paths and generate interpretable justifications at 71.35% accuracy. Ablations confirm the models resist surface heuristics and the performance transfers to cross-domain and time-split test sets.

What carries the argument

The comparative idea evaluation task on pairs grounded in PapersWithCode benchmark results, trained first with supervised fine-tuning and then with Reinforcement Learning with Verifiable Rewards to elicit reasoning.

If this is right

Small language models become effective objective verifiers for screening research ideas without running experiments.
The training approach generalizes across domains and time periods in held-out test sets.
Models can supply interpretable justifications for their forecasts of success.
Compute-efficient models offer a scalable way to support autonomous filtering in scientific discovery pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the approach works broadly, AI systems could use these verifiers to prioritize which generated ideas to test first in a discovery loop.
Similar comparative training could be applied in other fields where historical experiment outcomes are available as labels.
Future work might combine this evaluation with idea generation models to create closed-loop research automation.

Load-bearing premise

The 11,488 idea pairs from PapersWithCode supply unbiased labels that reflect true empirical success and extend to new ideas outside the dataset.

What would settle it

Running experiments on a fresh set of research ideas and finding that the model's predicted winners do not actually achieve higher benchmark scores than the predicted losers.

Figures

Figures reproduced from arXiv: 2605.21491 by Aniketh Garikaparthi, Manasi Patwardhan, Srujan P Mule.

**Figure 2.** Figure 2: Dataset Construction Pipeline We use raw entries from 1,918 NLP leaderboards to construct a statistically grounded idea pairs with a benchmark specific research goal, while difficulty stratification ensures robust evaluation across diverse research goals. rors are discarded, resulting in 5,695 RR and 832 Original markdown papers. Research Goal and Idea Extraction. For each one of the 1918 leaderboards, we… view at source ↗

**Figure 3.** Figure 3: Differential Analysis (∆ based) with Bootstrap statistical significance tests across different Difficulty Subsets (σ) and Overall Performance. ∗∗: p < 0.01; ∗ : p < 0.05. Reason-SFT-DAPO and Synthetic-Reason-SFTDAPO produce consistent and coherent reasoning traces prior to the final answer, while being resilient to the form of reward hacking seen in Reason-DAPO, showcasing that it is possible to induce i… view at source ↗

**Figure 4.** Figure 4: Consistency (%) across different stages and training paradigm of Qwen3 Model. 7.4 Robustness analysis We test the robustness of the trained models for some features they might be exploiting: (i) Length: categorize the idea pairs based on cases where longer idea is better and otherwise, (ii) Recency: categorize the idea pairs based on cases where the newer idea (published later) is better or worse, (iii) … view at source ↗

**Figure 6.** Figure 6: Distribution of the ideas/methods across the [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution of the ideas/methods across the [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: The average rewards through the training [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 10.** Figure 10: Overall Accuracy (%) Vs Mean Number of tokens generated during reasoning. els across different stress tests and difficulty levels of idea pair comparison. B.7.1 Difficulty vs. Length Sensitivity • Reason-DAPO: There is a strong inverse correlation between task difficulty and length sensitivity. As the difficulty increases (moving from 3-σ to 1-σ), the performance gap between “Longer” and “Shorter” inpu… view at source ↗

**Figure 11.** Figure 11: Distribution of consistency rate (%) across different research goals/leaderboards for the cross-domain test set [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Distribution of RMSE across different research goals/leaderboards for the cross-domain test set 2 different ranks predicted. This might still not be enough since we see high Top-1 accuracy for base models when they actually don’t do very well. • Reason-SFT-DAPO and Reason-SFTDrGRPO achieve better RMSE than GPT-5 (across all reasoning efforts) on CD test, hence showing the potential of such models in fil… view at source ↗

**Figure 13.** Figure 13: Distribution of consistency rate (%) across different research goals/leaderboards for the in-domain test set [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: Distribution of RMSE across different research goals/leaderboards for the in-domain test set Family Model DB ↓ DE ↓ DM ↓ Qwen Base 0.2507 0.0937 0.2832 Base (Reason) 0.2635 0.1392 0.3276 Direct-SFT 0.1676 0.1390 0.1915 Reason-DAPO 0.1755 0.1584 0.5475 Reason-SFT-DAPO 0.1941 0.1688 0.3112 Reason-SFT-DrGRPO 0.1970 0.1734 0.3641 Synthetic-Reason-SFT-DAPO 0.2186 0.2009 0.4320 Llama Base 0.2685 0.1552 0.3275 B… view at source ↗

read the original abstract

As language models accelerate scientific research by automating hypothesis generation and implementation, a new bottleneck emerges: evaluating and filtering hundreds of AI-generated ideas without exhaustive experimentation. We ask whether LMs can learn to forecast the empirical success of research ideas before any experiments are run. We study comparative empirical forecasting: given a benchmark-specific research goal and two candidate ideas, predict which will achieve better benchmark performance. We construct a dataset of 11,488 idea pairs grounded in objective outcomes from PapersWithCode. While off-the-shelf 8B-parameter models struggle (30% acc.), SFT dramatically boosts performance to 77.1%, outperforming GPT-5 (61.1%). By framing evaluation as a reasoning task via Reinforcement Learning with Verifiable Rewards (RLVR), we train models to discover latent reasoning paths, achieving 71.35% acc. with interpretable justifications. Through additional ablations and out-of-distribution tests, we show robustness to surface-level heuristics and transfer to both a cross-domain time-split test set and an independently constructed test set. Our results demonstrate that compute-efficient small language models can serve as effective, objective verifiers, offering a scalable path for autonomous scientific discovery.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Small LMs trained on PapersWithCode idea pairs reach 77% accuracy at forecasting which idea wins on benchmarks, but the labels likely carry publication bias that the current tests don't fully rule out.

read the letter

The key takeaway is that fine-tuning an 8B model on comparative idea pairs from PapersWithCode lets it predict the better-performing idea 77% of the time, beating GPT-5, and the RLVR approach gives some reasoning at 71% with justifications attached. What stands out is the construction of over 11,000 pairs where labels come directly from reported benchmark results rather than human judgments or model self-scores. The time-split and cross-domain tests, along with checks against surface heuristics, show the model isn't just latching onto obvious patterns in the training data. The main limitation is in the data source. PapersWithCode only includes published work, so the winning ideas are those that made it through review and reporting. This could mean the model learns to spot publishable-looking ideas instead of ones that would truly win in a fair test. The out-of-distribution tests stay within similar published distributions, so they don't fully address whether it generalizes to raw, unfiltered new ideas. The methods seem solid enough on the surface with the ablations mentioned, but without seeing the exact pair construction and error breakdowns it's hard to be sure there isn't some leakage or selection effect inflating the numbers. This kind of work is useful for teams trying to scale up AI-driven research idea generation and filtering. Readers interested in practical verifiers for hypotheses or in RL for reasoning will get concrete numbers and a clear task definition to build on. I think it deserves peer review. The experiment is well-posed and the results are reported with enough controls to make discussion worthwhile, even if the bias question will need more attention in revisions.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a method for training language models to predict which of two research ideas will perform better on a specific benchmark by constructing a dataset of 11,488 comparative idea pairs from PapersWithCode. It demonstrates that supervised fine-tuning improves an 8B model's accuracy from 30% to 77.1%, surpassing GPT-5's 61.1%, and that RLVR further enables interpretable reasoning at 71.35% accuracy. The paper includes ablations and out-of-distribution evaluations to support the robustness of the approach.

Significance. If the central claims hold, this work could provide a valuable tool for filtering promising research ideas generated by AI systems, reducing the need for exhaustive experimentation and supporting more autonomous scientific discovery pipelines. The grounding in objective benchmark outcomes rather than subjective assessments is a notable strength, as is the exploration of both SFT and RLVR approaches with reported OOD generalization.

major comments (2)

[Dataset construction] Dataset construction (implied in methods and abstract): The 11,488 idea pairs are derived exclusively from reported results of published papers on PapersWithCode. This introduces systematic selection bias, as only completed and accepted work appears in the source, under-sampling failed ideas. High accuracy may therefore reflect learning of publishability correlates (e.g., complexity signals or temporal trends) rather than intrinsic forecasting of empirical success. The time-split and cross-domain OOD tests remain inside the same publication-filtered distribution and do not rule out this bias.
[Results and OOD evaluation] Results and OOD evaluation sections: The claim of robustness to surface-level heuristics and transfer to new domains relies on the assumption that the benchmark outcomes provide unbiased ground truth. If publication bias is present, the reported 77.1% SFT accuracy and 71.35% RLVR accuracy may not generalize to truly novel or unpublished ideas outside the PapersWithCode distribution.

minor comments (1)

[Abstract] Abstract: 'GPT-5' is referenced without clarification; confirm whether this refers to a specific model version or if it is a placeholder.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below, providing clarifications and indicating where we have revised the paper to better acknowledge limitations while defending the core contributions on substantive grounds.

read point-by-point responses

Referee: [Dataset construction] Dataset construction (implied in methods and abstract): The 11,488 idea pairs are derived exclusively from reported results of published papers on PapersWithCode. This introduces systematic selection bias, as only completed and accepted work appears in the source, under-sampling failed ideas. High accuracy may therefore reflect learning of publishability correlates (e.g., complexity signals or temporal trends) rather than intrinsic forecasting of empirical success. The time-split and cross-domain OOD tests remain inside the same publication-filtered distribution and do not rule out this bias.

Authors: We appreciate the referee's concern about selection bias. However, every pair in our dataset consists of two ideas that both resulted in published papers with reported benchmark results; the label is determined solely by which idea achieved strictly superior performance on the target benchmark. The supervision signal is therefore relative empirical success between two viable, published ideas rather than a binary publishable/non-publishable distinction. This design reduces the risk that the model is merely learning generic publishability correlates. Our ablations further show that accuracy degrades when surface features such as idea length, lexical complexity, or publication year are explicitly controlled or masked, indicating the model exploits more substantive content. We have added a dedicated paragraph in the revised Limitations section discussing the scope of the published-literature distribution and the fact that time-split and cross-domain OOD tests remain within it. We do not claim the model would perform identically on entirely unpublished or failed ideas. revision: partial
Referee: [Results and OOD evaluation] Results and OOD evaluation sections: The claim of robustness to surface-level heuristics and transfer to new domains relies on the assumption that the benchmark outcomes provide unbiased ground truth. If publication bias is present, the reported 77.1% SFT accuracy and 71.35% RLVR accuracy may not generalize to truly novel or unpublished ideas outside the PapersWithCode distribution.

Authors: We agree that all reported accuracies and robustness claims are conditioned on ground-truth labels derived from published benchmark outcomes. The 77.1% SFT and 71.35% RLVR figures therefore reflect performance in forecasting relative success among ideas that reached the stage of public benchmark reporting. Our ablation studies were designed precisely to test whether the model relies on surface heuristics (e.g., temporal trends, verbosity) rather than idea content; performance remains substantially above chance even after these controls. In the revised manuscript we have tempered language in the Results and OOD sections to state that transfer is demonstrated to new domains and later time periods within the published literature, and we explicitly caution that extrapolation to ideas never submitted to benchmarks remains untested. These clarifications appear in both the main text and the new Limitations subsection. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation or evaluation chain

full rationale

The paper constructs an external dataset of 11,488 idea pairs whose labels derive from reported benchmark outcomes on PapersWithCode. It then applies standard supervised fine-tuning and RLVR (with verifiable rewards tied to those same external labels) and reports accuracy on held-out time-split, cross-domain, and independently constructed test sets. No equations, self-definitions, or self-citations are invoked to force the reported performance numbers; the metrics are measured directly against the independent ground-truth labels rather than against quantities defined by the model's own parameters. The methodology is therefore self-contained and externally benchmarked.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that historical benchmark scores constitute unbiased labels for future idea quality; no free parameters or invented entities are declared in the abstract.

axioms (1)

domain assumption Benchmark performance on PapersWithCode is an objective and generalizable proxy for research idea success
Used to create the 11,488 labeled pairs that serve as training and evaluation targets.

pith-pipeline@v0.9.0 · 5747 in / 1294 out tokens · 48715 ms · 2026-05-22T01:40:03.518560+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We construct a dataset of 11,488 idea pairs grounded in objective outcomes from PapersWithCode... SFT dramatically boosts performance to 77.1%... Reinforcement Learning with Verifiable Rewards (RLVR)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Direct-SFT yields dramatic improvements. Qwen3 reaches 77.10% accuracy... robustness to stress tests on paraphrasing and recency, length and position bias

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 1 internal anchor

[1]

Agentic ai for scientific discovery: A survey of progress, challenges, and future directions.arXiv preprint arXiv:2503.08979, 2025

Agentic ai for scientific discovery: A sur- vey of progress, challenges, and future directions. Preprint, arXiv:2503.08979. Xuemei Gu and Mario Krenn. 2025. Forecasting high-impact research topics via machine learning on evolving knowledge graphs.Machine Learning: Science and Technology, 6(2):025041. Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Y...

work page arXiv 2025
[2]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Rubrics as rewards: Reinforcement learning beyond verifiable domains.Preprint, arXiv:2507.17746. Alexander Gurung and Mirella Lapata. 2025. Learning to reason for long-form story generation.Preprint, arXiv:2503.22828. Danny Halawi, Fred Zhang, Chen Yueh-Han, and Jacob Steinhardt. 2024. Approaching human- level forecasting with language models.Preprint, ar...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, and Philip E

Writing-zero: Bridge the gap between non- verifiable tasks and verifiable rewards.Preprint, arXiv:2506.00103. Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, and Philip E. Tetlock. 2025. Forecastbench: A dynamic bench- mark of ai forecasting capabilities.Preprint, arXiv:2409.19839. Esther Landhuis. 2016. Scientific li...

work page arXiv 2025
[4]

ScholarEval: Research Idea Evaluation Grounded in Literature , journal =

Scholareval: Research idea evaluation grounded in literature.Preprint, arXiv:2510.16234. Charles O’Neill, Tirthankar Ghosal, Roberta R˘aileanu, Mike Walmsley, Thang Bui, Kevin Schawinski, and Ioana Ciuc ˘a. 2025. Sparks of science: Hypothe- sis generation using structured paper data.Preprint, arXiv:2504.12976. OpenAI. 2025. Gpt-5 system card. Accessed: 20...

work page arXiv 2025
[5]

Chenglei Si, Diyi Yang, and Tatsunori Hashimoto

The ideation-execution gap: Execution outcomes of llm-generated versus human research ideas.Preprint, arXiv:2506.20803. Chenglei Si, Diyi Yang, and Tatsunori Hashimoto

work page arXiv
[6]

Preprint, arXiv:2409.04109

Can llms generate novel research ideas? a large-scale human study with 100+ nlp researchers. Preprint, arXiv:2409.04109. Michael D. Skarlinski, Sam Cox, Jon M. Laurent, James D. Braza, Michaela Hinks, Michael J. Ham- merling, Manvitha Ponnapati, Samuel G. Rodriques, and Andrew D. White. 2024. Language agents achieve superhuman synthesis of scientific know...

work page arXiv 2024
[7]

Excluded

The pace of artificial intelligence innovations: Speed, talent, and trial-and-error.Journal of Infor- metrics, 14(4):101094. Keisuke Ueda, Wataru Hirota, Takuto Asakura, Takahiro Omi, Kosuke Takahashi, Kosuke Arima, and Tatsuya Ishigaki. 2025. Exploring design of multi-agent llm dialogues for research ideation. Preprint, arXiv:2507.08350. Qingyun Wang, Do...

work page arXiv 2025
[8]

Both idea_A and idea_B MUST be actual model names from the list: { models_list}

work page
[9]

reasoning

All model names in the "reasoning" text MUST be replaced with "idea_A" or " idea_B"

work page
[10]

Reasoning must be grounded in actual paper content, not inferred

work page
[11]

Only include pairs where the paper explains WHY one is better, not just describes differences

work page
[12]

The reasoning must connect differences to advantages or why one is better than the other

work page
[13]

If no such reasoning exists for any pair, return empty comparative_reasoning array

work page
[14]

Extract reasoning for each pair where the paper explains superiority

Capture ALL reasoning the paper provides - don’t limit to specific types Important Notes: - Multiple models: You may receive 3, 4, or more models to analyze. Extract reasoning for each pair where the paper explains superiority. - Not all pairs need reasoning: If the paper doesn’t explain why one is better (just mentions differences), don’t extract a reaso...

work page
[15]

Write as a SINGLE comprehensive paragraph (not multiple sections)

work page
[16]

Focus on the core RESEARCH OBJECTIVE that this benchmark addresses

work page
[17]

Include what type of input data is used, what output is expected, and how performance is measured

work page
[18]

Be specific about the research challenge and why it is important

work page
[19]

Use scientific language but keep it readable and focused

work page
[20]

Mention the specific benchmark/dataset name

work page
[21]

Example 2:

Keep the research goal between 3-5 sentences INSTRUCTIONS: - Write a cohesive paragraph that flows naturally - Start with the research objective or problem being addressed - Include input/output specifications naturally within the paragraph - Mention evaluation approach without making it a separate section - Focus on the RESEARCH GOAL, not just describing...

work page
[22]

+", "&",

Analyze each model name as a complete unit: Take the EXACT model or method name as given and analyze it as one single model/method, even if it contains symbols like "+", "&", "with", etc

work page
[23]

we propose

Check if originally introduced: You may look for phrases like "we propose", "we introduce", "we present [exact_model_name]", "our [exact_model_name]", detailed descriptions indicating novelty or any other relavant context

work page
[24]

using [exact_model_name] from [citation]

If NOT originally introduced: Look for citations when the complete model or method name is mentioned: - Find phrases like "using [exact_model_name] from [citation]", "based on [ exact_model_name] [citation]", "[exact_model_name] (Author et al.)" etc., but be mindful of cases where the exact model name is just a variant of the original (Like MethodX(unidir...

work page
[25]

ModelA + ModelB

For combination-style model or method names (e.g., "ModelA + ModelB", " Enhanced ModelX", "ModelY with additional components (like trained on certain dataset etc.)"): - Treat the ENTIRE name as ONE MODEL - do not analyze components separately - If the complete combination is a novel approach, mark as introduced_in_this_paper = true - If the complete combi...

work page
[26]

models": [ {{

Use citations to find original papers: When a model is cited, go to the references section and find the complete bibliographic information for that citation. Return JSON format with EXACTLY ONE entry per model name provided: {{ "models": [ {{ "model_name": "EXACT_MODEL_NAME_AS_PROVIDED", "introduced_in_this_paper": true/false, "original_paper_title": "Tit...

work page 2048

[1] [1]

Agentic ai for scientific discovery: A survey of progress, challenges, and future directions.arXiv preprint arXiv:2503.08979, 2025

Agentic ai for scientific discovery: A sur- vey of progress, challenges, and future directions. Preprint, arXiv:2503.08979. Xuemei Gu and Mario Krenn. 2025. Forecasting high-impact research topics via machine learning on evolving knowledge graphs.Machine Learning: Science and Technology, 6(2):025041. Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Y...

work page arXiv 2025

[2] [2]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Rubrics as rewards: Reinforcement learning beyond verifiable domains.Preprint, arXiv:2507.17746. Alexander Gurung and Mirella Lapata. 2025. Learning to reason for long-form story generation.Preprint, arXiv:2503.22828. Danny Halawi, Fred Zhang, Chen Yueh-Han, and Jacob Steinhardt. 2024. Approaching human- level forecasting with language models.Preprint, ar...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, and Philip E

Writing-zero: Bridge the gap between non- verifiable tasks and verifiable rewards.Preprint, arXiv:2506.00103. Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, and Philip E. Tetlock. 2025. Forecastbench: A dynamic bench- mark of ai forecasting capabilities.Preprint, arXiv:2409.19839. Esther Landhuis. 2016. Scientific li...

work page arXiv 2025

[4] [4]

ScholarEval: Research Idea Evaluation Grounded in Literature , journal =

Scholareval: Research idea evaluation grounded in literature.Preprint, arXiv:2510.16234. Charles O’Neill, Tirthankar Ghosal, Roberta R˘aileanu, Mike Walmsley, Thang Bui, Kevin Schawinski, and Ioana Ciuc ˘a. 2025. Sparks of science: Hypothe- sis generation using structured paper data.Preprint, arXiv:2504.12976. OpenAI. 2025. Gpt-5 system card. Accessed: 20...

work page arXiv 2025

[5] [5]

Chenglei Si, Diyi Yang, and Tatsunori Hashimoto

The ideation-execution gap: Execution outcomes of llm-generated versus human research ideas.Preprint, arXiv:2506.20803. Chenglei Si, Diyi Yang, and Tatsunori Hashimoto

work page arXiv

[6] [6]

Preprint, arXiv:2409.04109

Can llms generate novel research ideas? a large-scale human study with 100+ nlp researchers. Preprint, arXiv:2409.04109. Michael D. Skarlinski, Sam Cox, Jon M. Laurent, James D. Braza, Michaela Hinks, Michael J. Ham- merling, Manvitha Ponnapati, Samuel G. Rodriques, and Andrew D. White. 2024. Language agents achieve superhuman synthesis of scientific know...

work page arXiv 2024

[7] [7]

Excluded

The pace of artificial intelligence innovations: Speed, talent, and trial-and-error.Journal of Infor- metrics, 14(4):101094. Keisuke Ueda, Wataru Hirota, Takuto Asakura, Takahiro Omi, Kosuke Takahashi, Kosuke Arima, and Tatsuya Ishigaki. 2025. Exploring design of multi-agent llm dialogues for research ideation. Preprint, arXiv:2507.08350. Qingyun Wang, Do...

work page arXiv 2025

[8] [8]

Both idea_A and idea_B MUST be actual model names from the list: { models_list}

work page

[9] [9]

reasoning

All model names in the "reasoning" text MUST be replaced with "idea_A" or " idea_B"

work page

[10] [10]

Reasoning must be grounded in actual paper content, not inferred

work page

[11] [11]

Only include pairs where the paper explains WHY one is better, not just describes differences

work page

[12] [12]

The reasoning must connect differences to advantages or why one is better than the other

work page

[13] [13]

If no such reasoning exists for any pair, return empty comparative_reasoning array

work page

[14] [14]

Extract reasoning for each pair where the paper explains superiority

Capture ALL reasoning the paper provides - don’t limit to specific types Important Notes: - Multiple models: You may receive 3, 4, or more models to analyze. Extract reasoning for each pair where the paper explains superiority. - Not all pairs need reasoning: If the paper doesn’t explain why one is better (just mentions differences), don’t extract a reaso...

work page

[15] [15]

Write as a SINGLE comprehensive paragraph (not multiple sections)

work page

[16] [16]

Focus on the core RESEARCH OBJECTIVE that this benchmark addresses

work page

[17] [17]

Include what type of input data is used, what output is expected, and how performance is measured

work page

[18] [18]

Be specific about the research challenge and why it is important

work page

[19] [19]

Use scientific language but keep it readable and focused

work page

[20] [20]

Mention the specific benchmark/dataset name

work page

[21] [21]

Example 2:

Keep the research goal between 3-5 sentences INSTRUCTIONS: - Write a cohesive paragraph that flows naturally - Start with the research objective or problem being addressed - Include input/output specifications naturally within the paragraph - Mention evaluation approach without making it a separate section - Focus on the RESEARCH GOAL, not just describing...

work page

[22] [22]

+", "&",

Analyze each model name as a complete unit: Take the EXACT model or method name as given and analyze it as one single model/method, even if it contains symbols like "+", "&", "with", etc

work page

[23] [23]

we propose

Check if originally introduced: You may look for phrases like "we propose", "we introduce", "we present [exact_model_name]", "our [exact_model_name]", detailed descriptions indicating novelty or any other relavant context

work page

[24] [24]

using [exact_model_name] from [citation]

If NOT originally introduced: Look for citations when the complete model or method name is mentioned: - Find phrases like "using [exact_model_name] from [citation]", "based on [ exact_model_name] [citation]", "[exact_model_name] (Author et al.)" etc., but be mindful of cases where the exact model name is just a variant of the original (Like MethodX(unidir...

work page

[25] [25]

ModelA + ModelB

For combination-style model or method names (e.g., "ModelA + ModelB", " Enhanced ModelX", "ModelY with additional components (like trained on certain dataset etc.)"): - Treat the ENTIRE name as ONE MODEL - do not analyze components separately - If the complete combination is a novel approach, mark as introduced_in_this_paper = true - If the complete combi...

work page

[26] [26]

models": [ {{

Use citations to find original papers: When a model is cited, go to the references section and find the complete bibliographic information for that citation. Return JSON format with EXACTLY ONE entry per model name provided: {{ "models": [ {{ "model_name": "EXACT_MODEL_NAME_AS_PROVIDED", "introduced_in_this_paper": true/false, "original_paper_title": "Tit...

work page 2048