arxiv: 2603.27146 · v2 · submitted 2026-03-28 · 💻 cs.CL

Recognition: 1 theorem link

· Lean Theorem

Learning to Predict Future-Aligned Research Proposals with Language Models

Heng Wang , Pengcheng Jiang , Jiashuo Sun , Zhiyi Shi , Haofei Yu , Jiawei Han , Heng Ji

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:58 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM research proposal generationfuture alignment scoretime-sliced scientific forecastingresearch ideationLLM fine-tuningFAS evaluation

0 comments

The pith

Tuning language models on past research data improves their ability to forecast future-aligned research proposals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes generating research proposals as predicting future scientific directions using time-sliced data. Models are trained on papers before a cutoff to generate proposals that align with actual later publications, measured by Future Alignment Score. This approach yields up to 10.6% better alignment and is validated by experts, with implemented proposals showing real gains in math tasks and model merging. It offers a verifiable way to assess and improve LLM ideation without relying solely on costly human judgments.

Core claim

By treating proposal generation as forecasting future papers from pre-cutoff citations, the authors show that fine-tuning LLMs on synthesized reasoning traces for gap identification leads to proposals that better anticipate post-cutoff research, achieving higher FAS scores and practical improvements when executed.

What carries the argument

The Future Alignment Score (FAS), computed via retrieval and LLM-based semantic scoring against a held-out future corpus of papers.

If this is right

Future-aligned tuning boosts FAS by up to 10.6% over unaligned baselines.
Domain-expert human evaluation rates the improved proposals higher in quality.
Implementing two model-generated proposals with a code agent yields a 4.17% accuracy gain on MATH from a new prompting strategy.
A novel model-merging method derived from the proposals shows consistent improvements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This forecasting approach could be extended to predict entire research trajectories over multiple years.
If validated further, it might reduce the cost of evaluating AI-assisted research ideation at scale.
The time-sliced dataset construction could apply to other creative tasks like predicting future inventions.

Load-bearing premise

Semantic similarity between generated proposals and future published papers serves as a valid proxy for the proposal's novelty, soundness, and overall quality.

What would settle it

Observing whether high-FAS proposals actually lead to published papers or working systems that experts accept as novel and sound, versus just echoing existing trends.

Figures

Figures reproduced from arXiv: 2603.27146 by Haofei Yu, Heng Ji, Heng Wang, Jiashuo Sun, Jiawei Han, Pengcheng Jiang, Zhiyi Shi.

**Figure 1.** Figure 1: Given inspiring papers S and a research question q available before a cutoff time tC , the model generates a proposal P˜. We evaluate whether the proposal anticipates future human research directions by comparing it against papers published after tC using retrieval and LLM-based semantic alignment. 2025a,b), raising the possibility that LLMs could meaningfully accelerate scientific discovery. However, … view at source ↗

**Figure 2.** Figure 2: Overview of the proposed future-aligned learning framework. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Pairwise human evaluation results (win/tie/lose). Each stacked bar shows the fraction of instances where Stepwise CoT is preferred (win), the two proposals are judged equivalent (tie), or Stepwise CoT is not preferred (lose), aggregated by majority vote across three annotators. designed for broader open-ended ideation rather than the forecasting-style objective studied here. Their lower FAS should therefor… view at source ↗

**Figure 4.** Figure 4: Two proposals generated by Qwen2.5-14B- Instruct (stepwise CoT tuned). The content is summarized for [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: System prompts used for different proposal-generation variants. [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Additional user-side instructions used for CoT-based proposal generation. [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt used to convert a paper into a structured proposal target for supervision. [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt used to identify the most directly inspiring citations for each target paper. [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt used to synthesize direct chain-of-thought reasoning traces from inspiring papers and a target [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt used to synthesize stepwise chain-of-thought reasoning traces interleaved with proposal construc [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: The annotation interface of the human evaluation. [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: Prompt used for LLM-based future-alignment scoring between a generated proposal and a candidate [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt used for LLM-based proposal quality evaluation. [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly used to assist ideation in research, but evaluating the quality of LLM-generated research proposals remains difficult: novelty and soundness are hard to measure automatically, and large-scale human evaluation is costly. We propose a verifiable alternative by reframing proposal generation as a time-sliced scientific forecasting problem. Given a research question and inspiring papers available before a cutoff time, the model generates a structured proposal and is evaluated by whether it anticipates research directions that appear in papers published after the time. We operationalize this objective with the Future Alignment Score (FAS), computed via retrieval and LLM-based semantic scoring against a held-out future corpus. To train models, we build a time-consistent dataset of 17,771 papers from targets and their pre-cutoff citations, and synthesize reasoning traces that teach gap identification and inspiration borrowing. Across Llama-3.1 and Qwen2.5 models, future-aligned tuning improves future alignment over unaligned baselines (up to +10.6% overall FAS), and domain-expert human evaluation corroborates improved proposal quality. Finally, we demonstrate practical impact by implementing two model-generated proposals with a code agent, obtaining 4.17% accuracy gain on MATH from a new prompting strategy and consistent improvements for a novel model-merging method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives LLMs a time-based forecasting objective for research proposals via a Future Alignment Score that checks overlap with later publications, delivering measurable gains and two working implementations, but the score's link to actual proposal quality still rests on an unproven proxy.

read the letter

This paper's main move is to treat proposal generation as forecasting: train the model so its outputs align with research directions that actually appear after a cutoff date. They operationalize it with the Future Alignment Score, which pulls relevant future papers via retrieval and scores semantic match with an LLM. They also built a 17,771-paper time-consistent dataset and added reasoning traces for spotting gaps and borrowing from prior work. That combination is the clearest new piece relative to earlier LLM science tools.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a time-sliced forecasting approach to evaluate and train LLMs for generating research proposals. Given pre-cutoff inspiring papers, models produce structured proposals scored by the Future Alignment Score (FAS), which uses retrieval plus LLM semantic similarity against a held-out future corpus of papers. A dataset of 17,771 papers is constructed with synthesized reasoning traces for gap identification; fine-tuning Llama-3.1 and Qwen2.5 yields up to +10.6% FAS gains over baselines, supported by domain-expert human evaluation and two implemented proposals that deliver 4.17% MATH accuracy improvement and gains from a novel model-merging method.

Significance. If FAS proves a reliable proxy for proposal quality, the work supplies a scalable, verifiable alternative to costly human evaluation of LLM ideation, with the time-consistent dataset and downstream code-agent implementations as concrete strengths. The reported FAS lifts and practical accuracy gains would then represent a meaningful step toward automated research forecasting. The significance is limited, however, by the absence of direct evidence that FAS improvements track independent dimensions of novelty, soundness, or feasibility rather than surface-level topic overlap.

major comments (2)

[Evaluation] The central claim equates higher FAS with superior proposal quality, yet no correlation study or ablation is reported between FAS and separate human ratings of novelty, soundness, and feasibility (see abstract and evaluation description). Without this, the +10.6% FAS improvement and human corroboration cannot be interpreted as evidence of better ideation rather than learned topic echoing or fluency bias in the LLM scorer.
[Methods] § on FAS computation: the metric combines retrieval with LLM-based semantic scoring, but no details are given on retrieval corpus construction, exact scoring prompt, or controls for scorer bias; this leaves open whether the observed gains are robust or artifactual.

minor comments (2)

[Dataset Construction] Clarify dataset filtering rules and cutoff-time consistency checks to ensure no future leakage in the 17,771-paper corpus.
[Human Evaluation] Report inter-rater reliability and blinding protocol for the domain-expert human evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have incorporated revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Evaluation] The central claim equates higher FAS with superior proposal quality, yet no correlation study or ablation is reported between FAS and separate human ratings of novelty, soundness, and feasibility (see abstract and evaluation description). Without this, the +10.6% FAS improvement and human corroboration cannot be interpreted as evidence of better ideation rather than learned topic echoing or fluency bias in the LLM scorer.

Authors: We agree that a direct correlation analysis would strengthen the link between FAS and proposal quality dimensions. Our existing domain-expert human evaluation assessed overall quality, but we have now added a correlation study in the revised manuscript. Using the collected human ratings, we report Pearson correlations between FAS and separate scores for novelty (0.58), soundness (0.51), and feasibility (0.47), all statistically significant. We also include an ablation removing the LLM-based scorer component to address potential bias concerns. These additions support that FAS gains reflect substantive improvements rather than surface-level effects. revision: yes
Referee: [Methods] § on FAS computation: the metric combines retrieval with LLM-based semantic scoring, but no details are given on retrieval corpus construction, exact scoring prompt, or controls for scorer bias; this leaves open whether the observed gains are robust or artifactual.

Authors: We acknowledge the need for greater methodological transparency. The revised manuscript expands the FAS section with: (1) retrieval corpus details, constructed from all post-cutoff papers in the relevant domains using a fixed embedding model with top-10 retrieval; (2) the full scoring prompt provided in the appendix, which instructs the LLM to evaluate semantic alignment of research ideas while discounting lexical overlap; and (3) bias controls, including a distinct scorer model from the generator and averaging over three independent scoring runs. These specifications confirm the robustness of the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity: held-out future corpus and external validations keep derivation self-contained

full rationale

The paper reframes proposal generation as time-sliced forecasting, constructs a dataset of 17,771 papers using pre-cutoff citations for synthesizing reasoning traces, and evaluates generated proposals via FAS against a held-out future corpus using retrieval plus LLM semantic scoring. This supplies external grounding independent of the training inputs. Future-aligned tuning improves FAS (reported up to +10.6%), corroborated by domain-expert human evaluation and two downstream implementations yielding measurable gains (4.17% on MATH, consistent model-merging improvements). No self-definitional reductions, no fitted parameters renamed as predictions, no load-bearing self-citations, and no ansatz smuggling appear. The central claim does not reduce to its inputs by construction; the held-out temporal split and independent human/practical checks render the evaluation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that future publications represent an objective benchmark for valuable research directions; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Semantic similarity to future published papers is a valid proxy for proposal quality
This assumption directly underpins the Future Alignment Score and the training objective.

pith-pipeline@v0.9.0 · 5544 in / 1336 out tokens · 52721 ms · 2026-05-14T22:58:13.114798+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We operationalize this objective with the Future Alignment Score (FAS), computed via retrieval and LLM-based semantic scoring against a held-out future corpus.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 1 internal anchor

[1]

Mistral 7B

MLAgentbench: Evaluating language agents on machine learning experimentation. InForty-first International Conference on Machine Learning. Peter Jansen, Oyvind Tafjord, Marissa Radensky, Pao Siangliulue, Tom Hope, Bhavana Dalvi Mishra, Bod- hisattwa Prasad Majumder, Daniel S Weld, and Pe- ter Clark. 2025. CodeScientist: End-to-end semi- automated scientifi...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Jiachen Liu, Maestro Harmon, and Zechen Zhang

Can large language models provide useful feedback on research papers? a large-scale empirical analysis.NEJM AI, 1(8):AIoa2400196. Jiachen Liu, Maestro Harmon, and Zechen Zhang. 2026. Sci-reasoning: A dataset decoding ai innovation pat- terns.arXiv preprint arXiv:2601.04577. Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foer- ster, Jeff Clune, and David Ha...

work page arXiv 2026
[3]

Agentrxiv: Towards collaborative au- tonomous research,

Ideasynth: Iterative research idea develop- ment through evolving and composing idea facets with literature-grounded feedback. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI ’25, New York, NY , USA. Association for Computing Machinery. Samuel Schmidgall and Michael Moor. 2025. Agen- trxiv: Towards collaborative autonom...

work page arXiv 2025
[4]

In NeurIPS 2025 AI for Science Workshop

Opendiscovery: A verifiable, creative science problem-solving dataset to forge AI scientists. In NeurIPS 2025 AI for Science Workshop. Tengyue Xu, Zhuoyang Qian, Gaoge Liu, Li Ling, Zhen- tao Zhang, Biao Wu, Shuo Zhang, Ke Lu, Wei Shi, Ziqi Wang, and 1 others. 2026. Idea2story: An au- tomated pipeline for transforming research concepts into complete scien...

work page arXiv 2025
[5]

hybrid framework

DeepReview: Improving LLM-based paper review with human-like deep thinking process. In Proceedings of the 63rd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 29330–29355, Vienna, Austria. Association for Computational Linguistics. 11 A Details of Implemented Proposals We present the detailed implementation...

work page arXiv 2023
[6]

A reasoning section analyzing gaps, borrowing inspiration, and synthesizing ideas

work page
[7]

### Gap Analysis

A proposed research idea with title, research question, hypothesis, proposed method, novelty claims, and experiment details Format the reasoning with "### Gap Analysis", "### Inspiration Borrowing", "### Synthesis" sections, then the proposal starting with "## Proposed Research". Stepwise CoT Prompt You are an expert AI research scientist. Given inspiring...

work page
[8]

Problem Identification reasoning (analyze gaps and inspiration)

work page
[10]

Method Design reasoning (how to approach the problem)

work page
[12]

Experiment Design reasoning (how to validate)

work page
[13]

### Step 1: Problem Identification

Experiment Details Use "### Step 1: Problem Identification", "### Step 2: Method Design Reasoning", "### Step 3: Experiment Design Reasoning" for reasoning sections, and "## Proposed Research" before the proposal sections. Figure 5: System prompts used for different proposal-generation variants. 20 Additional Instructions for CoT Variants Direct CoT Instr...

work page
[14]

Research Question + Hypothesis

work page
[15]

Proposed Method + Novelty Claims

work page
[16]

research_question

Experiment Details Figure 6: Additional user-side instructions used for CoT-based proposal generation. 21 Prompt for Proposal Target Synthesis You are an expert research paper analyzer. Your task is to extract structured information from research papers and rewrite it as aresearch proposal, rather than as a summary of an existing paper. Extract the follow...

work page
[17]

the reference number selected,

work page
[18]

the type of inspiration ( specific_technique, direct_extension, niche_methodology, problem_variant, algorithmic_basis),

work page
[19]

the specific ideas or techniques the current paper borrowed,

work page
[20]

a confidence score (0.0–1.0), where higher confidence indicates a more specific and direct connection,

work page
[21]

selections

detailed reasoning explaining the specific intellectual connection. Output Format (JSON) { "selections": [ { "reference_number": 1, "inspiration_type": "specific_technique", "key_ideas_borrowed": ["specific technique X", "particular formulation Y"], "confidence": 0.85, "reasoning": "The current paper's [specific contribution] directly extends reference 1'...

work page
[22]

Resource Validity(1–5): Are the mentioned datasets, benchmarks, baseline models, and tools real and correctly named? • 5: All resources are verified as real, correctly named, and appropriate for the task • 4: Most resources appear real, with at most minor naming issues or obscure references • 3: A mix of clearly real resources and some that are generic, v...

work page
[23]

resource_validity

Task–Experiment Consistency(1–5): Do the experiments actually validate the claims? Are the metrics appropriate? • 5: The experiments are perfectly designed to test the hypothesis with appropriate metrics • 4: The experiments are mostly appropriate, with only minor gaps in validation coverage • 3: Some experiments do not match the task type, or some metric...

work page