{"paper":{"title":"Learning to Predict Future-Aligned Research Proposals with Language Models","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Tuning language models on past research data improves their ability to forecast future-aligned research proposals.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Haofei Yu, Heng Ji, Heng Wang, Jiashuo Sun, Jiawei Han, Pengcheng Jiang, Zhiyi Shi","submitted_at":"2026-03-28T05:41:15Z","abstract_excerpt":"Large language models (LLMs) are increasingly used to assist ideation in research, but evaluating the quality of LLM-generated research proposals remains difficult: novelty and soundness are hard to measure automatically, and large-scale human evaluation is costly. We propose a verifiable alternative by reframing proposal generation as a time-sliced scientific forecasting problem. Given a research question and inspiring papers available before a cutoff time, the model generates a structured proposal and is evaluated by whether it anticipates research directions that appear in papers published "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Across Llama-3.1 and Qwen2.5 models, future-aligned tuning improves future alignment over unaligned baselines (up to +10.6% overall FAS), and domain-expert human evaluation corroborates improved proposal quality. Finally, we demonstrate practical impact by implementing two model-generated proposals with a code agent, obtaining 4.17% accuracy gain on MATH from a new prompting strategy and consistent improvements for a novel model-merging method.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That semantic similarity between a generated proposal and future published papers, measured via retrieval and LLM-based scoring, serves as a valid proxy for the proposal's novelty, soundness, and overall quality.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"LLMs fine-tuned on time-sliced paper data generate proposals with up to 10.6% higher Future Alignment Score against actual later publications, with human experts and real implementations confirming gains.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Tuning language models on past research data improves their ability to forecast future-aligned research proposals.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"de301f198b833a235497909a7f175d4319ad3333875495cc360e3176c2cebfad"},"source":{"id":"2603.27146","kind":"arxiv","version":3},"verdict":{"id":"e11188c2-b7b8-4aa3-8bcf-881160bfcb3e","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T22:58:13.114798Z","strongest_claim":"Across Llama-3.1 and Qwen2.5 models, future-aligned tuning improves future alignment over unaligned baselines (up to +10.6% overall FAS), and domain-expert human evaluation corroborates improved proposal quality. Finally, we demonstrate practical impact by implementing two model-generated proposals with a code agent, obtaining 4.17% accuracy gain on MATH from a new prompting strategy and consistent improvements for a novel model-merging method.","one_line_summary":"LLMs fine-tuned on time-sliced paper data generate proposals with up to 10.6% higher Future Alignment Score against actual later publications, with human experts and real implementations confirming gains.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That semantic similarity between a generated proposal and future published papers, measured via retrieval and LLM-based scoring, serves as a valid proxy for the proposal's novelty, soundness, and overall quality.","pith_extraction_headline":"Tuning language models on past research data improves their ability to forecast future-aligned research proposals."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2603.27146/integrity.json","findings":[],"available":true,"detectors_run":[],"snapshot_sha256":"c28c3603d3b5d939e8dc4c7e95fa8dfce3d595e45f758748cecf8e644a296938"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":1,"snapshot_sha256":"cfdb198fe15a5424f1e4a066ca804f9019bba4e98a9853799a01ac1079ca3b37"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}