GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.
Chenglei Si, Diyi Yang, and Tatsunori Hashimoto
11 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 11roles
background 2polarities
background 2representative citing papers
LLM-generated research ideas cluster more around bridge-like opportunities and synthesis methods than the broader distribution seen in human papers.
A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.
The Divergent Remote Association Test (DRAT) is the first creativity test that significantly predicts LLMs' scientific ideation ability, unlike prior tests such as DAT or RAT.
A knowledge-first approach to LLM-driven automatic heuristic design in combinatorial optimization yields better discovery efficiency, transfer, and generalization than code-centric baselines by formalizing a distortion-compression trade-off.
SoundnessBench shows frontier LLMs exhibit pervasive optimism bias when rating the soundness of ML research proposals, frequently calling low-soundness ideas sound under standard prompts.
Intern-Atlas constructs a methodological evolution graph with 9.4 million edges from 1.03 million AI papers to capture how methods emerge, adapt, and transition, enabling better idea evaluation and generation for AI-driven research.
Small LMs reach 77.1% accuracy at comparative forecasting of research idea success on benchmarks after supervised fine-tuning, with RLVR yielding interpretable reasoning at 71.35%.
LLMs generate kitsch due to their training process, causing outputs to be perceived as kitschier than human-created works in controlled reader studies.
The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.
citing papers explorer
-
GIANTS: Generative Insight Anticipation from Scientific Literature
GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.
-
Measuring the Gap Between Human and LLM Research Ideas
LLM-generated research ideas cluster more around bridge-like opportunities and synthesis methods than the broader distribution seen in human papers.
-
Can AI Agents Synthesize Scientific Conclusions?
A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.
-
Back to the Beginning of Heuristic Design: Bridging Code and Knowledge with LLMs
A knowledge-first approach to LLM-driven automatic heuristic design in combinatorial optimization yields better discovery efficiency, transfer, and generalization than code-centric baselines by formalizing a distortion-compression trade-off.
-
Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists
Intern-Atlas constructs a methodological evolution graph with 9.4 million edges from 1.03 million AI papers to capture how methods emerge, adapt, and transition, enabling better idea evaluation and generation for AI-driven research.
-
Teaching Language Models to Forecast Research Success Through Comparative Idea Evaluation
Small LMs reach 77.1% accuracy at comparative forecasting of research idea success on benchmarks after supervised fine-tuning, with RLVR yielding interpretable reasoning at 71.35%.
-
LLMs Generate Kitsch
LLMs generate kitsch due to their training process, causing outputs to be perceived as kitschier than human-created works in controlled reader studies.
-
AI for Auto-Research: Roadmap & User Guide
The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.