The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
Genome-bench: A scientific reasoning benchmark from real-world expert discussions.CoRR, abs/2505.19501
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
citation-role summary
dataset 1
citation-polarity summary
verdicts
UNVERDICTED 3roles
dataset 1polarities
use dataset 1representative citing papers
Post-training stages reshape generalization in biological reasoning models distinctly: CPT aligns with biological language, SFT boosts ID performance but causes OOD to peak early and decline, while RL on strong SFT checkpoints can recover OOD generalization.
Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.
citing papers explorer
-
Evaluating Large Language Models in Scientific Discovery
The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.