The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
Genome-bench: A scientific reasoning benchmark from real-world expert discussions.CoRR, abs/2505.19501
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 3roles
dataset 1polarities
use dataset 1representative citing papers
Post-training stages reshape generalization in biological reasoning models distinctly: CPT aligns with biological language, SFT boosts ID performance but causes OOD to peak early and decline, while RL on strong SFT checkpoints can recover OOD generalization.
Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.
citing papers explorer
-
How Post-Training Shapes Biological Reasoning Models
Post-training stages reshape generalization in biological reasoning models distinctly: CPT aligns with biological language, SFT boosts ID performance but causes OOD to peak early and decline, while RL on strong SFT checkpoints can recover OOD generalization.