LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.
hub
Democratizing ai scientists using tooluniverse
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
LLM-native figures embed provenance and enable direct LLM interaction with scientific visualizations to accelerate discovery and improve reproducibility.
SkillFoundry mines heterogeneous scientific resources into a self-evolving library of validated agent skills, with 71.1% novelty versus prior libraries and measurable gains on coding benchmarks plus two genomics tasks.
LLMs perform adequately on bio-molecular classification tasks but remain weak on regression, with hybrid architectures outperforming others on long sequences and fine-tuning hurting generalization.
HADES is an agentic AI system that generates mechanistic hypotheses for drug-induced liver injury using molecular, metabolite, and pathway evidence, outperforming prior binary classifiers on the new DILER benchmark while establishing a baseline for hypothesis alignment.
AgentEconomist is an end-to-end agentic system with idea development, experimental design, and execution stages that uses a large economics paper database to produce research ideas with better literature grounding, novelty, and insight than generic LLMs.
AtomisticSkills is a new harness framework with 100+ human-curated skills that lets general AI agents perform atomistic research tasks including simulations, screening, and analysis, shown on electrolyte design, CO2 capture, drug screening, and catalyst tasks.
LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.
TSAssistant decomposes target safety assessment report generation into research and synthesis subagents with tool-based evidence retrieval, hierarchical instructions, and interactive human refinement, reporting high reproducibility and grounding.
Vibe Medicine proposes directing AI agents via natural language for end-to-end biomedical workflows using LLMs, agent frameworks, and a curated collection of over 1,000 medical skills.
LARA-HPC introduces a validation-first agentic system with dry-run verification and multi-phase refinement that improves robustness of AI-generated DFT workflows on HPC systems.
Introduces a scalable AI skill framework for autonomous microkinetics discovery that automates workflows and evaluates surrogate reliability.
citing papers explorer
-
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.