hub Canonical reference

Autonomous chemical research with large language models

Daniil A. Boiko, Robert MacKnight, Ben Kline, Gabe Gomes · 2023 · Nature · DOI 10.1038/s41586-023-06792-0

Canonical reference. 88% of citing Pith papers cite this work as background.

16 Pith papers citing it

734 external citations · Crossref

Background 88% of classified citations

open at publisher browse 16 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 7 dataset 1

citation-polarity summary

background 7 use dataset 1

representative citing papers

Evaluating Large Language Models in Scientific Discovery

cs.AI · 2025-12-17 · unverdicted · novelty 8.0

The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.

LAB-Bench: Measuring Capabilities of Language Models for Biology Research

cs.AI · 2024-07-14 · accept · novelty 8.0

LAB-Bench provides over 2,400 multiple-choice questions to measure LLM performance on real biology research tasks like literature recall, figure reading, database access, and sequence manipulation, with initial results compared against human expert biologists.

AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents

physics.flu-dyn · 2026-05-07 · conditional · novelty 7.0 · 2 refs

AI CFD Scientist autonomously discovers a Spalart-Allmaras runtime correction reducing lower-wall Cf RMSE by 7.89% on the periodic hill at Reh=5600 while using a vision-language gate to detect 14 of 16 silent failures missed by solver checks.

QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding

quant-ph · 2026-04-28 · unverdicted · novelty 7.0

Introduces QCalEval benchmark showing best zero-shot VLM score of 72.3 on quantum calibration plots, with fine-tuning and in-context learning effects varying by model type.

Optimizing ground state preparation protocols with autoresearch

quant-ph · 2026-04-28 · unverdicted · novelty 7.0 · 2 refs

AI coding agents evolve simple ground-state protocols into improved versions for VQE, DMRG, and AFQMC on spin models and molecules by using executable energy scores under fixed compute budgets.

Constraint-Aware Corrective Memory for Language-Based Drug Discovery Agents

cs.AI · 2026-04-10 · unverdicted · novelty 7.0

CACM improves language-based drug discovery agents by 36.4% via protocol auditing, a grounded diagnostician, and compressed static/dynamic/corrective memory channels that localize failures and bias corrections.

MatClaw: An Autonomous Code-First LLM Agent for End-to-End Materials Exploration

cond-mat.mtrl-sci · 2026-04-03 · conditional · novelty 7.0 · 2 refs

MatClaw shows a code-first LLM agent autonomously generating and executing workflows for ML force field training, Curie temperature prediction, and parameter search on CuInP2S6, succeeding on code but requiring interventions for tacit domain knowledge.

Integrating Domain-Specialized Language Models with AI Measurement Tools for Deterministic Atomic-Resolution Experimentation

physics.app-ph · 2026-02-24 · unverdicted · novelty 7.0

Domain-specialized small language models enable deterministic atomic-resolution scanning probe microscopy control with 99.3% command accuracy, lower computational cost, and better domain performance than larger general models.

ColPackAgent: Agent-Skill-Guided Hard-Particle Monte Carlo Workflows for Colloidal Packing

cs.AI · 2026-05-15 · unverdicted · novelty 6.0

ColPackAgent integrates a custom colpack Python package wrapping HOOMD-blue with MCP tools and an agent skill to enable reliable autonomous workflows for colloidal packing simulations across interactive, prompt-driven, and autoresearch modes.

LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

LEAPBench shows trajectory scoring changes best-model rankings on 53% of tasks, LLMs do not beat Bayesian optimization, and domain-aware prompting underperforms domain-agnostic on biology tasks aligned with published literature.

Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale

cs.LG · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

An LLM entity-tagging pipeline plus multi-agent system extracts ~6.3M nuanced records from 22.5M PubMed papers across six tasks with lower measured error than existing curated databases.

Sibyl-AutoResearch: Autonomous Research Needs Self-Evolving Trial-and-Error Harnesses, Not Paper Generators

cs.MA · 2026-05-21 · unverdicted · novelty 5.0

Sibyl-AutoResearch introduces self-evolving trial-and-error harnesses with auditable conversion units that link trial signals to updated research behaviors and harness repairs in autonomous systems.

Bolek: A Multimodal Language Model for Molecular Reasoning

cs.LG · 2026-05-04 · unverdicted · novelty 5.0

Bolek injects Morgan fingerprint embeddings into an instruction-tuned text model, then fine-tunes on molecular alignment and synthetic chain-of-thought tasks to improve performance and grounding on 15 TDC binary classification endpoints while generalizing to unseen tasks.

From Perception to Autonomous Computational Modeling: A Multi-Agent Approach

cs.CE · 2026-04-08 · unverdicted · novelty 5.0

A multi-agent LLM framework autonomously completes the full computational mechanics pipeline from a photograph to a code-compliant engineering report on a steel L-bracket example.

Evolving Roles of LLMs in Scientific Innovation: Assistant, Collaborator, Scientist, and Evaluator

cs.DL · 2025-07-16 · unverdicted · novelty 4.0

The paper proposes a four-role framework for LLMs in scientific innovation and reviews methods, benchmarks, and limitations across Assistant, Collaborator, Scientist, and Evaluator roles.

ToolRosella: Translating Code Repositories into Standardized Tools for Scientific Agents

cs.SE · 2026-03-10

citing papers explorer

Showing 16 of 16 citing papers.

Evaluating Large Language Models in Scientific Discovery cs.AI · 2025-12-17 · unverdicted · none · ref 19
The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
LAB-Bench: Measuring Capabilities of Language Models for Biology Research cs.AI · 2024-07-14 · accept · none · ref 5
LAB-Bench provides over 2,400 multiple-choice questions to measure LLM performance on real biology research tasks like literature recall, figure reading, database access, and sequence manipulation, with initial results compared against human expert biologists.
AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents physics.flu-dyn · 2026-05-07 · conditional · none · ref 1 · 2 links
AI CFD Scientist autonomously discovers a Spalart-Allmaras runtime correction reducing lower-wall Cf RMSE by 7.89% on the periodic hill at Reh=5600 while using a vision-language gate to detect 14 of 16 silent failures missed by solver checks.
QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding quant-ph · 2026-04-28 · unverdicted · none · ref 43
Introduces QCalEval benchmark showing best zero-shot VLM score of 72.3 on quantum calibration plots, with fine-tuning and in-context learning effects varying by model type.
Optimizing ground state preparation protocols with autoresearch quant-ph · 2026-04-28 · unverdicted · none · ref 14 · 2 links
AI coding agents evolve simple ground-state protocols into improved versions for VQE, DMRG, and AFQMC on spin models and molecules by using executable energy scores under fixed compute budgets.
Constraint-Aware Corrective Memory for Language-Based Drug Discovery Agents cs.AI · 2026-04-10 · unverdicted · none · ref 6
CACM improves language-based drug discovery agents by 36.4% via protocol auditing, a grounded diagnostician, and compressed static/dynamic/corrective memory channels that localize failures and bias corrections.
MatClaw: An Autonomous Code-First LLM Agent for End-to-End Materials Exploration cond-mat.mtrl-sci · 2026-04-03 · conditional · none · ref 2 · 2 links
MatClaw shows a code-first LLM agent autonomously generating and executing workflows for ML force field training, Curie temperature prediction, and parameter search on CuInP2S6, succeeding on code but requiring interventions for tacit domain knowledge.
Integrating Domain-Specialized Language Models with AI Measurement Tools for Deterministic Atomic-Resolution Experimentation physics.app-ph · 2026-02-24 · unverdicted · none · ref 13
Domain-specialized small language models enable deterministic atomic-resolution scanning probe microscopy control with 99.3% command accuracy, lower computational cost, and better domain performance than larger general models.
ColPackAgent: Agent-Skill-Guided Hard-Particle Monte Carlo Workflows for Colloidal Packing cs.AI · 2026-05-15 · unverdicted · none · ref 52
ColPackAgent integrates a custom colpack Python package wrapping HOOMD-blue with MCP tools and an agent skill to enable reliable autonomous workflows for colloidal packing simulations across interactive, prompt-driven, and autoresearch modes.
LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design cs.LG · 2026-05-14 · unverdicted · none · ref 2
LEAPBench shows trajectory scoring changes best-model rankings on 53% of tasks, LLMs do not beat Bayesian optimization, and domain-aware prompting underperforms domain-agnostic on biology tasks aligned with published literature.
Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale cs.LG · 2026-05-07 · unverdicted · none · ref 12 · 2 links
An LLM entity-tagging pipeline plus multi-agent system extracts ~6.3M nuanced records from 22.5M PubMed papers across six tasks with lower measured error than existing curated databases.
Sibyl-AutoResearch: Autonomous Research Needs Self-Evolving Trial-and-Error Harnesses, Not Paper Generators cs.MA · 2026-05-21 · unverdicted · none · ref 6
Sibyl-AutoResearch introduces self-evolving trial-and-error harnesses with auditable conversion units that link trial signals to updated research behaviors and harness repairs in autonomous systems.
Bolek: A Multimodal Language Model for Molecular Reasoning cs.LG · 2026-05-04 · unverdicted · none · ref 7
Bolek injects Morgan fingerprint embeddings into an instruction-tuned text model, then fine-tunes on molecular alignment and synthetic chain-of-thought tasks to improve performance and grounding on 15 TDC binary classification endpoints while generalizing to unseen tasks.
From Perception to Autonomous Computational Modeling: A Multi-Agent Approach cs.CE · 2026-04-08 · unverdicted · none · ref 24
A multi-agent LLM framework autonomously completes the full computational mechanics pipeline from a photograph to a code-compliant engineering report on a steel L-bracket example.
Evolving Roles of LLMs in Scientific Innovation: Assistant, Collaborator, Scientist, and Evaluator cs.DL · 2025-07-16 · unverdicted · none · ref 11
The paper proposes a four-role framework for LLMs in scientific innovation and reviews methods, benchmarks, and limitations across Assistant, Collaborator, Scientist, and Evaluator roles.
ToolRosella: Translating Code Repositories into Standardized Tools for Scientific Agents cs.SE · 2026-03-10 · unreviewed · ref 15

Autonomous chemical research with large language models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer