MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6.7 points.
hub Mixed citations
arXiv preprint arXiv:2504.06196 (2025)
Mixed citation behavior. Most common role is background (57%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
VibeProteinBench is a new benchmark evaluating LLMs on open-ended language-interfaced protein design across recognition, engineering, and generation, with no model showing strong performance in all areas.
OmicsLM integrates continuous omics embeddings into LLMs for multi-sample biological reasoning, matching specialized models on profile tasks while outperforming them and general LLMs on language-guided QA over real expression data.
LLMs perform adequately on bio-molecular classification tasks but remain weak on regression, with hybrid architectures outperforming others on long sequences and fine-tuning hurting generalization.
MolDeTox is a new benchmark that shows fragment-level stepwise editing by LLMs and VLMs improves structural validity and detoxification quality over prior toxicity-focused evaluations.
HADES is an agentic AI system that generates mechanistic hypotheses for drug-induced liver injury using molecular, metabolite, and pathway evidence, outperforming prior binary classifiers on the new DILER benchmark while establishing a baseline for hypothesis alignment.
Boltz-2 and fine-tuned DrugFormDTA lead ML-based binding prediction while GNINA leads docking tools on a cleaned antiviral dataset, with performance varying by viral protein.
Bolek injects Morgan fingerprint embeddings into an instruction-tuned text model, then fine-tunes on molecular alignment and synthetic chain-of-thought tasks to improve performance and grounding on 15 TDC binary classification endpoints while generalizing to unseen tasks.
ToxiEval-ZKP applies zero-knowledge proofs to enable private verification that generative AI molecules meet multidimensional toxicity repair criteria.
Hackathon submissions indicate LLMs are moving from general assistants toward composable multi-agent systems for structuring scientific knowledge and automating tasks in materials science and chemistry.
citing papers explorer
-
MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning
MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6.7 points.
-
VibeProteinBench: An Evaluation Benchmark for Language-interfaced Vibe Protein Design
VibeProteinBench is a new benchmark evaluating LLMs on open-ended language-interfaced protein design across recognition, engineering, and generation, with no model showing strong performance in all areas.
-
OmicsLM: A Multimodal Large Language Model for Multi-Sample Omics Reasoning
OmicsLM integrates continuous omics embeddings into LLMs for multi-sample biological reasoning, matching specialized models on profile tasks while outperforming them and general LLMs on language-guided QA over real expression data.
-
The limits of bio-molecular modeling with large language models : a cross-scale evaluation
LLMs perform adequately on bio-molecular classification tasks but remain weak on regression, with hybrid architectures outperforming others on long sequences and fine-tuning hurting generalization.
-
MolDeTox: Evaluating Language Model's Stepwise Fragment Editing for Molecular Detoxification
MolDeTox is a new benchmark that shows fragment-level stepwise editing by LLMs and VLMs improves structural validity and detoxification quality over prior toxicity-focused evaluations.
-
An explainable hypothesis-driven approach to Drug-Induced Liver Injury with HADES
HADES is an agentic AI system that generates mechanistic hypotheses for drug-induced liver injury using molecular, metabolite, and pathway evidence, outperforming prior binary classifiers on the new DILER benchmark while establishing a baseline for hypothesis alignment.
-
Benchmarking open-source tools for in silico antiviral drug discovery
Boltz-2 and fine-tuned DrugFormDTA lead ML-based binding prediction while GNINA leads docking tools on a cleaned antiviral dataset, with performance varying by viral protein.
-
Bolek: A Multimodal Language Model for Molecular Reasoning
Bolek injects Morgan fingerprint embeddings into an instruction-tuned text model, then fine-tunes on molecular alignment and synthetic chain-of-thought tasks to improve performance and grounding on 15 TDC binary classification endpoints while generalizing to unseen tasks.
-
ToxiEval-ZKP: A Structure-Private Verification Framework for Molecular Toxicity Repair Tasks
ToxiEval-ZKP applies zero-knowledge proofs to enable private verification that generative AI molecules meet multidimensional toxicity repair criteria.
-
From Knowledge to Action: Outcomes of the 2025 Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry
Hackathon submissions indicate LLMs are moving from general assistants toward composable multi-agent systems for structuring scientific knowledge and automating tasks in materials science and chemistry.