MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6.7 points.
hub Mixed citations
Txgemma: Efficient and agentic llms for therapeutics
Mixed citation behavior. Most common role is background (57%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
VibeProteinBench is a new benchmark evaluating LLMs on open-ended language-interfaced protein design across recognition, engineering, and generation, with no model showing strong performance in all areas.
OmicsLM integrates continuous omics embeddings into LLMs for multi-sample biological reasoning, matching specialized models on profile tasks while outperforming them and general LLMs on language-guided QA over real expression data.
LLMs perform adequately on bio-molecular classification tasks but remain weak on regression, with hybrid architectures outperforming others on long sequences and fine-tuning hurting generalization.
MolDeTox is a new benchmark that shows fragment-level stepwise editing by LLMs and VLMs improves structural validity and detoxification quality over prior toxicity-focused evaluations.
HADES is an agentic AI system that generates mechanistic hypotheses for drug-induced liver injury using molecular, metabolite, and pathway evidence, outperforming prior binary classifiers on the new DILER benchmark while establishing a baseline for hypothesis alignment.
Boltz-2 and fine-tuned DrugFormDTA lead ML-based binding prediction while GNINA leads docking tools on a cleaned antiviral dataset, with performance varying by viral protein.
Bolek injects Morgan fingerprint embeddings into an instruction-tuned text model, then fine-tunes on molecular alignment and synthetic chain-of-thought tasks to improve performance and grounding on 15 TDC binary classification endpoints while generalizing to unseen tasks.
ToxiEval-ZKP applies zero-knowledge proofs to enable private verification that generative AI molecules meet multidimensional toxicity repair criteria.
Hackathon submissions indicate LLMs are moving from general assistants toward composable multi-agent systems for structuring scientific knowledge and automating tasks in materials science and chemistry.
citing papers explorer
-
VibeProteinBench: An Evaluation Benchmark for Language-interfaced Vibe Protein Design
VibeProteinBench is a new benchmark evaluating LLMs on open-ended language-interfaced protein design across recognition, engineering, and generation, with no model showing strong performance in all areas.