Chem-GMNet uses sphere-native embeddings, DualSKA attention, and SH-FFN layers to match or beat ChemBERTa-2 on MoleculeNet tasks with fewer parameters and sometimes no pretraining.
hub Mixed citations
Chemberta: Large-scale self- supervised pretraining for molecular property prediction
Mixed citation behavior. Most common role is background (40%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
FORGE reformulates molecular optimization as context-aware fragment ranking and replacement using mined low-to-high edit pairs, outperforming larger language models and graph methods on standard benchmarks.
Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.
Drug-blind cancer sensitivity prediction is limited by evaluation metric and training distribution rather than drug representation complexity.
PolyLM fine-tunes a 9B-parameter LLM on 185k papers to predict polymer properties from text alone, achieving median R² of 0.74 on 68k held-out samples.
Chemically meaningful steering for properties like cLogP and TPSA emerges in entangled Transformer-VAE latent spaces only after controlling for SELFIES representation confounds through residualization and decoded traversals.
NOSE aligns molecular, receptor, and linguistic modalities in a shared embedding space via tri-modal orthogonal contrastive learning and weak positive samples, achieving SOTA performance and zero-shot generalization on olfactory tasks.
FlexMS is a new flexible benchmarking framework that lets researchers dynamically combine deep learning architectures and evaluate their mass spectrum prediction performance on public metabolomics datasets using multiple metrics and retrieval tasks.
MIST models up to 10x larger than prior work, fine-tuned on over 400 structure-property tasks, match or exceed SOTA on benchmarks and demonstrate zero-shot olfactory perception mapping consistent with hyperbolic geometry.
SmellNet supplies 828k gas-sensor time series across 50 substances plus 43 mixtures; ScentFormer reaches 63.3% top-1 accuracy on classification and 50.2% top-1@0.1 on mixture prediction.
ChemCrow augments LLMs with 18 expert chemistry tools to autonomously plan and execute syntheses and guide molecular discoveries in organic synthesis, drug discovery, and materials design.
SPADE selects ligands more efficiently than deep learning or Bayesian optimization, needing fewer tests on average to identify high-quality drug candidates for novel proteins.
A benchmark across 156 comparisons finds classical ML models win 116 times while larger pretrained and LLM models win far fewer, showing predictive performance depends on model-task fit rather than scale.
Active learning for chemical reaction extraction frequently produces non-monotonic learning curves and fails to deliver stable gains over random sampling because of strong pretraining, structured CRF decoding, and label sparsity.
Lit2Vec delivers a documented, reproducible pipeline that extracts and annotates a large licensed chemistry paper corpus from S2ORC with paragraph embeddings and subfield labels.
Fine-tuned LLaMA 3 achieves regression performance on QM9 molecular properties and 28 materials properties from composition strings that rivals random forests but is 5-10x worse than specialized models using atomic coordinates.
A newly built database of proton-conductive MOFs supports descriptor and transformer machine learning models that predict conductivity with MAE 0.91.
citing papers explorer
-
Chem-GMNet: A Sphere-Native Geometric Transformer for Molecular Property Prediction
Chem-GMNet uses sphere-native embeddings, DualSKA attention, and SH-FFN layers to match or beat ChemBERTa-2 on MoleculeNet tasks with fewer parameters and sometimes no pretraining.
-
FORGE: Fragment-Oriented Ranking and Generation for Context-Aware Molecular Optimization
FORGE reformulates molecular optimization as context-aware fragment ranking and replacement using mined low-to-high edit pairs, outperforming larger language models and graph methods on standard benchmarks.
-
From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models
Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.
-
Training distribution determines the ceiling of drug-blind cancer sensitivity prediction
Drug-blind cancer sensitivity prediction is limited by evaluation metric and training distribution rather than drug representation complexity.
-
Can LLMs Predict Polymer Physics Just by Reading Synthesis and Processing Prose?
PolyLM fine-tunes a 9B-parameter LLM on 185k papers to predict polymer properties from text alone, achieving median R² of 0.74 on 68k held-out samples.
-
Molecules Meet Language: Confound-Aware Representation Learning and Chemical Property Steering in Transformer-VAE Latent Spaces
Chemically meaningful steering for properties like cLogP and TPSA emerges in entangled Transformer-VAE latent spaces only after controlling for SELFIES representation confounds through residualization and decoded traversals.
-
NOSE: Neural Olfactory-Semantic Embedding with Tri-Modal Orthogonal Contrastive Learning
NOSE aligns molecular, receptor, and linguistic modalities in a shared embedding space via tri-modal orthogonal contrastive learning and weak positive samples, achieving SOTA performance and zero-shot generalization on olfactory tasks.
-
FlexMS is a flexible framework for benchmarking deep learning-based mass spectrum prediction tools in metabolomics
FlexMS is a new flexible benchmarking framework that lets researchers dynamically combine deep learning architectures and evaluate their mass spectrum prediction performance on public metabolomics datasets using multiple metrics and retrieval tasks.
-
Foundation Models for Discovery and Exploration in Chemical Space
MIST models up to 10x larger than prior work, fine-tuned on over 400 structure-property tasks, match or exceed SOTA on benchmarks and demonstrate zero-shot olfactory perception mapping consistent with hyperbolic geometry.
-
SmellNet: A Large-scale Dataset for Real-world Smell Recognition
SmellNet supplies 828k gas-sensor time series across 50 substances plus 43 mixtures; ScentFormer reaches 63.3% top-1 accuracy on classification and 50.2% top-1@0.1 on mixture prediction.
-
ChemCrow: Augmenting large-language models with chemistry tools
ChemCrow augments LLMs with 18 expert chemistry tools to autonomously plan and execute syntheses and guide molecular discoveries in organic synthesis, drug discovery, and materials design.
-
SPADE: Faster Drug Discovery by Learning from Sparse Data
SPADE selects ligands more efficiently than deep learning or Bayesian optimization, needing fewer tests on average to identify high-quality drug candidates for novel proteins.
-
Do Larger Models Really Win in Drug Discovery? A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction
A benchmark across 156 comparisons finds classical ML models win 116 times while larger pretrained and LLM models win far fewer, showing predictive performance depends on model-task fit rather than scale.
-
When Active Learning Falls Short: An Empirical Study on Chemical Reaction Extraction
Active learning for chemical reaction extraction frequently produces non-monotonic learning curves and fails to deliver stable gains over random sampling because of strong pretraining, structured CRF decoding, and label sparsity.
-
Lit2Vec: A Reproducible Workflow for Building a Legally Screened Chemistry Corpus from S2ORC for Downstream Retrieval and Text Mining
Lit2Vec delivers a documented, reproducible pipeline that extracts and annotates a large licensed chemistry paper corpus from S2ORC with paragraph embeddings and subfield labels.
-
Regression with Large Language Models for Materials and Molecular Property Prediction
Fine-tuned LLaMA 3 achieves regression performance on QM9 molecular properties and 28 materials properties from composition strings that rivals random forests but is 5-10x worse than specialized models using atomic coordinates.
-
Machine Learning Based Prediction of Proton Conductivity in Metal-Organic Frameworks
A newly built database of proton-conductive MOFs supports descriptor and transformer machine learning models that predict conductivity with MAE 0.91.