An automated rule-based parser plus LLM pipeline creates a 163k-pair molecular structure-language dataset validated at 98.6% precision on a 2,000-sample subset.
Specify their type (e.g., hydroxyl, amine, halogen, carbonyl), location, and bonding pattern relative to the molecular framework
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
A Large-Scale Dataset for Molecular Structure-Language Description via a Rule-Regularized Method
An automated rule-based parser plus LLM pipeline creates a 163k-pair molecular structure-language dataset validated at 98.6% precision on a 2,000-sample subset.