An automated rule-based parser plus LLM pipeline creates a 163k-pair molecular structure-language dataset validated at 98.6% precision on a 2,000-sample subset.
N2 bears no hydrogen because it is substituted by the carbonyl carbon, by N1, and by a sulfonyl group (next step)
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
A Large-Scale Dataset for Molecular Structure-Language Description via a Rule-Regularized Method
An automated rule-based parser plus LLM pipeline creates a 163k-pair molecular structure-language dataset validated at 98.6% precision on a 2,000-sample subset.