Subword tokenization impairs phonological knowledge encoding in LMs, but an IPA-based fine-tuning method restores it with minimal impact on other capabilities.
LLM-Powered Grapheme-to- Phoneme Conversion: Benchmark and Case Study
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CL 2representative citing papers
Hybrid OLaPh framework outperforms prior G2P baselines on WikiPron while enabling synthetic data for an LLM that generalizes well on out-of-vocabulary terms.
citing papers explorer
-
How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them
Subword tokenization impairs phonological knowledge encoding in LMs, but an IPA-based fine-tuning method restores it with minimal impact on other capabilities.
-
OLaPh: Optimal Language Phonemizer
Hybrid OLaPh framework outperforms prior G2P baselines on WikiPron while enabling synthetic data for an LLM that generalizes well on out-of-vocabulary terms.