pith. machine review for the scientific record. sign in

arxiv: 2604.17105 · v1 · submitted 2026-04-18 · 💻 cs.CL

Recognition: unknown

How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:44 UTC · model grok-4.3

classification 💻 cs.CL
keywords tokenizationphonological knowledgelanguage modelssyllabificationIPAfine-tuningprobing experimentsSTAD metric
0
0 comments X

The pith

Subword tokenization systematically weakens language models' encoding of phonological features like rhyme and syllabification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard subword tokenization breaks words into units without reference to their sounds, which impairs how text-only language models internally represent phonological knowledge. Probing experiments reveal consistent weakening for both local features such as rhyme patterns and global features such as syllabification. The authors introduce the STAD metric to quantify how far token boundaries stray from natural syllable boundaries and demonstrate that larger distances predict weaker phonological representations. They then present a lightweight IPA-based fine-tuning procedure that adds phonological awareness to existing models. The result is improved performance on phonology tasks alongside only minor losses on math and general reasoning benchmarks.

Core claim

Through probing experiments, subword-based tokenization systematically weakens the encoding of both local (e.g., rhyme) and global (e.g., syllabification) phonological features in text-only language models. The syllabification-tokenization alignment distance (STAD) quantifies the misalignment between a model's tokenization and natural syllable boundaries, with higher misalignment correlating to poorer phonological representations. A lightweight IPA-based fine-tuning method infuses phonological awareness, yielding consistent improvements on three phonology-related tasks with only 1.1% and 0.9% drops on GSM8K and MMLU respectively.

What carries the argument

The STAD metric measuring misalignment between token boundaries and syllable boundaries as a diagnostic tool, paired with the IPA-based fine-tuning method that adds sound-level information during adaptation.

If this is right

  • Higher STAD scores directly predict weaker results on both rhyme and syllabification probing tasks.
  • IPA fine-tuning produces gains on phonology tasks without requiring changes to the base tokenizer or full retraining.
  • The same method leaves math and general reasoning performance nearly unchanged.
  • Tokenization misalignment harms local phonological features and global ones in comparable ways.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Phonology-aware models could improve downstream uses such as poetry generation or pronunciation prediction.
  • STAD could serve as a quick test for whether a new tokenizer preserves other linguistic structures beyond syllables.
  • Directly redesigning tokenizers around phonetic units might outperform post-training fixes.
  • The same misalignment pattern may appear in models handling languages with richer sound systems.

Load-bearing premise

The probing experiments accurately measure the models' phonological knowledge representations and the STAD metric validly quantifies the misalignment that affects those representations.

What would settle it

Retrain or adapt a model using a tokenizer whose boundaries are forced to align with syllable breaks and check whether phonological task scores rise in proportion to the reduction in STAD while math and reasoning scores stay flat.

Figures

Figures reproduced from arXiv: 2604.17105 by Disen Liao, Freda Shi.

Figure 1
Figure 1. Figure 1: Illustrations of two key issues in LMs’ phonological [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Average number of CogNet-related entries for token [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance evaluated on three phonology-related tasks. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: CogNet entries related to “musical” across multi [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Examples of our question templates and some ex [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Control-label sanity check. Probing performance with random targets for GPT-2 (upper block) and Llama-3.1-8B (lower block). Left-to-right: (i) Rhyming awareness–accuracy; (ii) G2P–R 2 ; (iii) syllable counting–R 2 . Solid lines reproduce the original probes, dashed lines the corresponding control probes. All curves collapse to chance (accuracy ≈ 0.5) or sub-chance (R 2 ≤ 0), demonstrating that the linear p… view at source ↗
read the original abstract

Tokenization is the first step in every language model (LM), yet it never takes the sounds of words into account. We investigate how tokenization influences text-only LMs' ability to represent phonological knowledge. Through a series of probing experiments, we show that subword-based tokenization systematically weakens the encoding of both local (e.g., rhyme) and global (e.g., syllabification) phonological features. To quantify this effect, we introduce the syllabification-tokenization alignment distance (STAD), a metric that measures the misalignment between a model's tokenization and the natural syllable boundaries of words, and find that higher misalignment correlates with poorer phonological representations, providing a simple diagnostic for phonology-aware tokenization. To address these limitations, we propose a lightweight IPA-based fine-tuning method that infuses phonological awareness into LMs, leading to consistent improvements across three phonology-related tasks while largely preserving math and general reasoning ability, with 1.1\% and 0.9\% drops on GSM8K and MMLU, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript claims that subword tokenization systematically weakens language models' encoding of phonological features (both local, e.g., rhyme, and global, e.g., syllabification). It introduces the syllabification-tokenization alignment distance (STAD) metric to quantify misalignment between BPE token boundaries and natural syllable boundaries, reports a correlation between higher STAD and lower accuracy on phonological probing tasks, and proposes a lightweight IPA-based fine-tuning method that improves performance on three phonology-related tasks while causing only small drops (1.1% on GSM8K, 0.9% on MMLU) on math and general reasoning benchmarks.

Significance. If the results hold after addressing confounds, the work would be significant for identifying a concrete limitation of standard tokenization in representing phonological structure and for providing both a diagnostic metric (STAD) and a practical mitigation via IPA fine-tuning that largely preserves general capabilities. It could inform tokenization design for tasks involving pronunciation or prosody. The multi-task evaluation and emphasis on minimal side effects are positive features.

major comments (1)
  1. [Probing experiments and STAD correlation] The central diagnostic claim—that higher STAD directly indicates weaker phonological representations—is load-bearing but rests on an uncontrolled correlation. Words with elevated STAD tend to be longer, lower-frequency, or morphologically complex; each factor independently reduces probe performance in LMs. The manuscript does not report regression controls for these variables or results on length/frequency-matched subsets, so the correlation cannot isolate tokenization misalignment as the causal factor (see the probing experiments and STAD correlation analysis).
minor comments (1)
  1. [Abstract] The abstract reports specific percentage drops on GSM8K and MMLU but does not specify the exact phonology tasks, baseline models, or number of runs, making it difficult to assess the magnitude of improvements.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. The concern about uncontrolled confounds in the STAD correlation is well-taken and points to a genuine limitation in the current analysis. We address it directly below and will strengthen the manuscript accordingly.

read point-by-point responses
  1. Referee: The central diagnostic claim—that higher STAD directly indicates weaker phonological representations—is load-bearing but rests on an uncontrolled correlation. Words with elevated STAD tend to be longer, lower-frequency, or morphologically complex; each factor independently reduces probe performance in LMs. The manuscript does not report regression controls for these variables or results on length/frequency-matched subsets, so the correlation cannot isolate tokenization misalignment as the causal factor (see the probing experiments and STAD correlation analysis).

    Authors: We agree that the reported correlation between STAD and probe accuracy does not yet isolate tokenization misalignment from the listed confounds. The manuscript presents only the raw correlation and does not include regression controls or matched-subset analyses. In the revision we will add (i) a multivariate regression of probe accuracy on STAD while controlling for word length, log-frequency, and morphological complexity (number of morphemes), and (ii) results on length- and frequency-matched word subsets. These additions will either confirm that the STAD effect remains significant after controls or will lead us to qualify the causal interpretation. The revised manuscript will report both the original and the controlled results. revision: yes

Circularity Check

0 steps flagged

No circularity: STAD metric and probing results are independently derived from linguistic structure

full rationale

The paper defines STAD directly from external syllable boundaries and BPE token boundaries without reference to the probing accuracies or fine-tuning outcomes. The reported correlation is an empirical observation between two separately measured quantities, not a fitted parameter renamed as a prediction. IPA fine-tuning is introduced as an intervention motivated by the observed misalignment, without any self-definitional loop or load-bearing self-citation that reduces the central claim to its own inputs. The derivation chain remains self-contained against external linguistic benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The paper relies on standard assumptions from interpretability research in NLP and linguistics; no free parameters or invented entities beyond the new metric are described in the abstract.

axioms (2)
  • domain assumption Probing experiments can reliably reveal specific types of knowledge encoded in language model representations.
    Standard assumption in model interpretability studies.
  • domain assumption Phonological features such as rhyme and syllabification are meaningful and measurable aspects of language understanding.
    Drawn from linguistic theory and applied to model evaluation.
invented entities (1)
  • STAD (syllabification-tokenization alignment distance) no independent evidence
    purpose: To quantify misalignment between a model's token boundaries and natural syllable boundaries in words.
    Newly introduced metric in the paper.

pith-pipeline@v0.9.0 · 5479 in / 1459 out tokens · 53770 ms · 2026-05-10T06:44:06.187079+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 32 canonical work pages · 9 internal anchors

  1. [1]

    Guillaume Alain and Yoshua Bengio. 2018. Understanding intermediate layers using linear classifier probes, 2017. In URL https://openreview. net/forum

  2. [2]

    Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, M \'e rouane Debbah, \'E tienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, Daniele Mazzotta, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. 2023. https://arxiv.org/abs/2311.16867 The falcon series of open language models . Preprint, ...

  3. [3]

    Morris Alper and Hadar Averbuch-Elor. 2024. Kiki or bouba? sound symbolism in vision-and-language models. Advances in Neural Information Processing Systems, 36

  4. [4]

    Amos Azaria and Tom Mitchell. 2023. The internal state of an llm knows when it's lying. arXiv preprint arXiv:2304.13734

  5. [5]

    Khuyagbaatar Batsuren, G \'a bor Bella, Fausto Giunchiglia, et al. 2019. Cognet: A large-scale cognate database. In ACL 2019 The 57th Annual Meeting of the Association for Computational Linguistics: Proceedings of the Conference, pages 3136--3145. Association for Computational Linguistics

  6. [6]

    Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. 2021. https://doi.org/10.5281/zenodo.5297715 GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow . If you use this software, please cite it using these metadata

  7. [7]

    Euan Bonner, Ryan Lege, and Erin Frazier. 2023. Large language model-based artificial intelligence in the language classroom: Practical ideas for teaching. Teaching English with Technology, 23(1):23--41

  8. [8]

    Bastian Bunzeck, Daniel Duran, Leonie Schade, and Sina Zarrie . 2024. http://arxiv.org/abs/2410.01487 Small Language Models Like Small Vocabularies : Probing the Linguistic Abilities of Grapheme - and Phoneme - Based Baby Llamas . arXiv preprint. ArXiv:2410.01487

  9. [9]

    Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. 2022. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827

  10. [10]

    Noam Chomsky. 1965. Aspects of the Theory of Syntax. MIT press

  11. [11]

    Ju-Chieh Chou, Chung-Ming Chien, Wei-Ning Hsu, Karen Livescu, Arun Babu, Alexis Conneau, Alexei Baevski, and Michael Auli. 2023. Toward joint language modeling for speech units and text. In The 2023 Conference on Empirical Methods in Natural Language Processing

  12. [12]

    Jonathan H Clark, Dan Garrette, Iulia Turc, and John Wieting. 2022. Canine: Pre-training an efficient tokenization-free encoder for language representation. Transactions of the Association for Computational Linguistics, 10:73--91

  13. [13]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. https://arxiv.org/abs/2110.14168 Training verifiers to solve math word problems . Preprint, arXiv:2110.14168

  14. [14]

    Yihe Deng, Weitong Zhang, Zixiang Chen, and Quanquan Gu. 2023. Rephrase and respond: Let large language models ask better questions for themselves. arXiv preprint arXiv:2311.04205

  15. [15]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171--4186

  16. [16]

    Dario Di Palma, Alessandro De Bellis, Giovanni Servedio, Vito Walter Anelli, Fedelucio Narducci, and Tommaso Di Noia. 2025. https://doi.org/10.18653/v1/2025.acl-long.306 LL a MA s have feelings too: Unveiling sentiment and emotion representations in LL a MA models through probing . In Proceedings of the 63rd Annual Meeting of the Association for Computati...

  17. [17]

    Allyson Ettinger, Ahmed Elgohary, and Philip Resnik. 2016. Probing for semantic evidence of composition by means of simple classification tasks. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP , pages 134--139

  18. [18]

    Philip Gage. 1994. A new algorithm for data compression. The C Users Journal, 12(2):23--38

  19. [19]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

  20. [20]

    Wes Gurnee and Max Tegmark. 2023. Language models represent space and time. arXiv preprint arXiv:2310.02207

  21. [21]

    Vita A Hamaniuk. 2021. The potential of large language models in language education. Educational Dimension, 5:208--210

  22. [22]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. https://arxiv.org/abs/2009.03300 Measuring massive multitask language understanding . Preprint, arXiv:2009.03300

  23. [23]

    John Hewitt and Percy Liang. 2019. Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2733--2743

  24. [24]

    John Hewitt and Christopher D Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129--4138

  25. [25]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. https://arxiv.org/abs/2106.09685 Lora: Low-rank adaptation of large language models . Preprint, arXiv:2106.09685

  26. [26]

    Jennifer Hu and Roger Levy. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.306 Prompting is not a substitute for probability measurements in large language models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 5040--5060, Singapore. Association for Computational Linguistics

  27. [27]

    International Phonetic Association . 1999. Handbook of the International Phonetic Association: A guide to the use of the International Phonetic Alphabet. Cambridge University Press

  28. [28]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L \'e lio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth \'e e Lacroix, and William El Sayed. 2023. https://arxiv.org/ab...

  29. [29]

    Guy Kaplan, Matanel Oren, Yuval Reif, and Roy Schwartz. 2025. https://openreview.net/forum?id=328vch6tRs From tokens to words: On the inner lexicon of LLM s . In The Thirteenth International Conference on Learning Representations

  30. [30]

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521--3526

  31. [31]

    Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. arXiv preprint arXiv:1804.10959

  32. [32]

    Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226

  33. [33]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

  34. [34]

    Junteng Liu, Shiqi Chen, Yu Cheng, and Junxian He. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1012 On the universal truthfulness hyperplane inside LLM s . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18199--18224, Miami, Florida, USA. Association for Computational Linguistics

  35. [35]

    Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi \`e re, Mihir Sanjay Kale, Juliette Love, et al. 2024. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295

  36. [36]

    Jerry Ngo and Yoon Kim. 2024. What do language models hear? probing for auditory representations in language models. arXiv preprint arXiv:2402.16998

  37. [37]

    Pedregosa, G

    F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in P ython. Journal of Machine Learning Research, 12:2825--2830

  38. [38]

    Mahta Fetrat Qharabagh, Zahra Dehghanian, and Hamid R. Rabiee. 2024. http://arxiv.org/abs/2409.08554 LLM - Powered Grapheme -to- Phoneme Conversion : Benchmark and Case Study . arXiv preprint. ArXiv:2409.08554

  39. [39]

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog

  40. [40]

    Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909

  41. [41]

    Logan IV, Eric Wallace, and Sameer Singh

    Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.346 A uto P rompt: E liciting K nowledge from L anguage M odels with A utomatically G enerated P rompts . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4222--4235, On...

  42. [42]

    June E Shoup. 1980. Phonological aspects of speech recognition. Trends in speech recognition, pages 125--138

  43. [43]

    Aaditya K Singh and DJ Strouse. 2024. Tokenization counts: the impact of tokenization on arithmetic in frontier llms. arXiv preprint arXiv:2402.14903

  44. [44]

    Kaiser Sun, Peng Qi, Yuhao Zhang, Lan Liu, William Wang, and Zhiheng Huang. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.887 Tokenization consistency matters for generative models on extractive NLP tasks . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 13300--13310, Singapore. Association for Computational Linguistics

  45. [45]

    Lichao Sun, Kazuma Hashimoto, Wenpeng Yin, Akari Asai, Jia Li, Philip Yu, and Caiming Xiong. 2020. Adv-bert: Bert is not robust on misspellings! generating nature adversarial samples on bert. arXiv preprint arXiv:2003.04985

  46. [46]

    Mukuntha Narayanan Sundararaman, Ayush Kumar, and Jithendra Vepa. 2021. Phoneme-bert: Joint language modelling of phoneme sequence and asr transcript. arXiv preprint arXiv:2102.00804

  47. [47]

    Ashima Suvarna, Harshita Khandelwal, and Nanyun Peng. 2024. http://arxiv.org/abs/2404.02456 PhonologyBench : Evaluating Phonological Skills of Large Language Models . arXiv preprint. ArXiv:2404.02456

  48. [48]

    Teknium. 2023. https://huggingface.co/datasets/teknium/OpenHermes-2.5 Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants

  49. [49]

    Elena Voita and Ivan Titov. 2020. Information-theoretic probing with minimum description length. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 183--196

  50. [50]

    Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. 2022. Byt5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291--306

  51. [51]

    Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Guoyin Wang, Heng Li, Jiangcheng Zhu, Jianqun Chen, et al. 2024. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652

  52. [52]

    Chengyue Yu, Lei Zang, Jiaotuan Wang, Chenyi Zhuang, and Jinjie Gu. 2024. Charpoet: A chinese classical poetry generation system based on token-free llm. arXiv preprint arXiv:2401.03512

  53. [53]

    Guangyan Zhang, Kaitao Song, Xu Tan, Daxin Tan, Yuzi Yan, Yanqing Liu, Gang Wang, Wei Zhou, Tao Qin, Tan Lee, et al. 2022. Mixed-phoneme bert: Improving bert with mixed phoneme and sup-phoneme representations for text to speech. arXiv preprint arXiv:2203.17190

  54. [54]

    Ran Zhang and Steffen Eger. 2024. Llm-based multi-agent poetry generation in non-cooperative environments. arXiv preprint arXiv:2409.03659

  55. [55]

    Brian Siyuan Zheng, Alisa Liu, Orevaoghene Ahia, Jonathan Hayase, Yejin Choi, and Noah A Smith. 2025. Broken tokens? your language model can secretly handle non-canonical tokenizations. arXiv preprint arXiv:2506.19004

  56. [56]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  57. [57]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...