pith. sign in

arxiv: 2606.03165 · v1 · pith:6MBPGITCnew · submitted 2026-06-02 · 💻 cs.CL · cs.AI

Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models

Pith reviewed 2026-06-28 10:08 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords lexical alignmentpreference learninglarge language modelsmisalignmentscientific Englishautomated evaluationPubMed abstracts
0
0 comments X

The pith

Two new metrics detect lexical overuse in LLMs and quantify its link to preference learning stages without manual curation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Lexical Alignment Score to flag words that appear more often in model-generated text than in reference human text, and the Triangulated Preference Shift to measure how much of those differences trace to the human-preference training stage. It generates continuations from PubMed abstracts, applies windowed document prevalence across six model families, and surfaces items such as suggest, additionally, and strategy as overused. The metrics run without hand-selected word lists, stay stable under changes in window size and random seeds, and match earlier manual findings on scientific English. A sympathetic reader would care because the approach removes the curation bottleneck that has limited prior work to small-scale or domain-specific studies.

Core claim

The Lexical Alignment Score measures overuse via windowed document prevalence of tokens in model continuations relative to human references, while the Triangulated Preference Shift compares prevalence differences across model checkpoints before and after preference learning to isolate the contribution of that stage.

What carries the argument

Lexical Alignment Score and Triangulated Preference Shift, which quantify token overuse and attribute portions of it to preference learning by comparing generated text prevalence across training stages.

If this is right

  • The metrics can be applied at scale to non-scientific domains and additional languages without new manual curation.
  • Overused lexical items can be tracked across successive model releases to monitor changes in alignment behavior.
  • The procedure replicates earlier manual observations on scientific English while remaining stable across parameter settings and random seeds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the metrics generalize, they could be used to test whether preference tuning in one domain produces lexical shifts that affect performance in unrelated domains.
  • Comparing the Triangulated Preference Shift values across model families might reveal whether certain architectures are more or less sensitive to preference-induced lexical changes.

Load-bearing premise

That windowed document prevalence on generated continuations isolates lexical alignment effects and that observed shifts can be attributed specifically to preference learning rather than other model or data factors.

What would settle it

Running the same procedure on base models before any preference tuning and finding the same set of overused words at comparable rates would falsify the attribution to preference stages.

Figures

Figures reproduced from arXiv: 2606.03165 by Jose A. Hernandez, Thomas Stephan Juzek, Xiaoyang Ming.

Figure 1
Figure 1. Figure 1: cLAS by model family (base vs instruct); lower: closer to human. [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: cTPS ratios (base vs instruct). tries occur in our list. From the 291 items in Kobak et al. (2024), 240 (83%) occur in our data. Varying K. We observe that scores vary slightly across different window sizes. Here, we choose K = 40 (smaller windows disregard a larger portion of the data), 50, and 60 (window sizes larger than that approach document length and undermine the windowing logic). The scores averag… view at source ↗
read the original abstract

The language used by digital chat assistants such as ChatGPT can diverge from human expectations (misalignment). Research, mostly on Scientific English, has described both what divergences occur and, to some extent, why, linking them to the training stage of human preference learning. Yet, existing approaches rely on manual curation. This paper introduces two curation-free, assumption-light evaluation metrics: the Lexical Alignment Score, which identifies lexical overuse, and the Triangulated Preference Shift, which quantifies how much of such shifts can be attributed to human preference learning. Using PubMed abstracts, continuations were generated and measured using windowed document prevalence across six model families (Falcon, Gemma, Llama, Mistral, OLMo, Yi). The procedure identifies, without manual intervention, overused items such as 'suggest', 'additionally', and 'strategy', and estimates their link to preference learning. Our findings replicate prior work and remain stable across parameter settings, random seeds, and evaluation on further data. The approach scales readily and enables systematic study of lexical (mis)alignment beyond Scientific English and across languages, and as such, the metrics have the potential to contribute to improved alignment for future models and understanding of its origins.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces two curation-free metrics—the Lexical Alignment Score, which uses windowed document prevalence to detect lexical overuse in LLM-generated continuations of PubMed abstracts, and the Triangulated Preference Shift, which quantifies attribution of such overuses to human preference learning stages. Applied across six model families (Falcon, Gemma, Llama, Mistral, OLMo, Yi), the method identifies items such as 'suggest', 'additionally', and 'strategy' without manual curation, claims to replicate prior findings on preference-linked divergences in scientific English, and reports stability across random seeds, parameter settings, and additional data.

Significance. If the Triangulated Preference Shift validly isolates preference-stage effects, the automated, scalable metrics would enable systematic study of lexical misalignment beyond scientific English and across languages, supporting improved alignment research. The reported replication of prior work and stability across seeds and settings are explicit strengths that enhance reproducibility.

major comments (2)
  1. [Triangulated Preference Shift] Triangulated Preference Shift section: the attribution of lexical overuses (e.g., 'suggest') specifically to preference learning stages lacks isolation from model-family confounds; no ablations are reported that hold base model, pretraining corpus, or architecture fixed while varying only the preference stage, so observed shifts could arise from tokenizer effects or non-preference differences across the six families. This is load-bearing for the central claim that the metric 'estimates their link to preference learning'.
  2. [Lexical Alignment Score and evaluation] Lexical Alignment Score and evaluation sections: the abstract and results report replication and stability but provide no derivation details, ground-truth validation, or error analysis for the windowed prevalence metric, leaving open whether the identified overuses accurately reflect alignment effects rather than generation artifacts.
minor comments (2)
  1. [Abstract and Triangulated Preference Shift] The explicit triangulation formula is not shown in the abstract and should be stated with all terms defined in the main text for clarity.
  2. [Introduction] References to the manual-curation methods being replaced should be expanded to include specific prior studies on Scientific English divergences.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, acknowledging where the manuscript requires clarification or additional discussion, and outline planned revisions.

read point-by-point responses
  1. Referee: [Triangulated Preference Shift] Triangulated Preference Shift section: the attribution of lexical overuses (e.g., 'suggest') specifically to preference learning stages lacks isolation from model-family confounds; no ablations are reported that hold base model, pretraining corpus, or architecture fixed while varying only the preference stage, so observed shifts could arise from tokenizer effects or non-preference differences across the six families. This is load-bearing for the central claim that the metric 'estimates their link to preference learning'.

    Authors: We agree that the analysis does not include ablations holding base model, pretraining corpus, or architecture fixed while varying only the preference stage. The six model families differ in multiple respects, so tokenizer effects and other non-preference differences cannot be ruled out as contributors to the observed shifts. The consistency across families and replication of prior findings provide supporting evidence but do not substitute for controlled isolation. In the revised manuscript we will add an explicit limitations subsection moderating the language on attribution to preference stages and clarifying the scope of the central claim. revision: yes

  2. Referee: [Lexical Alignment Score and evaluation] Lexical Alignment Score and evaluation sections: the abstract and results report replication and stability but provide no derivation details, ground-truth validation, or error analysis for the windowed prevalence metric, leaving open whether the identified overuses accurately reflect alignment effects rather than generation artifacts.

    Authors: We agree that additional methodological detail would strengthen the presentation. The current text defines the Lexical Alignment Score via windowed document prevalence but does not supply a formal derivation, error analysis, or explicit ground-truth validation. In revision we will expand the methods section with the mathematical formulation and parameter rationale, add an error-analysis subsection addressing potential generation artifacts, and discuss indirect validation through alignment with manually identified overuses reported in prior literature on scientific English. These changes will be incorporated in the next version. revision: yes

Circularity Check

0 steps flagged

No circularity: metrics defined and applied empirically without reduction to inputs

full rationale

The paper introduces Lexical Alignment Score and Triangulated Preference Shift as curation-free metrics based on windowed document prevalence in generated continuations from six model families. It applies them to PubMed abstracts to identify overused lexical items and estimate links to preference learning, replicating prior work with stability checks. No equations, definitions, or derivation steps are presented that equate a claimed prediction or attribution to a fitted parameter or self-citation by construction. The central claims rest on direct measurement rather than self-referential loops, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Metrics rest on domain assumptions about how lexical overuse is measured and attributed; no free parameters or invented entities are described in the abstract.

axioms (2)
  • domain assumption Windowed document prevalence on model continuations versus human abstracts measures lexical alignment without manual curation
    Core definition of Lexical Alignment Score
  • domain assumption Observed lexical shifts can be triangulated to human preference learning stages
    Core definition of Triangulated Preference Shift

pith-pipeline@v0.9.1-grok · 5753 in / 1211 out tokens · 27025 ms · 2026-06-28T10:08:21.882197+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 10 canonical work pages · 7 internal anchors

  1. [1]

    Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models

    Introduction Usage of digital assistants based on Large Lan- guage Models (LLMs), such as ChatGPT, is in- creasing fast, and these artificial intelligence (AI) tools are now widely used for programming, lan- guage editing, and information finding (Stack Over- flow, 2024; Coffey, 2024; O’Brien and Sanders, 2025; Sidoti and McClain, 2025). They perform stro...

  2. [2]

    delve”, “furthermore

    Related Work 2.1. Lexical Overuse in LLMs After the release of ChatGPT, a sudden spike in the usage of a small set of words (e.g., “delve”, “furthermore”, “intricate”) in academic writing was noted(Matsui,2024;Kobaketal.,2024;Liangetal., 2024; Liu and Bu, 2024). As many of these words are also overused in AI-generated texts (Matsui, 2024),themostplausible...

  3. [3]

    Reply only with the continuation; do not repeat the user text; no preface

    Task and Data We address the gaps identified in Section 2 by introducing curation-free, assumption-light met- rics that cover both thewhatandwhyof lexical (mis)alignment. Our data comprise PubMed ab- stracts, which enables a comparison with the liter- ature on lexical overuse in LLMs, which largely fo- cuses on this very domain (Scientific English). We sa...

  4. [4]

    Thisisasentencewith seven words

    Evaluation Metrics Both metrics introduced below, theLexical Align- ment Score(LAS) and theTriangulated Preference Shift(TPS), operate on length-controlled windows of paired human-model continuations, using win- dowed document prevalence to ensure robustness against few-document spikes. Intuitively, the two metrics approach the follow- ing questions. LAS ...

  5. [5]

    Table 2b links to prior work by limiting analysis to content words (NOUN/VERB/ADJ/ADV) and aggregating over instruct models

    Results Table 2a reports the top 20 lemma-level Lex- ical Alignment Scores aggregated over all in- struct models; results for the base models can be found in Table 4 in Appendix B. Table 2b links to prior work by limiting analysis to content words (NOUN/VERB/ADJ/ADV) and aggregating over instruct models. The top 20 items ranked by the preference-stage met...

  6. [6]

    (i) we scored an additional, unseen 20% of the data

    Validation To further validate the metrics, we conducted four checks. (i) we scored an additional, unseen 20% of the data. (ii) We compare our overuse list to the literature. (iii) We ran the metrics with different window sizes (K = 40, 50, 60). (iv) We reran the K= 50configuration for four more random seeds. The following is done for the LAS metric. Addi...

  7. [7]

    However, focus- ing on content words (NOUN/VERB/ADJ/ADV), for all models this trend either reverses (4/6 families in Figure 1 bottom) or at least weakens

    Discussion At the corpus level, instruct variants generally align betterwith human usage than their base counter- parts (5/6 families; Figure 1 top). However, focus- ing on content words (NOUN/VERB/ADJ/ADV), for all models this trend either reverses (4/6 families in Figure 1 bottom) or at least weakens. These are precisely the items emphasised in prior re...

  8. [8]

    Critically, the pro- posed methods are curation-free and assumption- light

    Conclusion The goal of this paper was to introduce evaluation metricsformeasuringlexicalalignmentandthecon- tribution of preference learning. Critically, the pro- posed methods are curation-free and assumption- light. Key to it is a scalable design in which model continuations are compared against matched hu- man continuations from the same source docu- m...

  9. [9]

    Cer- tainly, here is

    Meta/AI persona: e.g., “Cer- tainly, here is ...”, “as an AI model”, apologies, instructions, tool/safety notes

  10. [10]

    Conversation turns & scaf- folding (only if truly dia- logic): pseudo-dialogue mark- ers <|user|>, <|assistant|>, </|user|>, </|assistant|> ONLY IF followed by chat-like material (greeting, instruction, apology, question to reader); delete the marker and that span; otherwise delete markers only

  11. [11]

    Obvious repetition/loops: re- move verbatim or near-verbatim repeats; KEEP one copy

  12. [12]

    I/me/my” or di- rect address “you/your

    First/second-person META sen- tences using “I/me/my” or di- rect address “you/your”. Do NOT delete “we/our/us”. PRESERVE: - Keep phrases like “In conclu- sion” / “concluding” when embed- ded in a normal sentence. - All scientific content and phrasing (incl. “we/our/us”). - Angle-bracketed tokens in gen- eral (operators, tags, gene sym- bols, XML-like mark...

  13. [13]

    GPT-4 Technical Report

    Gpt-4 technical report.arXiv preprint arXiv:2303.08774. Sina Alemohammad, Josue Casco-Rodriguez, Lorenzo Luzi, Ahmed Imtiaz Humayun, Hos- sein Babaei, Daniel LeJeune, Ali Siahkoohi, and Richard G. Baraniuk. 2023. Self-consuming gen- erative models go mad. arXiv preprint. Gregory C. Allen. 2025. Deepseek: A deep dive. Congressional Testimony, CSIS. Ebtesam...

  14. [14]

    Hui Bai, Jan G

    Modelmisalignmentandlanguagechange: Traces of ai-associated language in unscripted spoken english.arXiv preprint arXiv:2508.00238. Hui Bai, Jan G. Voelkel, Shane Muldowney, Jo- hannes C. Eichstaedt, and Robb Willer. 2025. LLM-generated messages can persuade hu- mans on policy issues.Nature Communications, 16(6037). Emily M Bender, Timnit Gebru, Angelina M...

  15. [15]

    InProceedings of the 24th conference on com- putational natural language learning, pages 609– 619

    Cloze distillation: Improving neural lan- guage models with human next-word prediction. InProceedings of the 24th conference on com- putational natural language learning, pages 609– 619. Sarah Fitterer, Dominik Gangl, and Jannes Ulbrich

  16. [16]

    InACL 2025 Student Research Workshop

    Testing english news articles for lexical ho- mogenizationduetowidespreaduseoflargelan- guage models. InACL 2025 Student Research Workshop. Iason Gabriel. 2020. Artificial intelligence, values, and alignment.Minds and Machines, 30(3):411– 437. Riley Galpin, Bryce Anderson, and Tom S Juzek

  17. [17]

    contamination

    Exploring the structure of ai-induced lan- guage change in scientific english.arXiv preprint arXiv:2506.21817. Sebastian Gehrmann, Hendrik Strobelt, and Alexander Rush. 2019. GLTR: Statistical de- tection and visualization of generated text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Sys- tem Demonstrations...

  18. [18]

    Understanding the Effects of RLHF on LLM Generalisation and Diversity

    Understanding the effects of rlhf on llm generalisation and diversity.arXiv preprint arXiv:2310.06452. Dmitry Kobak, Rita González Márquez, Emőke- Ágnes Horvát, and Jan Lause. 2024. Delv- ing into chatgpt usage in academic writing through excess vocabulary.arXiv preprint arXiv:2406.07016. HadasKotek,RikkerDockum,andDavidSun.2023. Gender bias and stereotyp...

  19. [19]

    Can AI-Generated Text be Reliably Detected?

    Research priorities for robust and beneficial artificial intelligence.AI magazine, 36(4):105–114. Vinu Sankar Sadasivan, Aounon Kumar, Sriram Balasubramanian, Wenxiao Wang, and Soheil Feizi. 2023. Can ai-generated text be reliably detected?arXiv preprint arXiv:2303.11156. Sigal Samuel and Jordan Crook. 2025. Here’s how deepseek censorship actually works—a...

  20. [20]

    Classification of human- and AI-generated texts for different languages and domains.Inter- national Journal of Speech Technology, 27:935– 956. M. Sharma et al. 2023. Towards under- standing sycophancy in language models. arXiv:2310.13548. Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, YarinGal, NicolasPapernot, andRossAnderson

  21. [21]

    2 OLMo 2 Furious

    The curse of recursion: Training on gener- ated data makes models forget. arXiv preprint. Olivia Sidoti and Colleen McClain. 2025. 34% of U.S. adults have used ChatGPT, about double the share in 2023. Pew Research Center. Stack Overflow. 2024. Ai — 2024 stack overflow developer survey. Stack Overflow. Chris Stokel-Walker. 2025. We tried out deepseek. it w...

  22. [22]

    Finetuned Language Models Are Zero-Shot Learners

    Llama 2: Open foundation and fine-tuned chat models.arXiv. Debora Weber-Wulff, Alla Anohina-Naumeca, Sonja Bjelobaba, Tomáš Folt `ynek, Jean Guerrero-Dib, Olumide Popoola, Petr Šigut, and Lorna Waddington. 2023. Testing of detection tools for ai-generated text.International Journal for Educational Integrity, 19(1):1–39. JasonWei,MaartenBosma,VincentYZhao,...

  23. [23]

    From lists to emojis: How format bias affects model alignment.arXiv preprint arXiv:2409.11704