pith. sign in

arxiv: 2602.12005 · v3 · pith:UUIAAANHnew · submitted 2026-02-12 · 💻 cs.CL

LaCy: What Small Language Models Can and Should Learn is Not Just a Question of Loss

Pith reviewed 2026-05-21 12:50 UTC · model grok-4.3

classification 💻 cs.CL
keywords small language modelsdelegationloss signalgrammatical signalsfactualitypretrainingdelegation tokencascaded generation
0
0 comments X

The pith

Deciding which tokens small language models should learn versus delegate requires grammatical signals beyond loss alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that token loss during pretraining predicts mismatches with ground truth but fails to flag which mismatches would produce factually or semantically invalid text. In domains like Wikipedia, high-loss tokens sometimes represent acceptable alternative continuations that should not trigger delegation. Adding lightweight grammatical information extracted by a parser improves the model's ability to choose between learning a token and inserting a special delegation token. This leads to small models that generate with higher factuality when cascaded with a larger model. The approach is presented as simpler and cheaper than alternatives that rely on external judges or different training signals.

Core claim

LaCy augments the standard language modeling loss with factuality signals derived from grammatical information to determine which tokens an SLM should predict itself and which ones should be replaced by a <CALL> token during pretraining. This combination allows the model to learn appropriate delegation behavior, resulting in higher FactScores during generation in a cascade with a larger model and outperforming SLMs trained with Rho or LLM-judge methods.

What carries the argument

The LaCy pretraining objective, which combines per-token loss with grammatical factuality signals to decide between self-prediction and delegation via a <CALL> token.

If this is right

  • SLMs trained this way produce higher FactScores when generating in a cascade with a larger model.
  • Delegation decisions become more accurate by distinguishing acceptable high-loss alternatives from invalid ones.
  • The method requires no external judges during training and remains simpler and cheaper than alternatives.
  • The model learns to insert the delegation token at appropriate points rather than always defaulting to loss-based triggers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Pretraining for capacity-limited models may need domain-adapted validity signals rather than uniform loss objectives.
  • Similar augmentation could be tested in non-Wikipedia text by substituting other lightweight structural cues for the grammatical parser.
  • If the approach generalizes, it suggests delegation can be learned as part of the core pretraining objective without separate post-training stages.

Load-bearing premise

Grammatical information reliably signals whether a token continuation is factually or semantically valid in the target domain.

What would settle it

Training an SLM with only the loss signal and measuring whether its FactScores in a cascade match or exceed those of a LaCy-trained model on the same data.

read the original abstract

Language models have consistently grown to compress more world knowledge into their parameters, but the knowledge that can be pretrained into them is upper-bounded by their parameter size. Especially the capacity of Small Language Models (SLMs) is limited, leading to factually incorrect generations. This problem is often mitigated by giving the SLM access to an outside source: the ability to query a larger model, documents, or a database. Under this setting, we study the fundamental question of \emph{which tokens an SLM can and should learn} during pretraining, versus \emph{which ones it should delegate} via a \texttt{<CALL>} token. We find that this is not simply a question of loss: although the loss is predictive of whether a predicted token mismatches the ground-truth, it is insufficient for identifying which predictions would actually lead to factual or semantically invalid continuations. Some high-loss tokens correspond to \emph{acceptable} alternative continuations of a pretraining document and therefore should not trigger a \texttt{<CALL>}. This suggests that learnability cannot be characterized from loss alone, but requires additional domain-specific signals about the role of a token in the sentence. In Wikipedia-like domains, we show that augmenting the loss signal with lightweight grammatical information from a spaCy parser substantially improves delegation decisions. Based on this insight, we propose LaCy, a novel pretraining method that combines loss with factuality signals to decide which tokens an SLM should learn. Our experiments demonstrate that LaCy models successfully learn which tokens to predict and when to call for help. This results in higher FactScores when generating in a cascade with a bigger model and outperforms Rho or LLM-judge trained SLMs, while being simpler and cheaper.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that deciding which tokens small language models should learn during pretraining versus delegate via a <CALL> token cannot be based on loss alone, as loss predicts token mismatch but not whether a substitution yields factually or semantically invalid text. It proposes LaCy, which augments loss with grammatical signals (POS, dependencies) from a spaCy parser in Wikipedia-like domains to improve delegation, yielding higher FactScores in cascades with larger models and outperforming Rho/LLM-judge baselines.

Significance. If the empirical results hold, LaCy offers a lightweight, parser-based alternative to pure loss or LLM-judge delegation for SLMs, potentially improving efficiency and factuality without heavy external resources; the reproducible training procedure and falsifiable delegation policy are strengths.

major comments (2)
  1. [Abstract and method description] The central claim that spaCy grammatical features reliably proxy for factuality or semantic validity of high-loss token continuations is load-bearing but unsupported by direct evidence; no correlation analysis between spaCy tags and human factuality judgments on mismatched tokens is described, leaving open that the parser may capture only surface acceptability.
  2. [Pretraining procedure] The <CALL> token insertion during pretraining is presented as non-interfering, yet no ablation or analysis shows it preserves the core language modeling objective or ensures the learned policy transfers to normal inference without distribution shift.
minor comments (2)
  1. [Experiments] Clarify the exact combination rule for loss and spaCy features (e.g., weighting, feature vector construction) and report statistical significance of FactScore gains.
  2. [Evaluation] Add details on data splits, exact baselines (Rho, LLM-judge), and domain specificity of Wikipedia-like text to allow replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, outlining how we will strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and method description] The central claim that spaCy grammatical features reliably proxy for factuality or semantic validity of high-loss token continuations is load-bearing but unsupported by direct evidence; no correlation analysis between spaCy tags and human factuality judgments on mismatched tokens is described, leaving open that the parser may capture only surface acceptability.

    Authors: We agree that a direct correlation analysis would provide stronger support for using spaCy signals as a proxy. The current results rely on downstream FactScore improvements in cascades as indirect evidence that the grammatical signals help identify tokens whose substitution affects semantic validity. In the revised manuscript we will add a new subsection with human annotations on a sample of high-loss mismatched tokens, reporting correlation coefficients between spaCy tags (POS, dependencies) and human factuality/semantic validity judgments to address this gap. revision: yes

  2. Referee: [Pretraining procedure] The <CALL> token insertion during pretraining is presented as non-interfering, yet no ablation or analysis shows it preserves the core language modeling objective or ensures the learned policy transfers to normal inference without distribution shift.

    Authors: This is a fair observation; the manuscript does not contain explicit ablations on this point. We will add an ablation study comparing standard language modeling metrics (perplexity, next-token loss) on held-out Wikipedia data for models trained with and without <CALL> insertion. We will also include inference-time experiments measuring generation quality and delegation frequency under standard prompting (no explicit <CALL> cues) to demonstrate that the learned policy transfers without substantial distribution shift. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical method

full rationale

The paper describes an empirical pretraining procedure (LaCy) that augments per-token loss with lightweight grammatical features extracted by an external spaCy parser to improve delegation decisions via a <CALL> token. No equations, derivations, or fitted parameters are presented that reduce the claimed improvement in FactScore or delegation accuracy to a quantity defined in terms of itself. The central insight—that loss alone is insufficient and must be supplemented by domain-specific signals—is justified by experimental comparisons rather than by construction from the inputs or by load-bearing self-citations. The method remains self-contained against external benchmarks such as human factuality judgments and comparisons to Rho or LLM-judge baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; no explicit free parameters, axioms, or invented entities are stated beyond the introduction of the <CALL> token and reliance on an external parser.

pith-pipeline@v0.9.0 · 5874 in / 1167 out tokens · 55253 ms · 2026-05-21T12:50:56.174312+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 3 internal anchors

  1. [1]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    doi: 10.48550/arxiv.1803.05457. Roi Cohen, Konstantin Dobler, Eden Biran, and Gerard de Melo. I don’t know: Explicit modeling of uncertainty with an [IDK] token. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=Wc0vlQuoLb. Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata ...

  2. [2]

    Scaling Laws for Neural Language Models

    doi: 10.48550/arxiv.2001.08361. Mojtaba Komeili, Kurt Shuster, and Jason Weston. Internet-augmented dialogue generation. InAnnual Meeting of the Association for Computational Linguistics, 2021. URLhttps://api.semanticscholar.org/CorpusID:236034557. Jakub Krajewski, Amitis Shidani, Dan Busbridge, Sam Wiseman, and Jason Ramapuram. Revisiting the scaling pro...

  3. [3]

    Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G

    doi: 10.48550/arxiv.2505.24832. Harikrishna Narasimhan, Wittawat Jitkrittum, Aditya Krishna Menon, Ankit Singh Rawat, and Sanjiv Kumar. Post- hoc estimators for learning to defer to an expert. InProceedings of the 36th International Conference on Neural Information Processing Systems, Advances in Neural Information Processing Systems 36, Red Hook, NY, USA,

  4. [4]

    ISBN 9781713871088

    Curran Associates Inc. ISBN 9781713871088. Yonatan Oren, Shiori Sagawa, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust language modeling. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Interna- tional Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4227–4237, Hon...

  5. [5]

    Qwen3 Technical Report

    Association for Computational Linguistics. doi: 10.18653/v1/D19-1432. URLhttps://aclanthology.org/D19-1432. Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388. Markus Norman Rabe and Charles Staats. Self-attention does not needo(n2)memory.arXiv:2112.05682, 2021. 12 Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, ...

  6. [6]

    URL https:// doi.org/10.18653/v1/p19-1472

    URLhttps://openreview.net/forum?id=6okaSfANzh. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, 2019. Association for Computational Linguistics. doi: 10.18...

  7. [7]

    We prompt Gemini 2.0 Flash to score the semantic and factual validity of the proposed next token using the prompt below

    Starting from token 1, at each position of each tokenized document, we record the model’s proposed next token (corresponding to the highest logit, i.e., greedy decoding) alongside the the ground truth next token. We prompt Gemini 2.0 Flash to score the semantic and factual validity of the proposed next token using the prompt below. We manually verified th...

  8. [8]

    ** s t a r t i n g _ t e x t **: The initial text segment

  9. [9]

    ** p r o p o s e d _ n e x t _ t o k e n **: A token that could continue the s t a r t i n g _ t e x t

  10. [10]

    ** r e f e r e n c e _ n e x t _ t o k e n **: A re fe ren ce token for c o m p a r i s o n # E v a l u a t i o n Criteria The p r o p o s e d _ n e x t _ t o k e n is VALID if : - Joining s t a r t i n g _ t e x t + p r o p o s e d _ n e x t _ t o k e n creates a s tat em en t that is lo gi ca ll y and f act ua ll y c o m p a t i b l e with s t a r t i n...

  11. [11]

    ** e x p l a n a t i o n **: Your re as on in g for the decision

  12. [12]

    Loss + Ignorefacts

    ** output **: A binary label (1 = valid , 0 = invalid ) # Examples ## Example 1: Invalid C o n t i n u a t i o n ** s t a r t i n g _ t e x t **: ’ Wolfgang Amadeus Mozart \ n \ n (27 January 17 ’ ** p r o p o s e d _ n e x t _ t o k e n **: ’6 ’ ** r e f e r e n c e _ n e x t _ t o k e n **: ’5 ’ ** e x p l a n a t i o n **: The p r o p o s e d _ n e x t...

  13. [13]

    Tell me a bio of <name>. <name> is

    enhanced with a RAG prompt given below. The background informationwiki_contentis obtained by using the full text from the wikipedia entry corresponding to each given person, truncated to8000characters (roughly2000tokens). For those few entities who do not have a unique wikipedia page, no background information was provided. We use Qwen 3 32B with greedy d...

  14. [14]

    What is Ufa the capital of?

    in treating it as a general language benchmark rather than factual QA. We omitCommonsense QA (Talmor et al., 2019), as our models did not exceed chance-level performance on this benchmark. We use the eval-harness library. Models are evaluated in the standard way, comparing the loglikelihoods of the possible answers (A, B, C, etc). Factual Benchmarks.We ev...