TiTok: Transfer Token-level Knowledge via Contrastive Excess to Transplant LoRA

Chanjoo Jung; Jaehyung Kim

arxiv: 2510.04682 · v3 · pith:LGCOSZSUnew · submitted 2025-10-06 · 💻 cs.CL · cs.AI

TiTok: Transfer Token-level Knowledge via Contrastive Excess to Transplant LoRA

Chanjoo Jung , Jaehyung Kim This is my paper

Pith reviewed 2026-05-18 10:31 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LoRA transplantationtoken-level knowledge transfercontrastive excessparameter-efficient fine-tuningsynthetic data filteringLLM adaptationknowledge distillation

0 comments

The pith

TiTok transfers LoRA across different model backbones by using token-wise contrastive excess to filter synthetic data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TiTok as a way to move fine-tuned LoRA adaptations from one large language model to another with a different base architecture. It does so by calculating the contrastive excess at each token, which is the difference in how the model behaves with and without the LoRA weights. This difference points out the tokens that carry the most task-specific information. The method then uses these tokens to select the best parts of synthetic data for training the target model. This avoids the need for extra discriminator models and leads to better results in experiments on standard benchmarks.

Core claim

The discovery is that the token-wise contrastive excess between a source model with LoRA and the same model without it effectively captures task-relevant information, allowing selective filtering of synthetic data to achieve successful knowledge transfer and LoRA transplantation to target models with different architectures, all without additional training overhead or extra models.

What carries the argument

The token-wise contrastive excess, defined as the difference in token-level outputs or probabilities between the source model equipped with LoRA and its base version without it, which works to highlight informative tokens for data filtering in the transfer process.

If this is right

TiTok achieves average performance gains of 4 to 10 percent over baselines in LoRA transfer settings.
The approach operates without requiring any additional models such as discriminators for data selection.
It supports effective transplantation across multiple different backbone architectures.
Selective filtering using the contrastive excess preserves the most relevant synthetic data for the target task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the excess signal proves architecture-independent, it could apply to transferring knowledge between entirely different model families.
Future work might explore using this excess for other forms of model compression or pruning.
Applying TiTok to instruction-tuned models on complex reasoning tasks could test its limits on more challenging transfers.

Load-bearing premise

The contrastive excess between the source model with and without LoRA reliably identifies tokens whose knowledge transfers effectively to a different target backbone architecture.

What would settle it

A clear falsifier would be if target models using data filtered by TiTok's contrastive excess show no significant performance advantage over those using randomly selected or unfiltered synthetic data in cross-backbone transfer experiments.

read the original abstract

Large Language Models (LLMs) are widely applied in real world scenarios, yet fine-tuning them comes with significant computational and storage costs. Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA mitigate these costs; however, the adapted parameters are dependent on the base model and cannot be transferred across different backbones. One way to address this issue is through knowledge distillation, but its effectiveness inherently depends on training data. Recent work such as TransLoRA avoids this by generating synthetic data; nevertheless, this adds complexity since it requires training an additional discriminator model. In this paper, we propose TiTok, a new framework that enables effective LoRA Transplantation through Token-level knowledge transfer. Specifically, TiTok captures task-relevant information through a token-wise contrastive excess between a source model with and without LoRA. This excess highlights informative tokens and enables selective filtering of synthetic data, all without additional models or overhead. Through experiments on three benchmarks across multiple transfer settings, we demonstrate that TiTok is consistently effective, achieving average performance gains of +4~10% compared to baselines overall.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes TiTok, a framework for transplanting LoRA adapters across different LLM backbones. It computes a token-wise contrastive excess between a source model with and without LoRA to identify informative tokens, which are then used to selectively filter synthetic data for knowledge transfer to a target model. This is claimed to work without training additional models and yields average gains of +4-10% over baselines on three benchmarks across multiple transfer settings.

Significance. If the cross-architecture transfer holds, the method would offer a lightweight alternative to approaches like TransLoRA that require a separate discriminator, reducing overhead in PEFT adaptation. The token-level contrastive excess is presented as a derived quantity that highlights task-relevant information without extra parameters.

major comments (2)

[Abstract / Experiments] The central transfer claim—that source-model contrastive excess reliably identifies tokens carrying architecture-agnostic knowledge suitable for a different target backbone—rests on an untested assumption. No ablation recomputes the excess on the target model or reports correlation between source excess ranks and target fine-tuning gains on held-out tokens (see the description of experiments across multiple transfer settings).
[Abstract] Abstract states performance gains of +4~10% on three benchmarks but supplies no experimental details, baselines, number of runs, variance, or ablation results. This prevents verification of the central claim that the method is consistently effective.

minor comments (2)

[Abstract] The abstract refers to 'three benchmarks' without naming them or describing the transfer settings (e.g., which source and target model pairs are used).
[Method] Notation for the contrastive excess is introduced without an explicit equation or algorithmic pseudocode in the provided text, making the precise definition of the 'excess' quantity unclear.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the presentation of our claims and experimental evidence.

read point-by-point responses

Referee: [Abstract / Experiments] The central transfer claim—that source-model contrastive excess reliably identifies tokens carrying architecture-agnostic knowledge suitable for a different target backbone—rests on an untested assumption. No ablation recomputes the excess on the target model or reports correlation between source excess ranks and target fine-tuning gains on held-out tokens (see the description of experiments across multiple transfer settings).

Authors: We acknowledge that an explicit ablation recomputing the contrastive excess on the target model and reporting rank correlations with target fine-tuning gains would provide more direct evidence for the architecture-agnostic property. While our experiments already demonstrate consistent gains across multiple source-to-target transfer settings, we did not include this specific analysis. In the revised manuscript we will add the requested ablation study using held-out tokens to address this point. revision: yes
Referee: [Abstract] Abstract states performance gains of +4~10% on three benchmarks but supplies no experimental details, baselines, number of runs, variance, or ablation results. This prevents verification of the central claim that the method is consistently effective.

Authors: The abstract follows standard length constraints and therefore summarizes results at a high level. Detailed descriptions of benchmarks, baselines, number of runs, variance, and ablations are provided in the Experiments section. To improve verifiability we will revise the abstract to briefly reference the main baselines and the range of transfer settings while directing readers to the full experimental details in the body of the paper. revision: partial

Circularity Check

0 steps flagged

No significant circularity; TiTok defines contrastive excess directly from model outputs

full rationale

The paper defines the token-wise contrastive excess as the difference in token-level outputs between the source model with and without LoRA, then uses this quantity to filter synthetic data. This construction is presented as a new derived signal rather than an algebraic reduction of previously fitted parameters or a self-citation chain. No equations or steps in the provided description reduce the central claim to its inputs by definition. The transfer effectiveness is supported by experiments on three benchmarks rather than forced by construction. This is the common case of an independent empirical proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that token-wise differences induced by LoRA encode transferable task knowledge that can be isolated without additional learned components.

axioms (1)

domain assumption Token-wise contrastive excess between source model with and without LoRA isolates task-relevant information transferable across backbones.
This premise is invoked to justify selective filtering of synthetic data in the abstract.

pith-pipeline@v0.9.0 · 5719 in / 1076 out tokens · 54269 ms · 2026-05-18T10:31:55.487279+00:00 · methodology

TiTok: Transfer Token-level Knowledge via Contrastive Excess to Transplant LoRA

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)