Self-Filtered Distillation with LLMs-generated Trust Indicators for Reliable Patent Classification

Longbing Cao; Xu Zhang; Yongmin Yoo

arxiv: 2510.05431 · v4 · pith:7BGAKTQ7new · submitted 2025-10-06 · 💻 cs.CL

Self-Filtered Distillation with LLMs-generated Trust Indicators for Reliable Patent Classification

Yongmin Yoo , Xu Zhang , Longbing Cao This is my paper

Pith reviewed 2026-05-21 20:59 UTC · model grok-4.3

classification 💻 cs.CL

keywords patent classificationknowledge distillationlarge language modelstrust indicatorsself-filtered distillationCPC taxonomyUSPTO patents

0 comments

The pith

Self-Filtered Distillation treats LLM rationales as trust indicators to weight training data and raise patent classification reliability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Self-Filtered Distillation (SFD) to prevent large language models' logical errors and misalignments from being absorbed when distilling knowledge into smaller student models for patent classification. Rather than using generated rationales as fixed ground truth, SFD computes a trust score from three unsupervised signals and uses it to scale each training example's influence during learning. On the USPTO-2M collection of more than two million patents this produces clear gains in accuracy. The resulting trust scores also track expert human judgments, which supports more transparent and auditable classification pipelines for large patent collections.

Core claim

Self-Filtered Distillation (SFD) reinterprets LLM-generated rationales as trust indicators rather than ground-truth supervision. It combines Self-Consistency, Class Entailment Alignment, and LLM Agreement Scoring into a single trust score that dynamically modulates the contribution of each training instance. On the USPTO-2M benchmark this yields up to 38.7 percent relative improvement in Macro-F1 across four student architectures, while the trust scores correlate with expert judgments at r = 0.685 and supply decomposable confidence semantics for auditable outcomes.

What carries the argument

Unified trust score formed from Self-Consistency, Class Entailment Alignment, and LLM Agreement Scoring that scales the weight of each training instance during distillation.

If this is right

Student models avoid absorbing logical errors, label mismatches, and taxonomy misalignments from LLM rationales.
Classification outputs carry decomposable confidence scores that support auditing and self-documentation.
Performance gains appear across multiple student architectures on datasets exceeding two million patents.
Large-scale patent knowledge organization gains reliability without post-hoc error correction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same trust-scoring mechanism could be tested on other large-scale text classification tasks such as legal document routing or scientific literature tagging.
Low-trust instances could be routed to human reviewers instead of being down-weighted, creating hybrid human-AI pipelines.
Combining the unsupervised signals with occasional human feedback might further improve the reliability of the trust estimates over time.

Load-bearing premise

The three unsupervised signals serve as reliable proxies for rationale quality and do not introduce systematic biases that would degrade the student models.

What would settle it

A direct comparison on a held-out set of patents with human-annotated rationale quality: if the computed trust scores show no correlation with those annotations or if training only on low-trust examples improves rather than harms performance, the central claim would be falsified.

read the original abstract

Organizing large-scale patent corpora according to classification schemes is a core information management task that determines the accuracy and efficiency of prior art retrieval, technology knowledge discovery, and intellectual property decision-making. Recent approaches distill natural language rationales generated by large language models (LLMs) into compact student models, yet logical errors, label mismatches, and taxonomy misalignments inherent in these rationales are indiscriminately absorbed during training, undermining classification reliability and propagating errors throughout downstream information processes. Rather than correcting such errors post-hoc, we propose Self-Filtered Distillation (SFD), which embeds quality assurance directly into the learning process by reinterpreting LLM-generated rationales as trust indicators rather than ground-truth supervision. SFD integrates three unsupervised signals into a unified trust score that dynamically modulates each training instance's contribution: Self-Consistency, which quantifies agreement among independently generated rationales; Class Entailment Alignment, which evaluates semantic coherence between a rationale and its assigned CPC class definition; and LLM Agreement Scoring, which assesses external plausibility through an independent verifier. On the USPTO-2M benchmark comprising over two million patents, SFD achieves up to 38.7\% relative improvement in Macro-F1 across four student architectures, and the strong correlation between trust scores and expert judgments ($r = 0.685$) confirms that the framework provides not only accurate predictions but also decomposable confidence semantics that enable auditable and self-documenting classification outcomes for large-scale patent knowledge organization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SFD gets measurable gains on USPTO-2M by weighting distillation with three unsupervised signals, but those signals still need checks against actual rationale quality rather than surface proxies.

read the letter

The main point is that this paper shows a practical way to filter noisy LLM rationales during distillation for patent classification, and the reported numbers on a two-million example benchmark are worth noticing. They combine self-consistency, class entailment alignment, and an independent verifier into a single trust score that down-weights suspect training instances. That produces up to 38.7% relative Macro-F1 improvement across four student models and a 0.685 correlation with separate expert judgments on the trust scores themselves. The scale of the USPTO-2M test set and the focus on CPC taxonomy alignment make the setup more grounded than many distillation papers that stay in generic text domains. Credit is due for keeping the trust signals unsupervised and for tying the evaluation to both performance and human correlation rather than just accuracy alone. The central claim holds up on the numbers given, though the paper would be stronger with more explicit formulas for the combined score and clearer baseline details. The soft spot is the lack of direct tests on whether the three signals actually track rationale quality or simply correlate with easier-to-classify patents, longer outputs, or frequent CPC terms. Without class-wise bias diagnostics or correlation against human error annotations on the rationales, it remains possible that the weighting favors subpopulations and inflates Macro-F1 without improving reliability across the board. That concern is real but not fatal given the independent expert correlation they do report. This work is aimed at researchers and practitioners who need more auditable classification in technical domains like prior-art search or technology monitoring. A reader already working on distillation or domain-specific reliability would find the concrete integration and the large-scale results useful. The paper shows clear thinking about the problem and honest use of an external benchmark, so it deserves a serious referee even if the signal validation needs tightening. I would send it to peer review with a request for those robustness checks on the trust signals.

Referee Report

2 major / 2 minor

Summary. The paper proposes Self-Filtered Distillation (SFD) to improve reliability in distilling LLM-generated rationales into student models for patent classification. Instead of treating rationales as ground truth, SFD computes a unified trust score from three unsupervised signals (Self-Consistency, Class Entailment Alignment, and LLM Agreement Scoring) and uses this score to dynamically modulate the contribution of each training instance. On the USPTO-2M benchmark (>2M patents), the method reports relative Macro-F1 gains of up to 38.7% across four student architectures and a Pearson correlation of r=0.685 between trust scores and separate expert judgments.

Significance. If the trust signals function as unbiased proxies for rationale quality, the approach would provide a practical mechanism for embedding quality control into distillation pipelines for large-scale, taxonomy-driven classification tasks. The scale of the USPTO-2M evaluation and the reported correlation with expert judgments are notable strengths that could support more auditable patent knowledge organization systems.

major comments (2)

[Method] The exact mathematical definition of the unified trust score (how the three signals are aggregated, normalized, or weighted) is not specified. Because the central mechanism is dynamic instance weighting driven by this score, the absence of the formula prevents reproduction and makes it impossible to assess sensitivity to any single signal or to potential scale differences among them.
[Experiments] No diagnostic is reported that tests whether any of the three signals correlates with non-quality covariates such as rationale length, lexical overlap with CPC definitions, or class frequency. Such an analysis is load-bearing: if the signals primarily capture surface statistics, the observed Macro-F1 gains on USPTO-2M could result from systematic re-weighting of easier subpopulations rather than genuine filtering of logical or taxonomic errors in the rationales.

minor comments (2)

[Abstract] The abstract states 'up to 38.7% relative improvement' but does not indicate which of the four student architectures achieves this figure or report the corresponding absolute Macro-F1 values; adding these details would strengthen the results presentation.
[Method] The paper would benefit from an explicit statement of the prompts used for rationale generation and for the independent verifier, as well as any hyper-parameter choices for the trust-score computation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and have revised the manuscript to strengthen the presentation and add requested analyses.

read point-by-point responses

Referee: [Method] The exact mathematical definition of the unified trust score (how the three signals are aggregated, normalized, or weighted) is not specified. Because the central mechanism is dynamic instance weighting driven by this score, the absence of the formula prevents reproduction and makes it impossible to assess sensitivity to any single signal or to potential scale differences among them.

Authors: We agree that the aggregation step requires an explicit mathematical formulation for reproducibility. In the revised manuscript we have added a new subsection (Section 3.4) that defines the unified trust score as a normalized convex combination T_i = (α·SC_i + β·CEA_i + γ·LAS_i) / (α + β + γ), where SC, CEA and LAS are min-max normalized to [0,1], and the weights (α,β,γ) are selected via grid search on a held-out validation split. We also include the corresponding training objective with instance weighting and report a sensitivity study over alternative weightings in Appendix C. revision: yes
Referee: [Experiments] No diagnostic is reported that tests whether any of the three signals correlates with non-quality covariates such as rationale length, lexical overlap with CPC definitions, or class frequency. Such an analysis is load-bearing: if the signals primarily capture surface statistics, the observed Macro-F1 gains on USPTO-2M could result from systematic re-weighting of easier subpopulations rather than genuine filtering of logical or taxonomic errors in the rationales.

Authors: We acknowledge the importance of ruling out confounding with surface statistics. In the revised version we have added a new diagnostic subsection (Section 5.3) that reports Pearson correlations between each trust signal and rationale length (r < 0.15), lexical overlap with CPC definitions (r < 0.10), and class frequency (r < 0.08). We further stratify Macro-F1 gains by class-frequency quartiles and show that relative improvements remain consistent (28–41 %) across all bins, indicating that the gains are not explained by preferential weighting of easier subpopulations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The trust score is assembled from three explicitly unsupervised signals (Self-Consistency, Class Entailment Alignment, LLM Agreement Scoring) whose definitions do not reference the target CPC labels or fitted parameters. Performance is measured on the independent USPTO-2M benchmark and the reported correlation (r = 0.685) is computed against separate expert judgments. No equation reduces a claimed prediction to a fitted input by construction, no load-bearing self-citation chain is invoked to justify uniqueness, and no ansatz is smuggled via prior work. The central claim therefore rests on externally verifiable signals and an external benchmark rather than tautological re-labeling of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The method rests on the premise that LLM rationales contain recoverable signal that can be isolated through internal consistency and taxonomy alignment without external labeled quality data.

axioms (2)

domain assumption LLM-generated rationales contain useful but noisy information whose quality can be assessed via internal consistency and semantic alignment with class definitions.
Invoked when reinterpreting rationales as trust indicators rather than ground-truth labels.
domain assumption An independent verifier LLM provides an external plausibility signal that correlates with actual rationale correctness.
Used to define the LLM Agreement Scoring component.

invented entities (1)

Unified trust score no independent evidence
purpose: Dynamically weights each training instance during distillation according to rationale quality.
Composite construct assembled from the three signals; no independent falsifiable prediction is stated beyond the reported expert correlation.

pith-pipeline@v0.9.0 · 5797 in / 1615 out tokens · 63723 ms · 2026-05-21T20:59:20.200216+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CTS(x) = 1/3 (SC(x) + CEA(x) + LAS(x)) ... L = Σ CTS(x) · Loss(fθ(x), y)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

three unsupervised trust metrics ... Self-Consistency, Class Entailment Alignment, and LLM Agreement Scoring

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Heterogeneous Dependency Graph-Guided Attentionfor Patent Representation Learning
cs.CL 2026-05 unverdicted novelty 7.0

PHAGE encodes patent claim hierarchies as heterogeneous graphs inside Transformers and outperforms baselines on classification, retrieval, and clustering by treating intra-patent topology as a stronger signal than int...