pith. sign in

arxiv: 2605.24297 · v2 · pith:NZTD54ARnew · submitted 2026-05-22 · 💻 cs.IR · cs.AI

Benchmarking Patent Embeddings: A Multi-Task Evaluation of 22 Models Across Retrieval, Classification, and Clustering

Pith reviewed 2026-06-30 14:03 UTC · model grok-4.3

classification 💻 cs.IR cs.AI
keywords patent embeddingsfine-tuning recipesinformation retrievaltext classificationclusteringcross-domain evaluationembedding models
0
0 comments X

The pith

Optimal fine-tuning for patent embeddings depends on the target task and training landscape

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether one fine-tuning recipe can serve all uses of patent embeddings. It evaluates 22 models ranging from 22M to 12B parameters on retrieval, classification, and clustering using 113,148 WIPO assistive technology patents plus an external DAPFAM dataset. Cross-sectional alignment improves retrieval most while a combined signal recipe improves classification and clustering most. Fine-tuning on a single patent landscape degrades retrieval performance on other landscapes for five of eight model-recipe pairs. Title, abstract, and claims together form the best input view for all models.

Core claim

The optimal fine-tuning recipe depends on the downstream task: cross-sectional alignment (recipe R3) provides the largest improvements to retrieval performance (+7.1% nDCG@10), whereas a combined signal recipe (recipe R4) is better suited to classification (+7.1 F1) and clustering (+10.9 V-measure); single-landscape fine-tuning significantly degrades cross-domain retrieval for 5 of 8 model-recipe combinations on the DAPFAM corpus.

What carries the argument

Multi-task benchmark comparing four fine-tuning recipes (including cross-sectional alignment R3 and combined signal R4) across retrieval, classification, and clustering on two patent corpora

If this is right

  • Practitioners should select cross-sectional alignment fine-tuning when the goal is information retrieval from patents.
  • Practitioners should select combined signal fine-tuning when the goal is classification or clustering of patents.
  • Models fine-tuned on one patent landscape cannot be assumed to retain retrieval performance on other landscapes.
  • Title plus abstract plus claims is the preferred text view for patent embedding models.
  • Hybrid BM25-dense fusion does not close the 55-65% in-domain versus out-of-domain performance gap.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Patent search systems may need separate embedding models for retrieval versus classification or clustering uses.
  • Training data drawn from multiple patent landscapes could reduce the observed cross-domain degradation.
  • The consistent within-family scaling but erratic cross-family scaling suggests further tests of parameter scaling on patent data.

Load-bearing premise

The WIPO assistive technology patents and DAPFAM dataset together with the chosen tasks and metrics are representative of patent embedding behavior in general.

What would settle it

Finding that one single fine-tuning recipe ranks first on retrieval, classification, and clustering across multiple independent patent datasets would falsify the claim that the optimal recipe depends on the task.

Figures

Figures reproduced from arXiv: 2605.24297 by Amirhossein Yousefiramandi, Ciaran Cooney.

Figure 1
Figure 1. Figure 1: Scaling trend: model size vs task performance. Each panel shows parameter count against nDCG@10 [PITH_FULL_IMAGE:figures/full_fig_p013_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Domain generalization gap: IN-domain vs OUT-of-domain nDCG@10 on the TAC view. Percent￾ages indicate relative degradation. icant (p < 0.001), including the modest +0.0021 improvement for Llama-Nemotron-8B. Further￾more, after fusion, Octen-8B+BM25 and Qwen3- 8B+BM25 become statistically indistinguishable (p = 0.43), despite their dense-only scores differing significantly—illustrating how BM25 interpolation… view at source ↗
Figure 3
Figure 3. Figure 3: Hybrid BM25-dense interpolation: nDCG@10 as a function of the dense weight α on the TAC view. Stars mark dense-only scores (α = 1.0) [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Section ablation heatmaps: nDCG@10 for each query-section [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Query section comparison on the TAC corpus [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Fine-tuning retrieval lift: nDCG@10 im￾provement over zero-shot (R0) for R3 (multi-view) and R4 (combined) recipes across four models. Error bars (PATEMBED-BASE, QWEN3-EMBEDDING-0.6B only) are ±1 std across seeds {42, 7, 13} from the rerun cam￾paign; BGE-M3 and EMBEDDINGGEMMA-300M bars are single-seed point estimates (§7). 4.8 DWPI Expert Text Analysis [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Fine-tuning impact: percentage change from [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 10
Figure 10. Figure 10: Model ranking across all tasks: average rank over retrieval (nDCG@10), classification (F1), and clustering (V-measure). Lower rank indicates better performance. 4.11 Embedding Dimension Truncation To assess storage and inference efficiency trade￾offs, we evaluate five representative models at re￾duced embedding dimensions by truncating and L2- renormalizing their full-dimension embeddings— the Matryoshka … view at source ↗
Figure 11
Figure 11. Figure 11: Embedding dimension truncation: (left) absolute nDCG@10 at each dimension; (right) percentage of [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: DAPFAM external validation: nDCG@10 for R0 (zero-shot) vs R3 (multi-view) and R4 (com￾bined) fine-tuned recipes across four base models and text views. 5 Discussion Scale Helps Within Families, but Task Rank￾ings Diverge. Within the Qwen3 and Llama￾Nemotron families, scale predicts performance monotonically; cross-family the relationship is noisier (KaLM-Gemma3-12B ranks 8th on TAC [PITH_FULL_IMAGE:figur… view at source ↗
Figure 13
Figure 13. Figure 13: Per-label F1 scores for the WIPO Conven [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: UMAP embedding space visualization (test split, TAC view) for four representative models. Colors [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: DWPI advantage (nDCG@10) by query sec￾tion and model. Values show the difference between DWPI-Full and the corresponding non-DWPI corpus view [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: DWPI advantage for classification: macro [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: DWPI advantage for clustering: V-measure [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: profiles six representative models across retrieval, classification, clustering, domain robust￾ness, and recall; [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Spearman rank correlation across tasks and [PITH_FULL_IMAGE:figures/full_fig_p031_19.png] view at source ↗
Figure 21
Figure 21. Figure 21: Query difficulty distribution for Llama [PITH_FULL_IMAGE:figures/full_fig_p031_21.png] view at source ↗
read the original abstract

Two questions regarding practitioners' use of patent embeddings arise: (i) Does one fine-tuning recipe suffice for all downstream applications? (ii) Is fine-tuning on one patent landscape sufficient for downstream application on other landscapes? By evaluating 22 pre-trained embedding models (ranging from 22M to 12B parameters) on three tasks -- information retrieval, classification, and clustering -- on 113,148 WIPO patents for assistive technology (46,069 citation queries) and on an external DAPFAM dataset, we find that two results cast doubt on the prevailing wisdom. (i) The optimal fine-tuning recipe depends on the downstream task: cross-sectional alignment (recipe R3) provides the largest improvements to retrieval performance (+7.1% nDCG@10), whereas a combined signal recipe (recipe R4) is better suited to classification (+7.1 F1) and clustering (+10.9 V-measure); a matched data control confirms that differences in training dataset size are not a contributing factor. (ii) Single-landscape fine-tuning hampers cross-landscape information retrieval: fine-tuning on one landscape significantly degrades cross-domain retrieval for 5 of 8 model-recipe combinations on the DAPFAM corpus, with the stronger zero-shot models suffering most. While within-family scaling is consistent (Qwen3 0.6B->4B->8B; Llama-Nemotron 1B->8B), cross-family scaling is erratic; the 12B KaLM-Gemma3 is ranked 8th on TAC retrieval performance, following prefix modification. Title+Abstract+Claims is the ubiquitous best text view, and all models suffer from a 55-65% gap between IN and OUT-of-domain performance which cannot be mitigated by hybrid BM25-dense fusion. Code and evaluation framework are publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper evaluates 22 pre-trained embedding models (22M to 12B parameters) on retrieval, classification, and clustering using 113,148 WIPO assistive technology patents (46,069 citation queries) and the external DAPFAM dataset. It reports that optimal fine-tuning recipe is task-dependent: cross-sectional alignment (R3) yields largest retrieval gains (+7.1% nDCG@10) while combined-signal recipe (R4) is best for classification (+7.1 F1) and clustering (+10.9 V-measure); a matched-data control rules out size confounds. Single-landscape fine-tuning degrades cross-domain retrieval for 5 of 8 model-recipe pairs on DAPFAM, with stronger zero-shot models affected most. Title+Abstract+Claims is best text view; within-family scaling is consistent but cross-family erratic; a 55-65% in/out-domain gap persists despite BM25-dense fusion. Code and framework are public.

Significance. If the empirical results hold, the work supplies actionable, task-specific guidance for patent embedding fine-tuning and cautions against single-landscape training, challenging prevailing assumptions in patent IR. The matched-data control and public code release are clear strengths that aid reproducibility and verification. The findings could shape practitioner choices and future benchmarking. The narrow technological focus of both corpora, however, constrains how far the task-dependence and cross-landscape degradation claims can be generalized.

major comments (2)
  1. [Abstract] Abstract: the reported improvements (+7.1% nDCG@10, +7.1 F1, +10.9 V-measure) are presented without statistical significance tests, confidence intervals, or variance estimates across the 46k queries. This information is required to substantiate that recipe optimality truly varies by task rather than reflecting sampling variation.
  2. [Abstract] Abstract: the claim that single-landscape fine-tuning significantly degrades cross-landscape retrieval (and thus casts doubt on prevailing wisdom) rests on the WIPO assistive-technology corpus and DAPFAM. Both cover narrow domains with potentially atypical citation graphs and IPC distributions; without replication on additional domains (e.g., chemistry or software patents), the general warning is not yet load-bearing for broad conclusions.
minor comments (1)
  1. The abstract mentions 'prefix modification' for the 12B KaLM-Gemma3 model without defining the modification or quantifying its effect on the reported ranking.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below. Where revisions are needed to strengthen the manuscript, we indicate our plans explicitly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported improvements (+7.1% nDCG@10, +7.1 F1, +10.9 V-measure) are presented without statistical significance tests, confidence intervals, or variance estimates across the 46k queries. This information is required to substantiate that recipe optimality truly varies by task rather than reflecting sampling variation.

    Authors: We agree that statistical significance testing is necessary to substantiate the reported gains and the claim of task-dependent optimality. In the revised version we will add bootstrap confidence intervals (1,000 resamples) and paired t-tests (or Wilcoxon signed-rank where normality assumptions fail) for all headline deltas in both the abstract and the results tables. These will be computed over the 46,069 citation queries and reported alongside the point estimates. revision: yes

  2. Referee: [Abstract] Abstract: the claim that single-landscape fine-tuning significantly degrades cross-landscape retrieval (and thus casts doubt on prevailing wisdom) rests on the WIPO assistive-technology corpus and DAPFAM. Both cover narrow domains with potentially atypical citation graphs and IPC distributions; without replication on additional domains (e.g., chemistry or software patents), the general warning is not yet load-bearing for broad conclusions.

    Authors: We acknowledge the narrow technological scope of the two corpora and agree that broader replication would increase the load-bearing strength of the cross-landscape degradation claim. The current evidence rests on (a) consistent degradation across 5 of 8 model-recipe pairs on the external DAPFAM set and (b) the 55-65 % in/out-domain gap that persists even under hybrid fusion. In revision we will explicitly qualify the generalizability statement in the abstract and discussion, framing the result as a cautionary finding for the assistive-technology and related domains rather than a universal claim. We also note that the matched-data control and the public code release already allow other researchers to test the same recipes on additional patent corpora. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmarking on external held-out datasets

full rationale

The paper reports direct performance measurements of 22 models on retrieval (nDCG@10), classification (F1), and clustering (V-measure) tasks using the WIPO assistive-technology patent corpus (113,148 patents, 46,069 queries) and the external DAPFAM dataset. Task-dependent recipe optimality and cross-landscape degradation are stated as observed outcomes of these evaluations, with an explicit matched-data control for training-set size. No equations, predictions, or central claims are shown to reduce by construction to fitted parameters, self-definitions, or self-citation chains. The derivation chain consists entirely of standard empirical benchmarking steps against independent external data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study relies on standard IR and clustering metrics plus pre-existing datasets and models; no new free parameters, axioms beyond domain conventions, or invented entities are introduced.

axioms (1)
  • domain assumption nDCG@10, F1, and V-measure are appropriate and sufficient metrics for the retrieval, classification, and clustering tasks on patent data.
    Standard practice in the information retrieval and machine learning communities for these task types.

pith-pipeline@v0.9.1-grok · 5880 in / 1333 out tokens · 54240 ms · 2026-06-30T14:03:12.293883+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 18 canonical work pages · 6 internal anchors

  1. [1]

    Hamid Bekamiri, Daniel S

    PatenTEB: A comprehensive benchmark for patent text embed- dings.arXiv preprint arXiv:2510.22264. Hamid Bekamiri, Daniel S. Hain, and Roman Jurowet- zki

  2. [2]
  3. [3]

    European Patent Office

    Large language models for patent classification: Strengths, trade-offs, and the long tail effect.arXiv preprint arXiv:2601.23200. European Patent Office

  4. [4]

    Mainak Ghosh and Sebastian Erhardt

    Scaling deep con- trastive learning batch size under memory limited setup.arXiv preprint arXiv:2101.06983. Mainak Ghosh and Sebastian Erhardt

  5. [5]

    PaECTER: Patent-level embedding using citation-informed trans- formers.arXiv preprint arXiv:2402.19411. Google

  6. [6]

    K., Guzman, S., Mastrapas, G., Sturua, S., Wang, B., et al

    Embedding Gemma: Compact embed- ding models.Google Technical Report. Michael Günther, Jackmin Ong, Isabelle Mohr, Alaed- dine Abdessalem, Tanguy Abel, Mohammad Amin Coni, Nils Smoli ´c, and Bo Wang. 2024a. Jina Embeddings 2: 8192-token general-purpose text embeddings for long documents.arXiv preprint arXiv:2310.19923. Michael Günther et al. 2024b. Jina-Co...

  7. [7]

    arXiv preprint arXiv:2501.01028

    KaLM-Embedding: Superior training data brings a stronger embedding model. arXiv preprint arXiv:2501.01028. Lawrence Hubert and Phipps Arabie

  8. [8]

    Towards General Text Embeddings with Multi-stage Contrastive Learning

    Towards general text embeddings with multi-stage contrastive learning.arXiv preprint arXiv:2308.03281. Also at Findings of the Association for Computational Linguistics: EMNLP

  9. [9]

    MTEB: Massive Text Embedding Benchmark

    MTEB: Massive text embedding benchmark.arXiv preprint arXiv:2210.07316. Zach Nussbaum, John Morris, Brandon Duderstadt, and Andriy Mulyar

  10. [10]

    Nomic Embed: Train- ing a reproducible long context text embedder.arXiv preprint arXiv:2402.01613. NVIDIA

  11. [11]

    Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia

    Statistical significance, power, and sample sizes: A systematic review of SIGIR and TOIS, 2006–2015.Proceedings of SIGIR. Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia

  12. [12]

    arXiv preprint arXiv:2112.01488 , year=

    Col- BERTv2: Effective and efficient retrieval via lightweight late interaction.arXiv preprint arXiv:2112.01488. Alessandro Sarra et al

  13. [13]

    Homaira Huda Shomee, Zhu Wang, Sathya N

    Comparative analysis of embedding models for patent similarity.arXiv preprint arXiv:2403.16630. Homaira Huda Shomee, Zhu Wang, Sathya N. Ravi, and Sourav Medya

  14. [14]

    Text embeddings by weakly- supervised contrastive pre-training.arXiv preprint arXiv:2212.03533. Qiyao Wang, Guhong Chen, Hongbo Wang, Huaren Liu, Minghui Zhu, Zhifei Qin, Linwei Li, Yilin Yue, Shiqiang Wang, Jiayan Li, Yihang Wu, Ziqiang Liu, Longze Chen, Run Luo, Liyang Fan, Jiaming Li, Lei Zhang, Kan Xu, Chengming Li, Hamid Alinejad- Rokny, Shiwen Ni, Y...

  15. [15]

    Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou

    IPBench: Benchmarking the knowledge of large lan- guage models in intellectual property.arXiv preprint arXiv:2504.15524. Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou

  16. [16]

    A method comprising steps X, Y , and Z

    PatentMind: A multi-aspect reasoning graph for patent similarity evaluation.arXiv preprint arXiv:2505.19347. Dun Zhang, Jiacheng Li, Ziyang Zeng, and Fu- long Wang

  17. [17]

    You Zuo, Kim Gerdes, Eric Villemonte de La Clerg- erie, and Benoît Sagot

    Jasper and Stella: Distilla- tion of SOTA embedding models.arXiv preprint arXiv:2412.19048. You Zuo, Kim Gerdes, Eric Villemonte de La Clerg- erie, and Benoît Sagot

  18. [18]

    Patent Representation Learning via Self-supervision

    Patent representa- tion learning via self-supervision.arXiv preprint arXiv:2511.10657. A Additional Retrieval Results Table 21 reports Recall@10 and Table 22 reports MAP across all models and views. Table 23 re- ports nDCG@10 with 95% bootstrap confidence intervals for the TAC view, ALL slice; the† marker on KaLM-Embedding-Gemma3-12B indicates re- evaluat...

  19. [19]

    Best base- line per column inbold

    0.1809 0.1693 +0.0116<0.0001 ∗∗∗ 46,069 Table 25: Classification: Best k-NN Macro F1 across baseline, fine-tuned, and ColBERT models. Best base- line per column inbold. Conventional Conv-Environment Emerging Emerg-Mobility Emerg-Vision Mean Qwen3-8B 0.8506 0.67630.8601 0.65320.69810.7477Llama-Nemotron-8B0.8560 0.68070.8599 0.6273 0.7050 0.7458Qwen3-4B 0.8...