Citation-Driven Multi-View Training for Patent Embeddings: QaECTER and Sophia-Bench

Kim Gerdes (LISN; Kirian Guiller; Qatent); Younes Djemmal; You Zuo (ALMAnaCH)

arxiv: 2604.22897 · v1 · submitted 2026-04-24 · 💻 cs.IR · cs.AI

Citation-Driven Multi-View Training for Patent Embeddings: QaECTER and Sophia-Bench

Younes Djemmal , You Zuo (ALMAnaCH) , Kim Gerdes (LISN , Qatent) , Kirian Guiller This is my paper

Pith reviewed 2026-05-08 10:16 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords patent retrievaltext embeddingscitation graphsretrieval benchmarkmulti-view trainingNDCG evaluationSophia-BenchQaECTER

0 comments

The pith

A compact 344M-parameter patent embedding model trained on citation graphs outperforms a 23x larger model and all prior patent models on retrieval tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Patent retrieval has lacked benchmarks that capture the variety of real queries, domains, and jurisdictions, limiting progress in systems that support innovation and IP decisions. The paper introduces Sophia-Bench, a large-scale dataset of 10,000 queries and 75,000 documents across 12 query types, eight technology sections, and twelve jurisdictions, with relevance judged by citations plus a new InScope domain-relevance metric. It also presents QaECTER, a model trained through citation-driven multi-view self-alignment on patent graphs. This model sets new state-of-the-art results on Sophia-Bench and on the general English retrieval text embedding benchmark, beating much larger models without needing special prompts.

Core claim

QaECTER establishes a new state of the art for patent retrieval. It outperforms the #1 model on the English retrieval text embedding benchmark (RTEB), a model 23x larger, as well as all existing patent specific models across every query type, IPC section, and jurisdiction on Sophia-bench, with gains of up to 7.2% average NDCG@10 over the next-best model. These results hold on an independent external benchmark without task-specific prompts.

What carries the argument

Citation-driven multi-view self-alignment training on patent citation graphs, which creates aligned embeddings from multiple document views for retrieval.

If this is right

Patent search systems can achieve higher accuracy using smaller models that run more efficiently at scale.
Retrieval quality can now be measured consistently across diverse query formats, IPC sections, and filing jurisdictions.
Embedding models for patents generalize to new benchmarks without requiring custom instruction prompts.
Large-scale patent search infrastructure becomes more practical to deploy with compact high-performing embeddings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The citation-graph training approach may transfer to other domains rich in citation or reference data, such as academic literature or legal documents.
Wider use of Sophia-Bench could standardize evaluation practices and accelerate progress in specialized retrieval tasks.
The performance edge indicates that domain-specific signals from citations can sometimes outweigh sheer model scale in technical retrieval.

Load-bearing premise

Citation links, even when adjusted by the InScope metric, give an accurate unbiased measure of relevance for every query type, technology domain, and jurisdiction.

What would settle it

A controlled study in which patent examiners or experts rate the relevance of retrieved documents and find that non-cited but semantically close patents are systematically preferred over citation-based ground truth for multiple query types would undermine the benchmark scores and model comparisons.

Figures

Figures reproduced from arXiv: 2604.22897 by Kim Gerdes (LISN, Kirian Guiller, Qatent), Younes Djemmal, You Zuo (ALMAnaCH).

**Figure 1.** Figure 1: Corpus design in Sophia-bench, comprising 10,000 patent queries and a 75,000-document corpus stratified across three dimensions: temporal coverage (2016–2025, with 1,000 queries per year), jurisdiction distribution (Chinese, English, Japanese, Korean, German, and other jurisdictions), and technology coverage (all eight IPC sections A–H). Sophia-bench is a large-scale benchmark for evaluating patent embeddi… view at source ↗

**Figure 2.** Figure 2: Evaluation tasks in Sophia-bench: citation-based retrieval evaluates novelty and prior-art search using XY (cited by examiner) and A (cited by applicant) citations, while InScope tasks measure domain relevance at three IPC granularities from coarse Section-level to fine Subgroup-level classification. 12 Query Representations Structured Fields AI Summaries ab clms DESC iclm obj adb tacd ai clm sum ai obj ai… view at source ↗

**Figure 3.** Figure 3: The twelve query representations in Sophia-bench, comprising seven structured patent-document fields (abstract, claims, description, etc.) and five AI-generated textual summaries (claim summary, objective, advantages, abstract summary, and features). translated). Each corpus document is represented by the concatenation of its title, abstract, claims, and description (tacd). The query set contains 1,000 pat… view at source ↗

**Figure 4.** Figure 4: Recall@k curves on Sophia-bench for the tacd query type (overall). QaECTER (blue) leads at every cutoff from k = 1 to k = 1,000. QaECTER maintains the lead on both citation types across all general-purpose and patent embedding models. The results are consistent across all query types. tacd, which uses the full patent text as query, achieves marginally higher scores than ai features, but the model ranking … view at source ↗

**Figure 5.** Figure 5: NDCG@10 by filing jurisdiction on Sophia-bench, averaged across all 12 query types. Jurisdictions are sorted by QaECTER performance (descending). Four jurisdictions with fewer than five queries are excluded. European jurisdictions such as Germany and ROW. Moreover, QaECTER exhibits lower performance variation across jurisdictions than competing models: while others show steeper declines as retrieval diffic… view at source ↗

**Figure 6.** Figure 6: NDCG@10 by publication year on Sophia-bench for the TACD query type. All models show stable performance across publication years, with no significant temporal degradation. 5.5 Domain Relevance (InScope) view at source ↗

**Figure 7.** Figure 7: NDCG@10 by IPC section on Sophia-bench, averaged across all 12 query types. Sections are sorted by QaECTER performance (descending). FAM evaluates cross-view retrieval across six query–corpus configurations, combining title+abstract and title+abstract+claims as queries against three corpus representations. We report the average NDCG@100 across all six DAPFAM.ALL subtasks. Results are outlined in view at source ↗

read the original abstract

Patent retrieval underpins critical decisions in innovation, examination, and IP strategy, yet progress has been hampered by the absence of benchmarks that reflect the diversity of real world search scenarios. We address this gap with two contributions. First, we introduce Sophiabench, a large-scale patent retrieval benchmark comprising 10,000 queries and 75,000 corpus documents stratified across ten years, eight IPC technology sections, and twelve filing jurisdictions. Unlike prior benchmarks, Sophia-bench tests retrieval using 12 different query types-from structured patent fields to AI-generated summaries-and evaluates results against citation-based ground truth enhanced with a novel domain-relevance metric (InScope). Together, these enable systematic measurement of how well models perform across query types, technology domains, and jurisdictions. Second, we introduce QaECTER, a 344M-parameter embedding model trained on patent citation graphs and multi-view self-alignment. Despite its compact size, QaECTER establishes a new state of the art for patent retrieval. It outperforms the \#1 model on the English retrieval text embedding benchmark (RTEB), a model 23x larger, as well as all existing patent specific models across every query type, IPC section, and jurisdiction on Sophia-bench, with gains of up to 7.2% average NDCG@10 over the next-best model. These results are confirmed on an independent external benchmark, where QaECTER surpasses all prior models without requiring task-specific instruction prompts. Both the benchmark and the model are designed for practical deployment in large-scale patent search systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

QaECTER and Sophia-Bench add a stratified patent retrieval test set and a compact model with reported gains, but the shared citation signal between training and primary evaluation leaves the semantic improvements unclear.

read the letter

The paper's core offering is Sophia-Bench, a 10k-query patent retrieval set stratified by year, IPC section, jurisdiction, and 12 query types, plus the InScope metric layered on citation labels. QaECTER is a 344M-parameter model trained on citation graphs with multi-view self-alignment that reports beating prior patent models and a 23x larger general model on both the new benchmark and RTEB, with NDCG@10 lifts up to 7.2 percent and confirmation on an external set without task prompts. These elements are new and the numbers are concrete enough to notice for anyone working on domain-specific embeddings or patent search systems. The benchmark construction itself looks like a step forward for measuring performance across realistic slices of the patent space. The soft spot is the overlap between training signal and evaluation labels. Both rely heavily on citations, so the gains could come from better modeling citation structure rather than broader semantic similarity. Patents frequently cite for legal or examiner reasons that do not match exhaustive content overlap, especially across jurisdictions or emerging areas. InScope is presented as an improvement, but the abstract does not show how independent it is from the citation graph or whether it was validated against other relevance signals. The external benchmark result helps, yet the detailed per-query, per-IPC, and per-jurisdiction claims rest on Sophia-Bench. This work is aimed at patent IR researchers and teams building production search tools in the domain. The benchmark resource and the compact model size make it worth a serious referee's time, though the evaluation design will need scrutiny and likely some additional checks before the superiority claims can be taken at face value.

Referee Report

3 major / 2 minor

Summary. The paper introduces Sophia-bench, a large-scale patent retrieval benchmark with 10,000 queries and 75,000 corpus documents stratified across ten years, eight IPC sections, and twelve jurisdictions. Queries span 12 types (structured fields to AI-generated summaries) and relevance is defined via citation-based ground truth augmented by a novel InScope domain-relevance metric. The second contribution is QaECTER, a 344M-parameter embedding model trained on patent citation graphs plus multi-view self-alignment; the authors claim it sets a new SOTA for patent retrieval by outperforming the top RTEB model (23x larger) and all prior patent-specific models on every query type, IPC section, and jurisdiction in Sophia-bench (gains up to 7.2% average NDCG@10), with confirmation on an independent external benchmark.

Significance. If the evaluation is shown to be independent of citation signals and the gains are statistically robust, the work would be significant: it supplies a diverse, large-scale benchmark that better reflects real-world patent search variability than prior resources, and demonstrates that a compact citation-trained model can surpass much larger general-purpose embedders on both patent-specific and general retrieval tasks. The practical orientation toward deployment in large-scale patent systems is a further strength.

major comments (3)

[Abstract / Sophia-bench description] Abstract and Sophia-bench section: training QaECTER explicitly on citation graphs while defining ground truth via citations augmented by InScope creates a circularity risk. The manuscript must demonstrate that InScope is not merely a re-expression of citation proximity (e.g., via correlation analysis, ablation removing InScope, or results on a citation-free semantic relevance subset); without this, the 7.2% NDCG@10 margins and cross-query/IPC/jurisdiction superiority may reflect improved citation prediction rather than semantic retrieval quality.
[Results / Experimental setup] Results and experimental details: the reported outperformance numbers lack error bars, statistical significance tests, or ablation studies isolating the contribution of multi-view self-alignment versus pure citation-graph training. This undermines confidence in the claim that QaECTER is superior across all 12 query types, eight IPC sections, and twelve jurisdictions.
[Abstract / Evaluation] Independent external benchmark paragraph: while the abstract states confirmation on an external benchmark without task-specific prompts, no quantitative results, dataset description, or comparison to Sophia-bench are supplied, leaving the primary Sophia-bench claims without sufficient external validation.

minor comments (2)

[Abstract] Notation consistency: 'Sophiabench' and 'Sophia-bench' appear interchangeably; standardize throughout.
[Model description] The manuscript should clarify the exact parameter count and training data scale for QaECTER relative to the 23x larger RTEB model to make the size-efficiency claim fully transparent.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major point below and have revised the manuscript to incorporate the requested clarifications, analyses, and details.

read point-by-point responses

Referee: [Abstract / Sophia-bench description] Abstract and Sophia-bench section: training QaECTER explicitly on citation graphs while defining ground truth via citations augmented by InScope creates a circularity risk. The manuscript must demonstrate that InScope is not merely a re-expression of citation proximity (e.g., via correlation analysis, ablation removing InScope, or results on a citation-free semantic relevance subset); without this, the 7.2% NDCG@10 margins and cross-query/IPC/jurisdiction superiority may reflect improved citation prediction rather than semantic retrieval quality.

Authors: We appreciate the referee highlighting this important methodological concern. QaECTER is trained using patent citation graphs to capture co-citation and contextual signals for embedding learning, while Sophia-bench ground truth starts from citation links but augments them with InScope, a domain-relevance metric based on IPC section overlap, technological keyword similarity, and jurisdictional factors that are computed independently of the citation graph. This design intends to evaluate broader semantic retrieval rather than pure citation prediction. To directly address the circularity risk, we have added to the revised manuscript: (1) a correlation analysis between InScope scores and citation-proximity measures (e.g., shared citations and co-citation strength), (2) an ablation that removes InScope and uses citation-only ground truth, and (3) results on a citation-free semantic relevance subset. These additions appear in the updated Sophia-bench description and results sections. revision: yes
Referee: [Results / Experimental setup] Results and experimental details: the reported outperformance numbers lack error bars, statistical significance tests, or ablation studies isolating the contribution of multi-view self-alignment versus pure citation-graph training. This undermines confidence in the claim that QaECTER is superior across all 12 query types, eight IPC sections, and twelve jurisdictions.

Authors: We agree that the absence of error bars, significance testing, and targeted ablations weakens the strength of the claims. In the revised manuscript we have added bootstrap-derived error bars to all NDCG@10 results, paired statistical significance tests (t-tests) across every query type, IPC section, and jurisdiction, and a dedicated ablation study that isolates the multi-view self-alignment component from the base citation-graph training objective. The ablation demonstrates the incremental contribution of the self-alignment loss. These updates are now included in the results section and the experimental details appendix. revision: yes
Referee: [Abstract / Evaluation] Independent external benchmark paragraph: while the abstract states confirmation on an external benchmark without task-specific prompts, no quantitative results, dataset description, or comparison to Sophia-bench are supplied, leaving the primary Sophia-bench claims without sufficient external validation.

Authors: We apologize for the insufficient detail on the external benchmark in the original submission. We have revised the manuscript to include a full description of the external benchmark (a USPTO-derived patent retrieval collection using citation-based relevance judgments and no InScope augmentation), a new results table with quantitative NDCG@10 scores for QaECTER and all baselines, and a direct comparison of relative gains versus Sophia-bench. The external results continue to show QaECTER outperforming prior models without task-specific prompts, providing the requested independent validation. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical results on held-out benchmark data with independent external confirmation

full rationale

The paper trains QaECTER on citation graphs using multi-view self-alignment and evaluates retrieval performance on Sophia-bench using citation-augmented ground truth. This follows standard supervised learning practice with proxy relevance labels and held-out test queries/documents; performance is not forced by construction but is an empirical outcome that could have failed to generalize. The central SOTA claim is additionally supported by outperformance on the independent RTEB benchmark (a 23x larger model) without task-specific prompts. No equation, definition, or self-citation reduces the reported gains to the training inputs by tautology. The InScope metric is presented as an enhancement but does not alter the non-circular empirical nature of the evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the premise that patent citations reliably indicate relevance for retrieval; this domain assumption drives both model training and benchmark construction. No explicit free parameters are named in the abstract, though training hyperparameters are implicitly present. The InScope metric and the QaECTER model itself are introduced as new elements without external validation.

axioms (1)

domain assumption Patent citations serve as reliable indicators of relevance for retrieval tasks
This premise underpins both the citation-driven training of QaECTER and the citation-based ground truth in Sophia-Bench.

invented entities (2)

InScope metric no independent evidence
purpose: Novel domain-relevance metric to enhance citation-based ground truth
Introduced as a new component but no external validation or comparison to existing relevance measures is provided.
QaECTER embedding model no independent evidence
purpose: Compact model achieving SOTA patent retrieval via citation-driven multi-view training
The model is the primary proposed artifact whose performance is the central claim.

pith-pipeline@v0.9.0 · 5606 in / 1754 out tokens · 81864 ms · 2026-05-08T10:16:27.460760+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 1 internal anchor

[1]

MTEB Leaderboard,

Hugging Face, “MTEB Leaderboard,” 2026. [Online]. Available: https://huggingface.co/spaces/mteb/ leaderboard

work page 2026
[2]

CLEF-IP 2011: Retrieval in the Intellectual Property Domain,

F. Piroi, M. Lupu, A. Hanbury, and V. Zenz, “CLEF-IP 2011: Retrieval in the Intellectual Property Domain,” in CLEF (Notebook Papers/Labs/Workshop), 2011

work page 2011
[3]

TREC-CHEM: Large Scale Chemical Information Retrieval Evaluation at TREC,

M. Lupu, J. Huang, J. Zhu, and J. Tait, “TREC-CHEM: Large Scale Chemical Information Retrieval Evaluation at TREC,” ACM SIGIR Forum , vol. 43, no. 2, 2009

work page 2009
[4]

Overview of the Patent Retrieval Task at the NTCIR-6 Workshop,

A. Fujii, M. Iwayama, and N. Kando, “Overview of the Patent Retrieval Task at the NTCIR-6 Workshop,” in Proceedings of the NTCIR-6 Workshop , 2007

work page 2007
[5]

PatentMatch: A Dataset for Matching Patent Claims & Prior Art,

J. Risch, N. Alder, C. Hewel, and R. Krestel, “PatentMatch: A Dataset for Matching Patent Claims & Prior Art,” arXiv preprint arXiv:2012.13919 , 2020

work page arXiv 2012
[6]

DAPFAM: A Domain-Aware Family-level Dataset to Benchmark Cross-Domain Patent Retrieval,

I. Ayaou, D. Cavallucci, and H. Chibane, “DAPFAM: A Domain-Aware Family-level Dataset to Benchmark Cross-Domain Patent Retrieval,” arXiv preprint arXiv:2506.22141 , 2025

work page arXiv 2025
[7]

anferico/bert-for-patents,

Hugging Face, “anferico/bert-for-patents,” 2022. [Online]. Available: https://huggingface.co/anferico/ bert-for-patents

work page 2022
[8]

Octen Embedding Models,

Octen Team, “Octen Embedding Models,” 2026. [Online]. Available: https://huggingface.co/Octen

work page 2026
[9]

Qwen3 Technical Report

A. Yang et al. , “Qwen3 Technical Report,” arXiv preprint arXiv:2505.09388 , 2025

work page internal anchor Pith review arXiv 2025
[10]

arXiv preprint arXiv:2402.19411 , year=

M. Ghosh, M. E. Rose, S. Erhardt, E. Buunk, and D. Harhoff, “PaECTER: Patent-level Representation Learning using Citation-informed Transformers,” arXiv preprint arXiv:2402.19411 , 2024

work page arXiv 2024
[11]

PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding,

I. Ayaou and D. Cavallucci, “PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding,” arXiv preprint arXiv:2510.22264 , 2025

work page arXiv 2025
[12]

nomic-ai/modernbert-embed-base,

Nomic AI, “nomic-ai/modernbert-embed-base,” 2026. [Online]. Available: https://huggingface.co/nomic-ai/ modernbert-embed-base 17

work page 2026

[1] [1]

MTEB Leaderboard,

Hugging Face, “MTEB Leaderboard,” 2026. [Online]. Available: https://huggingface.co/spaces/mteb/ leaderboard

work page 2026

[2] [2]

CLEF-IP 2011: Retrieval in the Intellectual Property Domain,

F. Piroi, M. Lupu, A. Hanbury, and V. Zenz, “CLEF-IP 2011: Retrieval in the Intellectual Property Domain,” in CLEF (Notebook Papers/Labs/Workshop), 2011

work page 2011

[3] [3]

TREC-CHEM: Large Scale Chemical Information Retrieval Evaluation at TREC,

M. Lupu, J. Huang, J. Zhu, and J. Tait, “TREC-CHEM: Large Scale Chemical Information Retrieval Evaluation at TREC,” ACM SIGIR Forum , vol. 43, no. 2, 2009

work page 2009

[4] [4]

Overview of the Patent Retrieval Task at the NTCIR-6 Workshop,

A. Fujii, M. Iwayama, and N. Kando, “Overview of the Patent Retrieval Task at the NTCIR-6 Workshop,” in Proceedings of the NTCIR-6 Workshop , 2007

work page 2007

[5] [5]

PatentMatch: A Dataset for Matching Patent Claims & Prior Art,

J. Risch, N. Alder, C. Hewel, and R. Krestel, “PatentMatch: A Dataset for Matching Patent Claims & Prior Art,” arXiv preprint arXiv:2012.13919 , 2020

work page arXiv 2012

[6] [6]

DAPFAM: A Domain-Aware Family-level Dataset to Benchmark Cross-Domain Patent Retrieval,

I. Ayaou, D. Cavallucci, and H. Chibane, “DAPFAM: A Domain-Aware Family-level Dataset to Benchmark Cross-Domain Patent Retrieval,” arXiv preprint arXiv:2506.22141 , 2025

work page arXiv 2025

[7] [7]

anferico/bert-for-patents,

Hugging Face, “anferico/bert-for-patents,” 2022. [Online]. Available: https://huggingface.co/anferico/ bert-for-patents

work page 2022

[8] [8]

Octen Embedding Models,

Octen Team, “Octen Embedding Models,” 2026. [Online]. Available: https://huggingface.co/Octen

work page 2026

[9] [9]

Qwen3 Technical Report

A. Yang et al. , “Qwen3 Technical Report,” arXiv preprint arXiv:2505.09388 , 2025

work page internal anchor Pith review arXiv 2025

[10] [10]

arXiv preprint arXiv:2402.19411 , year=

M. Ghosh, M. E. Rose, S. Erhardt, E. Buunk, and D. Harhoff, “PaECTER: Patent-level Representation Learning using Citation-informed Transformers,” arXiv preprint arXiv:2402.19411 , 2024

work page arXiv 2024

[11] [11]

PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding,

I. Ayaou and D. Cavallucci, “PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding,” arXiv preprint arXiv:2510.22264 , 2025

work page arXiv 2025

[12] [12]

nomic-ai/modernbert-embed-base,

Nomic AI, “nomic-ai/modernbert-embed-base,” 2026. [Online]. Available: https://huggingface.co/nomic-ai/ modernbert-embed-base 17

work page 2026