Patent Representation Learning via Self-supervision

Beno\^it Sagot (ALMAnaCH); Eric Villemonte de La Clergerie (ALMAnaCH); Kim Gerdes (LISN); You Zuo (ALMAnaCH)

arxiv: 2511.10657 · v2 · pith:FASYYNGVnew · submitted 2025-11-03 · 💻 cs.CL · cs.AI· cs.LG

Patent Representation Learning via Self-supervision

You Zuo (ALMAnaCH) , Kim Gerdes (LISN) , Eric Villemonte de La Clergerie (ALMAnaCH) , Beno\^it Sagot (ALMAnaCH) This is my paper

classification 💻 cs.CL cs.AIcs.LG

keywords patentretrievalsamedropoutlearningpositivestitle--abstractview

0 comments

read the original abstract

We study self-supervised patent representation learning with contrastive objectives. A standard baseline constructs positives by encoding the same text twice under independent dropout masks, but applying this recipe to long, structured patent documents requires careful calibration. We show that dropout-only training can be substantially strengthened by tuning temperature and dropout rate, yet its best configuration is evaluation-dependent and does not transfer uniformly from title--abstract retrieval to claim-to-disclosure retrieval. We propose mixed dropout--section positives, a patent-specific view construction strategy in which the anchor is the title--abstract view and the positive is sampled either from a dropout re-encoding of the same view or from another section of the same patent, such as claims, summary, background, drawings, or description. This uses patent-internal structure as a training-time signal without IPC labels, citations, or relevance annotations. We evaluate on graded EPO search-report retrieval, DAPFAM, a recently proposed family-level patent retrieval benchmark, and IPC subclass classification. Section-based positives improve over calibrated dropout-only and generic title--abstract augmentation baselines, are competitive with citation-informed patent encoders and a general-purpose embedding model, and perform strongly on the out-of-domain split of DAPFAM. Additional cross-section alignment diagnostics show that section-pair training improves compatibility among abstracts, claims, and descriptions of the same invention. These results indicate that patent sections provide effective self-supervised positive views for learning dense patent representations.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Heterogeneous Dependency Graph-Guided Attentionfor Patent Representation Learning
cs.CL 2026-05 unverdicted novelty 7.0

PHAGE encodes patent claim hierarchies as heterogeneous graphs inside Transformers and outperforms baselines on classification, retrieval, and clustering by treating intra-patent topology as a stronger signal than int...
Heterogeneous Dependency Graph-Guided Attentionfor Patent Representation Learning
cs.CL 2026-05 unverdicted novelty 6.0

PHAGE improves patent classification, retrieval, and clustering by modeling heterogeneous claim dependencies with a typed graph, connectivity mask, and dual-granularity contrastive learning.
Benchmarking Patent Embeddings: A Multi-Task Evaluation of 22 Models Across Retrieval, Classification, and Clustering
cs.IR 2026-05 unverdicted novelty 5.0

Multi-task evaluation of 22 patent embedding models finds task-specific fine-tuning benefits and significant cross-landscape retrieval degradation that cannot be fixed by hybrid fusion.