DAPFAM: A Domain-Aware Family-level Dataset to benchmark cross domain patent retrieval

Denis Cavallucci (ICube); Hicham Chibane (ICube); Iliass Ayaou (ICube)

arxiv: 2506.22141 · v2 · pith:EBO4TVBBnew · submitted 2025-06-27 · 💻 cs.CL · cs.IR

DAPFAM: A Domain-Aware Family-level Dataset to benchmark cross domain patent retrieval

Iliass Ayaou (ICube) , Denis Cavallucci (ICube) , Hicham Chibane (ICube) This is my paper

classification 💻 cs.CL cs.IR

keywords retrievaldapfampatentdatasetdomainout-domainbenchmarkbm25

0 comments

read the original abstract

Patent prior-art retrieval becomes especially challenging when relevant disclosures cross technological boundaries. Existing benchmarks lack explicit domain partitions, making it difficult to assess how retrieval systems cope with such shifts. We introduce DAPFAM, a family-level benchmark with explicit IN-domain and OUT-domain partitions defined by a new IPC3 overlap scheme. The dataset contains 1,247 query families and 45,336 target families aggregated at the family level to reduce international redundancy, with citation based relevance judgments. We conduct 249 controlled experiments spanning lexical (BM25) and dense (transformer) backends, document and passage level retrieval, multiple query and document representations, aggregation strategies, and hybrid fusion via Reciprocal Rank Fusion (RRF). Results reveal a pronounced domain gap: OUT-domain performance remains roughly five times lower than IN-domain across all configurations. Passage-level retrieval consistently outperforms document-level, and dense methods provide modest gains over BM25, but none close the OUT-domain gap. Document-level RRF yields strong effectiveness efficiency trade-offs with minimal overhead. By exposing the persistent challenge of cross-domain retrieval, DAPFAM provides a reproducible, compute-aware testbed for developing more robust patent IR systems. The dataset is publicly available on huggingface at https://huggingface.co/datasets/datalyes/DAPFAM_patent.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Formally Verified Patent Analysis via Dependent Type Theory: Machine-Checkable Certificates from a Hybrid AI + Lean 4 Pipeline
cs.AI 2026-04 unverdicted novelty 8.0 partial

A Lean 4 system models patent claims as DAGs with match scores in a verified complete lattice and supplies kernel-checked certificates for coverage calculations and five IP use cases, conditional on unverified ML inputs.
Is It Novel and Why? Fine-Grained Patent Novelty Prediction Based on Passage Retrieval
cs.CL 2026-05 unverdicted novelty 7.0

Introduces a feature-level annotated patent dataset and LLM retrieval-reasoning workflows that outperform embedding baselines on passage retrieval and novel feature identification while avoiding spurious correlations ...
Citation-Driven Multi-View Training for Patent Embeddings: QaECTER and Sophia-Bench
cs.IR 2026-04 unverdicted novelty 7.0

QaECTER sets new state-of-the-art patent retrieval performance on the new Sophia-Bench benchmark and an external test, outperforming a 23x larger general model and all prior patent-specific models using citation-drive...