pith. sign in

arxiv: 2604.19219 · v2 · pith:TKADPU2Knew · submitted 2026-04-21 · 💻 cs.CR · cs.AI· cs.DC· cs.LG

Sherpa.ai Privacy-Preserving Multi-Party Entity Alignment without Intersection Disclosure for Noisy Identifiers

Pith reviewed 2026-05-21 00:08 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.DCcs.LG
keywords privacy-preserving entity alignmentprivate set unionvertical federated learningmulti-party computationnoisy matchingintersection privacyentity resolution
0
0 comments X

The pith

A multi-party private set union protocol aligns entities for vertical federated learning while hiding which records are shared.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a protocol that lets multiple organizations align their datasets on a common index without revealing which samples they have in common. It generalizes earlier two-party methods to handle more participants and adds support for both exact matches and matches tolerant to typos or formatting differences in identifiers. The approach uses private set union instead of intersection to avoid exposing sensitive relationships between datasets. Proofs of correctness and privacy are given along with complexity analysis for communication and computation. This setup targets practical vertical federated learning tasks such as joint disease modeling or fraud detection across institutions.

Core claim

The Sherpa.ai multi-party PSU protocol for VFL provides privacy-preserving entity alignment by operating on the union of identifiers rather than their intersection, thereby concealing membership information; it offers an order-preserving variant for exact alignment and an unordered variant that tolerates typographical and formatting noise in identifiers, with formal proofs of correctness and privacy plus a universal index mapping from local records to a shared space.

What carries the argument

The Sherpa.ai multi-party private set union protocol, which aligns records on the union of identifiers across parties while keeping intersection membership hidden and supporting both exact and approximate matching.

If this is right

  • Vertical federated learning becomes feasible across multiple organizations without exposing shared sample relationships.
  • Alignment works for noisy real-world identifiers such as misspelled names or inconsistent address formats.
  • Communication scales to more than two parties with lower overhead than running pairwise protocols.
  • Formal privacy and correctness guarantees apply to both the exact and noisy-matching variants.
  • Applications include cross-institution healthcare modeling and collaborative fraud detection without central data sharing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Organizations could adopt this alignment step before training joint models, reducing reliance on a trusted intermediary.
  • The unordered variant might extend to other approximate record-linkage settings beyond the paper's examples.
  • Integration with existing vertical federated learning frameworks would require only the index-mapping step described.
  • Testing on real multi-institutional datasets with controlled noise levels would quantify the practical privacy gain.

Load-bearing premise

The protocol can be realized securely under standard multi-party cryptographic assumptions and the unordered variant introduces no new leakage when identifiers contain noise.

What would settle it

An attack recovering intersection membership from protocol messages or outputs with success probability noticeably above random guessing.

Figures

Figures reproduced from arXiv: 2604.19219 by Daniel M. Jimenez-Gutierrez, Dario Pighin, Enrique Zuazua, Georgios Kellaris, Joaquin Del Rio, Oleksii Sliusarenko, Xabi Uribe-Etxebarria.

Figure 1
Figure 1. Figure 1: Illustrative example of entity alignment in VFL: based on the ID, Parties A and B perform private [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the PSI protocol. Only the common identifiers (IDs) between the two parties are [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the PSU protocol. All unique IDs across parties form the union dataset used for [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Pipeline of the proposed PSU protocol for multi-party VFL. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Scheme describing the main steps of the first part of the Diffie-Hellman protocol employed for PSU. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Scheme describing the main steps of the second part of the Diffie-Hellman protocol employed for [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
read the original abstract

Federated Learning (FL) enables collaborative model training among multiple parties without centralizing raw data. There are two main paradigms in FL: Horizontal FL (HFL), where all participants share the same feature space but hold different samples, and Vertical FL (VFL), where parties possess complementary features for the same set of samples. A prerequisite for VFL training is privacy-preserving entity alignment (PPEA), which establishes a common index of samples across parties (alignment) without revealing which samples are shared between them. Conventional private set intersection (PSI) achieves alignment but leaks intersection membership, exposing sensitive relationships between datasets. The standard private set union (PSU) mitigates this risk by aligning on the union of identifiers rather than the intersection. However, existing approaches are often limited to two parties or lack support for typo-tolerant matching. In this paper, we introduce the Sherpa.ai multi-party PSU protocol for VFL, a PPEA method that hides intersection membership and enables both exact and noisy matching. The protocol generalizes two-party approaches to multiple parties with low communication overhead and offers two variants: an order-preserving version for exact alignment and an unordered version tolerant to typographical and formatting discrepancies. We prove correctness and privacy, analyze communication and computational (exponentiation) complexity, and formalize a universal index mapping from local records to a shared index space. This multi-party PSU offers a scalable, mathematically grounded protocol for PPEA in real-world VFL deployments, such as multi-institutional healthcare disease detection, collaborative risk modeling between banks and insurers, and cross-domain fraud detection between telecommunications and financial institutions, while preserving intersection privacy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces Sherpa.ai, a multi-party private set union (PSU) protocol for privacy-preserving entity alignment (PPEA) in vertical federated learning (VFL). It enables multiple parties to align records on the union of identifiers without disclosing intersection membership, supports both exact matching via an order-preserving variant and noisy matching tolerant to typographical discrepancies via an unordered variant, generalizes prior two-party techniques with low communication overhead, includes proofs of correctness and privacy, analyzes communication and exponentiation complexity, and formalizes a universal index mapping from local records to a shared index space. Applications in multi-institutional healthcare, bank-insurer risk modeling, and cross-domain fraud detection are discussed.

Significance. If the security definitions, proofs, and complexity bounds hold under standard assumptions such as the semi-honest model with common cryptographic primitives, the work provides a practical advance for multi-party VFL by solving the intersection-leakage problem of PSI while adding support for noisy identifiers. The low-overhead multi-party generalization and formal index mapping could enable scalable deployments in privacy-sensitive domains where existing two-party PSU methods fall short.

minor comments (3)
  1. Abstract: the claim of 'low communication overhead' is stated without a quantitative comparison to the two-party baselines generalized from; adding one sentence with asymptotic or concrete costs would improve context.
  2. Section on the unordered variant: the description of how typographical and formatting discrepancies are handled without creating new leakage channels could include a short worked example of identifier normalization to aid verification.
  3. Complexity analysis: the exponentiation count is reported but a small table juxtaposing the multi-party costs against the referenced two-party protocols would clarify the overhead scaling.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision for our manuscript on Sherpa.ai. We appreciate the recognition of the multi-party PSU protocol's contributions to PPEA in VFL, including support for exact and noisy matching while hiding intersection membership. No specific major comments were listed in the report, so we provide no point-by-point responses below. We will incorporate any minor suggestions during revision.

Circularity Check

0 steps flagged

No significant circularity in protocol construction

full rationale

The paper presents a cryptographic multi-party PSU protocol for PPEA that generalizes two-party methods, with explicit proofs of correctness/privacy, complexity analysis, and a formalized universal index mapping. No equations, fitted parameters, self-definitional reductions, or load-bearing self-citations appear in the provided claims or abstract. The derivation rests on standard semi-honest cryptographic assumptions and formal proofs rather than reducing to prior fitted results or self-referential inputs by construction. This is a self-contained construction paper with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The protocol description relies on standard cryptographic primitives for PSI/PSU and the assumption that noisy matching can be performed privately; no free parameters or new invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Standard cryptographic assumptions underlying private set union protocols hold for the multi-party case.
    The privacy and correctness proofs are stated to rest on these background primitives.

pith-pipeline@v0.9.0 · 5876 in / 1321 out tokens · 42797 ms · 2026-05-21T00:08:10.009434+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages

  1. [1]

    Acero, D

    A. Acero, D. M. Jimenez-Gutierrez, D. Pighin, E. Zuazua, J. Del Rio, and X. Uribe-Etxebarria. The sherpa. ai blind vertical federated learning paradigm to minimize the number of communications.arXiv preprint arXiv:2510.17901, 2025

  2. [2]

    A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm.Journal of the royal statistical society: series B (methodological), 39(1):1--22, 1977

  3. [3]

    Diffie and M

    W. Diffie and M. Hellman. New directions in cryptography.IEEE transactions on Information Theory, 22(6):644--654, 1976

  4. [4]

    E. A. Durham, M. Kantarcioglu, Y. Xue, C. Toth, M. Kuzu, and B. Malin. Composite bloom filters for secure record linkage.IEEE transactions on knowledge and data engineering, 26(12):2956--2968, 2013

  5. [5]

    K. Frikken. Privacy-preserving set union. InInternational Conference on Applied Cryptography and Network Security, pages 237--252. Springer, 2007

  6. [6]

    J. Gao, S. Nguyen, M. Blanton, and N. Trieu. Pulse: Parallel private set union for large-scale entities. Cryptology ePrint Archive, 2025

  7. [7]

    J. Gao, S. Nguyen, and N. Trieu. Toward a practical multi-party private set union.Cryptology ePrint Archive, 2023

  8. [8]

    Y. Gao, X. Zheng, and C. Hu. A multi-party private set union protocol against malicious adversary. In International Conference on Innovative Computing, pages 159--167. Springer, 2024

  9. [9]

    Gkoulalas-Divanis, D

    A. Gkoulalas-Divanis, D. Vatsalan, D. Karapiperis, and M. Kantarcioglu. Modern privacy-preserving record linkage techniques: An overview.IEEE Transactions on Information Forensics and Security, 16:4966--4987, 2021

  10. [10]

    S. Gopi, P. Gulhane, J. Kulkarni, J. H. Shen, M. Shokouhi, and S. Yekhanin. Differentially private set union. InInternational Conference on Machine Learning, pages 3627--3636. PMLR, 2020

  11. [11]

    Y. He, X. Tan, J. Ni, L. T. Yang, and X. Deng. Differentially private set intersection for asymmetrical id alignment.IEEE Transactions on Information Forensics and Security, 17:3479--3494, 2022

  12. [12]

    Huang, X

    P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck. Learning deep structured semantic models for web search using clickthrough data. InProceedings of the 22nd ACM international conference on Information & Knowledge Management, pages 2333--2338, 2013

  13. [13]

    Indyk and R

    P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. InProceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604--613, 1998

  14. [14]

    Jia, S.-F

    Y. Jia, S.-F. Sun, H.-S. Zhou, J. Du, and D. Gu. Shuffle-based private set union: Faster and more secure. In31st USENIX Security Symposium (USENIX Security 22), pages 2947--2964, 2022

  15. [15]

    Jia, S.-F

    Y. Jia, S.-F. Sun, H.-S. Zhou, and D. Gu. Scalable private set union, with stronger security. In33rd USENIX Security Symposium (USENIX Security 24), pages 6471--6488, 2024

  16. [16]

    D. M. Jimenez-Gutierrez, Y. Falkouskaya, J. L. Hernandez-Ramos, A. Anagnostopoulos, I. Chatzi- giannakis, and A. Vitaletti. On the security and privacy of federated learning: A survey with attacks, defenses, frameworks, applications, and future directions.arXiv preprint arXiv:2508.13730, 2025

  17. [17]

    Kissner and D

    L. Kissner and D. Song. Privacy-preserving set operations. InAnnual International Cryptology Conference, pages 241--257. Springer, 2005

  18. [18]

    Kolesnikov, M

    V. Kolesnikov, M. Rosulek, N. Trieu, and X. Wang. Scalable private set union from symmetric-key techniques. InInternational Conference on the Theory and Application of Cryptology and Information Security, pages 636--666. Springer, 2019

  19. [19]

    R. J. Little and D. B. Rubin.Statistical analysis with missing data. John Wiley & Sons, 2019

  20. [20]

    McMahan, E

    B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-efficient learning of deep networks from decentralized data. InArtificial intelligence and statistics, pages 1273--1282. PMLR, 2017

  21. [21]

    Patki, R

    N. Patki, R. Wedge, and K. Veeramachaneni. The synthetic data vault. In2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 399--410, 2016

  22. [22]

    D. B. Rubin. Inference and missing data.Biometrika, 63(3):581--592, 1976

  23. [23]

    J. L. Schafer.Analysis of incomplete multivariate data. CRC press, 1997. 21 PRIME AI paper

  24. [24]

    Schnell, T

    R. Schnell, T. Bachteler, and J. Reiher. Privacy-preserving record linkage using bloom filters.BMC medical informatics and decision making, 9(1):1--11, 2009

  25. [25]

    J. H. Seo, J. H. Cheon, and J. Katz. Constant-round multi-party private set union using reversed laurent series. InInternational Workshop on Public Key Cryptography, pages 398--412. Springer, 2012

  26. [26]

    J. Sun, X. Yang, Y. Yao, A. Zhang, W. Gao, J. Xie, and C. Wang. Vertical federated learning without revealing intersection membership.arXiv preprint:2106.05508, 2021

  27. [27]

    B. Tu, Y. Bai, C. Zhang, Y. Cao, and Y. Chen. Fast enhanced private set union in the balanced and unbalanced scenarios.Cryptology ePrint Archive, 2025

  28. [28]

    B. Tu, Y. Chen, Q. Liu, and C. Zhang. Fast unbalanced private set union from fully homomorphic encryption. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 2959--2973, 2023

  29. [29]

    E. Uzun, S. P. Chung, V. Kolesnikov, A. Boldyreva, and W. Lee. Fuzzy labeled private set intersection with applications to private{Real-Time}biometric search. In30th USENIX Security Symposium (USENIX Security 21), pages 911--928, 2021

  30. [30]

    F. Wang, B. Mi, and R. Zeng. Efficient private set intersection for vertical federated learning in iov. In International Conference on Frontiers in Cyber Security, pages 120--130. Springer, 2024

  31. [31]

    J. Wang, E. X. Huang, P. Duan, H. Wang, and K.-Y. Lam. Psa: private set alignment for secure and collaborative analytics on large-scale data.IEEE Transactions on Dependable and Secure Computing, 2025

  32. [32]

    J. Wen, Z. Zhang, Y. Lan, Z. Cui, J. Cai, and W. Zhang. A survey on federated learning: challenges and applications.International journal of machine learning and cybernetics, 14(2):513--535, 2023

  33. [33]

    Y. Xi, Y. Guo, S. Xu, C. Cai, and X. Jia. Private sample alignment for vertical federated learning: An efficient and reliable realization.IEEE Transactions on Information Forensics and Security, 2025

  34. [34]

    Q. Yang, Y. Liu, Y. Cheng, Y. Kang, T. Chen, and H. Yu. Horizontal federated learning. InFederated learning, pages 49--67. Springer, 2022

  35. [35]

    Zhang, Y

    C. Zhang, Y. Chen, W. Liu, L. Peng, M. Hao, A. Wang, and X. Wang. Unbalanced private set union with reduced computation and communication. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 1434--1447, 2024

  36. [36]

    Zhang, Y

    C. Zhang, Y. Chen, W. Liu, M. Zhang, and D. Lin. Linear private set union from{Multi-Query}reverse private membership test. In32nd USENIX Security Symposium (USENIX Security 23), pages 337--354, 2023

  37. [37]

    Z. Zhao, X. Liang, H. Huang, and K. Wang. Deep federated learning hybrid optimization model based on encrypted aligned data.Pattern Recognition, 148:110193, 2024. 22