Sherpa.ai Privacy-Preserving Multi-Party Entity Alignment without Intersection Disclosure for Noisy Identifiers
Pith reviewed 2026-05-21 00:08 UTC · model grok-4.3
The pith
A multi-party private set union protocol aligns entities for vertical federated learning while hiding which records are shared.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Sherpa.ai multi-party PSU protocol for VFL provides privacy-preserving entity alignment by operating on the union of identifiers rather than their intersection, thereby concealing membership information; it offers an order-preserving variant for exact alignment and an unordered variant that tolerates typographical and formatting noise in identifiers, with formal proofs of correctness and privacy plus a universal index mapping from local records to a shared space.
What carries the argument
The Sherpa.ai multi-party private set union protocol, which aligns records on the union of identifiers across parties while keeping intersection membership hidden and supporting both exact and approximate matching.
If this is right
- Vertical federated learning becomes feasible across multiple organizations without exposing shared sample relationships.
- Alignment works for noisy real-world identifiers such as misspelled names or inconsistent address formats.
- Communication scales to more than two parties with lower overhead than running pairwise protocols.
- Formal privacy and correctness guarantees apply to both the exact and noisy-matching variants.
- Applications include cross-institution healthcare modeling and collaborative fraud detection without central data sharing.
Where Pith is reading between the lines
- Organizations could adopt this alignment step before training joint models, reducing reliance on a trusted intermediary.
- The unordered variant might extend to other approximate record-linkage settings beyond the paper's examples.
- Integration with existing vertical federated learning frameworks would require only the index-mapping step described.
- Testing on real multi-institutional datasets with controlled noise levels would quantify the practical privacy gain.
Load-bearing premise
The protocol can be realized securely under standard multi-party cryptographic assumptions and the unordered variant introduces no new leakage when identifiers contain noise.
What would settle it
An attack recovering intersection membership from protocol messages or outputs with success probability noticeably above random guessing.
Figures
read the original abstract
Federated Learning (FL) enables collaborative model training among multiple parties without centralizing raw data. There are two main paradigms in FL: Horizontal FL (HFL), where all participants share the same feature space but hold different samples, and Vertical FL (VFL), where parties possess complementary features for the same set of samples. A prerequisite for VFL training is privacy-preserving entity alignment (PPEA), which establishes a common index of samples across parties (alignment) without revealing which samples are shared between them. Conventional private set intersection (PSI) achieves alignment but leaks intersection membership, exposing sensitive relationships between datasets. The standard private set union (PSU) mitigates this risk by aligning on the union of identifiers rather than the intersection. However, existing approaches are often limited to two parties or lack support for typo-tolerant matching. In this paper, we introduce the Sherpa.ai multi-party PSU protocol for VFL, a PPEA method that hides intersection membership and enables both exact and noisy matching. The protocol generalizes two-party approaches to multiple parties with low communication overhead and offers two variants: an order-preserving version for exact alignment and an unordered version tolerant to typographical and formatting discrepancies. We prove correctness and privacy, analyze communication and computational (exponentiation) complexity, and formalize a universal index mapping from local records to a shared index space. This multi-party PSU offers a scalable, mathematically grounded protocol for PPEA in real-world VFL deployments, such as multi-institutional healthcare disease detection, collaborative risk modeling between banks and insurers, and cross-domain fraud detection between telecommunications and financial institutions, while preserving intersection privacy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Sherpa.ai, a multi-party private set union (PSU) protocol for privacy-preserving entity alignment (PPEA) in vertical federated learning (VFL). It enables multiple parties to align records on the union of identifiers without disclosing intersection membership, supports both exact matching via an order-preserving variant and noisy matching tolerant to typographical discrepancies via an unordered variant, generalizes prior two-party techniques with low communication overhead, includes proofs of correctness and privacy, analyzes communication and exponentiation complexity, and formalizes a universal index mapping from local records to a shared index space. Applications in multi-institutional healthcare, bank-insurer risk modeling, and cross-domain fraud detection are discussed.
Significance. If the security definitions, proofs, and complexity bounds hold under standard assumptions such as the semi-honest model with common cryptographic primitives, the work provides a practical advance for multi-party VFL by solving the intersection-leakage problem of PSI while adding support for noisy identifiers. The low-overhead multi-party generalization and formal index mapping could enable scalable deployments in privacy-sensitive domains where existing two-party PSU methods fall short.
minor comments (3)
- Abstract: the claim of 'low communication overhead' is stated without a quantitative comparison to the two-party baselines generalized from; adding one sentence with asymptotic or concrete costs would improve context.
- Section on the unordered variant: the description of how typographical and formatting discrepancies are handled without creating new leakage channels could include a short worked example of identifier normalization to aid verification.
- Complexity analysis: the exponentiation count is reported but a small table juxtaposing the multi-party costs against the referenced two-party protocols would clarify the overhead scaling.
Simulated Author's Rebuttal
We thank the referee for the positive summary, significance assessment, and recommendation of minor revision for our manuscript on Sherpa.ai. We appreciate the recognition of the multi-party PSU protocol's contributions to PPEA in VFL, including support for exact and noisy matching while hiding intersection membership. No specific major comments were listed in the report, so we provide no point-by-point responses below. We will incorporate any minor suggestions during revision.
Circularity Check
No significant circularity in protocol construction
full rationale
The paper presents a cryptographic multi-party PSU protocol for PPEA that generalizes two-party methods, with explicit proofs of correctness/privacy, complexity analysis, and a formalized universal index mapping. No equations, fitted parameters, self-definitional reductions, or load-bearing self-citations appear in the provided claims or abstract. The derivation rests on standard semi-honest cryptographic assumptions and formal proofs rather than reducing to prior fitted results or self-referential inputs by construction. This is a self-contained construction paper with independent content.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard cryptographic assumptions underlying private set union protocols hold for the multi-party case.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce the Sherpa.ai multi-party PSU protocol for VFL, a PPEA method that hides intersection membership and enables both exact and noisy matching... commutative encryption process based on the Diffie–Hellman key exchange principle
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm.Journal of the royal statistical society: series B (methodological), 39(1):1--22, 1977
work page 1977
-
[3]
W. Diffie and M. Hellman. New directions in cryptography.IEEE transactions on Information Theory, 22(6):644--654, 1976
work page 1976
-
[4]
E. A. Durham, M. Kantarcioglu, Y. Xue, C. Toth, M. Kuzu, and B. Malin. Composite bloom filters for secure record linkage.IEEE transactions on knowledge and data engineering, 26(12):2956--2968, 2013
work page 2013
-
[5]
K. Frikken. Privacy-preserving set union. InInternational Conference on Applied Cryptography and Network Security, pages 237--252. Springer, 2007
work page 2007
-
[6]
J. Gao, S. Nguyen, M. Blanton, and N. Trieu. Pulse: Parallel private set union for large-scale entities. Cryptology ePrint Archive, 2025
work page 2025
-
[7]
J. Gao, S. Nguyen, and N. Trieu. Toward a practical multi-party private set union.Cryptology ePrint Archive, 2023
work page 2023
-
[8]
Y. Gao, X. Zheng, and C. Hu. A multi-party private set union protocol against malicious adversary. In International Conference on Innovative Computing, pages 159--167. Springer, 2024
work page 2024
-
[9]
A. Gkoulalas-Divanis, D. Vatsalan, D. Karapiperis, and M. Kantarcioglu. Modern privacy-preserving record linkage techniques: An overview.IEEE Transactions on Information Forensics and Security, 16:4966--4987, 2021
work page 2021
-
[10]
S. Gopi, P. Gulhane, J. Kulkarni, J. H. Shen, M. Shokouhi, and S. Yekhanin. Differentially private set union. InInternational Conference on Machine Learning, pages 3627--3636. PMLR, 2020
work page 2020
-
[11]
Y. He, X. Tan, J. Ni, L. T. Yang, and X. Deng. Differentially private set intersection for asymmetrical id alignment.IEEE Transactions on Information Forensics and Security, 17:3479--3494, 2022
work page 2022
- [12]
-
[13]
P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. InProceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604--613, 1998
work page 1998
- [14]
- [15]
-
[16]
D. M. Jimenez-Gutierrez, Y. Falkouskaya, J. L. Hernandez-Ramos, A. Anagnostopoulos, I. Chatzi- giannakis, and A. Vitaletti. On the security and privacy of federated learning: A survey with attacks, defenses, frameworks, applications, and future directions.arXiv preprint arXiv:2508.13730, 2025
-
[17]
L. Kissner and D. Song. Privacy-preserving set operations. InAnnual International Cryptology Conference, pages 241--257. Springer, 2005
work page 2005
-
[18]
V. Kolesnikov, M. Rosulek, N. Trieu, and X. Wang. Scalable private set union from symmetric-key techniques. InInternational Conference on the Theory and Application of Cryptology and Information Security, pages 636--666. Springer, 2019
work page 2019
-
[19]
R. J. Little and D. B. Rubin.Statistical analysis with missing data. John Wiley & Sons, 2019
work page 2019
-
[20]
B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-efficient learning of deep networks from decentralized data. InArtificial intelligence and statistics, pages 1273--1282. PMLR, 2017
work page 2017
- [21]
-
[22]
D. B. Rubin. Inference and missing data.Biometrika, 63(3):581--592, 1976
work page 1976
-
[23]
J. L. Schafer.Analysis of incomplete multivariate data. CRC press, 1997. 21 PRIME AI paper
work page 1997
-
[24]
R. Schnell, T. Bachteler, and J. Reiher. Privacy-preserving record linkage using bloom filters.BMC medical informatics and decision making, 9(1):1--11, 2009
work page 2009
-
[25]
J. H. Seo, J. H. Cheon, and J. Katz. Constant-round multi-party private set union using reversed laurent series. InInternational Workshop on Public Key Cryptography, pages 398--412. Springer, 2012
work page 2012
- [26]
-
[27]
B. Tu, Y. Bai, C. Zhang, Y. Cao, and Y. Chen. Fast enhanced private set union in the balanced and unbalanced scenarios.Cryptology ePrint Archive, 2025
work page 2025
-
[28]
B. Tu, Y. Chen, Q. Liu, and C. Zhang. Fast unbalanced private set union from fully homomorphic encryption. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 2959--2973, 2023
work page 2023
-
[29]
E. Uzun, S. P. Chung, V. Kolesnikov, A. Boldyreva, and W. Lee. Fuzzy labeled private set intersection with applications to private{Real-Time}biometric search. In30th USENIX Security Symposium (USENIX Security 21), pages 911--928, 2021
work page 2021
-
[30]
F. Wang, B. Mi, and R. Zeng. Efficient private set intersection for vertical federated learning in iov. In International Conference on Frontiers in Cyber Security, pages 120--130. Springer, 2024
work page 2024
-
[31]
J. Wang, E. X. Huang, P. Duan, H. Wang, and K.-Y. Lam. Psa: private set alignment for secure and collaborative analytics on large-scale data.IEEE Transactions on Dependable and Secure Computing, 2025
work page 2025
-
[32]
J. Wen, Z. Zhang, Y. Lan, Z. Cui, J. Cai, and W. Zhang. A survey on federated learning: challenges and applications.International journal of machine learning and cybernetics, 14(2):513--535, 2023
work page 2023
-
[33]
Y. Xi, Y. Guo, S. Xu, C. Cai, and X. Jia. Private sample alignment for vertical federated learning: An efficient and reliable realization.IEEE Transactions on Information Forensics and Security, 2025
work page 2025
-
[34]
Q. Yang, Y. Liu, Y. Cheng, Y. Kang, T. Chen, and H. Yu. Horizontal federated learning. InFederated learning, pages 49--67. Springer, 2022
work page 2022
- [35]
- [36]
-
[37]
Z. Zhao, X. Liang, H. Huang, and K. Wang. Deep federated learning hybrid optimization model based on encrypted aligned data.Pattern Recognition, 148:110193, 2024. 22
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.