Privacy-Preserving Semantic Segmentation without Key Management

Hitoshi Kiya; Mare Hirose; Shoko Imaizumi

arxiv: 2604.16523 · v1 · submitted 2026-04-16 · 💻 cs.CV · cs.CR

Privacy-Preserving Semantic Segmentation without Key Management

Mare Hirose , Shoko Imaizumi , Hitoshi Kiya This is my paper

Pith reviewed 2026-05-10 11:55 UTC · model grok-4.3

classification 💻 cs.CV cs.CR

keywords privacy-preserving semantic segmentationimage encryptionindependent keysno key managementCityscapes datasetSETR vision transformerencrypted trainingencrypted inference

0 comments

The pith

Semantic segmentation models can be trained and run on images encrypted with independent per-client keys, without any key management or sharing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a method where the model creator and each client generate their own encryption keys locally and apply them to images before any data leaves their control. Both training the segmentation model and running inference happen entirely on these encrypted images, so no decryption key ever needs to be exchanged or stored centrally. To keep accuracy from collapsing, the same encryption step is applied to the training set as well as to the test images. Experiments using the SETR vision transformer on the Cityscapes urban-scene dataset show that usable segmentation performance is retained. The result is a practical way for multiple parties to collaborate on semantic segmentation while each controls its own encryption.

Core claim

The central claim is that semantic segmentation can be performed in a privacy-preserving manner by training and inferring directly on images that have each been encrypted with an independent key chosen locally by the client or model creator; applying the identical encryption process during training prevents the usual severe accuracy drop, and this is demonstrated to work on the Cityscapes dataset with the SETR model.

What carries the argument

The image encryption method applied uniformly to both the training set and the inference images, allowing each party to use its own locally generated key without coordination.

If this is right

Each client can choose a fresh key for every image without notifying or coordinating with other clients or the model owner.
The segmentation model never sees plaintext images, yet still produces per-pixel class labels that match the original scene content.
No central key server or key-distribution protocol is required at any stage of training or deployment.
The same encryption can be reused across multiple clients and multiple images while preserving enough visual structure for the transformer-based model to learn.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on other segmentation architectures or datasets to check whether the training-time encryption step generalizes beyond SETR and Cityscapes.
Because keys remain entirely local, the method removes a common bottleneck in multi-party vision pipelines where key management overhead otherwise grows with the number of participants.
If a stronger encryption scheme that still permits training were substituted, the same training-plus-inference workflow might yield higher final accuracy.

Load-bearing premise

That encrypting the training images with the same method used for test images keeps segmentation accuracy high enough to remain practically useful.

What would settle it

Measuring the mean intersection-over-union score on Cityscapes and finding it drops below a usable threshold such as 0.4 when the proposed encryption is applied to both training and inference.

Figures

Figures reproduced from arXiv: 2604.16523 by Hitoshi Kiya, Mare Hirose, Shoko Imaizumi.

**Figure 1.** Figure 1: illustrates an overview of the proposed method. A model creator encrypts the training images using independent This work was supported in part by JSPS KAKENHI Grant Number 25K07750. Pre-trained SETR … … Image encryption … Encrypted SETR Fine-tuning Client Cloud server Model creator Test image 𝑥!,# Test image 𝑥!,$ Encrypted test image 𝑥′!,$ Image encryption Encrypted test image 𝑥′#,$ 𝐾!,# 𝐾!,$ Segmentation … view at source ↗

**Figure 3.** Figure 3: Examples of segmentation maps. Furthermore, as shown in [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗

read the original abstract

This paper proposes a novel privacy-preserving semantic segmentation method that can use independent keys for each client and image. In the proposed method, the model creator and each client encrypt images using locally generated keys, and model training and inference are conducted on the encrypted images. To mitigate performance degradation, an image encryption method is applied to model training in addition to the generation of test images. In experiments, the effectiveness of the proposed method is confirmed on the Cityscapes dataset under the use of a vision transformer-based model, called SETR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims independent-key encryption lets you train a segmentation model on creator-encrypted images and run it on client-encrypted images without key sharing, but the abstract supplies no numbers or method details to show the domain shift is actually handled.

read the letter

The central claim is that semantic segmentation can run on images encrypted with completely separate keys for training and inference, with no key management required. They encrypt everything locally, train the model on the creator's encrypted Cityscapes images using SETR, and then test on client images encrypted with fresh keys. The encryption step during training is meant to keep accuracy from collapsing too far.

Referee Report

2 major / 2 minor

Summary. The paper proposes a privacy-preserving semantic segmentation method allowing independent per-client and per-image keys. The model creator and clients each encrypt images locally; both training and inference occur entirely on encrypted images. Encryption is also applied during training to mitigate accuracy loss. Effectiveness is asserted via experiments on Cityscapes using the SETR vision-transformer backbone.

Significance. If the central claim holds—that a single model trained on creator-key-encrypted images generalizes to inference on images encrypted with entirely independent client keys without key sharing or management—it would remove a major practical barrier in privacy-preserving computer vision. The approach would enable distributed, keyless deployment while preserving semantic segmentation utility, which is a meaningful contribution if the empirical results are robust.

major comments (2)

[Abstract, §4] Abstract and §4 (Experiments): the claim of 'effectiveness confirmed on the Cityscapes dataset' is unsupported by any reported metrics, baselines, ablation studies, or error analysis. No mIoU, pixel accuracy, or comparison against non-encrypted SETR or prior privacy-preserving segmentation methods is supplied, making it impossible to assess whether performance degradation has been sufficiently mitigated.
[§3] §3 (Proposed Method): the encryption scheme is described as using locally generated keys for both training (creator) and inference (clients), yet no analysis, proof, or ablation demonstrates that the learned features are invariant to key choice. If the underlying transform (permutation, block cipher, or shuffling) produces statistically distinct encrypted domains for different keys, the single-model claim cannot hold; the manuscript provides no evidence that training simulates the client-key distribution or employs a key-invariant representation.

minor comments (2)

[§3] Notation for the encryption function and key generation is introduced without a clear mathematical definition or pseudocode, making the exact procedure difficult to reproduce.
[Introduction] The abstract and introduction cite SETR but do not reference prior work on encrypted-domain semantic segmentation or key-management-free privacy methods, weakening the novelty positioning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the thorough review and insightful comments. Below, we provide point-by-point responses to the major comments and outline the revisions we plan to incorporate into the manuscript.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (Experiments): the claim of 'effectiveness confirmed on the Cityscapes dataset' is unsupported by any reported metrics, baselines, ablation studies, or error analysis. No mIoU, pixel accuracy, or comparison against non-encrypted SETR or prior privacy-preserving segmentation methods is supplied, making it impossible to assess whether performance degradation has been sufficiently mitigated.

Authors: We acknowledge this limitation in the current manuscript. The experiments section does describe the use of the Cityscapes dataset with the SETR model, but quantitative results such as mIoU and pixel accuracy were not explicitly tabulated. In the revised version, we will expand §4 to include detailed performance metrics, baseline comparisons (including non-encrypted SETR and relevant privacy-preserving methods), ablation studies on the encryption parameters, and error analysis. The abstract will be updated to reference these specific results. This will allow readers to properly evaluate the effectiveness and the degree to which performance degradation is mitigated. revision: yes
Referee: [§3] §3 (Proposed Method): the encryption scheme is described as using locally generated keys for both training (creator) and inference (clients), yet no analysis, proof, or ablation demonstrates that the learned features are invariant to key choice. If the underlying transform (permutation, block cipher, or shuffling) produces statistically distinct encrypted domains for different keys, the single-model claim cannot hold; the manuscript provides no evidence that training simulates the client-key distribution or employs a key-invariant representation.

Authors: We appreciate this observation regarding the need for supporting analysis. The method applies encryption during training to help the model learn from encrypted images, aiming for generalization to client-encrypted images with independent keys. However, we agree that explicit evidence, such as an analysis of feature invariance or ablations across different keys, is not provided in the current version. In the revision, we will add to §3 a discussion of the encryption transform's properties, along with experimental ablations demonstrating performance consistency across varied keys. This will substantiate the claim that a single model can handle independent client keys without key management. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with no derivations or self-referential equations

full rationale

The paper describes an empirical privacy-preserving semantic segmentation approach using image encryption during both training and inference on the Cityscapes dataset with a SETR backbone. No equations, derivations, or mathematical claims are present in the provided abstract or description. The central claim reduces to experimental validation rather than any chain that could loop back to inputs by construction. No self-citations, fitted parameters renamed as predictions, or uniqueness theorems are invoked. This is a standard non-circular empirical proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, parameters, or explicit assumptions; ledger left empty.

pith-pipeline@v0.9.0 · 5378 in / 1004 out tokens · 40286 ms · 2026-05-10T11:55:57.695244+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages

[1]

Y. Liu, H. Chen, and Z. Yang, ``Enforcing End-to-end Security for Remote Conference Applications,'' in Proc. IEEE Symp. Secur. Priv., San Francisco, CA, USA, 2024, pp. 2630--2647

work page 2024
[2]

Madono, M

K. Madono, M. Tanaka, M. Onishi, and T. Ogawa, ``Block-wise Scrambled Image Recognition Using Adaptation Network,'' in Proc. Workshop on Artif. Intell. Things (AAAI-WS), New York, NY, USA, 2020

work page 2020
[3]

Hirose, S

M. Hirose, S. Imaizumi, and H. Kiya, ``Learnable Image Encryption Without Key Management for Privacy-Preserving Vision Transformer,'' IEEE Access, vol. 13, pp. 201351--201362, 2025

work page 2025
[4]

Sueyoshi, K

H. Sueyoshi, K. Nishikawa and H. Kiya, ``A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique,'' in Proc. IEEE GCCE, Osaka, Japan, 2025, pp. 37--40

work page 2025
[5]

H. Kiya, T. Nagamori, S. Imaizumi, S. Shiota, ``Privacy-Preserving Semantic Segmentation Using Vision Transformer,'' J. Imaging, vol. 8, no. 9, 2022

work page 2022
[6]

Zheng et al., ``Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers,'' in Proc

S. Zheng et al., ``Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers,'' in Proc. CVPR, 2021, pp. 6881--6890

work page 2021
[7]

Everingham, L

M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, ``The PASCAL Visual Object Classes (VOC) Challenge,'' Int. J. Comput. Vis., vol. 88, pp. 303--338, 2010

work page 2010
[8]

8\ n - ͣ_ 6 V;z<n aahwn`# 5Cp[ =x?΋ < vǰ<9Oo gf ѻ /

11em plus .33em minus .07em 4000 4000 100 4000 4000 500 `\.=1000 = #1 \@IEEEnotcompsoconly \@IEEEcompsoconly #1 * [1] 0pt [0pt][0pt] #1 * [1] 0pt [0pt][0pt] #1 * \| ** #1 \@IEEEauthorblockNstyle \@IEEEcompsocnotconfonly \@IEEEauthorblockAstyle \@IEEEcompsocnotconfonly \@IEEEcompsocconfonly \@IEEEauthordefaulttextstyle \@IEEEcompsocnotconfonly \@IEEEauthor...

work page 2046

[1] [1]

Y. Liu, H. Chen, and Z. Yang, ``Enforcing End-to-end Security for Remote Conference Applications,'' in Proc. IEEE Symp. Secur. Priv., San Francisco, CA, USA, 2024, pp. 2630--2647

work page 2024

[2] [2]

Madono, M

K. Madono, M. Tanaka, M. Onishi, and T. Ogawa, ``Block-wise Scrambled Image Recognition Using Adaptation Network,'' in Proc. Workshop on Artif. Intell. Things (AAAI-WS), New York, NY, USA, 2020

work page 2020

[3] [3]

Hirose, S

M. Hirose, S. Imaizumi, and H. Kiya, ``Learnable Image Encryption Without Key Management for Privacy-Preserving Vision Transformer,'' IEEE Access, vol. 13, pp. 201351--201362, 2025

work page 2025

[4] [4]

Sueyoshi, K

H. Sueyoshi, K. Nishikawa and H. Kiya, ``A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique,'' in Proc. IEEE GCCE, Osaka, Japan, 2025, pp. 37--40

work page 2025

[5] [5]

H. Kiya, T. Nagamori, S. Imaizumi, S. Shiota, ``Privacy-Preserving Semantic Segmentation Using Vision Transformer,'' J. Imaging, vol. 8, no. 9, 2022

work page 2022

[6] [6]

Zheng et al., ``Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers,'' in Proc

S. Zheng et al., ``Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers,'' in Proc. CVPR, 2021, pp. 6881--6890

work page 2021

[7] [7]

Everingham, L

M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, ``The PASCAL Visual Object Classes (VOC) Challenge,'' Int. J. Comput. Vis., vol. 88, pp. 303--338, 2010

work page 2010

[8] [8]

8\ n - ͣ_ 6 V;z<n aahwn`# 5Cp[ =x?΋ < vǰ<9Oo gf ѻ /

11em plus .33em minus .07em 4000 4000 100 4000 4000 500 `\.=1000 = #1 \@IEEEnotcompsoconly \@IEEEcompsoconly #1 * [1] 0pt [0pt][0pt] #1 * [1] 0pt [0pt][0pt] #1 * \| ** #1 \@IEEEauthorblockNstyle \@IEEEcompsocnotconfonly \@IEEEauthorblockAstyle \@IEEEcompsocnotconfonly \@IEEEcompsocconfonly \@IEEEauthordefaulttextstyle \@IEEEcompsocnotconfonly \@IEEEauthor...

work page 2046