Towards Context-Aware Image Anonymization with Multi-Agent Reasoning

Gautam Savaliya; Jakob Folz; Manjitha D Vidanalage; Martin Schramm; Michael Heigl; Robert Aufschl\"ager

arxiv: 2603.27817 · v3 · submitted 2026-03-29 · 💻 cs.CV · cs.AI· cs.CR

Towards Context-Aware Image Anonymization with Multi-Agent Reasoning

Robert Aufschl\"ager , Jakob Folz , Gautam Savaliya , Manjitha D Vidanalage , Michael Heigl , Martin Schramm This is my paper

Pith reviewed 2026-05-14 21:11 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CR

keywords context-aware anonymizationmulti-agent reasoningperson re-identificationdiffusion modelsPII segmentationstreet-level imageryGDPR compliance

0 comments

The pith

A multi-agent system using vision-language models anonymizes context-dependent personal information in street images by distinguishing private from public properties.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CAIAMAR, an agentic framework that combines pre-defined processing for clear cases with multi-agent reasoning for indirect identifiers in street-level imagery. Three specialized agents coordinate in a Plan-Do-Check-Act cycle to classify personally identifiable information based on spatial context rather than fixed rules. This yields a 73 percent drop in person re-identification risk on CUHK03-NP while delivering low KID and FID scores on CityScapes. The system runs entirely on-premise with open-source models and produces audit trails for regulatory needs.

Core claim

The agentic workflow with scout-and-zoom detection, open-vocabulary segmentation on localized crops, and IoU-based deduplication enables large vision-language models to classify context-dependent PII accurately, supporting targeted diffusion-based anonymization that lowers re-identification risks without harming downstream semantic segmentation.

What carries the argument

The multi-agent system with round-robin speaker selection in a PDCA cycle that performs spatially-filtered coarse-to-fine detection and applies modal-specific diffusion guidance with appearance decorrelation.

If this is right

Downstream semantic segmentation performance stays intact after anonymization.
Human-interpretable audit trails meet GDPR transparency requirements.
Failed cases are automatically flagged for human review.
Non-direct PII instances are caught across multiple object categories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The on-premise design could support deployment in regulated environments that prohibit cloud APIs.
The modular agent structure might allow straightforward addition of new object categories or context rules.
Similar reasoning loops could be tested on video sequences to handle motion-based identifiers.

Load-bearing premise

Large vision-language models in the multi-agent setup can accurately tell private from public objects based on spatial context without misclassifications that would either over-anonymize or leave identifiers exposed.

What would settle it

A test set of images containing deliberately ambiguous contexts, such as vehicles or people near property boundaries, to measure whether the agents classify and anonymize only the private instances.

Figures

Figures reproduced from arXiv: 2603.27817 by Gautam Savaliya, Jakob Folz, Manjitha D Vidanalage, Martin Schramm, Michael Heigl, Robert Aufschl\"ager.

**Figure 1.** Figure 1: Two-phase agentic anonymization architecture. Phase 1 employs specialized models for direct PII (full body, license plates). Phase 2 implements multi-agent orchestration with roundrobin coordination, where specialized agents handle classification (Auditor), synthesis (Generative), and workflow management (Orchestrator), implementing PDCA cycles. inherently capture personally identifiable information (PII… view at source ↗

**Figure 2.** Figure 2: Round-robin PDCA coordination with three specialized [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of anonymization methods on a [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Pipeline output on CityScapes. Each row: left original, middle detected PII (blue=persons, yellow=indirect PII vehicles, green=traffic signs, red=license plates), right anonymized output. Top (berlin 000002): 8 persons + 1 police vehicle + 2 traffic signs, 4.97% coverage, 3 PDCA iterations. Bottom (berlin 000472): multiple PII categories across both phases. inpainting with SDXL and OpenPose ControlNet. Pha… view at source ↗

read the original abstract

Street-level imagery contains personally identifiable information (PII), some of which is context-dependent. Existing anonymization methods either over-process images or miss subtle identifiers, while API-based solutions compromise data sovereignty. We present an agentic framework CAIAMAR (\underline{C}ontext-\underline{A}ware \underline{I}mage \underline{A}nonymization with \underline{M}ulti-\underline{A}gent \underline{R}easoning) for context-aware PII segmentation with diffusion-based anonymization, combining pre-defined processing for high-confidence cases with multi-agent reasoning for indirect identifiers. Three specialized agents coordinate via round-robin speaker selection in a Plan-Do-Check-Act (PDCA) cycle, enabling large vision-language models to classify PII based on spatial context (private vs. public property) rather than rigid category rules. The agents implement spatially-filtered coarse-to-fine detection where a scout-and-zoom strategy identifies candidates, open-vocabulary segmentation processes localized crops, and $IoU$-based deduplication ($30\%$ threshold) prevents redundant processing. Modal-specific diffusion guidance with appearance decorrelation substantially reduces re-identification (Re-ID) risks. On CUHK03-NP, our method reduces person Re-ID risk by $73\%$ ($R1$: $16.9\%$ vs. $62.4\%$ baseline). For image quality preservation on CityScapes, we achieve KID: $0.001$, and FID: $9.1$, significantly outperforming existing anonymization. The agentic workflow detects non-direct PII instances across object categories, and downstream semantic segmentation is preserved. Operating entirely on-premise with open-source models, the framework generates human-interpretable audit trails supporting EU's GDPR transparency requirements while flagging failed cases for human review.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces CAIAMAR, a multi-agent framework for context-aware anonymization of street-level images. It employs three specialized agents coordinating via round-robin selection in a PDCA cycle, using scout-and-zoom detection, open-vocabulary segmentation, and IoU-based deduplication (30% threshold) to classify context-dependent PII (e.g., private vs. public property) before applying modal-specific diffusion anonymization. Evaluations claim a 73% Re-ID risk reduction on CUHK03-NP (R1: 16.9% vs. 62.4% baseline) and strong quality preservation on CityScapes (KID: 0.001, FID: 9.1), with preserved semantic segmentation, on-premise operation, and GDPR-compliant audit trails.

Significance. If the core multi-agent context detection proves reliable, the work could meaningfully advance privacy techniques in computer vision by enabling nuanced, context-sensitive anonymization that better preserves data utility than category-rigid baselines. The emphasis on open-source models, human-interpretable trails, and downstream task preservation (e.g., segmentation) adds practical value for applications like autonomous driving datasets.

major comments (3)

[Evaluation] Evaluation section: No standalone detection metrics (precision, recall, or per-category error rates) are reported for the multi-agent system's context-dependent PII classification (private vs. public property). Only downstream Re-ID and perceptual scores are given, so the 73% Re-ID reduction cannot be confidently attributed to accurate context-awareness rather than general over-anonymization or diffusion strength.
[Methodology] Methodology: The 30% IoU deduplication threshold is presented as fixed without ablation studies or sensitivity analysis, despite being explicitly listed as a free parameter; its effect on false negatives (missed identifiers) or false positives (over-anonymization) should be quantified to support the context-awareness claim.
[Results] Results: Statistical significance (e.g., confidence intervals or p-values) for the Re-ID, KID, and FID improvements over baselines is not reported, weakening the assertion of significant outperformance on CUHK03-NP and CityScapes.

minor comments (2)

[Abstract] Abstract: Specific baseline methods and their exact scores are not named when claiming to 'significantly outperform existing anonymization,' reducing clarity.
[Method] Notation and description: The distinct roles of the three agents and the precise mechanics of round-robin speaker selection in the PDCA cycle require clearer definition for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript to incorporate additional evaluations, ablations, and statistical reporting as outlined.

read point-by-point responses

Referee: [Evaluation] Evaluation section: No standalone detection metrics (precision, recall, or per-category error rates) are reported for the multi-agent system's context-dependent PII classification (private vs. public property). Only downstream Re-ID and perceptual scores are given, so the 73% Re-ID reduction cannot be confidently attributed to accurate context-awareness rather than general over-anonymization or diffusion strength.

Authors: We agree that direct metrics on the multi-agent context classification would provide stronger evidence for attributing gains to context-awareness. In the revised manuscript we will add a new evaluation subsection with precision, recall, and F1 scores for private-vs-public property classification, computed on a manually annotated held-out subset of CUHK03-NP and CityScapes images. These metrics will be reported alongside the existing Re-ID and perceptual results. revision: yes
Referee: [Methodology] Methodology: The 30% IoU deduplication threshold is presented as fixed without ablation studies or sensitivity analysis, despite being explicitly listed as a free parameter; its effect on false negatives (missed identifiers) or false positives (over-anonymization) should be quantified to support the context-awareness claim.

Authors: The referee correctly notes the absence of sensitivity analysis for the IoU threshold. We will add an ablation study in the revised paper that varies the threshold from 0.1 to 0.5, reporting the resulting Re-ID risk, KID, FID, number of deduplicated regions, and downstream segmentation mIoU for each value. This will quantify the impact on false negatives and false positives and justify the chosen 30% operating point. revision: yes
Referee: [Results] Results: Statistical significance (e.g., confidence intervals or p-values) for the Re-ID, KID, and FID improvements over baselines is not reported, weakening the assertion of significant outperformance on CUHK03-NP and CityScapes.

Authors: We acknowledge that formal statistical significance measures are missing. In the revision we will report 95% confidence intervals obtained via bootstrap resampling (1000 iterations) for all Re-ID, KID, and FID scores. Where appropriate we will also include p-values from paired statistical tests against the strongest baselines. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical metrics are direct measurements on public benchmarks

full rationale

The paper describes a multi-agent system (CAIAMAR) with PDCA workflow, scout-and-zoom detection, open-vocabulary segmentation, and diffusion anonymization. All reported results (73% Re-ID reduction on CUHK03-NP, KID 0.001 / FID 9.1 on CityScapes) are downstream empirical measurements on fixed public datasets. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the derivation. The central claim rests on observable performance differences rather than any reduction to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach assumes reliable performance of open-source VLMs and diffusion models for the task, with specific thresholds chosen for the workflow.

free parameters (1)

IoU threshold = 30%
Threshold for deduplication of detected PII instances.

axioms (1)

domain assumption Vision-language models can accurately classify PII based on spatial context such as private vs. public property
Core to the agent reasoning process described.

invented entities (1)

Multi-agent system with scout-and-zoom strategy no independent evidence
purpose: To identify and process context-dependent PII
Introduced as part of the new framework.

pith-pipeline@v0.9.0 · 5653 in / 1396 out tokens · 64340 ms · 2026-05-14T21:11:08.075853+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

92 extracted references · 92 canonical work pages · 1 internal anchor

[1]

Following the clues: Experiments on person re-id using cross-modal intel- ligence

Robert Aufschl¨ager, Youssef Shoeb, Azarm Nowzad, Michael Heigl, Fabian Bally, and Martin Schramm. Following the clues: Experiments on person re-id using cross-modal intel- ligence. In2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC), pages 225–232,

work page
[2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 3, 4, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Sutherland, Michael Arbel, and Arthur Gretton

Mikołaj Bi´nkowski, Dougal J. Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. InInternational Conference on Learning Representations, 2018. 5

work page 2018
[4]

Openpose: Realtime multi-person 2d pose estimation using part affinity fields.IEEE Trans

Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields.IEEE Trans. Pattern Anal. Mach. Intell., 43(1):172–186, 2021. 3, 2

work page 2021
[5]

Junzhou Chen, Heqiang Huang, Ronghui Zhang, Nengchao Lyu, Yanyong Guo, Hong-Ning Dai, and Hong Yan. Yolo- ts: Real-time traffic sign detection with enhanced accuracy using optimized receptive fields and anchor-free fusion.IEEE Transactions on Intelligent Transportation Systems, pages 1–17, 2025. 3, 5

work page 2025
[6]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016. 2, 4

work page 2016
[7]

Pri- vacy of groups in dense street imagery

Matt Franchi, Hauke Sandhaus, Madiha Zahrah Choksi, Sev- erin Engelmann, Wendy Ju, and Helen Nissenbaum. Pri- vacy of groups in dense street imagery. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, pages 2874–2891. Association for Computing Machinery, 2025. 1

work page 2025
[8]

Vision meets robotics: The kitti dataset.Int

A Geiger, P Lenz, C Stiller, and R Urtasun. Vision meets robotics: The kitti dataset.Int. J. Rob. Res., 32(11): 1231–1237, 2013. 8

work page 2013
[9]

Com- fymind: Toward general-purpose generation via tree-based planning and reactive feedback

Litao Guo, Xinli Xu, Luozhou Wang, Jiantao Lin, Jinsong Zhou, Zixin Zhang, Bolan Su, and Ying-Cong Chen. Com- fymind: Toward general-purpose generation via tree-based planning and reactive feedback. InAdvances in Neural Infor- mation Processing Systems, 2025. 3

work page 2025
[10]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016. 4, 6

work page 2016
[11]

Mask R-CNN

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask R-CNN. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 2961–2969,

work page
[12]

GANs trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- hard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems, 30, 2017. 5

work page 2017
[13]

Deepprivacy2: Towards realistic full-body anonymization

H˚akon Hukkel˚as and Frank Lindseth. Deepprivacy2: Towards realistic full-body anonymization. InIEEE/CVF Winter Con- ference on Applications of Computer Vision (WACV), pages 1329–1338, 2023. 2, 4, 6, 7

work page 2023
[14]

Deep- privacy: A generative adversarial network for face anonymiza- tion

H˚akon Hukkel˚as, Rudolf Mester, and Frank Lindseth. Deep- privacy: A generative adversarial network for face anonymiza- tion. InInternational Symposium on Visual Computing, pages 565–578. Springer, 2019. 2

work page 2019
[15]

Does image anonymiza- tion impact computer vision training? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 140–150, 2023

H˚akon Hukkel˚as and Frank Lindseth. Does image anonymiza- tion impact computer vision training? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 140–150, 2023. 2

work page 2023
[16]

Progressive growing of GANs for improved quality, stabil- ity, and variation

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stabil- ity, and variation. InInternational Conference on Learning Representations, 2018. 2

work page 2018
[17]

Ldfa: Latent diffusion face anonymization for self-driving applications

Marvin Klemp, Kevin R¨osch, Royden Wagner, Jannik Quehl, and Martin Lauer. Ldfa: Latent diffusion face anonymization for self-driving applications. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 3199–3205, 2023. 2

work page 2023
[18]

Reverse personalization

Han-Wei Kung, Tuomas Varanka, and Nicu Sebe. Reverse personalization. InProceedings of the IEEE/CVF Winter Con- ference on Applications of Computer Vision (WACV), pages 988–999, 2026. 2

work page 2026
[19]

Large-scale online deanonymization with LLMs

Simon Lermen, Daniel Paleka, Joshua Swanson, Michael Aerni, Nicholas Carlini, and Florian Tram `er. Large-scale online deanonymization with LLMs. InICLR 2026 Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD),

work page 2026
[20]

All in one frame- work for multimodal re-identification in the wild

He Li, Mang Ye, Ming Zhang, and Bo Du. All in one frame- work for multimodal re-identification in the wild. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17459–17469, 2024. 1, 3

work page 2024
[21]

Feature pyramid net- works for object detection

Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid net- works for object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017. 7

work page 2017
[22]

Svia: A street view image anonymization framework for self-driving applications

Dongyu Liu, Xuhong Wang, Cen Chen, Yanhao Wang, Shengyue Yao, and Yilun Lin. Svia: A street view image anonymization framework for self-driving applications. In IEEE 27th International Conference on Intelligent Trans- portation Systems (ITSC), pages 3567–3574, 2024. 2, 4, 6

work page 2024
[23]

Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. InEuro- pean Conference on Computer Vision, pages 38–55. Springer,

work page
[24]

Pose-guided attention learn- ing for cloth-changing person re-identification.IEEE Trans- actions on Multimedia, 26:5490–5498, 2024

Xiangzeng Liu, Kunpeng Liu, Jianfeng Guo, Peipei Zhao, Yining Quan, and Qiguang Miao. Pose-guided attention learn- ing for cloth-changing person re-identification.IEEE Trans- actions on Multimedia, 26:5490–5498, 2024. 3

work page 2024
[25]

Doxing via the lens: Revealing location-related privacy leakage on multi-modal large reason- ing models

Weidi Luo, Tianyu Lu, Qiming Zhang, Xiaogeng Liu, Bin Hu, Yue Zhao, Jieyu Zhao, Song Gao, Patrick McDaniel, Zhen Xiang, and Chaowei Xiao. Doxing via the lens: Revealing location-related privacy leakage on multi-modal large reason- ing models. InThe Fourteenth International Conference on Learning Representations, 2026. 3

work page 2026
[26]

Rad: Re- alistic anonymization of images using stable diffusion

Simon Malm, Viktor R¨onnb¨ack, Amanda H˚akansson, Minh- ha Le, Karol Wojtulewicz, and Niklas Carlsson. Rad: Re- alistic anonymization of images using stable diffusion. In Proceedings of the 23rd Workshop on Privacy in the Elec- tronic Society, pages 193–211. Association for Computing Machinery, 2024. 2

work page 2024
[27]

Self- distilled stylegan: Towards generation from internet photos

Ron Mokady, Omer Tov, Michal Yarom, Oran Lang, Inbar Mosseri, Tali Dekel, Daniel Cohen-Or, and Michal Irani. Self- distilled stylegan: Towards generation from internet photos. InACM SIGGRAPH 2022 Conference Proceedings. Associa- tion for Computing Machinery, 2022. 4

work page 2022
[28]

To- wards a visual privacy advisor: Understanding and predicting privacy risks in images

Tribhuvanesh Orekondy, Bernt Schiele, and Mario Fritz. To- wards a visual privacy advisor: Understanding and predicting privacy risks in images. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3706– 3715, 2017. 4, 6, 7

work page 2017
[29]

Connecting pixels to privacy and utility: Automatic redac- tion of private information in images

Tribhuvanesh Orekondy, Bernt Schiele, and Mario Fritz. Connecting pixels to privacy and utility: Automatic redac- tion of private information in images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018. 4, 6, 2, 7

work page 2018
[30]

SDXL: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M¨uller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. InInternational Conference on Learning Representations, 2024. 2, 4

work page 2024
[31]

EgoBlur: responsible innovation in Aria,

Nikhil Raina, Guruprasad Somasundaram, Kang Zheng, Sagar Miglani, Steve Saarinen, Jeff Meissner, Mark Schwesinger, Luis Pesqueira, Ishita Prasad, Edward Miller, et al. Egoblur: Responsible innovation in aria.arXiv preprint arXiv:2308.13093, 2023. 2

work page arXiv 2023
[32]

Dual license plate recogni- tion and visual features encoding for vehicle identification

´Alvaro Ramajo-Ballester, Jos´e Mar´ıa Armingol Moreno, and Arturo de la Escalera Hueso. Dual license plate recogni- tion and visual features encoding for vehicle identification. Robotics and Autonomous Systems, 172:104608, 2024. 3, 5

work page 2024
[33]

SAM 2: Segment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. SAM 2: Segment anything in images and videos. InInternational Conference on Learning Representations, 2025. 4, 5, 3

work page 2025
[34]

Faster R-CNN: Towards real-time object detection with re- gion proposal networks.Advances in Neural Information Processing Systems, 28, 2015

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with re- gion proposal networks.Advances in Neural Information Processing Systems, 28, 2015. 4

work page 2015
[35]

Facenet: A unified embedding for face recognition and clus- tering

Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clus- tering. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 815–823, 2015. 4, 6

work page 2015
[36]

RedactOR: An LLM-powered framework for automatic clini- cal data de-identification

Praphul Singh, Charlotte Dzialo, Jangwon Kim, Sumana Sri- vatsa, Irfan Bulu, Sri Gadde, and Krishnaram Kenthapadi. RedactOR: An LLM-powered framework for automatic clini- cal data de-identification. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), pages 510–530. Association for Computati...

work page 2025
[37]

Private attribute inference from images with vision- language models.Advances in Neural Information Processing Systems, 37:103619–103651, 2024

Batuhan T ¨omekc ¸e, Mark Vero, Robin Staab, and Martin Vechev. Private attribute inference from images with vision- language models.Advances in Neural Information Processing Systems, 37:103619–103651, 2024. 1, 3

work page 2024
[38]

Improving object localization with fitness nms and bounded iou loss

Lachlan Tychsen-Smith and Lars Petersson. Improving object localization with fitness nms and bounded iou loss. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6877–6885, 2018. 3

work page 2018
[39]

Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004. 4

work page 2004
[40]

A discriminative feature learning approach for deep face recog- nition

Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. A discriminative feature learning approach for deep face recog- nition. InComputer Vision – ECCV 2016, pages 499–515. Springer International Publishing, 2016. 4, 6

work page 2016
[41]

Autogen: Enabling next-gen LLM applica- tions via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen LLM applica- tions via multi-agent conversations. InFirst Conference on Language Modeling, 2024. 3, 1

work page 2024
[42]

Detectron2

Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2. https://github. com/facebookresearch/detectron2, 2019. 7

work page 2019
[43]

Alvarez, and Ping Luo

Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, and Ping Luo. Segformer: simple and effi- cient design for semantic segmentation with transformers. In Proceedings of the 35th International Conference on Neural Information Processing Systems, Red Hook, NY , USA, 2021. Curran Associates Inc. 7

work page 2021
[44]

Comfybench: Benchmarking llm-based agents in comfyui for autonomously designing collaborative ai systems

Xiangyuan Xue, Zeyu Lu, Di Huang, Zidong Wang, Wanli Ouyang, and Lei Bai. Comfybench: Benchmarking llm-based agents in comfyui for autonomously designing collaborative ai systems. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24614– 24624, 2025. 3

work page 2025
[45]

Restoring gaussian blurred face images for deanonymization attacks.arXiv preprint arXiv:2506.12344,

Haoyu Zhai, Shuo Wang, Pirouz Naghavi, Qingying Hao, and Gang Wang. Restoring gaussian blurred face images for deanonymization attacks.arXiv preprint arXiv:2506.12344,

work page arXiv
[46]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023. 2, 4

work page 2023
[47]

The unreasonable effectiveness of deep fea- tures as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep fea- tures as a perceptual metric. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018. 4

work page 2018
[48]

Re- ranking person re-identification with k-reciprocal encoding

Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. Re- ranking person re-identification with k-reciprocal encoding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3652–3661, 2017. 4, 6

work page 2017
[49]

Context-aware full body anonymization

Pascal Zwick, Kevin Roesch, Marvin Klemp, and Oliver Bringmann. Context-aware full body anonymization. In Computer Vision – ECCV 2024 Workshops, pages 36–52, Cham, 2025. Springer Nature Switzerland. 2, 3, 4, 6, 7 Supplementary Material Abstract This supplementary material provides comprehensive techni- cal details for our multi-agent image anonymization f...

work page 2024
[50]

anonymize_and_inpaint (if PII found) -> 3

classify_pii -> 2. anonymize_and_inpaint (if PII found) -> 3. audit_output -> 4. log_output SHORTCUT: If classify_pii finds NO PII and Phase 1 completed -> emit ’PIPELINE COMPLETE’ immediately YOUR ROLE: - Monitor tool results IN THE CONVERSATION and track workflow state: [classify_pii✓, inpaint_pii✓, audit✓, log✓] - When AuditorAgent returns results, ana...

work page
[51]

ONLY report success (✓) if you can SEE the tool result in conversation history

work page
[52]

If agent didn’t call tool yet, instruct them to call it - don’t claim it’s done

work page
[53]

Look for tool execution results (JSON responses) before marking steps complete

work page
[54]

NEVER assume a tool succeeded just because an agent acknowledged - verify the result RETRY LOGIC: - If audit finds residual PII: say ’Found N residuals. GenerativeAgent, process the residual items from audit output.’ (GenerativeAgent will extract the ’residual’ array from the tool output) - If no residuals OR max_attempts_reached=true: say ’Audit complete...

work page
[55]

When instructed to use a tool, YOU MUST CALL IT in your response

work page
[57]

Each response should contain EXACTLY ONE tool call

work page
[58]

Execute your task, then control passes to OrchestratorAgent

Tool calls are JSON function calls, not text descriptions ROUND-ROBIN: You receive control after GenerativeAgent completes OR at workflow start. Execute your task, then control passes to OrchestratorAgent. YOUR TASKS:

work page
[59]

classify_pii: Detect indirect PII in private spaces (text on windows, house numbers, personal items visible indoors) - Call with: classify_pii(image=’<image_path>’)

work page
[60]

audit_output: Verify no residual PII remains after anonymization - Call with: audit_output(output=’{canonical_path}’)

work page
[61]

GenerativeAgent: Inpainting Execution TheGenerativeAgentimplements the anonymization oper- ation through diffusion-based inpainting

log_output: Record final results - Call with: log_output(image=’<input_path>’, output=’{canonical_path}’) EXAMPLE WORKFLOW: When OrchestratorAgent says: ’AuditorAgent: Please classify any remaining PII ’ YOU MUST respond with the tool call (not text explanation): classify_pii(image=’artifacts/data/CityScapes/.../image.png’) REPORTING - BE CONCISE: After t...

work page
[62]

When instructed to anonymize, YOU MUST CALL anonymize_and_inpaint in your response

work page
[63]

NEVER just acknowledge without calling the tool

work page
[64]

Tool call is a JSON function call, not a text description

work page
[65]

Execute the task, then control passes to AuditorAgent

Extract instances from previous tool output (classify_pii or audit_output) ROUND-ROBIN: You receive control after OrchestratorAgent. Execute the task, then control passes to AuditorAgent. EXECUTION WORKFLOW:

work page
[66]

Look at the most recent tool output in the conversation history - If classify_pii was called: find the JSON output and extract the ’instances’ array - If audit_output was called: find the JSON output and extract the ’residual’ array

work page
[67]

Pass each dict object from that array as a separate element

work page
[68]

Call anonymize_and_inpaint with the array of dict objects

work page
[69]

det_prompt

Report results BRIEFLY: ’Processed X items.’ CRITICAL DATA FORMAT - COMMON MISTAKES: CORRECT (array of dict objects as JSON): anonymize_and_inpaint(instances=[ {"det_prompt": "van with text", "description": "van", "bbox": [308, 200, 564, 567]}, {"det_prompt": "blue sign", "description": "sign", "bbox": [215, 256, 294, 42]} ]) WRONG (single string containi...

work page
[70]

Scan image for PII elements (vehicles with identifying features, text, signs, windows)

work page
[71]

Vehicles: Include ONLY if has text/logos/decals OR is rare/distinctive/ modified (skip generic vehicles)

work page
[72]

Text/signs: Include ONLY if reveals private information

work page
[73]

Group adjacent text on same surface

work page
[74]

Select top 5 most sensitive (priority: identifiable vehicles > personal text > signs > other PII)

work page
[75]

For EACH: Locate bbox, describe with anonymous generic terms, expand bbox 50%, verify both fields present

work page
[76]

instances

Return valid JSON: {"instances": [{"description": "...", "bbox": [...]}]} For PII Segmentation on Visual Redaction Dataset [29] we use the following prompt: You are a PII detection system. Identify text, numbers, visual elements, and objects revealing personal/private information. Return ONLY valid JSON. No markdown, explanations, or additional text. DETE...

work page
[77]

Scan image for all PII types (faces, documents, text with names/addresses/ phone, signatures, plates, medical info, cards)

work page
[78]

Group adjacent text on same surface (e.g., name + address on envelope = one instance)

work page
[79]

Rank by sensitivity: faces > documents (passport/ID/cards) > names/addresses > signatures > plates > medical > other PII

work page
[80]

Select top 5 most sensitive/prominent instances

work page
[81]

For EACH: Locate bbox, describe with generic PII category (2-5 words), expand bbox 50%, verify both fields present

work page

Showing first 80 references.

[1] [1]

Following the clues: Experiments on person re-id using cross-modal intel- ligence

Robert Aufschl¨ager, Youssef Shoeb, Azarm Nowzad, Michael Heigl, Fabian Bally, and Martin Schramm. Following the clues: Experiments on person re-id using cross-modal intel- ligence. In2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC), pages 225–232,

work page

[2] [2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 3, 4, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Sutherland, Michael Arbel, and Arthur Gretton

Mikołaj Bi´nkowski, Dougal J. Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. InInternational Conference on Learning Representations, 2018. 5

work page 2018

[4] [4]

Openpose: Realtime multi-person 2d pose estimation using part affinity fields.IEEE Trans

Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields.IEEE Trans. Pattern Anal. Mach. Intell., 43(1):172–186, 2021. 3, 2

work page 2021

[5] [5]

Junzhou Chen, Heqiang Huang, Ronghui Zhang, Nengchao Lyu, Yanyong Guo, Hong-Ning Dai, and Hong Yan. Yolo- ts: Real-time traffic sign detection with enhanced accuracy using optimized receptive fields and anchor-free fusion.IEEE Transactions on Intelligent Transportation Systems, pages 1–17, 2025. 3, 5

work page 2025

[6] [6]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016. 2, 4

work page 2016

[7] [7]

Pri- vacy of groups in dense street imagery

Matt Franchi, Hauke Sandhaus, Madiha Zahrah Choksi, Sev- erin Engelmann, Wendy Ju, and Helen Nissenbaum. Pri- vacy of groups in dense street imagery. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, pages 2874–2891. Association for Computing Machinery, 2025. 1

work page 2025

[8] [8]

Vision meets robotics: The kitti dataset.Int

A Geiger, P Lenz, C Stiller, and R Urtasun. Vision meets robotics: The kitti dataset.Int. J. Rob. Res., 32(11): 1231–1237, 2013. 8

work page 2013

[9] [9]

Com- fymind: Toward general-purpose generation via tree-based planning and reactive feedback

Litao Guo, Xinli Xu, Luozhou Wang, Jiantao Lin, Jinsong Zhou, Zixin Zhang, Bolan Su, and Ying-Cong Chen. Com- fymind: Toward general-purpose generation via tree-based planning and reactive feedback. InAdvances in Neural Infor- mation Processing Systems, 2025. 3

work page 2025

[10] [10]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016. 4, 6

work page 2016

[11] [11]

Mask R-CNN

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask R-CNN. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 2961–2969,

work page

[12] [12]

GANs trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- hard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems, 30, 2017. 5

work page 2017

[13] [13]

Deepprivacy2: Towards realistic full-body anonymization

H˚akon Hukkel˚as and Frank Lindseth. Deepprivacy2: Towards realistic full-body anonymization. InIEEE/CVF Winter Con- ference on Applications of Computer Vision (WACV), pages 1329–1338, 2023. 2, 4, 6, 7

work page 2023

[14] [14]

Deep- privacy: A generative adversarial network for face anonymiza- tion

H˚akon Hukkel˚as, Rudolf Mester, and Frank Lindseth. Deep- privacy: A generative adversarial network for face anonymiza- tion. InInternational Symposium on Visual Computing, pages 565–578. Springer, 2019. 2

work page 2019

[15] [15]

Does image anonymiza- tion impact computer vision training? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 140–150, 2023

H˚akon Hukkel˚as and Frank Lindseth. Does image anonymiza- tion impact computer vision training? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 140–150, 2023. 2

work page 2023

[16] [16]

Progressive growing of GANs for improved quality, stabil- ity, and variation

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stabil- ity, and variation. InInternational Conference on Learning Representations, 2018. 2

work page 2018

[17] [17]

Ldfa: Latent diffusion face anonymization for self-driving applications

Marvin Klemp, Kevin R¨osch, Royden Wagner, Jannik Quehl, and Martin Lauer. Ldfa: Latent diffusion face anonymization for self-driving applications. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 3199–3205, 2023. 2

work page 2023

[18] [18]

Reverse personalization

Han-Wei Kung, Tuomas Varanka, and Nicu Sebe. Reverse personalization. InProceedings of the IEEE/CVF Winter Con- ference on Applications of Computer Vision (WACV), pages 988–999, 2026. 2

work page 2026

[19] [19]

Large-scale online deanonymization with LLMs

Simon Lermen, Daniel Paleka, Joshua Swanson, Michael Aerni, Nicholas Carlini, and Florian Tram `er. Large-scale online deanonymization with LLMs. InICLR 2026 Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD),

work page 2026

[20] [20]

All in one frame- work for multimodal re-identification in the wild

He Li, Mang Ye, Ming Zhang, and Bo Du. All in one frame- work for multimodal re-identification in the wild. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17459–17469, 2024. 1, 3

work page 2024

[21] [21]

Feature pyramid net- works for object detection

Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid net- works for object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017. 7

work page 2017

[22] [22]

Svia: A street view image anonymization framework for self-driving applications

Dongyu Liu, Xuhong Wang, Cen Chen, Yanhao Wang, Shengyue Yao, and Yilun Lin. Svia: A street view image anonymization framework for self-driving applications. In IEEE 27th International Conference on Intelligent Trans- portation Systems (ITSC), pages 3567–3574, 2024. 2, 4, 6

work page 2024

[23] [23]

Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. InEuro- pean Conference on Computer Vision, pages 38–55. Springer,

work page

[24] [24]

Pose-guided attention learn- ing for cloth-changing person re-identification.IEEE Trans- actions on Multimedia, 26:5490–5498, 2024

Xiangzeng Liu, Kunpeng Liu, Jianfeng Guo, Peipei Zhao, Yining Quan, and Qiguang Miao. Pose-guided attention learn- ing for cloth-changing person re-identification.IEEE Trans- actions on Multimedia, 26:5490–5498, 2024. 3

work page 2024

[25] [25]

Doxing via the lens: Revealing location-related privacy leakage on multi-modal large reason- ing models

Weidi Luo, Tianyu Lu, Qiming Zhang, Xiaogeng Liu, Bin Hu, Yue Zhao, Jieyu Zhao, Song Gao, Patrick McDaniel, Zhen Xiang, and Chaowei Xiao. Doxing via the lens: Revealing location-related privacy leakage on multi-modal large reason- ing models. InThe Fourteenth International Conference on Learning Representations, 2026. 3

work page 2026

[26] [26]

Rad: Re- alistic anonymization of images using stable diffusion

Simon Malm, Viktor R¨onnb¨ack, Amanda H˚akansson, Minh- ha Le, Karol Wojtulewicz, and Niklas Carlsson. Rad: Re- alistic anonymization of images using stable diffusion. In Proceedings of the 23rd Workshop on Privacy in the Elec- tronic Society, pages 193–211. Association for Computing Machinery, 2024. 2

work page 2024

[27] [27]

Self- distilled stylegan: Towards generation from internet photos

Ron Mokady, Omer Tov, Michal Yarom, Oran Lang, Inbar Mosseri, Tali Dekel, Daniel Cohen-Or, and Michal Irani. Self- distilled stylegan: Towards generation from internet photos. InACM SIGGRAPH 2022 Conference Proceedings. Associa- tion for Computing Machinery, 2022. 4

work page 2022

[28] [28]

To- wards a visual privacy advisor: Understanding and predicting privacy risks in images

Tribhuvanesh Orekondy, Bernt Schiele, and Mario Fritz. To- wards a visual privacy advisor: Understanding and predicting privacy risks in images. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3706– 3715, 2017. 4, 6, 7

work page 2017

[29] [29]

Connecting pixels to privacy and utility: Automatic redac- tion of private information in images

Tribhuvanesh Orekondy, Bernt Schiele, and Mario Fritz. Connecting pixels to privacy and utility: Automatic redac- tion of private information in images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018. 4, 6, 2, 7

work page 2018

[30] [30]

SDXL: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M¨uller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. InInternational Conference on Learning Representations, 2024. 2, 4

work page 2024

[31] [31]

EgoBlur: responsible innovation in Aria,

Nikhil Raina, Guruprasad Somasundaram, Kang Zheng, Sagar Miglani, Steve Saarinen, Jeff Meissner, Mark Schwesinger, Luis Pesqueira, Ishita Prasad, Edward Miller, et al. Egoblur: Responsible innovation in aria.arXiv preprint arXiv:2308.13093, 2023. 2

work page arXiv 2023

[32] [32]

Dual license plate recogni- tion and visual features encoding for vehicle identification

´Alvaro Ramajo-Ballester, Jos´e Mar´ıa Armingol Moreno, and Arturo de la Escalera Hueso. Dual license plate recogni- tion and visual features encoding for vehicle identification. Robotics and Autonomous Systems, 172:104608, 2024. 3, 5

work page 2024

[33] [33]

SAM 2: Segment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. SAM 2: Segment anything in images and videos. InInternational Conference on Learning Representations, 2025. 4, 5, 3

work page 2025

[34] [34]

Faster R-CNN: Towards real-time object detection with re- gion proposal networks.Advances in Neural Information Processing Systems, 28, 2015

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with re- gion proposal networks.Advances in Neural Information Processing Systems, 28, 2015. 4

work page 2015

[35] [35]

Facenet: A unified embedding for face recognition and clus- tering

Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clus- tering. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 815–823, 2015. 4, 6

work page 2015

[36] [36]

RedactOR: An LLM-powered framework for automatic clini- cal data de-identification

Praphul Singh, Charlotte Dzialo, Jangwon Kim, Sumana Sri- vatsa, Irfan Bulu, Sri Gadde, and Krishnaram Kenthapadi. RedactOR: An LLM-powered framework for automatic clini- cal data de-identification. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), pages 510–530. Association for Computati...

work page 2025

[37] [37]

Private attribute inference from images with vision- language models.Advances in Neural Information Processing Systems, 37:103619–103651, 2024

Batuhan T ¨omekc ¸e, Mark Vero, Robin Staab, and Martin Vechev. Private attribute inference from images with vision- language models.Advances in Neural Information Processing Systems, 37:103619–103651, 2024. 1, 3

work page 2024

[38] [38]

Improving object localization with fitness nms and bounded iou loss

Lachlan Tychsen-Smith and Lars Petersson. Improving object localization with fitness nms and bounded iou loss. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6877–6885, 2018. 3

work page 2018

[39] [39]

Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004. 4

work page 2004

[40] [40]

A discriminative feature learning approach for deep face recog- nition

Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. A discriminative feature learning approach for deep face recog- nition. InComputer Vision – ECCV 2016, pages 499–515. Springer International Publishing, 2016. 4, 6

work page 2016

[41] [41]

Autogen: Enabling next-gen LLM applica- tions via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen LLM applica- tions via multi-agent conversations. InFirst Conference on Language Modeling, 2024. 3, 1

work page 2024

[42] [42]

Detectron2

Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2. https://github. com/facebookresearch/detectron2, 2019. 7

work page 2019

[43] [43]

Alvarez, and Ping Luo

Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, and Ping Luo. Segformer: simple and effi- cient design for semantic segmentation with transformers. In Proceedings of the 35th International Conference on Neural Information Processing Systems, Red Hook, NY , USA, 2021. Curran Associates Inc. 7

work page 2021

[44] [44]

Comfybench: Benchmarking llm-based agents in comfyui for autonomously designing collaborative ai systems

Xiangyuan Xue, Zeyu Lu, Di Huang, Zidong Wang, Wanli Ouyang, and Lei Bai. Comfybench: Benchmarking llm-based agents in comfyui for autonomously designing collaborative ai systems. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24614– 24624, 2025. 3

work page 2025

[45] [45]

Restoring gaussian blurred face images for deanonymization attacks.arXiv preprint arXiv:2506.12344,

Haoyu Zhai, Shuo Wang, Pirouz Naghavi, Qingying Hao, and Gang Wang. Restoring gaussian blurred face images for deanonymization attacks.arXiv preprint arXiv:2506.12344,

work page arXiv

[46] [46]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023. 2, 4

work page 2023

[47] [47]

The unreasonable effectiveness of deep fea- tures as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep fea- tures as a perceptual metric. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018. 4

work page 2018

[48] [48]

Re- ranking person re-identification with k-reciprocal encoding

Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. Re- ranking person re-identification with k-reciprocal encoding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3652–3661, 2017. 4, 6

work page 2017

[49] [49]

Context-aware full body anonymization

Pascal Zwick, Kevin Roesch, Marvin Klemp, and Oliver Bringmann. Context-aware full body anonymization. In Computer Vision – ECCV 2024 Workshops, pages 36–52, Cham, 2025. Springer Nature Switzerland. 2, 3, 4, 6, 7 Supplementary Material Abstract This supplementary material provides comprehensive techni- cal details for our multi-agent image anonymization f...

work page 2024

[50] [50]

anonymize_and_inpaint (if PII found) -> 3

classify_pii -> 2. anonymize_and_inpaint (if PII found) -> 3. audit_output -> 4. log_output SHORTCUT: If classify_pii finds NO PII and Phase 1 completed -> emit ’PIPELINE COMPLETE’ immediately YOUR ROLE: - Monitor tool results IN THE CONVERSATION and track workflow state: [classify_pii✓, inpaint_pii✓, audit✓, log✓] - When AuditorAgent returns results, ana...

work page

[51] [51]

ONLY report success (✓) if you can SEE the tool result in conversation history

work page

[52] [52]

If agent didn’t call tool yet, instruct them to call it - don’t claim it’s done

work page

[53] [53]

Look for tool execution results (JSON responses) before marking steps complete

work page

[54] [54]

NEVER assume a tool succeeded just because an agent acknowledged - verify the result RETRY LOGIC: - If audit finds residual PII: say ’Found N residuals. GenerativeAgent, process the residual items from audit output.’ (GenerativeAgent will extract the ’residual’ array from the tool output) - If no residuals OR max_attempts_reached=true: say ’Audit complete...

work page

[55] [55]

When instructed to use a tool, YOU MUST CALL IT in your response

work page

[56] [57]

Each response should contain EXACTLY ONE tool call

work page

[57] [58]

Execute your task, then control passes to OrchestratorAgent

Tool calls are JSON function calls, not text descriptions ROUND-ROBIN: You receive control after GenerativeAgent completes OR at workflow start. Execute your task, then control passes to OrchestratorAgent. YOUR TASKS:

work page

[58] [59]

classify_pii: Detect indirect PII in private spaces (text on windows, house numbers, personal items visible indoors) - Call with: classify_pii(image=’<image_path>’)

work page

[59] [60]

audit_output: Verify no residual PII remains after anonymization - Call with: audit_output(output=’{canonical_path}’)

work page

[60] [61]

GenerativeAgent: Inpainting Execution TheGenerativeAgentimplements the anonymization oper- ation through diffusion-based inpainting

log_output: Record final results - Call with: log_output(image=’<input_path>’, output=’{canonical_path}’) EXAMPLE WORKFLOW: When OrchestratorAgent says: ’AuditorAgent: Please classify any remaining PII ’ YOU MUST respond with the tool call (not text explanation): classify_pii(image=’artifacts/data/CityScapes/.../image.png’) REPORTING - BE CONCISE: After t...

work page

[61] [62]

When instructed to anonymize, YOU MUST CALL anonymize_and_inpaint in your response

work page

[62] [63]

NEVER just acknowledge without calling the tool

work page

[63] [64]

Tool call is a JSON function call, not a text description

work page

[64] [65]

Execute the task, then control passes to AuditorAgent

Extract instances from previous tool output (classify_pii or audit_output) ROUND-ROBIN: You receive control after OrchestratorAgent. Execute the task, then control passes to AuditorAgent. EXECUTION WORKFLOW:

work page

[65] [66]

Look at the most recent tool output in the conversation history - If classify_pii was called: find the JSON output and extract the ’instances’ array - If audit_output was called: find the JSON output and extract the ’residual’ array

work page

[66] [67]

Pass each dict object from that array as a separate element

work page

[67] [68]

Call anonymize_and_inpaint with the array of dict objects

work page

[68] [69]

det_prompt

Report results BRIEFLY: ’Processed X items.’ CRITICAL DATA FORMAT - COMMON MISTAKES: CORRECT (array of dict objects as JSON): anonymize_and_inpaint(instances=[ {"det_prompt": "van with text", "description": "van", "bbox": [308, 200, 564, 567]}, {"det_prompt": "blue sign", "description": "sign", "bbox": [215, 256, 294, 42]} ]) WRONG (single string containi...

work page

[69] [70]

Scan image for PII elements (vehicles with identifying features, text, signs, windows)

work page

[70] [71]

Vehicles: Include ONLY if has text/logos/decals OR is rare/distinctive/ modified (skip generic vehicles)

work page

[71] [72]

Text/signs: Include ONLY if reveals private information

work page

[72] [73]

Group adjacent text on same surface

work page

[73] [74]

Select top 5 most sensitive (priority: identifiable vehicles > personal text > signs > other PII)

work page

[74] [75]

For EACH: Locate bbox, describe with anonymous generic terms, expand bbox 50%, verify both fields present

work page

[75] [76]

instances

Return valid JSON: {"instances": [{"description": "...", "bbox": [...]}]} For PII Segmentation on Visual Redaction Dataset [29] we use the following prompt: You are a PII detection system. Identify text, numbers, visual elements, and objects revealing personal/private information. Return ONLY valid JSON. No markdown, explanations, or additional text. DETE...

work page

[76] [77]

Scan image for all PII types (faces, documents, text with names/addresses/ phone, signatures, plates, medical info, cards)

work page

[77] [78]

Group adjacent text on same surface (e.g., name + address on envelope = one instance)

work page

[78] [79]

Rank by sensitivity: faces > documents (passport/ID/cards) > names/addresses > signatures > plates > medical > other PII

work page

[79] [80]

Select top 5 most sensitive/prominent instances

work page

[80] [81]

For EACH: Locate bbox, describe with generic PII category (2-5 words), expand bbox 50%, verify both fields present

work page