pith. sign in

arxiv: 2603.27817 · v3 · submitted 2026-03-29 · 💻 cs.CV · cs.AI· cs.CR

Towards Context-Aware Image Anonymization with Multi-Agent Reasoning

Pith reviewed 2026-05-14 21:11 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CR
keywords context-aware anonymizationmulti-agent reasoningperson re-identificationdiffusion modelsPII segmentationstreet-level imageryGDPR compliance
0
0 comments X

The pith

A multi-agent system using vision-language models anonymizes context-dependent personal information in street images by distinguishing private from public properties.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CAIAMAR, an agentic framework that combines pre-defined processing for clear cases with multi-agent reasoning for indirect identifiers in street-level imagery. Three specialized agents coordinate in a Plan-Do-Check-Act cycle to classify personally identifiable information based on spatial context rather than fixed rules. This yields a 73 percent drop in person re-identification risk on CUHK03-NP while delivering low KID and FID scores on CityScapes. The system runs entirely on-premise with open-source models and produces audit trails for regulatory needs.

Core claim

The agentic workflow with scout-and-zoom detection, open-vocabulary segmentation on localized crops, and IoU-based deduplication enables large vision-language models to classify context-dependent PII accurately, supporting targeted diffusion-based anonymization that lowers re-identification risks without harming downstream semantic segmentation.

What carries the argument

The multi-agent system with round-robin speaker selection in a PDCA cycle that performs spatially-filtered coarse-to-fine detection and applies modal-specific diffusion guidance with appearance decorrelation.

If this is right

  • Downstream semantic segmentation performance stays intact after anonymization.
  • Human-interpretable audit trails meet GDPR transparency requirements.
  • Failed cases are automatically flagged for human review.
  • Non-direct PII instances are caught across multiple object categories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The on-premise design could support deployment in regulated environments that prohibit cloud APIs.
  • The modular agent structure might allow straightforward addition of new object categories or context rules.
  • Similar reasoning loops could be tested on video sequences to handle motion-based identifiers.

Load-bearing premise

Large vision-language models in the multi-agent setup can accurately tell private from public objects based on spatial context without misclassifications that would either over-anonymize or leave identifiers exposed.

What would settle it

A test set of images containing deliberately ambiguous contexts, such as vehicles or people near property boundaries, to measure whether the agents classify and anonymize only the private instances.

Figures

Figures reproduced from arXiv: 2603.27817 by Gautam Savaliya, Jakob Folz, Manjitha D Vidanalage, Martin Schramm, Michael Heigl, Robert Aufschl\"ager.

Figure 1
Figure 1. Figure 1: Two-phase agentic anonymization architecture. Phase 1 employs specialized models for direct PII (full body, license plates). Phase 2 implements multi-agent orchestration with round￾robin coordination, where specialized agents handle classification (Auditor), synthesis (Generative), and workflow management (Or￾chestrator), implementing PDCA cycles. inherently capture personally identifiable information (PII… view at source ↗
Figure 2
Figure 2. Figure 2: Round-robin PDCA coordination with three specialized [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of anonymization methods on a [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Pipeline output on CityScapes. Each row: left original, middle detected PII (blue=persons, yellow=indirect PII vehicles, green=traffic signs, red=license plates), right anonymized output. Top (berlin 000002): 8 persons + 1 police vehicle + 2 traffic signs, 4.97% coverage, 3 PDCA iterations. Bottom (berlin 000472): multiple PII categories across both phases. inpainting with SDXL and OpenPose ControlNet. Pha… view at source ↗
read the original abstract

Street-level imagery contains personally identifiable information (PII), some of which is context-dependent. Existing anonymization methods either over-process images or miss subtle identifiers, while API-based solutions compromise data sovereignty. We present an agentic framework CAIAMAR (\underline{C}ontext-\underline{A}ware \underline{I}mage \underline{A}nonymization with \underline{M}ulti-\underline{A}gent \underline{R}easoning) for context-aware PII segmentation with diffusion-based anonymization, combining pre-defined processing for high-confidence cases with multi-agent reasoning for indirect identifiers. Three specialized agents coordinate via round-robin speaker selection in a Plan-Do-Check-Act (PDCA) cycle, enabling large vision-language models to classify PII based on spatial context (private vs. public property) rather than rigid category rules. The agents implement spatially-filtered coarse-to-fine detection where a scout-and-zoom strategy identifies candidates, open-vocabulary segmentation processes localized crops, and $IoU$-based deduplication ($30\%$ threshold) prevents redundant processing. Modal-specific diffusion guidance with appearance decorrelation substantially reduces re-identification (Re-ID) risks. On CUHK03-NP, our method reduces person Re-ID risk by $73\%$ ($R1$: $16.9\%$ vs. $62.4\%$ baseline). For image quality preservation on CityScapes, we achieve KID: $0.001$, and FID: $9.1$, significantly outperforming existing anonymization. The agentic workflow detects non-direct PII instances across object categories, and downstream semantic segmentation is preserved. Operating entirely on-premise with open-source models, the framework generates human-interpretable audit trails supporting EU's GDPR transparency requirements while flagging failed cases for human review.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces CAIAMAR, a multi-agent framework for context-aware anonymization of street-level images. It employs three specialized agents coordinating via round-robin selection in a PDCA cycle, using scout-and-zoom detection, open-vocabulary segmentation, and IoU-based deduplication (30% threshold) to classify context-dependent PII (e.g., private vs. public property) before applying modal-specific diffusion anonymization. Evaluations claim a 73% Re-ID risk reduction on CUHK03-NP (R1: 16.9% vs. 62.4% baseline) and strong quality preservation on CityScapes (KID: 0.001, FID: 9.1), with preserved semantic segmentation, on-premise operation, and GDPR-compliant audit trails.

Significance. If the core multi-agent context detection proves reliable, the work could meaningfully advance privacy techniques in computer vision by enabling nuanced, context-sensitive anonymization that better preserves data utility than category-rigid baselines. The emphasis on open-source models, human-interpretable trails, and downstream task preservation (e.g., segmentation) adds practical value for applications like autonomous driving datasets.

major comments (3)
  1. [Evaluation] Evaluation section: No standalone detection metrics (precision, recall, or per-category error rates) are reported for the multi-agent system's context-dependent PII classification (private vs. public property). Only downstream Re-ID and perceptual scores are given, so the 73% Re-ID reduction cannot be confidently attributed to accurate context-awareness rather than general over-anonymization or diffusion strength.
  2. [Methodology] Methodology: The 30% IoU deduplication threshold is presented as fixed without ablation studies or sensitivity analysis, despite being explicitly listed as a free parameter; its effect on false negatives (missed identifiers) or false positives (over-anonymization) should be quantified to support the context-awareness claim.
  3. [Results] Results: Statistical significance (e.g., confidence intervals or p-values) for the Re-ID, KID, and FID improvements over baselines is not reported, weakening the assertion of significant outperformance on CUHK03-NP and CityScapes.
minor comments (2)
  1. [Abstract] Abstract: Specific baseline methods and their exact scores are not named when claiming to 'significantly outperform existing anonymization,' reducing clarity.
  2. [Method] Notation and description: The distinct roles of the three agents and the precise mechanics of round-robin speaker selection in the PDCA cycle require clearer definition for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript to incorporate additional evaluations, ablations, and statistical reporting as outlined.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: No standalone detection metrics (precision, recall, or per-category error rates) are reported for the multi-agent system's context-dependent PII classification (private vs. public property). Only downstream Re-ID and perceptual scores are given, so the 73% Re-ID reduction cannot be confidently attributed to accurate context-awareness rather than general over-anonymization or diffusion strength.

    Authors: We agree that direct metrics on the multi-agent context classification would provide stronger evidence for attributing gains to context-awareness. In the revised manuscript we will add a new evaluation subsection with precision, recall, and F1 scores for private-vs-public property classification, computed on a manually annotated held-out subset of CUHK03-NP and CityScapes images. These metrics will be reported alongside the existing Re-ID and perceptual results. revision: yes

  2. Referee: [Methodology] Methodology: The 30% IoU deduplication threshold is presented as fixed without ablation studies or sensitivity analysis, despite being explicitly listed as a free parameter; its effect on false negatives (missed identifiers) or false positives (over-anonymization) should be quantified to support the context-awareness claim.

    Authors: The referee correctly notes the absence of sensitivity analysis for the IoU threshold. We will add an ablation study in the revised paper that varies the threshold from 0.1 to 0.5, reporting the resulting Re-ID risk, KID, FID, number of deduplicated regions, and downstream segmentation mIoU for each value. This will quantify the impact on false negatives and false positives and justify the chosen 30% operating point. revision: yes

  3. Referee: [Results] Results: Statistical significance (e.g., confidence intervals or p-values) for the Re-ID, KID, and FID improvements over baselines is not reported, weakening the assertion of significant outperformance on CUHK03-NP and CityScapes.

    Authors: We acknowledge that formal statistical significance measures are missing. In the revision we will report 95% confidence intervals obtained via bootstrap resampling (1000 iterations) for all Re-ID, KID, and FID scores. Where appropriate we will also include p-values from paired statistical tests against the strongest baselines. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical metrics are direct measurements on public benchmarks

full rationale

The paper describes a multi-agent system (CAIAMAR) with PDCA workflow, scout-and-zoom detection, open-vocabulary segmentation, and diffusion anonymization. All reported results (73% Re-ID reduction on CUHK03-NP, KID 0.001 / FID 9.1 on CityScapes) are downstream empirical measurements on fixed public datasets. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the derivation. The central claim rests on observable performance differences rather than any reduction to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach assumes reliable performance of open-source VLMs and diffusion models for the task, with specific thresholds chosen for the workflow.

free parameters (1)
  • IoU threshold = 30%
    Threshold for deduplication of detected PII instances.
axioms (1)
  • domain assumption Vision-language models can accurately classify PII based on spatial context such as private vs. public property
    Core to the agent reasoning process described.
invented entities (1)
  • Multi-agent system with scout-and-zoom strategy no independent evidence
    purpose: To identify and process context-dependent PII
    Introduced as part of the new framework.

pith-pipeline@v0.9.0 · 5653 in / 1396 out tokens · 64340 ms · 2026-05-14T21:11:08.075853+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

92 extracted references · 92 canonical work pages · 1 internal anchor

  1. [1]

    Following the clues: Experiments on person re-id using cross-modal intel- ligence

    Robert Aufschl¨ager, Youssef Shoeb, Azarm Nowzad, Michael Heigl, Fabian Bally, and Martin Schramm. Following the clues: Experiments on person re-id using cross-modal intel- ligence. In2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC), pages 225–232,

  2. [2]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 3, 4, 2

  3. [3]

    Sutherland, Michael Arbel, and Arthur Gretton

    Mikołaj Bi´nkowski, Dougal J. Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. InInternational Conference on Learning Representations, 2018. 5

  4. [4]

    Openpose: Realtime multi-person 2d pose estimation using part affinity fields.IEEE Trans

    Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields.IEEE Trans. Pattern Anal. Mach. Intell., 43(1):172–186, 2021. 3, 2

  5. [5]

    Junzhou Chen, Heqiang Huang, Ronghui Zhang, Nengchao Lyu, Yanyong Guo, Hong-Ning Dai, and Hong Yan. Yolo- ts: Real-time traffic sign detection with enhanced accuracy using optimized receptive fields and anchor-free fusion.IEEE Transactions on Intelligent Transportation Systems, pages 1–17, 2025. 3, 5

  6. [6]

    The cityscapes dataset for semantic urban scene understanding

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016. 2, 4

  7. [7]

    Pri- vacy of groups in dense street imagery

    Matt Franchi, Hauke Sandhaus, Madiha Zahrah Choksi, Sev- erin Engelmann, Wendy Ju, and Helen Nissenbaum. Pri- vacy of groups in dense street imagery. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, pages 2874–2891. Association for Computing Machinery, 2025. 1

  8. [8]

    Vision meets robotics: The kitti dataset.Int

    A Geiger, P Lenz, C Stiller, and R Urtasun. Vision meets robotics: The kitti dataset.Int. J. Rob. Res., 32(11): 1231–1237, 2013. 8

  9. [9]

    Com- fymind: Toward general-purpose generation via tree-based planning and reactive feedback

    Litao Guo, Xinli Xu, Luozhou Wang, Jiantao Lin, Jinsong Zhou, Zixin Zhang, Bolan Su, and Ying-Cong Chen. Com- fymind: Toward general-purpose generation via tree-based planning and reactive feedback. InAdvances in Neural Infor- mation Processing Systems, 2025. 3

  10. [10]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016. 4, 6

  11. [11]

    Mask R-CNN

    Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask R-CNN. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 2961–2969,

  12. [12]

    GANs trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- hard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems, 30, 2017. 5

  13. [13]

    Deepprivacy2: Towards realistic full-body anonymization

    H˚akon Hukkel˚as and Frank Lindseth. Deepprivacy2: Towards realistic full-body anonymization. InIEEE/CVF Winter Con- ference on Applications of Computer Vision (WACV), pages 1329–1338, 2023. 2, 4, 6, 7

  14. [14]

    Deep- privacy: A generative adversarial network for face anonymiza- tion

    H˚akon Hukkel˚as, Rudolf Mester, and Frank Lindseth. Deep- privacy: A generative adversarial network for face anonymiza- tion. InInternational Symposium on Visual Computing, pages 565–578. Springer, 2019. 2

  15. [15]

    Does image anonymiza- tion impact computer vision training? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 140–150, 2023

    H˚akon Hukkel˚as and Frank Lindseth. Does image anonymiza- tion impact computer vision training? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 140–150, 2023. 2

  16. [16]

    Progressive growing of GANs for improved quality, stabil- ity, and variation

    Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stabil- ity, and variation. InInternational Conference on Learning Representations, 2018. 2

  17. [17]

    Ldfa: Latent diffusion face anonymization for self-driving applications

    Marvin Klemp, Kevin R¨osch, Royden Wagner, Jannik Quehl, and Martin Lauer. Ldfa: Latent diffusion face anonymization for self-driving applications. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 3199–3205, 2023. 2

  18. [18]

    Reverse personalization

    Han-Wei Kung, Tuomas Varanka, and Nicu Sebe. Reverse personalization. InProceedings of the IEEE/CVF Winter Con- ference on Applications of Computer Vision (WACV), pages 988–999, 2026. 2

  19. [19]

    Large-scale online deanonymization with LLMs

    Simon Lermen, Daniel Paleka, Joshua Swanson, Michael Aerni, Nicholas Carlini, and Florian Tram `er. Large-scale online deanonymization with LLMs. InICLR 2026 Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD),

  20. [20]

    All in one frame- work for multimodal re-identification in the wild

    He Li, Mang Ye, Ming Zhang, and Bo Du. All in one frame- work for multimodal re-identification in the wild. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17459–17469, 2024. 1, 3

  21. [21]

    Feature pyramid net- works for object detection

    Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid net- works for object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017. 7

  22. [22]

    Svia: A street view image anonymization framework for self-driving applications

    Dongyu Liu, Xuhong Wang, Cen Chen, Yanhao Wang, Shengyue Yao, and Yilun Lin. Svia: A street view image anonymization framework for self-driving applications. In IEEE 27th International Conference on Intelligent Trans- portation Systems (ITSC), pages 3567–3574, 2024. 2, 4, 6

  23. [23]

    Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. InEuro- pean Conference on Computer Vision, pages 38–55. Springer,

  24. [24]

    Pose-guided attention learn- ing for cloth-changing person re-identification.IEEE Trans- actions on Multimedia, 26:5490–5498, 2024

    Xiangzeng Liu, Kunpeng Liu, Jianfeng Guo, Peipei Zhao, Yining Quan, and Qiguang Miao. Pose-guided attention learn- ing for cloth-changing person re-identification.IEEE Trans- actions on Multimedia, 26:5490–5498, 2024. 3

  25. [25]

    Doxing via the lens: Revealing location-related privacy leakage on multi-modal large reason- ing models

    Weidi Luo, Tianyu Lu, Qiming Zhang, Xiaogeng Liu, Bin Hu, Yue Zhao, Jieyu Zhao, Song Gao, Patrick McDaniel, Zhen Xiang, and Chaowei Xiao. Doxing via the lens: Revealing location-related privacy leakage on multi-modal large reason- ing models. InThe Fourteenth International Conference on Learning Representations, 2026. 3

  26. [26]

    Rad: Re- alistic anonymization of images using stable diffusion

    Simon Malm, Viktor R¨onnb¨ack, Amanda H˚akansson, Minh- ha Le, Karol Wojtulewicz, and Niklas Carlsson. Rad: Re- alistic anonymization of images using stable diffusion. In Proceedings of the 23rd Workshop on Privacy in the Elec- tronic Society, pages 193–211. Association for Computing Machinery, 2024. 2

  27. [27]

    Self- distilled stylegan: Towards generation from internet photos

    Ron Mokady, Omer Tov, Michal Yarom, Oran Lang, Inbar Mosseri, Tali Dekel, Daniel Cohen-Or, and Michal Irani. Self- distilled stylegan: Towards generation from internet photos. InACM SIGGRAPH 2022 Conference Proceedings. Associa- tion for Computing Machinery, 2022. 4

  28. [28]

    To- wards a visual privacy advisor: Understanding and predicting privacy risks in images

    Tribhuvanesh Orekondy, Bernt Schiele, and Mario Fritz. To- wards a visual privacy advisor: Understanding and predicting privacy risks in images. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3706– 3715, 2017. 4, 6, 7

  29. [29]

    Connecting pixels to privacy and utility: Automatic redac- tion of private information in images

    Tribhuvanesh Orekondy, Bernt Schiele, and Mario Fritz. Connecting pixels to privacy and utility: Automatic redac- tion of private information in images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018. 4, 6, 2, 7

  30. [30]

    SDXL: Improving latent diffusion models for high-resolution image synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M¨uller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. InInternational Conference on Learning Representations, 2024. 2, 4

  31. [31]

    EgoBlur: responsible innovation in Aria,

    Nikhil Raina, Guruprasad Somasundaram, Kang Zheng, Sagar Miglani, Steve Saarinen, Jeff Meissner, Mark Schwesinger, Luis Pesqueira, Ishita Prasad, Edward Miller, et al. Egoblur: Responsible innovation in aria.arXiv preprint arXiv:2308.13093, 2023. 2

  32. [32]

    Dual license plate recogni- tion and visual features encoding for vehicle identification

    ´Alvaro Ramajo-Ballester, Jos´e Mar´ıa Armingol Moreno, and Arturo de la Escalera Hueso. Dual license plate recogni- tion and visual features encoding for vehicle identification. Robotics and Autonomous Systems, 172:104608, 2024. 3, 5

  33. [33]

    SAM 2: Segment anything in images and videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. SAM 2: Segment anything in images and videos. InInternational Conference on Learning Representations, 2025. 4, 5, 3

  34. [34]

    Faster R-CNN: Towards real-time object detection with re- gion proposal networks.Advances in Neural Information Processing Systems, 28, 2015

    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with re- gion proposal networks.Advances in Neural Information Processing Systems, 28, 2015. 4

  35. [35]

    Facenet: A unified embedding for face recognition and clus- tering

    Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clus- tering. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 815–823, 2015. 4, 6

  36. [36]

    RedactOR: An LLM-powered framework for automatic clini- cal data de-identification

    Praphul Singh, Charlotte Dzialo, Jangwon Kim, Sumana Sri- vatsa, Irfan Bulu, Sri Gadde, and Krishnaram Kenthapadi. RedactOR: An LLM-powered framework for automatic clini- cal data de-identification. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), pages 510–530. Association for Computati...

  37. [37]

    Private attribute inference from images with vision- language models.Advances in Neural Information Processing Systems, 37:103619–103651, 2024

    Batuhan T ¨omekc ¸e, Mark Vero, Robin Staab, and Martin Vechev. Private attribute inference from images with vision- language models.Advances in Neural Information Processing Systems, 37:103619–103651, 2024. 1, 3

  38. [38]

    Improving object localization with fitness nms and bounded iou loss

    Lachlan Tychsen-Smith and Lars Petersson. Improving object localization with fitness nms and bounded iou loss. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6877–6885, 2018. 3

  39. [39]

    Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004. 4

  40. [40]

    A discriminative feature learning approach for deep face recog- nition

    Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. A discriminative feature learning approach for deep face recog- nition. InComputer Vision – ECCV 2016, pages 499–515. Springer International Publishing, 2016. 4, 6

  41. [41]

    Autogen: Enabling next-gen LLM applica- tions via multi-agent conversations

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen LLM applica- tions via multi-agent conversations. InFirst Conference on Language Modeling, 2024. 3, 1

  42. [42]

    Detectron2

    Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2. https://github. com/facebookresearch/detectron2, 2019. 7

  43. [43]

    Alvarez, and Ping Luo

    Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, and Ping Luo. Segformer: simple and effi- cient design for semantic segmentation with transformers. In Proceedings of the 35th International Conference on Neural Information Processing Systems, Red Hook, NY , USA, 2021. Curran Associates Inc. 7

  44. [44]

    Comfybench: Benchmarking llm-based agents in comfyui for autonomously designing collaborative ai systems

    Xiangyuan Xue, Zeyu Lu, Di Huang, Zidong Wang, Wanli Ouyang, and Lei Bai. Comfybench: Benchmarking llm-based agents in comfyui for autonomously designing collaborative ai systems. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24614– 24624, 2025. 3

  45. [45]

    Restoring gaussian blurred face images for deanonymization attacks.arXiv preprint arXiv:2506.12344,

    Haoyu Zhai, Shuo Wang, Pirouz Naghavi, Qingying Hao, and Gang Wang. Restoring gaussian blurred face images for deanonymization attacks.arXiv preprint arXiv:2506.12344,

  46. [46]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023. 2, 4

  47. [47]

    The unreasonable effectiveness of deep fea- tures as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep fea- tures as a perceptual metric. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018. 4

  48. [48]

    Re- ranking person re-identification with k-reciprocal encoding

    Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. Re- ranking person re-identification with k-reciprocal encoding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3652–3661, 2017. 4, 6

  49. [49]

    Context-aware full body anonymization

    Pascal Zwick, Kevin Roesch, Marvin Klemp, and Oliver Bringmann. Context-aware full body anonymization. In Computer Vision – ECCV 2024 Workshops, pages 36–52, Cham, 2025. Springer Nature Switzerland. 2, 3, 4, 6, 7 Supplementary Material Abstract This supplementary material provides comprehensive techni- cal details for our multi-agent image anonymization f...

  50. [50]

    anonymize_and_inpaint (if PII found) -> 3

    classify_pii -> 2. anonymize_and_inpaint (if PII found) -> 3. audit_output -> 4. log_output SHORTCUT: If classify_pii finds NO PII and Phase 1 completed -> emit ’PIPELINE COMPLETE’ immediately YOUR ROLE: - Monitor tool results IN THE CONVERSATION and track workflow state: [classify_pii✓, inpaint_pii✓, audit✓, log✓] - When AuditorAgent returns results, ana...

  51. [51]

    ONLY report success (✓) if you can SEE the tool result in conversation history

  52. [52]

    If agent didn’t call tool yet, instruct them to call it - don’t claim it’s done

  53. [53]

    Look for tool execution results (JSON responses) before marking steps complete

  54. [54]

    NEVER assume a tool succeeded just because an agent acknowledged - verify the result RETRY LOGIC: - If audit finds residual PII: say ’Found N residuals. GenerativeAgent, process the residual items from audit output.’ (GenerativeAgent will extract the ’residual’ array from the tool output) - If no residuals OR max_attempts_reached=true: say ’Audit complete...

  55. [55]

    When instructed to use a tool, YOU MUST CALL IT in your response

  56. [57]

    Each response should contain EXACTLY ONE tool call

  57. [58]

    Execute your task, then control passes to OrchestratorAgent

    Tool calls are JSON function calls, not text descriptions ROUND-ROBIN: You receive control after GenerativeAgent completes OR at workflow start. Execute your task, then control passes to OrchestratorAgent. YOUR TASKS:

  58. [59]

    classify_pii: Detect indirect PII in private spaces (text on windows, house numbers, personal items visible indoors) - Call with: classify_pii(image=’<image_path>’)

  59. [60]

    audit_output: Verify no residual PII remains after anonymization - Call with: audit_output(output=’{canonical_path}’)

  60. [61]

    GenerativeAgent: Inpainting Execution TheGenerativeAgentimplements the anonymization oper- ation through diffusion-based inpainting

    log_output: Record final results - Call with: log_output(image=’<input_path>’, output=’{canonical_path}’) EXAMPLE WORKFLOW: When OrchestratorAgent says: ’AuditorAgent: Please classify any remaining PII ’ YOU MUST respond with the tool call (not text explanation): classify_pii(image=’artifacts/data/CityScapes/.../image.png’) REPORTING - BE CONCISE: After t...

  61. [62]

    When instructed to anonymize, YOU MUST CALL anonymize_and_inpaint in your response

  62. [63]

    NEVER just acknowledge without calling the tool

  63. [64]

    Tool call is a JSON function call, not a text description

  64. [65]

    Execute the task, then control passes to AuditorAgent

    Extract instances from previous tool output (classify_pii or audit_output) ROUND-ROBIN: You receive control after OrchestratorAgent. Execute the task, then control passes to AuditorAgent. EXECUTION WORKFLOW:

  65. [66]

    Look at the most recent tool output in the conversation history - If classify_pii was called: find the JSON output and extract the ’instances’ array - If audit_output was called: find the JSON output and extract the ’residual’ array

  66. [67]

    Pass each dict object from that array as a separate element

  67. [68]

    Call anonymize_and_inpaint with the array of dict objects

  68. [69]

    det_prompt

    Report results BRIEFLY: ’Processed X items.’ CRITICAL DATA FORMAT - COMMON MISTAKES: CORRECT (array of dict objects as JSON): anonymize_and_inpaint(instances=[ {"det_prompt": "van with text", "description": "van", "bbox": [308, 200, 564, 567]}, {"det_prompt": "blue sign", "description": "sign", "bbox": [215, 256, 294, 42]} ]) WRONG (single string containi...

  69. [70]

    Scan image for PII elements (vehicles with identifying features, text, signs, windows)

  70. [71]

    Vehicles: Include ONLY if has text/logos/decals OR is rare/distinctive/ modified (skip generic vehicles)

  71. [72]

    Text/signs: Include ONLY if reveals private information

  72. [73]

    Group adjacent text on same surface

  73. [74]

    Select top 5 most sensitive (priority: identifiable vehicles > personal text > signs > other PII)

  74. [75]

    For EACH: Locate bbox, describe with anonymous generic terms, expand bbox 50%, verify both fields present

  75. [76]

    instances

    Return valid JSON: {"instances": [{"description": "...", "bbox": [...]}]} For PII Segmentation on Visual Redaction Dataset [29] we use the following prompt: You are a PII detection system. Identify text, numbers, visual elements, and objects revealing personal/private information. Return ONLY valid JSON. No markdown, explanations, or additional text. DETE...

  76. [77]

    Scan image for all PII types (faces, documents, text with names/addresses/ phone, signatures, plates, medical info, cards)

  77. [78]

    Group adjacent text on same surface (e.g., name + address on envelope = one instance)

  78. [79]

    Rank by sensitivity: faces > documents (passport/ID/cards) > names/addresses > signatures > plates > medical > other PII

  79. [80]

    Select top 5 most sensitive/prominent instances

  80. [81]

    For EACH: Locate bbox, describe with generic PII category (2-5 words), expand bbox 50%, verify both fields present

Showing first 80 references.