pith. sign in

arxiv: 2605.16630 · v2 · pith:EEO673QNnew · submitted 2026-05-15 · 💻 cs.CR · cs.AI

PrivScope: Task-scoped Disclosure Control for Hybrid Agentic Systems

Pith reviewed 2026-05-20 16:12 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords privacydisclosure controlhybrid agentscloud language modelstask scopingdata leakage preventioninformation abstractionagentic systems
0
0 comments X

The pith

Task-scoped disclosure control on device can prevent over-disclosure to cloud models in hybrid agents while preserving task performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that a local trusted governor can enforce task-scoped disclosure for hybrid agents that delegate to cloud language models. By extracting disclosure units, keeping sensitive identifiers local, and abstracting only the necessary minimal information, it reduces unnecessary exposure from persistent state and prior workflows. A reader would care if this holds because over-disclosure leads to profile leakage and higher re-identification success by attackers. If the approach works, agents can use rich local context without sending excess sensitive data to the cloud.

Core claim

PrivScope presents a trusted on-device payload governor that enforces task-scoped disclosure at the local-cloud language model boundary without requiring changes to the cloud models. The key idea is that sensitive information should reach the cloud only when required for the delegated subtask, and then only in the least revealing form that preserves utility. It extracts disclosure units from the assembled payload, keeps direct identifiers and account-linked values on device, and routes the rest through a cloud-necessity control that determines actual needs and abstracts to least-specific representations.

What carries the argument

cloud-necessity control, which determines the minimal information required for each subtask and abstracts it to the least-specific representation sufficient for the task

If this is right

  • Profile leakage drops to zero in the tested workflows compared to 17.7 percent without control
  • Attacker re-identification success is more than halved from 64.3 percent to 23.1 percent
  • Highest candidate recall is reached on every cloud language model tested
  • Task success stays close to the unprotected baseline on GPT-4o-mini and Gemini 2.5 Flash

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same boundary control could apply to other hybrid setups where local state is enriched before delegation to external services.
  • Abstraction rules might extend beyond text to handle structured records or other data formats in agent payloads.
  • Long-running agents with accumulating context could benefit from repeated scoping to limit cumulative exposure over multiple workflows.

Load-bearing premise

The cloud-necessity control can reliably determine the minimal information required for each delegated subtask and that abstraction to the least-specific representation will still allow the cloud model to complete the task successfully without needing additional context.

What would settle it

Running the same medical-booking workflows with only the abstracted data and checking whether task completion rates stay close to the unprotected baseline or whether the models request extra context to succeed.

Figures

Figures reproduced from arXiv: 2605.16630 by Shafizur Rahman Seeam, Yidan Hu, Yimin (Ian) Chen, Zhengxiong Li, Zhiyuan Yu.

Figure 1
Figure 1. Figure 1: High-level overview of PRIVSCOPE. PRIVSCOPE mediates an over-inclusive LC→CLM payload, producing a task-sufficient cloud-visible version while keeping private context on device. interactions, tool outputs, and retrieved artifacts [6], [8]. We refer to this evolving context as the agent’s working state. Working state improves personalization and reduces re￾peated user intervention, but it also creates a pri… view at source ↗
Figure 2
Figure 2. Figure 2: Hybrid local–cloud agent architecture. A trusted on-device [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of P [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The extractor combines profile matching, structured-pattern [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Role assignment partitions extracted disclosure units into [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Task-sufficient abstraction over cloud-needed units. The b [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Sensitivity to the local model backbone. For each backbone, the same local model serves as both the LC controller and the sanitizer. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: On-device sanitization latency of PRIVSCOPE across five local backbones, decomposed by pipeline stage and ordered by total runtime. Cloud-necessity analysis dominates latency; unit extraction contributes < 2% across all backbones. duces this brittleness by decomposing sanitization into explicit extraction, local binding, task-necessity filtering, and targeted abstraction. The remaining variation across bac… view at source ↗
Figure 9
Figure 9. Figure 9: Cloud API cost per 1,000 tasks across three commercial [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
read the original abstract

Hybrid local--cloud agents enrich user requests with context from persistent working state before delegating capability-intensive subtasks to a cloud language model (CLM). While this enrichment can improve task success, it also exposes unnecessary information in the cloud-bound payload, including task-irrelevant context, carryover from prior workflows, and overly specific sensitive details, resulting in \emph{over-disclosure}. Existing solutions either isolate workflows to limit cross-workflow leakage or apply general-purpose sanitization that does not reason over LC-assembled payload scope. We present \textsc{PrivScope}, a trusted on-device payload governor that enforces \emph{task-scoped disclosure} at the local--CLM boundary, without requiring cloud-side changes. Its key idea: sensitive information should reach the cloud only when required for the delegated subtask, and then only in the least revealing form preserving utility. \textsc{PrivScope} extracts disclosure units from the assembled payload and keeps direct identifiers and account-linked values on device. The remaining units pass through cloud-necessity control, which determines what is actually needed; units that must reach the cloud are abstracted to the least-specific representation sufficient for the task. On 100 medical-booking workflows across three commercial CLMs, \textsc{PrivScope} eliminates profile leakage (0.0\% vs.\ 17.7\%), more than halves attacker re-identification (23.1\% vs.\ 64.3\%), and achieves the highest candidate recall on every CLM tested while preserving task success close to the unprotected baseline on GPT-4o-mini and Gemini 2.5 Flash. Gains hold across five local backbones and add only seconds of on-device latency on commodity hardware.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. PrivScope is an on-device payload governor for hybrid local-cloud agentic systems that enforces task-scoped disclosure: it extracts disclosure units from the assembled payload, retains direct identifiers and account-linked values locally, routes remaining units through a cloud-necessity control to decide what must reach the CLM, and abstracts those units to the least-specific representation that still permits task completion. The paper evaluates the system on 100 medical-booking workflows across three commercial CLMs (and five local backbones), reporting elimination of profile leakage (0.0% vs. 17.7%), more than halved attacker re-identification (23.1% vs. 64.3%), highest candidate recall on every CLM, and task success close to the unprotected baseline on GPT-4o-mini and Gemini 2.5 Flash, with only seconds of added on-device latency.

Significance. If the cloud-necessity control and abstraction steps are shown to be reliable, PrivScope would provide a concrete, deployable mechanism for reducing over-disclosure at the local-CLM boundary without requiring changes to cloud models. The empirical results on concrete leakage and re-identification metrics, together with the preservation of task utility, would constitute a useful data point for privacy engineering in agentic workflows.

major comments (2)
  1. [Cloud-necessity control and evaluation sections] The central privacy-utility claims rest on the accuracy of the cloud-necessity control and the utility of the subsequent abstraction step. The manuscript should report independent accuracy metrics for the control (e.g., precision/recall against ground-truth necessity labels) and ablation results showing how false-positive or false-negative decisions affect both leakage and task success; without these, the reported 0.0% leakage and halved re-identification cannot be confidently attributed to the mechanism rather than to the specific 100-workflow test set.
  2. [Evaluation] The evaluation protocol (data exclusion rules, exact definition of profile leakage and attacker re-identification, prompt templates for the three CLMs, and how task success is scored) is not described with sufficient detail to allow reproduction or to assess whether the 100 medical-booking workflows contain edge cases that would stress the necessity control.
minor comments (2)
  1. [Threat model] Clarify the exact threat model for the attacker re-identification metric (e.g., what auxiliary information the attacker is assumed to possess).
  2. [Discussion] Add a short discussion of failure modes when the abstraction step produces a representation that is still insufficient for the CLM.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the requested additions where they strengthen the work.

read point-by-point responses
  1. Referee: [Cloud-necessity control and evaluation sections] The central privacy-utility claims rest on the accuracy of the cloud-necessity control and the utility of the subsequent abstraction step. The manuscript should report independent accuracy metrics for the control (e.g., precision/recall against ground-truth necessity labels) and ablation results showing how false-positive or false-negative decisions affect both leakage and task success; without these, the reported 0.0% leakage and halved re-identification cannot be confidently attributed to the mechanism rather than to the specific 100-workflow test set.

    Authors: We agree that independent accuracy metrics and ablations would improve attribution of the observed privacy gains. In the revised manuscript we will add a new subsection reporting precision and recall of the cloud-necessity control against manually annotated ground-truth necessity labels on a held-out portion of the workflows. We will also include ablation results that inject controlled false-positive and false-negative decisions into the control and quantify the resulting changes in leakage and task-success metrics. These additions will make the causal link between the mechanism and the reported outcomes more explicit. revision: yes

  2. Referee: [Evaluation] The evaluation protocol (data exclusion rules, exact definition of profile leakage and attacker re-identification, prompt templates for the three CLMs, and how task success is scored) is not described with sufficient detail to allow reproduction or to assess whether the 100 medical-booking workflows contain edge cases that would stress the necessity control.

    Authors: We accept that the current Evaluation section omits several details required for reproducibility. The revised version will expand this section to specify the exact data exclusion rules, provide formal definitions of profile leakage and attacker re-identification, reproduce the prompt templates used with each CLM, and describe the task-success scoring procedure. We will also add a short analysis of workflow characteristics, highlighting any edge cases that could challenge the necessity control. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on external CLM evaluations

full rationale

The paper's central claims consist of measured performance differences (0.0% vs 17.7% profile leakage, 23.1% vs 64.3% re-identification) obtained by running PrivScope on 100 medical-booking workflows against three commercial CLMs and five local backbones. These outcomes are direct experimental observations rather than quantities derived from fitted parameters, self-citations, or equations that reduce to the inputs by construction. The cloud-necessity control is presented as an implemented component whose accuracy is assessed via the same external-task-success and leakage metrics; no load-bearing uniqueness theorem, ansatz, or renaming of known results is invoked. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review limits visibility into exact assumptions; the design implicitly relies on accurate unit extraction and necessity determination being feasible on-device.

axioms (1)
  • domain assumption Local device can extract disclosure units and apply necessity control without external cloud assistance
    Core to the on-device governor design described in abstract.
invented entities (1)
  • cloud-necessity control no independent evidence
    purpose: Determines minimal information needed for delegated subtask
    New component introduced to decide what reaches the cloud

pith-pipeline@v0.9.0 · 5857 in / 1297 out tokens · 48935 ms · 2026-05-20T16:12:10.125974+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 6 internal anchors

  1. [1]

    Language mod- els are few-shot learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language mod- els are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

  2. [2]

    Language models are unsupervised multitask learners,

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskeveret al., “Language models are unsupervised multitask learners,”OpenAI blog, vol. 1, no. 8, p. 9, 2019

  3. [3]

    Introducing Operator,

    OpenAI, “Introducing Operator,” https://openai.com/index/ introducing-operator/, 2025, [Online; accessed 25-Sep-2025]

  4. [4]

    Empower your digital tasks with AutoGPT,

    Autogpt, “Empower your digital tasks with AutoGPT,” https://agpt.co/, 2025, [Online; accessed 25-Sep-2025]

  5. [5]

    Automate Your Business with AgentGPT,

    AGENTGPT, “Automate Your Business with AgentGPT,” https:// agentgpt.io/, 2024, [Online; accessed 25-Sep-2025]

  6. [6]

    Towards automating data access permissions in ai agents,

    Y . Wu, K. Yang, F. Roesner, T. Kohno, N. Zhang, and U. Iqbal, “Towards automating data access permissions in ai agents,”arXiv preprint arXiv:2511.17959, 2025

  7. [7]

    Kaggle, “Agents,” https://www.kaggle.com/whitepaper-agents, 2025, [Online; accessed 25-Sep-2025]

  8. [8]

    Runtime permissions for privacy in proactive intelligent assistants,

    N. Malkin, D. Wagner, and S. Egelman, “Runtime permissions for privacy in proactive intelligent assistants,” inEighteenth Symposium on Usable Privacy and Security (SOUPS 2022), 2022, pp. 633–651

  9. [9]

    Agentic plan caching: Test- time memory for fast and cost-efficient llm agents,

    Q. Zhang, M. Wornow, and K. Olukotun, “Agentic plan caching: Test- time memory for fast and cost-efficient llm agents,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  10. [10]

    Collaborative inference and learning between edge slms and cloud llms: A survey of algorithms, execution, and open challenges,

    S. Li, H. Wang, W. Xu, R. Zhang, S. Guo, J. Yuan, X. Zhong, T. Zhang, and R. Li, “Collaborative inference and learning between edge slms and cloud llms: A survey of algorithms, execution, and open challenges,” arXiv preprint arXiv:2507.16731, 2025

  11. [11]

    Beyond memoriza- tion: Violating privacy via inference with large language models,

    R. Staab, M. Vero, M. Balunovic, and M. Vechev, “Beyond memoriza- tion: Violating privacy via inference with large language models,” inThe Twelfth International Conference on Learning Representations, 2023

  12. [12]

    Deprompt: Desensitization and evaluation of personal identifiable information in large language model prompts,

    X. Sun, G. Liu, Z. He, H. Li, and X. Li, “Deprompt: Desensitization and evaluation of personal identifiable information in large language model prompts,”arXiv preprint arXiv:2408.08930, 2024

  13. [13]

    Extracting training data from large language models,

    N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-V oss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingssonet al., “Extracting training data from large language models,” in30th USENIX security symposium (USENIX Security 21), 2021, pp. 2633–2650

  14. [14]

    Sustainable ai: Environmental implications, challenges and opportunities,

    C.-J. Wu, R. Raghavendra, U. Gupta, B. Acun, N. Ardalani, K. Maeng, G. Chang, F. Aga, J. Huang, C. Baiet al., “Sustainable ai: Environmental implications, challenges and opportunities,”Proceedings of machine learning and systems, vol. 4, pp. 795–813, 2022

  15. [15]

    Splitreason: Learning to offload reasoning,

    Y . Akhauri, A. Fei, C.-C. Chang, A. F. AbouElhamayed, Y . Li, and M. S. Abdelfattah, “Splitreason: Learning to offload reasoning,”arXiv preprint arXiv:2504.16379, 2025

  16. [16]

    Cogenesis: A framework collaborating large and small language models for secure context-aware instruction following,

    K. Zhang, J. Wang, E. Hua, B. Qi, N. Ding, and B. Zhou, “Cogenesis: A framework collaborating large and small language models for secure context-aware instruction following,”arXiv preprint arXiv:2403.03129, 2024

  17. [17]

    Private Cloud Compute: A new frontier for AI privacy in the cloud,

    A. S. Engineering and A. (SEAR), “Private Cloud Compute: A new frontier for AI privacy in the cloud,” https://security.apple.com/ documentation/private-cloud-compute, 2024, [Online; accessed 28-Oct- 2025]

  18. [18]

    Mobilellm: Optimizing sub- billion parameter language models for on-device use cases,

    Z. Liu, C. Zhao, F. Iandola, C. Lai, Y . Tian, I. Fedorov, Y . Xiong, E. Chang, Y . Shi, R. Krishnamoorthiet al., “Mobilellm: Optimizing sub- billion parameter language models for on-device use cases,” inForty-first International Conference on Machine Learning, 2024

  19. [19]

    Pricing Flagship Model,

    OpenAI, “Pricing Flagship Model,” https://developers.openai.com/api/ docs/pricing, 2025, [Online; accessed 25-Sep-2025]

  20. [20]

    Presidio: Data Protection and De-identification SDK,

    M. Presidio, “Presidio: Data Protection and De-identification SDK,” https://microsoft.github.io/presidio/, 2025, [Online; accessed 21-April- 2026]

  21. [21]

    Privacy-and utility- preserving textual analysis via calibrated multivariate perturbations,

    O. Feyisetan, B. Balle, T. Drake, and T. Diethe, “Privacy-and utility- preserving textual analysis via calibrated multivariate perturbations,” in Proceedings of the 13th international conference on web search and data mining, 2020, pp. 178–186

  22. [22]

    Hide and seek (has): A lightweight framework for prompt privacy protection

    Y . Chen, T. Li, H. Liu, and Y . Yu, “Hide and seek (has): A lightweight framework for prompt privacy protection,”arXiv preprint arXiv:2309.03057, 2023

  23. [23]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y . Bisk, D. Friedet al., “Webarena: A realistic web environment for building autonomous agents,”arXiv preprint arXiv:2307.13854, 2023

  24. [24]

    Agentdam: Privacy leakage evaluation for autonomous web agents,

    A. Zharmagambetov, C. Guo, I. Evtimov, M. Pavlova, R. Salakhut- dinov, and K. Chaudhuri, “Agentdam: Privacy leakage evaluation for autonomous web agents,”arXiv preprint arXiv:2503.09780, 2025

  25. [25]

    Privacylens: Evaluating privacy norm awareness of language models in action,

    Y . Shao, T. Li, W. Shi, Y . Liu, and D. Yang, “Privacylens: Evaluating privacy norm awareness of language models in action,”Advances in Neural Information Processing Systems, vol. 37, pp. 89 373–89 407, 2024

  26. [26]

    Operationalizing contextual integrity in privacy-conscious assistants.arXiv preprint arXiv:2408.02373, 2024

    S. Ghalebikesabi, E. Bagdasaryan, R. Yi, I. Yona, I. Shumailov, A. Pappu, C. Shi, L. Weidinger, R. Stanforth, L. Berradaet al., “Operationalizing contextual integrity in privacy-conscious assistants,” arXiv preprint arXiv:2408.02373, 2024

  27. [27]

    SecGPT: An Execution Isolation Architecture for LLM-Based Systems,

    Y . Wu, F. Roesner, T. Kohno, N. Zhang, and U. Iqbal, “Isolategpt: An execution isolation architecture for llm-based agentic systems,”arXiv preprint arXiv:2403.04960, 2024

  28. [28]

    Airgapagent: Protecting privacy- conscious conversational agents,

    E. Bagdasarian, R. Yi, S. Ghalebikesabi, P. Kairouz, M. Gruteser, S. Oh, B. Balle, and D. Ramage, “Airgapagent: Protecting privacy- conscious conversational agents,” inProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, 2024, pp. 3868–3882

  29. [29]

    Alsa: Context-sensitive prompt privacy preservation in large language models,

    H. Ma, W. Lu, Y . Liang, T. Wang, Q. Zhang, Y . Zhu, and J. Si, “Alsa: Context-sensitive prompt privacy preservation in large language models,” inProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, 2025, pp. 2042–2053

  30. [30]

    The fire thief is also the keeper: Balancing usability and privacy in prompts,

    Z. Shen, Z. Xi, Y . He, W. Tong, J. Hua, and S. Zhong, “The fire thief is also the keeper: Balancing usability and privacy in prompts,”arXiv preprint arXiv:2406.14318, 2024

  31. [31]

    Privacyrestore: Privacy-preserving inference in large language models via privacy removal and restoration,

    Z. Zeng, J. Wang, J. Yang, Z. Lu, H. Li, H. Zhuang, and C. Chen, “Privacyrestore: Privacy-preserving inference in large language models via privacy removal and restoration,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 10 821–10 855

  32. [32]

    Propile: Probing privacy leakage in large language models,

    S. Kim, S. Yun, H. Lee, M. Gubri, S. Yoon, and S. J. Oh, “Propile: Probing privacy leakage in large language models,”Advances in Neural Information Processing Systems, vol. 36, pp. 20 750–20 762, 2023

  33. [33]

    Ciphergpt: Secure two-party gpt inference,

    X. Hou, J. Liu, J. Li, Y . Li, W.-j. Lu, C. Hong, and K. Ren, “Ciphergpt: Secure two-party gpt inference,”Cryptology ePrint Archive, 2023

  34. [34]

    Iron: Private inference on transformers,

    M. Hao, H. Li, H. Chen, P. Xing, G. Xu, and T. Zhang, “Iron: Private inference on transformers,”Advances in neural information processing systems, vol. 35, pp. 15 718–15 731, 2022

  35. [35]

    Anti-adversarial learning: Desensitizing prompts for large language models,

    X. Li, Z. Yin, X. Gu, and B. Shen, “Anti-adversarial learning: Desensitizing prompts for large language models,”arXiv preprint arXiv:2505.01273, 2025

  36. [36]

    Can llms keep a secret? testing privacy implications of language models via contextual integrity theory,

    N. Mireshghallah, H. Kim, X. Zhou, Y . Tsvetkov, M. Sap, R. Shokri, and Y . Choi, “Can llms keep a secret? testing privacy implications of language models via contextual integrity theory,”arXiv preprint arXiv:2310.17884, 2023

  37. [37]

    Privacy as contextual integrity,

    H. Nissenbaum, “Privacy as contextual integrity,”Wash. L. Rev., vol. 79, p. 119, 2004

  38. [38]

    Ci-bench: Benchmarking contextual integrity of ai assistants on synthetic data.arXiv preprint arXiv:2409.13903, 2024

    Z. Cheng, D. Wan, M. Abueg, S. Ghalebikesabi, R. Yi, E. Bagdasarian, B. Balle, S. Mellem, and S. O’Banion, “Ci-bench: Benchmarking contextual integrity of ai assistants on synthetic data,”arXiv preprint arXiv:2409.13903, 2024

  39. [39]

    Industrial-Strength Natural Language Processing,

    SpaCy, “Industrial-Strength Natural Language Processing,” https://spacy. io/, 2025, [Online; accessed 25-Sep-2025]

  40. [40]

    A survey on in-context learning,

    Q. Dong, L. Li, D. Dai, C. Zheng, J. Ma, R. Li, H. Xia, J. Xu, Z. Wu, B. Changet al., “A survey on in-context learning,” inProceedings of the 2024 conference on empirical methods in natural language processing, 2024, pp. 1107–1128

  41. [41]

    Can generalist foundation models outcompete special-purpose tuning? case study in medicine,

    H. Nori, Y . T. Lee, S. Zhang, D. Carignan, R. Edgar, N. Fusi, N. King, J. Larson, Y . Li, W. Liuet al., “Can generalist foundation models outcompete special-purpose tuning? case study in medicine,”arXiv preprint arXiv:2311.16452, 2023

  42. [42]

    Confidence in the reasoning of large language models,

    Y . Pawitan and C. Holmes, “Confidence in the reasoning of large language models,”Harvard Data Science Review, vol. 7, no. 1, pp. 2644–2353, 2025

  43. [43]

    Ship agents that wow,

    LangChain, “Ship agents that wow,” https://www.langchain.com/, 2025, [Online; accessed 21-April-2026]

  44. [44]

    The easiest way to build with open models,

    Ollama, “The easiest way to build with open models,” https://ollama. com/, 2025, [Online; accessed 25-Sep-2025]

  45. [45]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

  46. [46]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    M. Abdin, S. A. Jacobs, A. A. Awan, J. Aneja, A. Awadallah, H. Awadalla, N. Bach, A. Bahree, A. Bakhtiari, H. Behlet al., “Phi- 3 technical report: A highly capable language model locally on your phone,”arXiv preprint arXiv:2404.14219, 2024

  47. [47]

    Mistral 7B

    A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnieret al., “Mistral 7b,”arXiv preprint arXiv:2310.06825, 2023

  48. [48]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

  49. [49]

    GPT-4o mini: advancing cost-efficient intelligence,

    OpenAi, “GPT-4o mini: advancing cost-efficient intelligence,” https: //openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/, 2025, [Online; accessed 6-May-2026]

  50. [50]

    Claude Haiku 4.5,

    Claude, “Claude Haiku 4.5,” https://www.anthropic.com/claude/haiku, 2025, [Online; accessed 6-May-2026]

  51. [51]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosenet al., “Gem- ini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

  52. [52]

    Pricing,

    Claude, “Pricing,” https://platform.claude.com/docs/en/about-claude/ pricing, 2025, [Online; accessed 5-May-2026]. APPENDIX A. Methodology details Algorithm 1 summarizes the end-to-end flow, Algorithm 2 expands the payload-mediation procedure, and Table III sum- marizes the notation used throughout this section. TABLE III: Notation used in the PRIVSCOPEde...