pith. sign in

arxiv: 2606.28061 · v1 · pith:LOFD2KDGnew · submitted 2026-06-26 · 💻 cs.CR · cs.AI

ToolPrivacyBench: Benchmarking Purpose-Bound Privacy in Tool-Using LLM Agents

Pith reviewed 2026-06-29 03:41 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords LLM agentsprivacytool usebenchmarkpurpose-bound privacyinformation disclosuremulti-tool trajectoriesover-disclosure
0
0 comments X

The pith

Successful task completion in LLM agents does not guarantee appropriate privacy disclosure during tool calls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ToolPrivacyBench to evaluate whether tool-using LLM agents route private information only to tools authorized for a given purpose across full multi-step trajectories. It builds 2,150 test cases, each paired with a policy knowledge base that specifies allowed information flows, then compares an agent's recorded tool arguments and backend logs against those policies. Evaluation across nine agents shows that task success alone does not prevent over-disclosure of private data in intermediate calls. The benchmark therefore shifts focus from final outputs to purpose-bound boundaries enforced at every tool invocation.

Core claim

ToolPrivacyBench audits whether task-private atoms are routed only to authorized tools and downstream sinks, thereby evaluating both task completion and privacy over-disclosure during tool use. The benchmark contains 2,150 cases, including 1,150 fully synthetic privacy-sensitive business workflows and 1,000 cases adapted from existing multi-tool benchmarks. Each case is represented by a policy knowledge base. After an agent executes against mock business backends, the evaluator compares recorded tool arguments and backend audit logs with this policy knowledge base. The results show that successful tool execution does not imply appropriate privacy disclosure: an agent may complete a task whil

What carries the argument

Policy knowledge base that defines authorized information flows and purpose-bound boundaries for each tool, enabling comparison against recorded arguments and audit logs in full trajectories.

If this is right

  • Task-completion metrics alone are insufficient to certify privacy-safe agent behavior.
  • Trajectory-level auditing of tool arguments against purpose policies can surface over-disclosure invisible to final-response checks.
  • Existing function-calling benchmarks can be extended with policy knowledge bases to add privacy evaluation.
  • Agents must enforce need-to-know boundaries at each tool call rather than only at task end.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Runtime systems could embed similar policy checks to block unauthorized tool arguments before they are sent.
  • The same auditing approach could apply to non-LLM agents that chain external services.
  • Automated extraction of policy knowledge bases from task specifications would reduce manual setup cost for new domains.

Load-bearing premise

The policy knowledge base for each case accurately and completely defines the authorized information flows and purpose-bound boundaries for every tool in the trajectory.

What would settle it

An observed agent trajectory in which the benchmark reports over-disclosure but independent review of the policy and logs shows every transmitted atom was in fact authorized for that tool's purpose.

Figures

Figures reproduced from arXiv: 2606.28061 by Liang Liu, Shijing Hu, Zhicheng Zhao, Zhu Meng.

Figure 1
Figure 1. Figure 1: Motivating example of purpose-bound privacy in a multi-tool workflow. The same private atom may be necessary for one tool but unauthorized for [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the ToolPrivacyBench evaluation pipeline. A user task is executed by a tool-using LLM agent through an executable tool wrapper. Tool [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Purpose-bound policy representation and post-execution audit in ToolPrivacyBench. Private atoms are associated with tool purposes and information [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Source composition of the public-derived split. Blue bars report [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: SMTC-based overall model ranking on ToolPrivacyBench. Mod [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Utility and privacy comparison across evaluated LLM agents on the public-derived and synthetic private splits. The x-axis reports TaskSuccess, and [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Privacy over-disclosure diagnostics. (a) FOR, SWLR, and FreeTextFOR by information sink. (b) FTSlotRate and FreeTextFOR by free-text field. All [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
read the original abstract

Large language models (LLMs) have increasingly moved from standalone text generation systems to agents that invoke external tools, access environments, and execute multi-step tasks. However, conventional function-calling benchmarks mainly evaluate task completion and API correctness, while privacy evaluation benchmarks typically focus on final responses or privacy judgments. Neither perspective captures purpose-bound information flow across an executed multi-tool trajectory. Motivated by this limitation in current agent evaluation, ToolPrivacyBench audits whether task-private atoms are routed only to authorized tools and downstream sinks, thereby evaluating both task completion and privacy over-disclosure during tool use. The benchmark contains 2,150 cases, including 1,150 fully synthetic privacy-sensitive business workflows and 1,000 cases adapted from existing multi-tool and function-calling benchmarks. Each case is represented by a policy knowledge base. After an agent executes against mock business backends, the evaluator compares recorded tool arguments and backend audit logs with this policy knowledge base. The evaluation covers nine widely used agents to characterize purpose-bound privacy over-disclosure. The results show that successful tool execution does not imply appropriate privacy disclosure: an agent may complete a task while transmitting unnecessary private information through intermediate tool calls. ToolPrivacyBench therefore formalizes a need-to-know disclosure boundary, under which each tool should receive only the information necessary for its stated purpose, and uses trajectory-level auditing to identify privacy over-disclosure in multi-tool workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents ToolPrivacyBench, a benchmark of 2,150 cases (1,150 synthetic privacy-sensitive workflows and 1,000 adapted from existing multi-tool benchmarks) that evaluates purpose-bound privacy in LLM agents by auditing tool-call trajectories against per-case policy knowledge bases. After agents execute tasks on mock backends, the evaluator checks whether private information atoms are routed only to authorized tools and sinks. Evaluation of nine agents shows that task completion does not guarantee appropriate disclosure, as agents may transmit unnecessary private data via intermediate calls. The work formalizes a need-to-know boundary and trajectory-level auditing for over-disclosure.

Significance. If the policy knowledge bases accurately capture authorized flows, the benchmark would usefully extend agent evaluation beyond task success and final-response privacy to multi-step information-flow correctness. The combination of synthetic and adapted cases plus explicit trajectory auditing is a constructive addition to the literature on agent privacy.

major comments (2)
  1. [Benchmark Construction and Evaluation sections (around the description of the 2,150 cases and policy KB)] The construction, validation, and inter-annotator agreement for the policy knowledge bases (both the 1,150 synthetic workflows and the 1,000 adapted cases) receive no description, including any error rates or auditing procedure for the authorized flows. This is load-bearing for the central claim because over-disclosure detections are produced by comparing tool arguments and audit logs against these KBs; without evidence that the KBs are accurate and complete, the reported over-disclosure rates could reflect policy mis-specification rather than agent behavior.
  2. [Evaluation of nine agents] No details are supplied on how the nine agents were selected, prompted, or executed (including temperature, system prompts, or tool-calling harness), nor on the reproducibility of the 2,150 trajectories. This directly affects the reliability of the finding that successful execution does not imply appropriate disclosure.
minor comments (1)
  1. The abstract and introduction would benefit from a short table summarizing the nine agents, their base models, and the number of trajectories each produced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive comments on ToolPrivacyBench. The points raised regarding benchmark construction details and evaluation reproducibility are valid and will be addressed through revisions to improve the manuscript's clarity and rigor.

read point-by-point responses
  1. Referee: [Benchmark Construction and Evaluation sections (around the description of the 2,150 cases and policy KB)] The construction, validation, and inter-annotator agreement for the policy knowledge bases (both the 1,150 synthetic workflows and the 1,000 adapted cases) receive no description, including any error rates or auditing procedure for the authorized flows. This is load-bearing for the central claim because over-disclosure detections are produced by comparing tool arguments and audit logs against these KBs; without evidence that the KBs are accurate and complete, the reported over-disclosure rates could reflect policy mis-specification rather than agent behavior.

    Authors: We acknowledge that the current manuscript lacks explicit details on the construction, validation, and inter-annotator agreement for the policy knowledge bases. In the revised manuscript, we will add a new subsection in the Benchmark Construction section that describes: (1) the generation process for the 1,150 synthetic workflows, including how authorized flows were defined based on task purposes; (2) the adaptation procedure for the 1,000 cases from existing benchmarks, including how policies were mapped; (3) any validation steps, auditing procedures, and error rates assessed for the KBs; and (4) inter-annotator agreement metrics where multiple annotators were involved. This addition will strengthen the central claim by providing evidence of KB accuracy. revision: yes

  2. Referee: [Evaluation of nine agents] No details are supplied on how the nine agents were selected, prompted, or executed (including temperature, system prompts, or tool-calling harness), nor on the reproducibility of the 2,150 trajectories. This directly affects the reliability of the finding that successful execution does not imply appropriate disclosure.

    Authors: We agree that the manuscript should provide more details on agent selection, prompting, execution, and reproducibility to support the reliability of the results. In the revised Evaluation section, we will include: the selection criteria for the nine agents (e.g., popularity, diversity in tool-calling capabilities); the full system prompts and tool-calling harness configurations used; execution parameters such as temperature settings; and measures for reproducibility, including any use of fixed seeds, logging of trajectories, or multiple runs with reported variance. These additions will allow readers to better assess the finding that task completion does not guarantee appropriate disclosure. revision: yes

Circularity Check

0 steps flagged

No circularity; benchmark definition is independent of results

full rationale

The paper defines ToolPrivacyBench as a collection of 2,150 cases each equipped with a policy knowledge base, then audits agent trajectories by comparing recorded tool arguments against that base. No equations, fitted parameters, predictions, or derivations appear in the provided text. The central claim (successful execution does not imply appropriate disclosure) is established by direct comparison to the KB rather than by any self-referential reduction or self-citation chain. The accuracy of the KB is an external assumption, not a definitional loop or fitted input renamed as prediction. This satisfies the default expectation of no significant circularity for a benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; none can be identified.

pith-pipeline@v0.9.1-grok · 5787 in / 1062 out tokens · 39684 ms · 2026-06-29T03:41:07.565364+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 15 canonical work pages

  1. [1]

    M. Li, Y . Zhao, B. Yu, F. Song, H. Li, H. Yu, Z. Li, F. Huang, Y . Li, API-Bank: A comprehensive bench- mark for tool-augmented LLMs, in: Proceedings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing, Association for Computational Lin- guistics, Singapore, 2023, pp. 3102–3116. URL:https: //aclanthology.org/2023.emnlp- main.187/...

  2. [2]

    S. Yao, N. Shinn, P. Razavi, K. Narasimhan,τ-bench: A benchmark for tool-agent-user interaction in real-world domains, in: International Conference on Learning Rep- resentations, 2025. URL:https://proceedings.iclr .cc/paper_files/paper/2025/hash/1b126cc38b 8638e07bef37e7b2bb72bf-Abstract-Conference. html

  3. [4]

    Carlini, F

    N. Carlini, F. Tramèr, E. Wallace, M. Jagielski, A. Herbert-V oss, K. Lee, A. Roberts, T. Brown, D. Song, Ú. Erlingsson, A. Oprea, C. Raffel, Extracting training data from large language models, in: 30th USENIX Se- curity Symposium (USENIX Security 21), USENIX As- sociation, 2021, pp. 2633–2650. URL:https://www.us enix.org/conference/usenixsecurity21/pres...

  4. [5]

    Y . Shao, T. Li, W. Shi, Y . Liu, D. Yang, PrivacyLens: Evaluating privacy norm awareness of language models in action, in: Advances in Neural Information Processing Systems, volume 37, 2024, pp. 89373–89407. URL:ht tps://proceedings.neurips.cc/paper_files/p aper/2024/hash/a2a7e58309d5190082390ff10ff 3b2b8-Abstract-Datasets_and_Benchmarks_Tra ck.html. doi...

  5. [6]

    W. Yang, L. Some, M. Bain, B. Kang, A comprehen- sive survey on integrating large language models with knowledge-based methods, Knowledge-Based Systems 318 (2025) 113503. URL:https://doi.org/10.101 6/j.knosys.2025.113503. doi:10.1016/j.knosys.2 025.113503

  6. [7]

    Z. Zeng, Q. Cheng, X. Hu, Y . Zhuang, X. Liu, K. He, Z. Liu, KoSEL: Knowledge subgraph enhanced large language model for medical question answering, Knowledge-Based Systems 309 (2025) 112837. URL:ht tps://doi.org/10.1016/j.knosys.2024.112837. doi:10.1016/j.knosys.2024.112837

  7. [8]

    Carlini, C

    N. Carlini, C. Liu, Ú. Erlingsson, J. Kos, D. Song, The secret sharer: Evaluating and testing unintended mem- orization in neural networks, in: 28th USENIX Secu- rity Symposium (USENIX Security 19), USENIX As- sociation, Santa Clara, CA, 2019, pp. 267–284. URL: https://www.usenix.org/conference/usenix security19/presentation/carlini

  8. [9]

    H. Li, D. Guo, D. Li, W. Fan, Q. Hu, X. Liu, C. Chan, D. Yao, Y . Yao, Y . Song, PrivLM-Bench: A multi- level privacy evaluation benchmark for language mod- els, in: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), Association for Computational Lin- guistics, Bangkok, Thailand, 2024, pp. 54–...

  9. [10]

    Mireshghallah, H

    N. Mireshghallah, H. Kim, X. Zhou, Y . Tsvetkov, M. Sap, R. Shokri, Y . Choi, Can llms keep a secret? testing pri- vacy implications of language models via contextual in- tegrity theory, in: International Conference on Learning Representations, 2024. URL:https://proceedings. iclr.cc/paper_files/paper/2024/hash/08305d 8b2ddab98932c163ea73df065f-Abstract-Co...

  10. [11]

    El Yagoubi, G

    F. El Yagoubi, G. Badu-Marfo, R. Al Mallah, AgentLeak: A benchmark for internal-channel privacy leakage in multi-agent LLM systems, IEEE Access 14 (2026). URL: https://doi.org/10.1109/ACCESS.2026.3704541. doi:10.1109/ACCESS.2026.3704541

  11. [12]

    URL: https://www.dtaalliance.org/work/ai-agent s-privacy-and-the-importance-of-context-i n-data-regulation

    Data & Trusted AI Alliance, AI Agents, Privacy, and the Importance of Context in Data Regulation, 2025. URL: https://www.dtaalliance.org/work/ai-agent s-privacy-and-the-importance-of-context-i n-data-regulation

  12. [13]

    URL:https://cdn.openai.com/pdf /openai-privacy-hackathon-report-jan26.pdf

    OpenAI, OpenAI Privacy Hackathon Report, Report, OpenAI, 2026. URL:https://cdn.openai.com/pdf /openai-privacy-hackathon-report-jan26.pdf

  13. [14]

    S. G. Patil, H. Mao, F. Yan, C. C.-J. Ji, V . Suresh, I. Stoica, J. E. Gonzalez, The berkeley function calling leaderboard (BFCL): From tool use to agentic evaluation of large lan- guage models, in: Proceedings of the 42nd International Conference on Machine Learning, volume 267 ofPro- ceedings of Machine Learning Research, PMLR, 2025, pp. 48371–48392. UR...

  14. [15]

    X. Liu, H. Yu, H. Zhang, Y . Xu, X. Lei, H. Lai, Y . Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang, Y . Su, H. Sun, M. Huang, Y . Dong, J. Tang, AgentBench: Evaluating LLMs as agents, in: International Conference on Learn- ing Representations, 2024. URL:https://proceeding s.iclr.cc/paper_files/paper/2024/hash...

  15. [16]

    J. Lu, T. Holleis, Y . Zhang, B. Aumayer, F. Nan, H. Bai, S. Ma, S. Ma, M. Li, G. Yin, Z. Wang, R. Pang, Tool- Sandbox: A stateful, conversational, interactive evalua- tion benchmark for LLM tool use capabilities, in: Find- ings of the Association for Computational Linguistics: NAACL 2025, Association for Computational Linguis- tics, Albuquerque, New Mexi...

  16. [17]

    P. He, Z. Dai, B. He, H. Liu, X. Tang, H. Lu, J. Li, J. Ding, S. Mukherjee, S. Wang, Y . Xing, J. Tang, B. Du- moulin, TRAJECT-Bench: A trajectory-aware benchmark for evaluating agentic tool use, in: International Confer- ence on Learning Representations, 2026. URL:https: 22 //openreview.net/forum?id=TZWnWvsQ0X, ICLR 2026 Poster

  17. [18]

    Y . Ruan, H. Dong, A. Wang, S. Pitis, Y . Zhou, J. Ba, Y . Dubois, C. J. Maddison, T. Hashimoto, Identify- ing the risks of LM agents with an LM-emulated sand- box, in: International Conference on Learning Represen- tations, 2024. URL:https://proceedings.iclr.cc/ paper_files/paper/2024/hash/7274ed909a312d 4d869cc328ad1c5f04-Abstract-Conference.html

  18. [19]

    Debenedetti, J

    E. Debenedetti, J. Zhang, M. Balunovic, L. Beurer- Kellner, M. Fischer, F. Tramèr, Agentdojo: A dynamic environment to evaluate prompt injection attacks and de- fenses for LLM agents, in: Advances in Neural In- formation Processing Systems, volume 37, 2024. URL: https://proceedings.neurips.cc/paper_files /paper/2024/hash/97091a5177d8dc64b1da8bf3e 1f6fb54-...

  19. [20]

    Q. Zhan, Z. Liang, Z. Ying, D. Kang, InjecAgent: Bench- marking indirect prompt injections in tool-integrated large language model agents, in: Findings of the Association for Computational Linguistics: ACL 2024, Association for Computational Linguistics, Bangkok, Thailand, 2024, pp. 10471–10506. URL:https://aclanthology.org /2024.findings-acl.624/. doi:10...

  20. [21]

    Zhang, S

    Z. Zhang, S. Cui, Y . Lu, J. Zhou, J. Yang, H. Wang, M. Huang, Agent-safetybench: Evaluating the safety of LLM agents, 2024. URL:https://arxiv.org/abs/24 12.14470.arXiv:2412.14470

  21. [22]

    Andriushchenko, A

    M. Andriushchenko, A. Souly, M. Dziemian, D. Due- nas, M. Lin, J. Wang, D. Hendrycks, A. Zou, Z. Kolter, M. Fredrikson, Y . Gal, X. Davies, Agentharm: A bench- mark for measuring harmfulness of LLM agents, in: Inter- national Conference on Learning Representations, 2025. URL:https://proceedings.iclr.cc/paper_file s/paper/2025/hash/c493d23af93118975cdbc32c...

  22. [23]

    Zhang, J

    H. Zhang, J. Huang, K. Mei, Y . Yao, Z. Wang, C. Zhan, H. Wang, Y . Zhang, Agent security bench (ASB): For- malizing and benchmarking attacks and defenses in LLM- based agents, in: International Conference on Learning Representations, 2025. URL:https://proceedings. iclr.cc/paper_files/paper/2025/hash/5750f9 1d8fb9d5c02bd8ad2c3b44456b-Abstract-Confere nce.html

  23. [24]

    P. Y . Zhong, S. Chen, R. Wang, M. McCall, B. L. Titzer, H. Miller, P. B. Gibbons, RTBAS: Defending LLM agents against prompt injection and privacy leakage,

  24. [25]

    arXiv:2502.08966

    URL:https://arxiv.org/abs/2502.08966. arXiv:2502.08966

  25. [26]

    Z. Deng, J. Gui, W. Zhang, From secure agentic AI to secure agentic web: Challenges, threats, and future direc- tions, 2026. URL:https://arxiv.org/abs/2603.0 1564.arXiv:2603.01564

  26. [27]

    H. Wang, C. M. Poskitt, J. Wei, J. Sun, ProbGuard: Probabilistic runtime monitoring for LLM agent safety,

  27. [28]

    arXiv:2508.00500

    URL:https://arxiv.org/abs/2508.00500. arXiv:2508.00500

  28. [29]

    Liang, S

    X. Liang, S. Niu, Z. Li, S. Zhang, H. Wang, F. Xiong, Z. Fan, B. Tang, J. Zhao, J. Yang, S. Song, M. Wang, SafeRAG: Benchmarking security in retrieval-augmented generation of large language model, in: Proceedings of the 63rd Annual Meeting of the Association for Compu- tational Linguistics (V olume 1: Long Papers), Association for Computational Linguistic...

  29. [31]

    Azzam, A

    R. Azzam, A. Musamih, S. Gebreab, K. Salah, M. Omar, Agentic LLM for anonymizing healthcare data with con- textual awareness, Knowledge-Based Systems 343 (2026) 116034. URL:https://doi.org/10.1016/j.knosys .2026.116034. doi:10.1016/j.knosys.2026.1160 34

  30. [32]

    Q. Han, Z. Yang, H. Lin, T. Qin, A plug-and-play knowledge-enhanced module for medical reports gener- ation, Knowledge-Based Systems 309 (2025) 112805. URL:https://doi.org/10.1016/j.knosys.202 4.112805. doi:10.1016/j.knosys.2024.112805

  31. [33]

    J. Liu, W. Hao, K. Cheng, G. Chen, X. Xie, CART: A traceable zero-shot planning framework for large lan- guage models with adaptive replanning, Knowledge- Based Systems 336 (2026) 115189. URL:https://do i.org/10.1016/j.knosys.2025.115189. doi:10.101 6/j.knosys.2025.115189

  32. [34]

    R. Hou, S. Chen, Y . Fan, G. Yu, L. Zhu, J. Sun, J. Liu, T. Ruan, MSDiagnosis: A benchmark and framework for evaluating large language models in multi-step clinical di- agnosis, Knowledge-Based Systems 330 (2025) 114524. URL:https://doi.org/10.1016/j.knosys.2025. 114524. doi:10.1016/j.knosys.2025.114524

  33. [35]

    Weis, Securing AI agents: Why traditional authoriza- tion isn’t enough, 2026

    O. Weis, Securing AI agents: Why traditional authoriza- tion isn’t enough, 2026. URL:https://www.permit.i o/blog/securing-ai-agents-why-traditional-a uthorization-isnt-enough. 23

  34. [36]

    S. Rose, O. Borchert, S. Mitchell, S. Connelly, Zero Trust Architecture, NIST Special Publication 800-207, National Institute of Standards and Technology, 2020. URL:http s://doi.org/10.6028/NIST.SP.800-207. doi:10.6 028/NIST.SP.800-207

  35. [37]

    URL:https://op enai.com/index/introducing-gpt-5-5/

    OpenAI, Introducing GPT-5.5, 2026. URL:https://op enai.com/index/introducing-gpt-5-5/

  36. [38]

    URL:ht tps://www.anthropic.com/news/claude-opus-4 -7

    Anthropic, Introducing Claude Opus 4.7, 2026. URL:ht tps://www.anthropic.com/news/claude-opus-4 -7

  37. [39]

    URL: https://api-docs.deepseek.com/news/news2604 24

    DeepSeek, DeepSeek V4 Preview Release, 2026. URL: https://api-docs.deepseek.com/news/news2604 24

  38. [40]

    URL:https://github .com/MoonshotAI/Kimi-K2.5

    Moonshot AI, Kimi-K2.5, 2026. URL:https://github .com/MoonshotAI/Kimi-K2.5

  39. [41]

    URL:https://docs .z.ai/guides/llm/glm-5.1

    Z.AI, GLM-5.1: Overview, 2026. URL:https://docs .z.ai/guides/llm/glm-5.1

  40. [42]

    Qwen Team, Qwen3.6-Plus: Towards real world agents,

  41. [43]

    URL:https://qwen.ai/blog?id=qwen3.6

  42. [44]

    URL: https://ai.google.dev/gemini-api/docs/mode ls/gemini-3.5-flash

    Google AI for Developers, Gemini 3.5 Flash, 2026. URL: https://ai.google.dev/gemini-api/docs/mode ls/gemini-3.5-flash

  43. [45]

    URL:https://seed .bytedance.com/en/seed2

    ByteDance Seed, Seed2.0, 2026. URL:https://seed .bytedance.com/en/seed2

  44. [46]

    URL:https://www

    MiniMax, MiniMax M2.7, 2026. URL:https://www. minimax.io/models/text/m27. 24