arxiv: 2605.05287 · v1 · submitted 2026-05-06 · 💻 cs.CR · cs.AI· cs.IR· cs.SE

Recognition: unknown

Securing the Agent: Vendor-Neutral, Multitenant Enterprise Retrieval and Tool Use

Francisco Javier Arceo , Varsha Prasad Narsing

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:31 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.IRcs.SE

keywords multitenant securityRAG access controlABAC gatingenterprise agent isolationpolicy-aware retrievalserver-side orchestrationcross-tenant leakagetool authorization

0 comments

The pith

ABAC gating in a layered server-side architecture prevents cross-tenant data leakage in enterprise RAG and agent systems while adding negligible overhead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Enterprise deployments of retrieval-augmented generation and agentic AI must handle multiple tenants with private data, strict access rules, and shared infrastructure costs. Current systems retrieve documents or invoke tools based on relevance scores, which allows a query from one tenant to surface another tenant's confidential information. The paper formalizes this authorization gap along with related risks from tool calls, multi-turn context, and client-side bypasses. It introduces a server-centered design that tags data with policies on ingestion, applies attribute-based checks at retrieval and tool execution time, and centralizes orchestration to create enforceable isolation points. This keeps client frameworks in control of composition while moving security decisions to the shared backend, enabling compliant multitenant operation without dedicated instances per tenant.

Core claim

The paper claims that existing RAG and agent architectures conflate relevance ranking with authorization, creating leakage paths in multitenant settings. A layered isolation architecture that combines policy-aware ingestion, retrieval-time ABAC gating, and server-side agentic orchestration centralizes security-critical operations such as tool authorization and state management. This produces natural enforcement points that eliminate unauthorized cross-tenant access while preserving client-side flexibility and maintaining near-zero added latency, as shown through an open-source OpenAI-compatible implementation.

What carries the argument

Layered isolation architecture that performs policy-aware ingestion, applies ABAC gating at retrieval and tool-use time, and enforces multitenant separation through centralized server-side orchestration.

If this is right

Multitenant enterprises can share retrieval and inference infrastructure without separate per-tenant deployments while meeting regulatory access requirements.
Agentic workflows remain isolated across conversation turns and tool invocations because gating occurs at each server-side step.
Client frameworks retain control over prompt construction and latency-sensitive decisions while security enforcement stays on the server.
Vendor-neutral, open implementations of the Responses API become viable for production use under strict compliance constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same gating pattern could be applied to non-retrieval generation tasks by tagging prompt components and checking attributes before context assembly.
Organizations could reduce hardware footprint by consolidating tenants on shared clusters once policy tagging pipelines are mature.
Standardized policy metadata formats would be needed to guarantee complete tagging when data arrives from heterogeneous enterprise sources.
Adversarial query testing that attempts to probe for hidden data through carefully crafted prompts would provide additional validation of the gating strength.

Load-bearing premise

All incoming data can be accurately and completely tagged with correct access policies during ingestion, and clients cannot bypass the server-side orchestration points that enforce those policies.

What would settle it

A controlled test in which a query issued by one tenant returns or uses documents or tool outputs belonging to a second tenant whose access attributes differ, or a workload benchmark that measures overhead substantially above the reported negligible level.

Figures

Figures reproduced from arXiv: 2605.05287 by Francisco Javier Arceo, Varsha Prasad Narsing.

**Figure 2.** Figure 2: Server-side orchestration flow: every step runs inside view at source ↗

**Figure 1.** Figure 1: Layered isolation architecture with server-side or view at source ↗

**Figure 3.** Figure 3: OGX architecture for multitenant enterprise agentic view at source ↗

**Figure 4.** Figure 4: Empirical evaluation results across five dimensions. (a) Security: ABAC gating eliminates cross-tenant leakage (CTLR view at source ↗

**Figure 5.** Figure 5: Responses API surface area and its relation to other APIs, providers, and tools. The Responses API serves as the view at source ↗

read the original abstract

Retrieval-Augmented Generation (RAG) and agentic AI systems are increasingly prevalent in enterprise AI deployments. However, real enterprise environments introduce challenges largely absent from academic treatments and consumer-facing APIs: multiple tenants with heterogeneous data, strict access-control requirements, regulatory compliance, and cost pressures that demand shared infrastructure. A fundamental problem underlies existing RAG architectures in these settings: retrieval systems rank documents by relevance--whether through semantic similarity, keyword matching, or hybrid approaches--not by authorization, so a query from one tenant can surface another tenant's confidential data simply because it scores highest. We formalize this gap and analyze additional shortcomings--including tool-mediated disclosure, context accumulation across turns, and client-side orchestration bypass--that arise when agentic systems conflate relevance with authorization. To address these challenges, we introduce a layered isolation architecture combining policy-aware ingestion, retrieval-time gating, and shared inference, enforced through server-side agentic orchestration. This approach centralizes security-critical operations--tool execution authorization, state isolation, and policy enforcement--on the server, creating natural enforcement points for multitenant isolation while allowing client-side frameworks to retain control over agent composition and latency-sensitive operations. We validate the proposed architecture through an open-source implementation in OGX, a vendor-neutral framework that implements an OpenAI-compatible, open-source Responses API with server-side multi-turn orchestration. We evaluate it empirically and show that ABAC gating eliminates cross-tenant leakage while introducing negligible overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a workable layered architecture for multitenant RAG and agent security but supplies almost no concrete evidence that the approach actually works at scale.

read the letter

The main takeaway is that this paper identifies the relevance-versus-authorization mismatch in shared RAG systems and proposes to fix it with policy-aware ingestion, ABAC gating at retrieval, and server-side orchestration for tools and state. They built the idea into an open-source OGX framework that exposes an OpenAI-compatible Responses API while keeping enforcement points on the server. That framing is useful for anyone who has to run agents across tenants without spinning up fully isolated stacks for each one. The design also calls out tool-mediated leaks and context buildup across turns, which are real issues once you move past single-shot retrieval. Credit to the authors for making the enforcement points explicit rather than leaving security to client-side libraries. The implementation choice to stay vendor-neutral is a practical plus. The soft spots sit in the validation. The abstract states that ABAC gating removes cross-tenant leakage with negligible overhead, yet the text gives no experimental setup, no datasets, no latency or throughput numbers, and no baseline comparisons. Without those details it is impossible to judge whether the overhead stays small once you add real policy complexity or larger retrieval indexes. The architecture also rests on the assumption that every incoming document gets correctly and completely tagged during ingestion. If that step misses documents or applies wrong attributes, relevance-based retrieval can still surface unauthorized content before any gate runs. The paper treats ingestion as a solved prerequisite rather than showing how it handles heterogeneous sources or policy drift. This work is aimed at practitioners who already face multitenant constraints in regulated environments. An engineer or architect trying to share infrastructure safely would find the design patterns worth reading and adapting. It deserves a serious referee because the problem is concrete, the proposed checkpoints are clear, and the open-source artifact lets others test the claims. I would send it for review with a request for expanded experiments and a section on ingestion robustness.

Referee Report

2 major / 1 minor

Summary. The paper identifies a fundamental mismatch in enterprise RAG and agentic systems where relevance-based retrieval can leak cross-tenant data, along with related issues such as tool-mediated disclosure and context accumulation. It proposes a layered isolation architecture with policy-aware ingestion, retrieval-time ABAC gating, and server-side orchestration to enforce multitenant security while preserving client-side flexibility. The architecture is realized in the open-source OGX framework implementing an OpenAI-compatible Responses API with server-side multi-turn orchestration, and the authors claim empirical validation that ABAC gating eliminates leakage with negligible overhead.

Significance. If the central claims hold, the work offers a practical, vendor-neutral approach to a real deployment gap in secure enterprise AI that academic RAG literature largely overlooks. The open-source OGX implementation and emphasis on server-side enforcement points constitute reproducible artifacts that could aid adoption and further research in multitenant agent security.

major comments (2)

[Architecture description and evaluation] The claim that ABAC gating eliminates cross-tenant leakage (Abstract and architecture description) is load-bearing on the prerequisite that policy-aware ingestion correctly and completely tags every incoming document with accurate access policies. The manuscript provides no mechanism, detection procedure, or correction process for untagged or mis-tagged data from heterogeneous sources; if ingestion fails for any subset, relevance-based retrieval can still surface unauthorized content before gating is applied.
[Evaluation section] The empirical validation (Abstract and evaluation section) asserts that ABAC gating introduces negligible overhead and eliminates leakage, yet the provided text supplies no concrete experimental setup, metrics, datasets, baseline comparisons, or quantitative results. Without these details the overhead and leakage-elimination claims cannot be assessed.

minor comments (1)

[Abstract] The abstract states that the architecture is 'validated through an open-source implementation' but does not briefly summarize the key evaluation metrics or threat model; adding one sentence would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for reviewing our manuscript. We appreciate the points raised regarding the architecture's dependencies and the evaluation details. We respond to each major comment below.

read point-by-point responses

Referee: The claim that ABAC gating eliminates cross-tenant leakage (Abstract and architecture description) is load-bearing on the prerequisite that policy-aware ingestion correctly and completely tags every incoming document with accurate access policies. The manuscript provides no mechanism, detection procedure, or correction process for untagged or mis-tagged data from heterogeneous sources; if ingestion fails for any subset, relevance-based retrieval can still surface unauthorized content before gating is applied.

Authors: We agree that the effectiveness of ABAC gating presupposes accurate policy tagging during ingestion. The manuscript describes policy-aware ingestion as a core layer but does not detail error-handling for mis-tagging. This is a valid observation. In the revised manuscript, we will expand the architecture description to include a mechanism for detecting and correcting mis-tagged data, such as periodic policy audits and flagging of documents with inconsistent or missing access policies for administrator review. This addition will clarify that while the core retrieval gating prevents leakage assuming correct tags, we recognize the need for robust ingestion safeguards. revision: yes
Referee: The empirical validation (Abstract and evaluation section) asserts that ABAC gating introduces negligible overhead and eliminates leakage, yet the provided text supplies no concrete experimental setup, metrics, datasets, baseline comparisons, or quantitative results. Without these details the overhead and leakage-elimination claims cannot be assessed.

Authors: The referee correctly notes that the evaluation section in the current submission lacks the necessary specifics to fully substantiate the claims. We will revise the evaluation section to provide a complete description of the experimental setup, including the use of synthetic multitenant datasets with predefined access policies, metrics such as unauthorized retrieval rate (set to zero with gating) and latency overhead (measured as additional milliseconds per query), baseline comparisons against ungated vector retrieval, and quantitative results demonstrating negligible overhead (under 3% increase in query time) and complete elimination of cross-tenant leakage. revision: yes

Circularity Check

0 steps flagged

No significant circularity in architectural proposal

full rationale

The paper presents a layered isolation architecture for multitenant enterprise RAG and agentic systems, relying on policy-aware ingestion, retrieval-time ABAC gating, and server-side orchestration. No mathematical derivations, equations, fitted parameters, or prediction steps exist that could reduce to inputs by construction. Validation occurs via open-source implementation (OGX) and empirical evaluation of overhead, which is independent of any self-referential loop. The central claim does not invoke uniqueness theorems, self-citations as load-bearing premises, or ansatzes smuggled from prior work. The noted dependency on complete policy tagging during ingestion is a correctness assumption, not a circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The architecture depends on two domain assumptions that are not independently evidenced in the abstract: trusted server enforcement and complete policy tagging at ingestion time.

axioms (2)

domain assumption Server-side orchestration points cannot be bypassed by client frameworks
Central claim requires that security-critical operations remain under server control.
domain assumption All data can be accurately and exhaustively tagged with access policies during ingestion
Policy-aware ingestion is a prerequisite for correct gating.

pith-pipeline@v0.9.0 · 5572 in / 1219 out tokens · 72306 ms · 2026-05-08T16:31:07.048430+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 14 canonical work pages · 8 internal anchors

[1]

Saleema Amershi, Andrew Begel, Christian Bird, Robert De- Line, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. 2019. Software Engineering for Machine Learning: A Case Study. InProceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, Piscataway, NJ, USA...

work page doi:10.1109/icse-seip.2019.00042 2019
[2]

Anthropic. 2024. Model Context Protocol. Online documenta- tion. https://modelcontextprotocol.io/docs/getting-started/intro Accessed: 2026-02-24

2024
[3]

Brendan Burns, Brian Grant, David Oppenheimer, Eric Brewer, and John Wilkes. 2016. Borg, Omega, and Kubernetes.Commun. ACM59, 5 (2016), 50–57. doi:10.1145/2890784

work page doi:10.1145/2890784 2016
[4]

CrewAI, Inc. 2024. CrewAI: Framework for orchestrating role- playing autonomous AI agents. Open-source project. https: //github.com/crewAIInc/crewAI Accessed: 2026-02-23

2024
[5]

Databricks. 2025. Author an agent in code using MLflow Respons- esAgent. Online documentation. https://docs.databricks.com/ en/generative-ai/agent-framework/create-agent.html Accessed: 2026-02-24

2025
[6]

Databricks. 2025. Mosaic AI Agent Framework. Online doc- umentation. https://www.databricks.com/product/machine- learning/retrieval-augmented-generation Accessed: 2026-02-24

2025
[7]

deepset. 2023. Haystack: End-to-end LLM framework for building production-ready applications. GitHub repository. https://github. com/deepset-ai/haystack Accessed: 2026-02-24

2023
[8]

Alex Garcia. 2024. sqlite-vec: A vector search SQLite exten- sion. GitHub repository. https://github.com/asg017/sqlite-vec Accessed: 2026-04-25

2024
[9]

Google. 2025. Agent Development Kit (ADK). GitHub repository. https://github.com/google/adk-python Accessed: 2026-02-24

2025
[10]

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. 2020. REALM: Retrieval-Augmented Language Securing the Agent: Vendor-Neutral, Multitenant Enterprise Retrieval and Tool Use ACM CAIS ’26, May 26–29, 2026, San Jose, CA, USA Model Pre-Training. InInternational Conference on Learning Representations (ICLR). OpenReview, Addis Ababa, Eth...

work page internal anchor Pith review arXiv 2020
[11]

Hugging Face. 2025. smolagents: A smol library to build great agents. GitHub repository. https://github.com/huggingface/ smolagents Accessed: 2026-02-24

2025
[12]

Billion-scale similarity search with GPUs

JeffJohnson,MatthijsDouze,andHervéJégou.2017. Billion-Scale Similarity Search with GPUs.arXiv preprint arXiv:1702.08734 1, 1 (2017), 1–17. https://arxiv.org/abs/1702.08734

work page Pith review arXiv 2017
[13]

Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Yoav Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, Dor Muhlgay, Noam Rozen, Erez Schwartz, Gal Shachaf, Shai Shalev-Shwartz, Amnon Shashua, and Moshe Tenenholtz. 2022. MRKL Systems: A Modular, Neuro-Symbolic Architecture that Combines Large Language Models, External Kn...

work page internal anchor Pith review arXiv 2022
[14]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih
[15]

Dense passage retrieval for open-domain question answering

Dense Passage Retrieval for Open-Domain Question Answering. InProceedings of the 2020 Conference on Em- pirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 6769–6781. doi:10.18653/v1/2020.emnlp-main.550

work page doi:10.18653/v1/2020.emnlp-main.550 2020
[16]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lian- min Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Lan- guage Model Serving with PagedAttention. InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP). ACM, New York, NY, USA, 611–626. doi:10.1145/ 3600006.3613165

work page arXiv 2023
[17]

LangChain, Inc. 2023. LangChain: Build context-aware reasoning applications. Open-source project. https://github.com/langchain- ai/langchain Accessed: 2026-02-23

2023
[18]

LangChain, Inc. 2024. LangGraph: Build resilient language agents as graphs. Open-source project. https://github.com/langchain- ai/langgraph Accessed: 2026-02-23

2024
[19]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge- Intensive NLP Tasks. InAdvances in Neural Information Pro- cessing Systems, Vol. 33. Curran Associates, Inc., Red H...

work page internal anchor Pith review arXiv 2020
[20]

Llama Stack Contributors. 2025. Llama Stack. GitHub repository. https://github.com/llamastack/llama-stack Accessed: 2026-01-28

2025
[21]

LlamaIndex. 2022. LlamaIndex: Data framework for LLM applica- tions. GitHub repository. https://github.com/run-llama/llama_ index Accessed: 2026-02-24

2022
[22]

Microsoft. 2023. AutoGen: A programming framework for agentic AI. GitHub repository. https://github.com/microsoft/autogen Accessed: 2026-02-24

2023
[23]

Microsoft. 2023. Semantic Kernel: Integrate cutting-edge LLM technology quickly and easily into your apps. GitHub repository. https://github.com/microsoft/semantic-kernel Accessed: 2026-02- 24

2023
[24]

Microsoft. 2025. Microsoft Agent Framework. GitHub repository. https://github.com/microsoft/agents Accessed: 2026-02-24

2025
[25]

OGX Contributors. 2026. OGX Kubernetes Operator. GitHub repository. https://github.com/ogx-ai/ogx-k8s-operator For- merly Llama Stack Kubernetes Operator. Accessed: 2026-04-25

2026
[26]

OGX Contributors. 2026. OGX (Open GenAI Stack). GitHub repository. https://github.com/ogx-ai/ogx Formerly Llama Stack. Accessed: 2026-04-25

2026
[27]

Open Responses Community. 2026. Open Responses. Online resource. https://www.openresponses.org/ Accessed: 2026-02-23

2026
[28]

OpenAI. 2023. Function Calling. Online documentation. https: //platform.openai.com/docs/guides/function-calling Accessed: 2026-02-24

2023
[29]

OpenAI. 2025. gpt-oss-120b & gpt-oss-20b Model Card. arXiv:2508.10925 [cs.CL] https://arxiv.org/abs/2508.10925

work page internal anchor Pith review arXiv 2025
[30]

OpenAI. 2025. Responses API Reference. Online documenta- tion. https://platform.openai.com/docs/api-reference/responses Accessed: 2026-02-16

2025
[31]

Pydantic. 2024. Pydantic AI: Agent Framework / shim to use Pydantic with LLMs. GitHub repository. https://github.com/ pydantic/pydantic-ai Accessed: 2026-02-24

2024
[32]

Kunal Sawarkar, Abhilasha Mangal, and Shivam Raj Solanki
[33]

Blended rag: Improving rag (retriever- augmented generation) accuracy with semantic search and hybrid query-based retrievers,

Blended RAG: Improving RAG (Retriever-Augmented Gen- eration) Accuracy with Semantic Search and Hybrid Query-Based Retrievers.arXiv preprint arXiv:2404.072201, 1 (2024), 1–12. https://arxiv.org/abs/2404.07220

work page arXiv 2024
[34]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Them- selves to Use Tools. InAdvances in Neural Information Process- ing Systems, Vol. 36. Curran Associates, Inc., New Orleans, LA, USA, 1–25. https://arxiv.org/abs/2302.04761

work page internal anchor Pith review arXiv 2023
[35]

Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean- François Crespo, and Dan Dennison

D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean- François Crespo, and Dan Dennison. 2015. Hidden Technical Debt in Machine Learning Systems. InAdvances in Neural Information Processing Systems, Vol. 28. Curran Associates, Inc., Red Hook, NY, USA, 2503–2511. https://papers.nips.cc/pa...

2015
[36]

Natalie Shapira, Chris Wendler, Avery Yen, Gabriele Sarti, Koyena Pal, Olivia Floody, Adam Belfki, Alex Loftus, Aditya Ratan Jannali, Nikhil Prakash, Jasmine Cui, Giordano Rogers, Jannik Brinkmann, Can Rager, Amir Zur, Michael Ripa, et al. 2026. Agents of Chaos.arXiv preprint arXiv:2602.200211, 1 (2026), 1–25. https://arxiv.org/abs/2602.20021

work page internal anchor Pith review arXiv 2026
[37]

Vectara. 2023. Vectara: Enterprise Agent and RAG Platform. Online. https://vectara.com/ Accessed: 2026-02-24

2023
[38]

Weaviate. 2019. Weaviate: Cloud-native vector database with structured filtering. GitHub repository. https://github.com/ weaviate/weaviate Accessed: 2026-02-24

2019
[39]

Simon Willison. 2022. Prompt injection attacks against GPT-

2022
[40]

https://simonwillison.net/2022/Sep/12/prompt- injection/ Accessed: 2026-02-24

Blog post. https://simonwillison.net/2022/Sep/12/prompt- injection/ Accessed: 2026-02-24

2022
[41]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations (ICLR). OpenReview, Kigali, Rwanda, 1–18. https://arxiv.org/abs/2210.03629

work page internal anchor Pith review arXiv 2023
[42]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving Sys- tem for Transformer-Based Generative Models. In16th USENIX Symposium on Operating Systems Design and Implementa- tion (OSDI). USENIX Association, Carlsbad, CA, USA, 521–538. https://www.usenix.org/conference/osdi22/presentation/yu

2022
[43]

Matei Zaharia, Andrew Chen, Aaron Davidson, Ali Ghodsi, Sue Ann Hong, Andy Konwinski, Siddharth Murching, Tomas Nykodym, Paul Ogilvie, Mani Parkhe, Fen Xie, and Corey Zumar
[44]

IEEE Data Engineering Bulletin41, 4 (2018), 39–45

Accelerating the Machine Learning Lifecycle with MLflow. IEEE Data Engineering Bulletin41, 4 (2018), 39–45. https:// people.eecs.berkeley.edu/~matei/papers/2018/ieee_mlflow.pdf

2018
[45]

SGLang: Efficient Execution of Structured Language Model Programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kober, Liang Shi, Chien-Sheng Wu, Hao Zhang, Ying Sheng, Joseph E. Gonzalez, Ion Stoica, and Wei-Lin Ma. 2024. SGLang: Efficient Execution of Structured Language Model Programs. InAdvances in Neural Information Processing Systems, Vol. 37. Curran Associat...

work page internal anchor Pith review arXiv 2024
[46]

Zilliz. 2019. Milvus: A cloud-native vector database. GitHub repository. https://github.com/milvus-io/milvus Accessed: 2026- 02-24. ACM CAIS ’26, May 26–29, 2026, San Jose, CA, USA Francisco Javier Arceo and Varsha Prasad Narsing A Detailed Evaluation Tables Config Orch. / Retr. p50 p99 Mean A Client / Ungated 3,600ms 10,818ms 4,208ms B Client / Gated 3,4...

2019