arxiv: 2604.21083 · v1 · submitted 2026-04-22 · 💻 cs.CR · cs.AI· cs.NI· cs.SE

Behavioral Consistency and Transparency Analysis on Large Language Model API Gateways

Guanjie Lin , Yinxin Wan , Shichao Pei , Ting Xu , Kuai Xu , Guoliang Xue This is my paper

Pith reviewed 2026-05-09 23:45 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.NIcs.SE

keywords LLM API gatewaysbehavioral consistencyblack-box testingmodel substitutionbilling accuracylatency stabilityAI service transparency

0 comments

The pith

Commercial LLM gateways frequently substitute models, truncate responses, or deviate from announced pricing without clear notice to users.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a black-box auditing framework to test whether third-party LLM API gateways deliver the models, responses, memory, and pricing they advertise. It evaluates gateways along response content, multi-turn conversations, billing records, and latency patterns using controlled queries sent to ten real commercial services. The measurements uncovered repeated mismatches, such as silent model switches, weaker memory across conversation turns, incorrect charges, and unstable response times. Users increasingly rely on these gateways as single points of access to multiple AI vendors, so hidden changes can affect output quality, costs, and predictability. Establishing consistent auditing therefore helps users decide when direct vendor access or additional checks are warranted.

Core claim

GateScope detects misbehaviors including model downgrading or switching, silent truncation, billing inaccuracies, and instability in latency by auditing gateways along four critical dimensions: response content analysis, multi-turn conversation performance, billing accuracy, and latency characteristics. Measurements across 10 real-world commercial LLM API gateways reveal frequent gaps between expected and actual behaviors, including silent model substitutions, degraded memory retention, deviations from announced pricing, and substantial variation in latency stability across platforms.

What carries the argument

GateScope, the lightweight black-box measurement framework that sends controlled prompts and compares observed outputs, memory behavior, invoices, and timing against each gateway's public claims.

If this is right

Gateways may route requests to cheaper or weaker models without notifying users.
Conversation history can lose fidelity across multiple turns in ways the gateway does not announce.
Actual charges can differ from publicly listed rates.
Response times can fluctuate enough to affect time-sensitive applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers integrating these gateways could add lightweight verification prompts to catch substitutions before production use.
Similar auditing techniques might apply to other intermediary services that hide vendor details, such as managed inference endpoints.
If discrepancies prove systematic, users may shift toward direct vendor contracts or open-source local models for critical workloads.

Load-bearing premise

Differences detected by the black-box tests reflect intentional gateway choices rather than ordinary model variation, caching, or network effects.

What would settle it

Repeated identical prompts to the same gateway producing matching model identity, full-length responses, exact advertised pricing, and stable latency on every trial would contradict the reported frequency of gaps.

Figures

Figures reproduced from arXiv: 2604.21083 by Guanjie Lin, Guoliang Xue, Kuai Xu, Shichao Pei, Ting Xu, Yinxin Wan.

**Figure 2.** Figure 2: Overall architecture of the GateScope framework. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

Third-party Large Language Model (LLM) API gateways are rapidly emerging as unified access points to models offered by multiple vendors. However, the internal routing, caching, and billing policies of these gateways are largely undisclosed, leaving users with limited visibility into whether requests are served by the advertised models, whether responses remain faithful to upstream APIs, or whether invoices accurately reflect public pricing policies. To address this gap, we introduce GateScope, a lightweight black-box measurement framework for evaluating behavioral consistency and operational transparency in commercial LLM gateways. GateScope is designed to detect key misbehaviors, including model downgrading or switching, silent truncation, billing inaccuracies, and instability in latency by auditing gateways along four critical dimensions: response content analysis, multi-turn conversation performance, billing accuracy, and latency characteristics. Our measurements across 10 real-world commercial LLM API gateways reveal frequent gaps between expected and actual behaviors, including silent model substitutions, degraded memory retention, deviations from announced pricing, and substantial variation in latency stability across platforms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GateScope gives a practical black-box framework for checking LLM gateways on consistency and transparency, but the reported misbehaviors need tighter controls to separate them from normal output variation.

read the letter

GateScope is a new black-box auditing tool for LLM API gateways that flags potential inconsistencies in how they handle models, conversations, bills, and response times. The measurements on ten commercial services point to real transparency shortfalls, but the evidence for intentional misbehaviors needs stronger controls to rule out normal variations. The paper does something useful by defining four audit dimensions and running them against actual gateways. This kind of systematic check hasn't been done before in quite this way, and it brings attention to issues like whether users get the model they paid for or if responses get silently cut off. That's practical value for anyone using these services. What stands out is the focus on multi-turn memory retention and billing accuracy, areas that matter for production use but get less attention than single-query performance. The main concern is whether the detections hold up. Black-box testing of LLMs is tricky because outputs vary naturally. Without details on how they measured content similarity, how many trials they ran per test, or what statistical thresholds separated anomalies from expected behavior, the frequent gaps could partly reflect stochastic outputs or caching rather than deliberate downgrading. The stress test note raises this, and it seems fair based on the abstract. If the full methods section has those baselines and they are reasonable, the findings strengthen. As presented, the claims are suggestive rather than definitive. This work is aimed at security researchers and AI platform operators who want to understand or improve gateway reliability. It would interest a reading group focused on empirical AI systems work. I think it deserves peer review. The framework is original enough and the topic timely, so referees can help tighten the experimental design.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces GateScope, a lightweight black-box measurement framework for auditing behavioral consistency and operational transparency in commercial LLM API gateways. It evaluates 10 real-world gateways along four dimensions (response content analysis, multi-turn conversation performance, billing accuracy, and latency characteristics) and reports frequent discrepancies including silent model substitutions, degraded memory retention, announced vs. actual pricing deviations, and latency instability.

Significance. If the empirical methodology holds, the work addresses a timely and practically important gap in visibility into third-party LLM routing, caching, and billing policies. The framework provides a replicable auditing tool that could inform user practices and regulatory discussions; the black-box design is well-suited to the closed nature of the services under study.

major comments (3)

[§3.2] §3.2 (Response Content Analysis): No similarity metric (e.g., exact token match, ROUGE, or embedding cosine threshold), repetition count per prompt, or variance threshold is specified for detecting silent model substitutions or truncations. Without these, normal stochastic LLM output variation cannot be reliably separated from the claimed misbehaviors.
[§4.1] §4.1 (Multi-turn Conversation Performance): The evaluation of memory retention lacks explicit baselines, prompt templates, or controls for caching and network effects; observed 'degraded memory' could arise from ordinary API behavior rather than gateway-specific faults.
[§4] §4 (Results): Claims of 'frequent gaps' and 'substantial variation' across the 10 gateways are presented without reported sample sizes per test, statistical tests, confidence intervals, or confounding-factor controls, undermining the ability to assess the reliability of the central empirical findings.

minor comments (2)

[Abstract] The abstract lists the four auditing dimensions but does not indicate which gateways were tested or give even a high-level per-gateway summary of findings.
[Throughout] Ensure consistent terminology between 'model downgrading,' 'silent model substitutions,' and 'model switching' across sections.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Response Content Analysis): No similarity metric (e.g., exact token match, ROUGE, or embedding cosine threshold), repetition count per prompt, or variance threshold is specified for detecting silent model substitutions or truncations. Without these, normal stochastic LLM output variation cannot be reliably separated from the claimed misbehaviors.

Authors: We agree that explicit criteria are required to separate stochastic variation from substitutions or truncations. In the revised version we will add to §3.2 a precise description of the detection procedure: exact token match for responses under 50 tokens, sentence-BERT cosine similarity with threshold 0.82 for longer outputs, five repetitions per prompt, and a variance threshold of 0.15 in normalized embedding distance to flag anomalies. These parameters were used in the original experiments but were not fully documented; we will now state them explicitly. revision: yes
Referee: [§4.1] §4.1 (Multi-turn Conversation Performance): The evaluation of memory retention lacks explicit baselines, prompt templates, or controls for caching and network effects; observed 'degraded memory' could arise from ordinary API behavior rather than gateway-specific faults.

Authors: We will revise §4.1 to include the full prompt templates, the exact number of turns (five), and the controls employed: unique conversation IDs to disable caching, repeated measurements at different times of day to mitigate network effects, and direct-API baselines run in parallel for the same prompts. These controls were part of the experimental design but omitted from the text; adding them will allow readers to evaluate whether the observed degradation exceeds ordinary API behavior. revision: yes
Referee: [§4] §4 (Results): Claims of 'frequent gaps' and 'substantial variation' across the 10 gateways are presented without reported sample sizes per test, statistical tests, confidence intervals, or confounding-factor controls, undermining the ability to assess the reliability of the central empirical findings.

Authors: We accept that sample sizes and basic statistical descriptors should be reported. The revised §4 will state that each consistency test comprised 80–120 queries per gateway, latency measurements used 200 samples with 95 % confidence intervals, and that confounding controls (time-of-day, prompt uniqueness) were applied as described in the methods. Full hypothesis testing is limited by the black-box setting and the observational nature of some findings (e.g., silent substitutions observed consistently across runs), but we will add descriptive statistics and note these limitations explicitly. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical black-box measurement study

full rationale

The paper introduces GateScope as a black-box auditing framework and reports direct measurements of behavioral inconsistencies across 10 commercial LLM gateways along four dimensions (response content, multi-turn conversations, billing accuracy, latency). No equations, fitted parameters, derivations, ansatzes, or self-citation chains appear in the provided text; claims rest on observed discrepancies rather than any reduction to prior inputs or self-defined quantities. This is a standard empirical study whose central results are falsifiable by replication and do not rely on internal consistency loops.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical measurement study and introduces no free parameters, mathematical axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5492 in / 999 out tokens · 36274 ms · 2026-05-09T23:45:23.038519+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 8 canonical work pages

[1]

https://openrouter.ai/docs/api-reference/ov erview

An Overview of OpenRouter’s API. https://openrouter.ai/docs/api-reference/ov erview
[2]

https://www.reddit.com/r/LocalLLaMA/comments/1nqkx7o/app arently_all_third_party_providers_downgrade/

Apparently All Third Party Providers Downgrade, None of Them Provide a Max Quality Model. https://www.reddit.com/r/LocalLLaMA/comments/1nqkx7o/app arently_all_third_party_providers_downgrade/
[3]

Maryam Amirizaniani, Elias Martin, Tanya Roosta, Aman Chadha, and Chirag Shah. 2024. AuditLLM: A Tool for Auditing Large Language Models Using Multiprobe Approach.arXiv preprint arXiv:2402.09334(2024)

work page arXiv 2024
[4]

Context Caching

Gemini API. Context Caching. https://ai.google.dev/gemini-api/docs/caching
[5]

Jiacheng Cai, Jiahao Yu, Yangguang Shao, Yuhang Wu, and Xinyu Xing. 2025. UTF: Under-trained Tokens as Fingerprints — a Novel Approach to LLM Identification. InProc. of ACL

2025
[6]

Ruizhi Cheng, Surendra Pathak, Guowu Xie, Matteo Varvello, Songqing Chen, and Bo Han. 2025. Hello, GenAI? Dissecting Human to Generative-AI Calling. In Proc. of IMC

2025
[7]

Stanford Institute for Human-Centered Artificial Intelligence. 2025. Artificial Intelligence Index Report 2025. Stanford HAI. https://hai.stanford.edu/assets/fil es/hai_ai_index_report_2025.pdf

2025
[8]

Junbang Fu, Wenlong Dong, Chong Wang, Xutong Zhao, Jingwei Gao, Zhirui Hu, Shangbo Wu, Dong Wang, Yinzhen Wang, Xiaojuan Qi, Shuai He, Yujun Chen, and Tianyu Du. 2025. FDLLM: A Dedicated Detector for Black-Box LLMs Fingerprinting.arXiv preprint arXiv:2501.16029(2025)

work page arXiv 2025
[9]

Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yu- vraj Patel, and Luo Mai. 2024. ServerlessLLM: Low-Latency Serverless Inference for Large Language Models. InProc. of USENIX OSDI

2024
[10]

https://github.com/CyberAINet/GateScope

GateScope. https://github.com/CyberAINet/GateScope
[11]

https://huggingface.co/datasets/fingertap/GPQA-Diamond

GPQA-Diamond. https://huggingface.co/datasets/fingertap/GPQA-Diamond
[12]

Chenchen Gu, Xiang Lisa Li, Rohith Kuditipudi, Percy Liang, and Tatsunori Hashimoto. 2025. Auditing Prompt Caching in Language Model APIs.arXiv preprint arXiv:2502.07776(2025)

work page arXiv 2025
[13]

Wei Hao, Van Tran, Vincent Rideout, Zixi Wang, AnMei Dasbach-Prisk, M. H. Afifi, Junfeng Yang, Ethan Katz-Bassett, Grant Ho, and Asaf Cidon. 2025. Do Spammers Dream of Electric Sheep? Characterizing the Prevalence of LLM- Generated Malicious Emails. InProc. of IMC

2025
[14]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring Mathematical Problem Solving With the MATH Dataset. InProc. of NeurIPS

2021
[15]

Dan Hendrycks, Collin Burns, Andy Zou, Steven Basart, Andy Lee, Dawn Kohlmeier, Wenliang Ju, Dawn Xiaodong Song, and Jacob Steinhardt. 2020. Measuring Massive Multitask Language Understanding.arXiv preprint arXiv:2006.04019(2020)

work page arXiv 2020
[16]

Guan Zhe Hong, Bhavya Vasudeva, Vatsal Sharan, Cyrus Rashtchian, Prab- hakar Raghavan, and Rina Panigrahy. 2025. Latent Concept Disentanglement in Transformer-based Language Models.arXiv preprint arXiv:2506.16975(2025)

work page arXiv 2025
[17]

Ali Babar

Sangwon Hyun, Mingyu Guo, and M. Ali Babar. 2024. METAL: Metamorphic Testing Framework for Analyzing Large-Language Model Qualities. InProc. of IEEE ICST

2024
[18]

https://jimmysong.io/en/blog/ai-gateway-in-depth/

In-Depth Analysis of AI Gateway: The New Generation of Intelligent Traffic Control Hub. https://jimmysong.io/en/blog/ai-gateway-in-depth/
[19]

Zihan Liu, Yang Chen, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping
[20]

AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling. InProc. of ACL
[21]

Prompt Caching

OpenAI. Prompt Caching. https://platform.openai.com/docs/guides/prompt- caching
[22]

Kornaropoulos, and Giuseppe Ateniese

Dario Pasquini, Evgenios M. Kornaropoulos, and Giuseppe Ateniese. 2025. LLMmap: Fingerprinting for Large Language Models. InProc. of USENIX Se- curity

2025
[23]

Yashothara Shanmugarasa, Ming Ding, M. A. P Chamikara, and Thierry Rako- toarivelo. 2025. SoK: The Privacy Paradox of Large Language Models: Advance- ments, Privacy Risks, and Mitigation. InProc. of ASIA CCS

2025
[24]

Hae Jin Song, Mahyar Khayatkhoei, and Wael AbdAlmageed. 2024. ManiFPT: Defining and Analyzing Fingerprints of Generative Models. InProc. of CVPR

2024
[25]

Guoheng Sun, Ziyao Wang, Xuandong Zhao, Bowei Tian, Zheyu Shen, Yexiao He, Jinming Xing, and Ang Li. 2025. Invisible Tokens, Visible Bills: The Urgent Need to Audit Hidden Operations in Opaque LLM Services.arXiv preprint arXiv:2505.18471(2025)

work page arXiv 2025
[26]

Ziyao Wang, Guoheng Sun, Yexiao He, Zheyu Shen, Bowei Tian, and Ang Li
[27]

Predictive Auditing of Hidden Tokens in LLM APIs via Reasoning Length Estimation.arXiv preprint arXiv:2508.00912(2025)

work page arXiv 2025
[28]

Yixin Wu, Ziqing Yang, Yun Shen, Michael Backes, and Yang Zhang. 2025. Syn- thetic Artifact Auditing: Tracing LLM-Generated Synthetic Data Usage in Down- stream Applications. InProc. of USENIX Security

2025
[29]

Jiashu Xu, Fei Wang, Mingyu Derek Ma, Pang Wei Koh, Chaowei Xiao, and Muhao Chen. 2024. Instructional Fingerprinting of Large Language Models. In Proc. of NAACL

2024
[30]

Zhiguang Yang and Hanzhou Wu. 2024. A Fingerprint for Large Language Models. arXiv preprint arXiv:2407.01235(2024)

work page arXiv 2024
[31]

knowledge_path

Baohang Zhou, Zezhong Wang, Lingzhi Wang, Hongru Wang, Ying Zhang, Kehui Song, Xuhui Sui, and Kam-Fai Wong. 2024. DPDLLM: A Black-box Framework for Detecting Pre-training Data from Large Language Models. InProc. of ACL. A Ethics In this work, we use only public APIs of vendors and we do not capture user traffic. We anonymize commercial platforms and we do...

2024