Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test

Ajraf Mannan; Hanlin Zhu; Jonathan Michala; Raluca Ada Popa; Sijun Tan; Tianyi Qiu; Willie Neiswanger; Xiaoyuan Zhu; Yaowen Ye

arxiv: 2506.06975 · v5 · submitted 2025-06-08 · 💻 cs.CR · cs.AI· cs.CL

Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test

Xiaoyuan Zhu , Yaowen Ye , Tianyi Qiu , Hanlin Zhu , Sijun Tan , Ajraf Mannan , Jonathan Michala , Raluca Ada Popa

show 1 more author

Willie Neiswanger

This is my paper

Pith reviewed 2026-05-19 11:31 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CL

keywords black-box auditingLLM API verificationuniformity testmodel substitution detectionstatistical poweradversarial robustnessquery efficiency

0 comments

The pith

A rank-based uniformity test verifies whether black-box LLM APIs serve the claimed authentic model or a substituted version.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a statistical procedure that lets users check if an API provider has quietly replaced the promised LLM with a quantized, fine-tuned, or entirely different model. Users run a local reference copy of the authentic model, issue the same prompts to both, and examine whether the ranks of a derived test statistic remain uniform. If the ranks deviate, the API output distribution has shifted, revealing the substitution. The approach requires few queries and produces no obvious patterns that an adversarial provider could use to reroute or mix responses.

Core claim

We introduce a rank-based uniformity test that confirms behavioral equality between a black-box LLM API and a locally deployed authentic reference model by testing whether the ranks of a chosen test statistic follow a uniform distribution under the null hypothesis of no substitution.

What carries the argument

The rank-based uniformity test, which converts output comparisons into ranked values and checks their distribution for uniformity to expose any distributional shift caused by model changes.

If this is right

The test detects quantization, harmful fine-tuning, jailbreak prompts, and full model substitution with higher statistical power than earlier methods.
It operates under tight query budgets while remaining robust when providers attempt to detect and evade testing.
Users gain a practical way to confirm that an API delivers the exact model advertised without needing weights or logits.
The method extends to multiple threat models without requiring detectable query sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Service-level agreements could incorporate routine application of this test as an independent audit mechanism.
The same ranking idea might apply to verifying other black-box generative services whose internal distributions matter.
Combining the test with adaptive prompting could further reduce the number of queries needed for high-confidence decisions.

Load-bearing premise

Outputs from the authentic reference model must produce a test statistic whose ranks are uniformly distributed when no substitution has occurred.

What would settle it

Apply the test to a known authentic model and observe whether the ranks of the test statistic deviate from uniformity at a rate exceeding the expected false-positive threshold.

read the original abstract

As API access becomes a primary interface to large language models (LLMs), users often interact with black-box systems that offer little transparency into the deployed model. To reduce costs or maliciously alter model behaviors, API providers may discreetly serve quantized or fine-tuned variants, which can degrade performance and compromise safety. Detecting such substitutions is difficult, as users lack access to model weights and, in most cases, even output logits. To tackle this problem, we propose a rank-based uniformity test that can verify the behavioral equality of a black-box LLM to a locally deployed authentic model. Our method is accurate, query-efficient, and avoids detectable query patterns, making it robust to adversarial providers that reroute or mix responses upon the detection of testing attempts. We evaluate the approach across diverse threat scenarios, including quantization, harmful fine-tuning, jailbreak prompts, and full model substitution, showing that it consistently achieves superior statistical power over prior methods under constrained query budgets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a rank-based uniformity test to audit black-box LLM APIs by checking whether their outputs match those of a locally deployed authentic reference model. The approach is presented as query-efficient, resistant to detectable query patterns, and robust against adversarial providers that might reroute or mix responses. Evaluations across threat models (quantization, harmful fine-tuning, jailbreak prompts, and full model substitution) claim superior statistical power relative to prior methods under limited query budgets.

Significance. If the uniformity assumption under the null holds and the power advantages generalize beyond the reported scenarios, the method could provide a practical, low-overhead tool for users to detect unauthorized model substitutions in commercial LLM APIs. This addresses a timely security and reliability concern in black-box deployments. The focus on adversarial robustness and constrained-query regimes is a positive contribution, though the lack of explicit validation for sampling-parameter fidelity limits the strength of the claims.

major comments (2)

The central claim rests on the assumption that, when the black-box API matches the local authentic model, the ranks of the chosen test statistic are uniformly distributed under the null (enabling reliable detection of distributional shifts). This assumption is load-bearing for type-I error control and the reported power advantages, yet the evaluations do not include ablations on mismatches in sampling parameters (temperature, top-p, seed) or unmodeled output dependencies that commonly arise in LLM generation. Without such checks, the superiority over prior methods under constrained budgets cannot be fully substantiated.
The threat-model evaluations (quantization, fine-tuning, jailbreaks, substitution) are relevant but do not stress-test the uniformity property when the local reference and API differ only in decoding hyperparameters. Adding a controlled experiment that varies only these parameters while keeping the underlying model fixed would directly address whether the rank transformation remains uniform and whether the test retains its claimed robustness.

minor comments (2)

Clarify the exact definition of the test statistic and the rank transformation procedure, including any dependence on prompt distribution or output length, to allow independent reproduction.
The abstract and introduction would benefit from a concise statement of the precise null and alternative hypotheses in statistical terms.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight an important aspect of validating the core uniformity assumption under the null hypothesis. We address each major comment below and describe the changes incorporated into the revised version.

read point-by-point responses

Referee: The central claim rests on the assumption that, when the black-box API matches the local authentic model, the ranks of the chosen test statistic are uniformly distributed under the null (enabling reliable detection of distributional shifts). This assumption is load-bearing for type-I error control and the reported power advantages, yet the evaluations do not include ablations on mismatches in sampling parameters (temperature, top-p, seed) or unmodeled output dependencies that commonly arise in LLM generation. Without such checks, the superiority over prior methods under constrained budgets cannot be fully substantiated.

Authors: We agree that explicit validation of the uniformity property under sampling-parameter mismatches is necessary to fully support the type-I error guarantees and power claims. In the revised manuscript we have added a dedicated ablation study that fixes the underlying model weights while systematically varying only the decoding hyperparameters (temperature, top-p, and random seed) between the local reference and the simulated black-box API. The new results show that, when the hyperparameters match, the rank statistics remain uniformly distributed (p-value histograms are flat and Kolmogorov-Smirnov tests do not reject uniformity). When the hyperparameters differ, the test correctly rejects the null with power that scales with query budget, consistent with the behavioral-equivalence goal of the audit. We have also added a brief discussion of potential unmodeled output dependencies in the limitations section. revision: yes
Referee: The threat-model evaluations (quantization, fine-tuning, jailbreaks, substitution) are relevant but do not stress-test the uniformity property when the local reference and API differ only in decoding hyperparameters. Adding a controlled experiment that varies only these parameters while keeping the underlying model fixed would directly address whether the rank transformation remains uniform and whether the test retains its claimed robustness.

Authors: We concur that isolating the effect of decoding hyperparameters provides a direct stress test of the uniformity assumption. The controlled experiment described in our response to the preceding comment has been inserted as a new subsection in the experimental evaluation. It demonstrates that the rank-based test preserves its statistical calibration when hyperparameters are identical and retains high power to detect mismatches, thereby strengthening the robustness claims under limited query budgets. revision: yes

Circularity Check

0 steps flagged

No circularity: rank-based uniformity test is a self-contained statistical procedure with independent empirical evaluation.

full rationale

The paper introduces a rank-based uniformity test as a new auditing method for black-box LLM APIs, with the null hypothesis that ranks of the test statistic are uniform when the API matches the local reference model. This assumption is stated explicitly and evaluated empirically across threat scenarios (quantization, fine-tuning, jailbreaks, substitution) rather than derived from fitted parameters or prior self-referential definitions. No equations reduce by construction to inputs, no load-bearing self-citations justify uniqueness theorems, and no ansatz or renaming of known results is presented as a derivation. The method's power and type-I error claims rest on standard statistical properties of rank transforms under the stated null, which are externally verifiable and not forced by the paper's own data fits. This is the common case of an honest, non-circular proposal of a statistical procedure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the statistical assumption that the authentic model's response ranks are uniformly distributed under the null, with no free parameters or invented entities explicitly introduced in the abstract.

axioms (1)

domain assumption The rank statistic computed from responses of the authentic model follows a uniform distribution under the null hypothesis of behavioral equality.
This null hypothesis underpins the uniformity test and its power to detect substitutions.

pith-pipeline@v0.9.0 · 5725 in / 1260 out tokens · 49448 ms · 2026-05-19T11:31:21.628777+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

rank statistic rtgt := Fπref(s−tgt) + U · P(f(y,x)=stgt), U∼Uniform[0,1]; apply Cramér–von Mises test on {ri}

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VOW: Verifiable and Oblivious Watermark Detection for Large Language Models
cs.CR 2026-04 unverdicted novelty 7.0

VOW formulates LLM watermark detection as a secure two-party computation using a Verifiable Oblivious Pseudorandom Function to achieve private and cryptographically verifiable detection.