Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test
Pith reviewed 2026-05-19 11:31 UTC · model grok-4.3
The pith
A rank-based uniformity test verifies whether black-box LLM APIs serve the claimed authentic model or a substituted version.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce a rank-based uniformity test that confirms behavioral equality between a black-box LLM API and a locally deployed authentic reference model by testing whether the ranks of a chosen test statistic follow a uniform distribution under the null hypothesis of no substitution.
What carries the argument
The rank-based uniformity test, which converts output comparisons into ranked values and checks their distribution for uniformity to expose any distributional shift caused by model changes.
If this is right
- The test detects quantization, harmful fine-tuning, jailbreak prompts, and full model substitution with higher statistical power than earlier methods.
- It operates under tight query budgets while remaining robust when providers attempt to detect and evade testing.
- Users gain a practical way to confirm that an API delivers the exact model advertised without needing weights or logits.
- The method extends to multiple threat models without requiring detectable query sequences.
Where Pith is reading between the lines
- Service-level agreements could incorporate routine application of this test as an independent audit mechanism.
- The same ranking idea might apply to verifying other black-box generative services whose internal distributions matter.
- Combining the test with adaptive prompting could further reduce the number of queries needed for high-confidence decisions.
Load-bearing premise
Outputs from the authentic reference model must produce a test statistic whose ranks are uniformly distributed when no substitution has occurred.
What would settle it
Apply the test to a known authentic model and observe whether the ranks of the test statistic deviate from uniformity at a rate exceeding the expected false-positive threshold.
read the original abstract
As API access becomes a primary interface to large language models (LLMs), users often interact with black-box systems that offer little transparency into the deployed model. To reduce costs or maliciously alter model behaviors, API providers may discreetly serve quantized or fine-tuned variants, which can degrade performance and compromise safety. Detecting such substitutions is difficult, as users lack access to model weights and, in most cases, even output logits. To tackle this problem, we propose a rank-based uniformity test that can verify the behavioral equality of a black-box LLM to a locally deployed authentic model. Our method is accurate, query-efficient, and avoids detectable query patterns, making it robust to adversarial providers that reroute or mix responses upon the detection of testing attempts. We evaluate the approach across diverse threat scenarios, including quantization, harmful fine-tuning, jailbreak prompts, and full model substitution, showing that it consistently achieves superior statistical power over prior methods under constrained query budgets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a rank-based uniformity test to audit black-box LLM APIs by checking whether their outputs match those of a locally deployed authentic reference model. The approach is presented as query-efficient, resistant to detectable query patterns, and robust against adversarial providers that might reroute or mix responses. Evaluations across threat models (quantization, harmful fine-tuning, jailbreak prompts, and full model substitution) claim superior statistical power relative to prior methods under limited query budgets.
Significance. If the uniformity assumption under the null holds and the power advantages generalize beyond the reported scenarios, the method could provide a practical, low-overhead tool for users to detect unauthorized model substitutions in commercial LLM APIs. This addresses a timely security and reliability concern in black-box deployments. The focus on adversarial robustness and constrained-query regimes is a positive contribution, though the lack of explicit validation for sampling-parameter fidelity limits the strength of the claims.
major comments (2)
- The central claim rests on the assumption that, when the black-box API matches the local authentic model, the ranks of the chosen test statistic are uniformly distributed under the null (enabling reliable detection of distributional shifts). This assumption is load-bearing for type-I error control and the reported power advantages, yet the evaluations do not include ablations on mismatches in sampling parameters (temperature, top-p, seed) or unmodeled output dependencies that commonly arise in LLM generation. Without such checks, the superiority over prior methods under constrained budgets cannot be fully substantiated.
- The threat-model evaluations (quantization, fine-tuning, jailbreaks, substitution) are relevant but do not stress-test the uniformity property when the local reference and API differ only in decoding hyperparameters. Adding a controlled experiment that varies only these parameters while keeping the underlying model fixed would directly address whether the rank transformation remains uniform and whether the test retains its claimed robustness.
minor comments (2)
- Clarify the exact definition of the test statistic and the rank transformation procedure, including any dependence on prompt distribution or output length, to allow independent reproduction.
- The abstract and introduction would benefit from a concise statement of the precise null and alternative hypotheses in statistical terms.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight an important aspect of validating the core uniformity assumption under the null hypothesis. We address each major comment below and describe the changes incorporated into the revised version.
read point-by-point responses
-
Referee: The central claim rests on the assumption that, when the black-box API matches the local authentic model, the ranks of the chosen test statistic are uniformly distributed under the null (enabling reliable detection of distributional shifts). This assumption is load-bearing for type-I error control and the reported power advantages, yet the evaluations do not include ablations on mismatches in sampling parameters (temperature, top-p, seed) or unmodeled output dependencies that commonly arise in LLM generation. Without such checks, the superiority over prior methods under constrained budgets cannot be fully substantiated.
Authors: We agree that explicit validation of the uniformity property under sampling-parameter mismatches is necessary to fully support the type-I error guarantees and power claims. In the revised manuscript we have added a dedicated ablation study that fixes the underlying model weights while systematically varying only the decoding hyperparameters (temperature, top-p, and random seed) between the local reference and the simulated black-box API. The new results show that, when the hyperparameters match, the rank statistics remain uniformly distributed (p-value histograms are flat and Kolmogorov-Smirnov tests do not reject uniformity). When the hyperparameters differ, the test correctly rejects the null with power that scales with query budget, consistent with the behavioral-equivalence goal of the audit. We have also added a brief discussion of potential unmodeled output dependencies in the limitations section. revision: yes
-
Referee: The threat-model evaluations (quantization, fine-tuning, jailbreaks, substitution) are relevant but do not stress-test the uniformity property when the local reference and API differ only in decoding hyperparameters. Adding a controlled experiment that varies only these parameters while keeping the underlying model fixed would directly address whether the rank transformation remains uniform and whether the test retains its claimed robustness.
Authors: We concur that isolating the effect of decoding hyperparameters provides a direct stress test of the uniformity assumption. The controlled experiment described in our response to the preceding comment has been inserted as a new subsection in the experimental evaluation. It demonstrates that the rank-based test preserves its statistical calibration when hyperparameters are identical and retains high power to detect mismatches, thereby strengthening the robustness claims under limited query budgets. revision: yes
Circularity Check
No circularity: rank-based uniformity test is a self-contained statistical procedure with independent empirical evaluation.
full rationale
The paper introduces a rank-based uniformity test as a new auditing method for black-box LLM APIs, with the null hypothesis that ranks of the test statistic are uniform when the API matches the local reference model. This assumption is stated explicitly and evaluated empirically across threat scenarios (quantization, fine-tuning, jailbreaks, substitution) rather than derived from fitted parameters or prior self-referential definitions. No equations reduce by construction to inputs, no load-bearing self-citations justify uniqueness theorems, and no ansatz or renaming of known results is presented as a derivation. The method's power and type-I error claims rest on standard statistical properties of rank transforms under the stated null, which are externally verifiable and not forced by the paper's own data fits. This is the common case of an honest, non-circular proposal of a statistical procedure.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The rank statistic computed from responses of the authentic model follows a uniform distribution under the null hypothesis of behavioral equality.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
rank statistic rtgt := Fπref(s−tgt) + U · P(f(y,x)=stgt), U∼Uniform[0,1]; apply Cramér–von Mises test on {ri}
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
VOW: Verifiable and Oblivious Watermark Detection for Large Language Models
VOW formulates LLM watermark detection as a secure two-party computation using a Verifiable Oblivious Pseudorandom Function to achieve private and cryptographically verifiable detection.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.