ComplianceGate: Classifier-Gated Multi-Tier LLM Routing for Inference in Regulated Industries

Abhishek Dey

arxiv: 2606.31163 · v2 · pith:5VGZRS34new · submitted 2026-06-30 · 💻 cs.LG · cs.AI· cs.CL

ComplianceGate: Classifier-Gated Multi-Tier LLM Routing for Inference in Regulated Industries

Abhishek Dey This is my paper

Pith reviewed 2026-07-01 06:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords LLM routingcompliance enforcementPII detectioninference optimizationdata residencyclassifier gatingregulated industries

0 comments

The pith

A pre-inference encoder classifier routes each query to a sized model in the right location, making PII data residency violations impossible.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that inserting a trained encoder classifier before any decoder inference solves two problems at once in regulated settings. The classifier scores query complexity and sensitivity, then sends PII queries to local endpoints and simple queries to small models while complex ones go to larger ones. Because classification happens before any LLM computation, sensitive data never reaches an out-of-jurisdiction endpoint. The approach therefore turns compliance into a structural property rather than a post-hoc check. Readers in finance, healthcare, or government would care because current single-model or MoE deployments let data cross boundaries before any decision is made.

Core claim

A trained encoder classifier sits before any decoder inference, evaluates each query for complexity and data sensitivity, and routes it to an appropriately sized dense model in the appropriate geographic location; PII-containing queries reach only local endpoints before any LLM computation begins, making data residency violations structurally impossible while simple queries incur a fraction of the usual cost.

What carries the argument

Classifier-gated multi-tier routing: an encoder classifier that decides model size and location before any LLM forward pass.

If this is right

PII queries are routed to local endpoints before any model computation occurs.
Simple queries reach small models and incur only a fraction of baseline cost and latency.
Overall median latency drops 39 percent and generation throughput rises to 122-200 tokens per second.
The classifier itself adds only 7 ms overhead while achieving 99.2 percent accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pre-inference gate could be applied to other regulatory constraints such as content filtering or audit logging.
If the classifier distribution shifts in production, the zero-false-negative guarantee would need continuous monitoring.
The routing decision itself becomes an auditable record that existing single-model deployments lack.

Load-bearing premise

The classifier must reach near-perfect PII recall with zero false negatives on the exact distribution of queries seen in production.

What would settle it

Send a batch of real PII-containing queries through the deployed system and check whether any reaches a non-local endpoint.

Figures

Figures reproduced from arXiv: 2606.31163 by Abhishek Dey.

**Figure 2.** Figure 2: Classification Labels and Routing Logic 2.4 Confidence-Based Fallback The classifier outputs a probability distribution over all labels; the highest probability value represents the model's confidence in its classification decision. When this confidence falls below a configurable threshold (default 80%), the system defaults to the most restrictive label (complex_pii). This guarantees that under uncertainty… view at source ↗

**Figure 3.** Figure 3: Confidence-Based Fallback Flow 2.5 Model Tiers The architecture supports any number of dense decoder models behind the routing layer. Each tier operates independently on separate infrastructure with no shared memory or failure domain. Models can be of any size, provider, or architecture. The router holds endpoint references as externalized [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗

**Figure 4.** Figure 4: Deployment Architecture 3. Related Work Several approaches have been proposed to reduce LLM serving costs through query routing, model cascading, or output mixing. We review the most relevant work and identify where each fall short for regulated industry deployment. FrugalGPT (Chen et al., 2023) sends queries through a sequence of models from cheapest to most expensive, escalating only when a quality thres… view at source ↗

**Figure 8.** Figure 8: Cost per query under skewed query distribution (60/25/15), resampled from benchmark data. MoE requires all experts in memory (480 GB FP8) regardless of query simplicity, making per-query cost reduction impossible. The gated classifier routes simple queries to compact models (Nova Lite, Gemma 4B, Ministral 14B - 4 to 28 GB) at 6–8× lower cost per query. As the proportion of simple queries increases, savings… view at source ↗

read the original abstract

Large language models deployed in regulated industries operate under two constraints: compliance enforcement and cost efficiency. Personally identifiable information (PII) in user queries can reach model endpoints before the system determines whether that data should leave its jurisdictional boundary. Serving all queries through a single large model consumes full GPU capacity regardless of query complexity while offering no mechanism for geographic routing. Mixture-of-Experts architectures do not address this routing occurs between expert layers within the model after data has already arrived at the endpoint, with all experts loaded in memory regardless of query complexity. We propose a classifier-gated routing architecture that enforces compliance by design. A trained encoder classifier sits before any decoder inference, evaluating each query for complexity and data sensitivity, then routing it to an appropriately sized dense model in the appropriate geographic location. PII-containing queries route to local endpoints before any LLM computation begins, making data residency violations structurally impossible. Simple queries reach small, fast models at a fraction of the cost. Our evaluation on 600 queries demonstrates 39% median latency reduction, 33-52% cost savings depending on query distribution, and generation throughput of 122-200 tokens/second versus 50-64 for the baseline. The encoder classifier achieves 99.2% accuracy with near-perfect PII recall at 7ms inference overhead, establishing pre-inference classification as a practical path to compliance-by-design LLM deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper puts a classifier before any LLM inference to route PII queries locally and match complexity to model size, but the structural compliance guarantee rests on classifier performance that the reported results do not yet confirm.

read the letter

The core idea is straightforward: run a small encoder first to flag PII and estimate query difficulty, then send sensitive queries to local endpoints and simple ones to smaller models. This avoids sending everything to a single large endpoint and tries to make residency violations impossible by keeping PII out of remote inference entirely.

It does a clean job of describing the routing flow and the motivation for regulated settings. The reported numbers—39% median latency cut, 33-52% cost range, and 122-200 tokens per second—come from a 600-query test set with a 7 ms classifier overhead, which is concrete enough to show the pattern can be fast in practice.

The main weakness is the compliance claim. The paper states that violations become structurally impossible because PII routes locally before any decoder runs, yet this only holds if the classifier has zero false negatives on every PII instance that will appear in production. The abstract gives 99.2% accuracy and near-perfect recall on those 600 queries, but supplies no false-negative count, no labeling protocol, no out-of-distribution or adversarial checks, and no confusion matrix. Without that evidence the guarantee does not follow from the data.

Baseline details and statistical tests are also thin, so it is hard to judge how much of the gain is from the routing versus from the particular model mix chosen.

This is aimed at engineers who need to ship LLM services under data-residency rules. It is worth sending to peer review so the classifier evaluation can be examined in full; the architecture itself is a reasonable engineering pattern even if the strongest claim needs more support.

Referee Report

3 major / 0 minor

Summary. The paper proposes ComplianceGate, a classifier-gated multi-tier routing architecture for LLMs in regulated industries. A trained encoder classifier evaluates each query for complexity and PII sensitivity before any decoder inference, routing PII-containing queries to local endpoints (making data-residency violations structurally impossible) and simple queries to smaller, faster models in appropriate locations. Evaluation on 600 queries reports 39% median latency reduction, 33-52% cost savings, 122-200 tokens/second throughput, and 99.2% classifier accuracy with near-perfect PII recall at 7ms overhead.

Significance. If the empirical results hold under rigorous validation, the architecture offers a concrete mechanism for compliance-by-design in LLM deployments, separating routing from inference to address data residency constraints that post-inference or MoE approaches do not. The reported efficiency gains over a single large-model baseline would be practically relevant for cost-sensitive regulated settings, provided the classifier's PII performance generalizes.

major comments (3)

[Abstract] Abstract: The central claim that 'data residency violations [are] structurally impossible' requires the classifier to achieve exactly zero false negatives on PII for all production queries. The reported 99.2% accuracy and 'near-perfect PII recall' on 600 queries provide no false-negative count, per-class confusion matrix, PII labeling protocol, or results on OOD/adversarial test sets, so the structural guarantee is not yet supported by the evidence.
[Abstract] Evaluation (implied by abstract performance claims): The 39% latency reduction, 33-52% cost savings, and throughput numbers are presented without dataset composition details, query selection criteria, baseline implementation specifics, or statistical tests, leaving the reliability of the central performance claims difficult to assess.
[Abstract] Abstract: No description is given of the encoder classifier's architecture, training procedure, label acquisition for complexity and sensitivity, or how routing decisions are made, all of which are load-bearing for reproducing the compliance and efficiency results.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback emphasizing the need for stronger evidence on the compliance claims and clearer reporting of evaluation details. We address each major comment below and will revise the manuscript to improve transparency and rigor where the original work supports it.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'data residency violations [are] structurally impossible' requires the classifier to achieve exactly zero false negatives on PII for all production queries. The reported 99.2% accuracy and 'near-perfect PII recall' on 600 queries provide no false-negative count, per-class confusion matrix, PII labeling protocol, or results on OOD/adversarial test sets, so the structural guarantee is not yet supported by the evidence.

Authors: The core architectural guarantee is that the classifier operates before any decoder inference or data transmission, so PII queries are routed to local endpoints by design. We agree the abstract omits key supporting details from the 600-query evaluation. In revision we will add the per-class confusion matrix (showing zero false negatives for the PII class in our test set), the expert annotation protocol used for labeling, and an explicit limitations paragraph on OOD generalization. The reported 'near-perfect' recall reflects zero observed false negatives on the evaluated distribution. revision: partial
Referee: [Abstract] Evaluation (implied by abstract performance claims): The 39% latency reduction, 33-52% cost savings, and throughput numbers are presented without dataset composition details, query selection criteria, baseline implementation specifics, or statistical tests, leaving the reliability of the central performance claims difficult to assess.

Authors: We will expand both the abstract and the evaluation section to specify dataset composition (600 queries drawn from production logs with stratified sampling across PII presence and complexity), selection criteria, baseline configuration (single 70B model on identical hardware), and statistical reporting (medians with IQR plus Mann-Whitney U tests). revision: yes
Referee: [Abstract] Abstract: No description is given of the encoder classifier's architecture, training procedure, label acquisition for complexity and sensitivity, or how routing decisions are made, all of which are load-bearing for reproducing the compliance and efficiency results.

Authors: The body of the manuscript already contains these details (fine-tuned sentence-transformer encoder, training on 10k expert-labeled examples, probability-threshold routing). We will insert a concise summary of architecture, training, and decision logic into the abstract and/or introduction to make the abstract self-contained. revision: yes

standing simulated objections not resolved

Absence of OOD or adversarial test sets for the PII classifier; no such evaluation was performed in the original work.

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents an architectural proposal for classifier-gated multi-tier LLM routing without any equations, mathematical derivations, fitted parameters called predictions, or self-citations. The central claim that data residency violations are structurally impossible follows directly from the described pre-inference routing design (assuming the classifier functions as stated), but this is an explicit architectural property rather than a reduction of outputs to inputs by construction. Empirical results on 600 queries are reported separately and do not involve renaming known results or smuggling ansatzes. The derivation chain is self-contained as a system design with external empirical support.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated premise that a classifier with the reported accuracy can be trained on representative data without introducing new failure modes.

pith-pipeline@v0.9.1-grok · 5775 in / 1326 out tokens · 42824 ms · 2026-07-01T06:48:14.510571+00:00 · methodology

ComplianceGate: Classifier-Gated Multi-Tier LLM Routing for Inference in Regulated Industries

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)