ComplianceGate: Classifier-Gated Multi-Tier LLM Routing for Inference in Regulated Industries

Abhishek Dey

arxiv: 2606.31163 · v2 · pith:5VGZRS34new · submitted 2026-06-30 · 💻 cs.LG · cs.AI· cs.CL

ComplianceGate: Classifier-Gated Multi-Tier LLM Routing for Inference in Regulated Industries

Abhishek Dey This is my paper

Pith reviewed 2026-07-02 20:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords LLM routingPII detectioncompliance by designinference optimizationdata residencyclassifier-gated architectureregulated industriesmulti-tier deployment

0 comments

The pith

A pre-inference encoder classifier routes PII queries to local endpoints and simple queries to small models, making data residency violations impossible by design.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that inserting a trained encoder classifier before any LLM decoder can evaluate each query for sensitivity and complexity, then direct it to the right model tier and location. This matters because regulated industries currently risk sending personal data to external endpoints before any compliance check occurs. If the classifier works as described, PII never leaves its jurisdiction and simple queries avoid full-size models. The reported results include 99.2 percent classification accuracy, near-perfect PII recall, 39 percent median latency drop, and 33-52 percent cost reduction on 600 test queries. The 7 ms overhead of the classifier is presented as small enough to keep the approach practical.

Core claim

A classifier-gated routing architecture places a trained encoder ahead of all decoder inference. The encoder labels each query for data sensitivity and complexity, then routes PII-containing queries exclusively to local endpoints and simple queries to smaller dense models. This ordering ensures no LLM computation begins until the routing decision is complete, so data residency rules are satisfied structurally rather than through post-hoc checks.

What carries the argument

A trained encoder classifier that evaluates each query for complexity and data sensitivity before any decoder inference begins.

If this is right

PII data never reaches non-local model endpoints because routing occurs before any LLM computation.
Simple queries incur only the cost and latency of small dense models rather than full-size ones.
Median latency falls 39 percent and generation throughput rises to 122-200 tokens per second.
Cost savings range from 33 to 52 percent depending on the mix of query types.
Compliance is enforced by the routing topology itself rather than by additional runtime verification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pre-inference gate could be extended to additional constraints such as model capability matching or regulatory jurisdiction without changing the core ordering.
If classifier errors occur mainly on edge cases, organizations could add a small human-review queue for low-confidence classifications.
The approach separates compliance logic from model architecture, so it could be applied to existing dense or mixture-of-experts deployments.

Load-bearing premise

The encoder classifier can correctly identify PII and query complexity on unseen inputs with 99.2 percent accuracy and near-perfect recall.

What would settle it

A held-out set of queries containing previously unseen PII patterns where the classifier fails to flag them, allowing data to reach a non-local endpoint.

Figures

Figures reproduced from arXiv: 2606.31163 by Abhishek Dey.

**Figure 2.** Figure 2: Classification Labels and Routing Logic 2.4 Confidence-Based Fallback The classifier outputs a probability distribution over all labels; the highest probability value represents the model's confidence in its classification decision. When this confidence falls below a configurable threshold (default 80%), the system defaults to the most restrictive label (complex_pii). This guarantees that under uncertainty… view at source ↗

**Figure 3.** Figure 3: Confidence-Based Fallback Flow 2.5 Model Tiers The architecture supports any number of dense decoder models behind the routing layer. Each tier operates independently on separate infrastructure with no shared memory or failure domain. Models can be of any size, provider, or architecture. The router holds endpoint references as externalized [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗

**Figure 4.** Figure 4: Deployment Architecture 3. Related Work Several approaches have been proposed to reduce LLM serving costs through query routing, model cascading, or output mixing. We review the most relevant work and identify where each fall short for regulated industry deployment. FrugalGPT (Chen et al., 2023) sends queries through a sequence of models from cheapest to most expensive, escalating only when a quality thres… view at source ↗

**Figure 8.** Figure 8: Cost per query under skewed query distribution (60/25/15), resampled from benchmark data. MoE requires all experts in memory (480 GB FP8) regardless of query simplicity, making per-query cost reduction impossible. The gated classifier routes simple queries to compact models (Nova Lite, Gemma 4B, Ministral 14B - 4 to 28 GB) at 6–8× lower cost per query. As the proportion of simple queries increases, savings… view at source ↗

read the original abstract

Large language models deployed in regulated industries operate under two constraints: compliance enforcement and cost efficiency. Personally identifiable information (PII) in user queries can reach model endpoints before the system determines whether that data should leave its jurisdictional boundary. Serving all queries through a single large model consumes full GPU capacity regardless of query complexity while offering no mechanism for geographic routing. Mixture-of-Experts architectures do not address this routing occurs between expert layers within the model after data has already arrived at the endpoint, with all experts loaded in memory regardless of query complexity. We propose a classifier-gated routing architecture that enforces compliance by design. A trained encoder classifier sits before any decoder inference, evaluating each query for complexity and data sensitivity, then routing it to an appropriately sized dense model in the appropriate geographic location. PII-containing queries route to local endpoints before any LLM computation begins, making data residency violations structurally impossible. Simple queries reach small, fast models at a fraction of the cost. Our evaluation on 600 queries demonstrates 39% median latency reduction, 33-52% cost savings depending on query distribution, and generation throughput of 122-200 tokens/second versus 50-64 for the baseline. The encoder classifier achieves 99.2% accuracy with near-perfect PII recall at 7ms inference overhead, establishing pre-inference classification as a practical path to compliance-by-design LLM deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The routing idea is practical but the 'structurally impossible' compliance claim rests on an unverified classifier with thin evaluation support.

read the letter

The paper's core proposal is a classifier that sits before any LLM inference, tags queries for PII and complexity, then routes them to the right model size and geographic location. That pre-inference gate is the main new piece; it combines PII detection with tiered routing in one system rather than handling compliance after data arrives or inside an MoE layer.

The reported numbers are the part that could matter in practice: 39% median latency drop, 33-52% cost savings, and the classifier adding only 7 ms while hitting 99.2% accuracy on 600 queries. Those figures give a concrete sense of the overhead and potential savings if the classifier holds up.

The soft spot is the central guarantee. The text states that PII queries route locally before any computation, making residency violations structurally impossible. That only follows if the classifier has zero false negatives on every input it will ever see. The 99.2% figure and near-perfect recall come from one test set of 600 queries with no reported details on query sources, labeling process, out-of-distribution cases, or adversarial PII. No runtime verification or fallback is described either. A single miss breaks the structural claim, and the evaluation does not rule that out.

The cost and latency results are harder to judge without baseline architecture details or statistical tests. The work is aimed at teams already running LLMs in finance or healthcare who need routing options. It deserves a referee to check whether the full methods section supplies the missing data and robustness checks; the idea is worth testing even if the strongest claim needs tempering.

Referee Report

2 major / 1 minor

Summary. The paper proposes ComplianceGate, a classifier-gated multi-tier LLM routing architecture for regulated industries. A trained encoder classifier evaluates each incoming query for PII sensitivity and complexity before any decoder inference begins, routing PII queries exclusively to local endpoints and simple queries to smaller, cheaper models in appropriate geographic locations. The central claims are that this makes data-residency violations structurally impossible and yields 39% median latency reduction, 33-52% cost savings, and 122-200 tokens/s throughput on an evaluation of 600 queries, with the classifier achieving 99.2% accuracy and near-perfect PII recall at 7 ms overhead.

Significance. If the empirical results hold under broader testing, the pre-inference classification approach would offer a practical compliance-by-design mechanism that current single-model or MoE deployments lack, while delivering measurable efficiency gains. The work highlights a concrete engineering pattern for separating compliance enforcement from model inference.

major comments (2)

[Abstract / Evaluation] Abstract and Evaluation section: the manuscript reports aggregate metrics (99.2% accuracy, near-perfect PII recall, 39% latency reduction, etc.) on 600 queries but supplies no information on query sources, labeling process for PII, baseline system details, statistical significance testing, or error analysis. This leaves the central performance and compliance claims without verifiable support.
[Abstract] Abstract: the claim that PII-containing queries make data residency violations 'structurally impossible' holds only if the classifier has zero false negatives on every possible input. The reported 99.2% accuracy and near-perfect recall on 600 queries does not establish this guarantee; OOD robustness, adversarial or encoded PII, and formal error bounds are not addressed, and no runtime verification or fallback is described.

minor comments (1)

[Introduction] The distinction between internal MoE routing and the proposed system-level routing could be stated more explicitly to avoid reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on the manuscript. We address each of the major comments below and indicate the planned revisions.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and Evaluation section: the manuscript reports aggregate metrics (99.2% accuracy, near-perfect PII recall, 39% latency reduction, etc.) on 600 queries but supplies no information on query sources, labeling process for PII, baseline system details, statistical significance testing, or error analysis. This leaves the central performance and compliance claims without verifiable support.

Authors: We agree that the current manuscript does not provide sufficient details on the query sources, PII labeling process, baseline system, statistical significance testing, or error analysis. These elements will be added to the Evaluation section and summarized in the abstract in the revised version to ensure the performance and compliance claims are fully supported and verifiable. revision: yes
Referee: [Abstract] Abstract: the claim that PII-containing queries make data residency violations 'structurally impossible' holds only if the classifier has zero false negatives on every possible input. The reported 99.2% accuracy and near-perfect recall on 600 queries does not establish this guarantee; OOD robustness, adversarial or encoded PII, and formal error bounds are not addressed, and no runtime verification or fallback is described.

Authors: We concur that the strong claim in the abstract requires qualification, as the classifier's performance is evaluated on a finite set of 600 queries and does not guarantee zero false negatives across all inputs. We will revise the abstract to clarify that data residency violations are rendered impossible for queries correctly classified as containing PII. Additionally, we will include a new Limitations subsection addressing OOD robustness, adversarial and encoded PII, the lack of formal error bounds, and the current absence of runtime verification or fallback mechanisms, framing these as important areas for future research. revision: yes

Circularity Check

0 steps flagged

No circularity: architecture proposal with empirical evaluation only

full rationale

The paper describes a classifier-gated routing architecture and reports measured performance (99.2% accuracy, latency/cost reductions) on a 600-query evaluation set. No equations, derivations, fitted parameters renamed as predictions, or self-citations appear in the provided text. The central claim that PII routing makes residency violations 'structurally impossible' is an architectural assertion whose validity rests on classifier behavior, not on any reduction of results to inputs by construction. This is a standard empirical systems paper with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard supervised learning assumptions for the classifier and on the empirical results of an unreported evaluation; no new physical entities or ungrounded mathematical axioms are introduced.

free parameters (1)

classifier decision thresholds for complexity and PII sensitivity
Routing logic depends on outputs of a trained classifier whose decision boundaries are fitted during training.

axioms (1)

domain assumption A supervised classifier can be trained to achieve 99.2% accuracy and near-perfect PII recall on the target query distribution
Invoked by the evaluation claims and the assertion that pre-inference classification establishes compliance-by-design.

pith-pipeline@v0.9.1-grok · 5775 in / 1451 out tokens · 34830 ms · 2026-07-02T20:14:10.598902+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 7 canonical work pages · 5 internal anchors

[1]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

Chen, L., Zaharia, M., and Zou, J. "FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance." arXiv:2305.05176,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

RouteLLM: Learning to Route LLMs with Preference Data

Ong, I., Almahairi, A., Wu, V ., Chiang, W.-L., Wu, T., Gonzalez, J. E., Kadous, M. W., and Stoica, I. "RouteLLM: Learning to Route LLMs with Preference Data." arXiv:2406.18665,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

AutoMix: Automatically Mixing Language Models

Madaan, A., Tandon, N., Chen, L., Anantha, R., Cheng, P., and Bansal, M. "AutoMix: Automatically Mixing Language Models." arXiv:2310.12963,

work page arXiv
[4]

Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing

Ding, D., Mallick, A., Wang, C., Sim, R., Mukherjee, S., Ruhle, V ., Lakshmanan, L., and Awadallah, A. H. "Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing." arXiv:2404.14618,

work page arXiv
[5]

Mixtral of Experts

Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., et al. "Mixtral of Experts." arXiv:2401.04088,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Qwen3 Technical Report

Qwen Team. "Qwen3 Technical Report." arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., et al. "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

8 July 2026, Bengaluru, KA. India Abhishek Dey Appendix-1: System Architecture The following diagrams illustrate the system architecture, request flow, classification logic, and deployment topology described in Section

2026

[1] [1]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

Chen, L., Zaharia, M., and Zou, J. "FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance." arXiv:2305.05176,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

RouteLLM: Learning to Route LLMs with Preference Data

Ong, I., Almahairi, A., Wu, V ., Chiang, W.-L., Wu, T., Gonzalez, J. E., Kadous, M. W., and Stoica, I. "RouteLLM: Learning to Route LLMs with Preference Data." arXiv:2406.18665,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

AutoMix: Automatically Mixing Language Models

Madaan, A., Tandon, N., Chen, L., Anantha, R., Cheng, P., and Bansal, M. "AutoMix: Automatically Mixing Language Models." arXiv:2310.12963,

work page arXiv

[4] [4]

Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing

Ding, D., Mallick, A., Wang, C., Sim, R., Mukherjee, S., Ruhle, V ., Lakshmanan, L., and Awadallah, A. H. "Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing." arXiv:2404.14618,

work page arXiv

[5] [5]

Mixtral of Experts

Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., et al. "Mixtral of Experts." arXiv:2401.04088,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Qwen3 Technical Report

Qwen Team. "Qwen3 Technical Report." arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., et al. "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

8 July 2026, Bengaluru, KA. India Abhishek Dey Appendix-1: System Architecture The following diagrams illustrate the system architecture, request flow, classification logic, and deployment topology described in Section

2026