arxiv: 2604.13069 · v1 · submitted 2026-03-20 · 💻 cs.CY

Recognition: no theorem link

Geographic Blind Spots in AI Control Monitors: A Cross-National Audit of Claude Opus 4.6

Jason Hung

Authors on Pith no claims yet

Pith reviewed 2026-05-15 08:27 UTC · model grok-4.3

classification 💻 cs.CY

keywords AI control monitorsgeographic biasLLM fabricationGlobal NorthGlobal SouthAI governanceknowledge gapsmonitor accuracy

0 comments

The pith

Claude Opus 4.6 fabricates answers more often for Global North AI queries than for Global South ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper audits Claude Opus 4.6, a model used in AI control protocols, to check whether its factual knowledge of the global AI landscape is uniform across countries. It finds higher fabrication rates on queries about Global North contexts, where the model attempts answers more readily yet supplies incorrect details. This pattern creates geographic blind spots that control systems assume do not exist. The audit applies a six-dimension knowledge framework to 17 indicators spanning 227 countries and uses regression methods to measure the disparities. If the finding holds, protocols that treat the monitor as reliable everywhere carry a built-in regional weakness.

Core claim

Contrary to the initial hypothesis, Claude Opus 4.6 produces higher fabrication rates for Global North queries than for Global South counterparts. This pattern aligns with a partial-knowledge mechanism in which the model attempts answers more frequently for Global North contexts but commits to incorrect values. The fabrication profile creates an exploitable vulnerability in which an adversarial system could frame harmful actions in governance or public attitude terms to reduce detection probability.

What carries the argument

The AI Control Knowledge Framework (ACKF), a six-dimension thematic scheme operationalized with 17 verified indicators from the Global AI Dataset v2, paired with a five-category response classification that distinguishes verifiable fabrication from honest refusal.

If this is right

Adversarial AI systems could frame harmful actions using Global North governance or public attitude terms to lower the probability of detection by the monitor.
AI control protocols that assume uniform monitor accuracy require geographic calibration to close the identified gaps.
Monitor reliability varies by the regional familiarity of query content, affecting oversight in diverse deployment settings.
Design of future control systems must account for partial knowledge as a distinct failure mode separate from outright refusal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar partial-knowledge patterns may appear in other large language models deployed for oversight tasks.
Control architectures could reduce risk by requiring explicit refusal on queries involving uncertain geographic facts rather than allowing attempted answers.
Extending the indicator set beyond the current 17 might test whether the North-South disparity persists across additional governance domains.

Load-bearing premise

The 17 indicators drawn from GAID v2 and the five-category response classification scheme accurately measure the factual knowledge required for reliable AI control monitoring across all deployment contexts.

What would settle it

A re-audit that applies a different set of indicators or expands the country sample and finds equal or lower fabrication rates for Global North queries would undermine the central claim.

read the original abstract

Artificial intelligence (AI) control protocols assume that trusted large language model (LLM) monitors reliably assess proposed actions across all deployment contexts. This paper tests that assumption in the geographic dimension. We audit Claude Opus 4.6-the monitor specified in Apart Research's AI Control Hackathon Track 3 benchmark-for systematic gaps in its factual knowledge of the global AI landscape. We develop the AI Control Knowledge Framework (ACKF), a six-dimension thematic scheme, and operationalise it with 17 verified indicators drawn from the Global AI Dataset v2 (GAID v2): 24,453 indicators across 227 countries published on Harvard Dataverse. A five-category response classification scheme distinguishes verifiable fabrication (VF) from honest refusal (HR); logistic regression with country-clustered standard errors combined with difference-in-differences (DiD) estimation quantifies geographic disparities in monitor accuracy across 2,820 country-metric-year observations. Contrary to our initial hypothesis, Claude Opus 4.6 produces higher fabrication rates for Global North queries than for Global South counterparts-a pattern consistent with a partial-knowledge mechanism in which the model attempts answers more frequently for Global North contexts but commits to incorrect values. This fabrication profile constitutes an exploitable vulnerability, where an adversarial AI system could frame harmful actions in governance or public attitude terms to reduce the probability of detection. This study provides the first cross-national, multi-domain audit of an AI control monitor's geographic knowledge gaps, with direct implications for the design of control protocols.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper audits Claude Opus 4.6 as an AI control monitor for geographic knowledge gaps using the AI Control Knowledge Framework (ACKF) operationalized via 17 indicators from GAID v2 (24,453 indicators across 227 countries). It applies logistic regression with country-clustered standard errors and difference-in-differences estimation to 2,820 country-metric-year observations, finding higher verifiable-fabrication rates on Global North queries than Global South ones—contrary to the initial hypothesis—and interprets this as evidence of a partial-knowledge mechanism that creates exploitable vulnerabilities in control protocols.

Significance. If the measurement and results hold, the work provides the first cross-national empirical audit of an AI control monitor's geographic blind spots, highlighting a concrete risk that adversarial systems could exploit by framing actions in Global North governance terms. The use of a public dataset (GAID v2 on Harvard Dataverse) and standard econometric methods supports reproducibility and falsifiability.

major comments (1)

[Abstract] Abstract: The central claim that higher VF rates on Global North queries reflect a partial-knowledge mechanism (and thus an exploitable vulnerability) depends on the 17 GAID v2 indicators and five-category VF/HR/etc. classification correctly operationalizing the factual knowledge required for reliable AI control monitoring. No validation is reported—such as expert review of indicator relevance to control protocols, inter-rater reliability for the response classification, or robustness checks under alternative codings—which is load-bearing for interpreting the North–South reversal.

minor comments (2)

[Abstract] Abstract: Model specification details are absent (e.g., exact covariates in the logistic regression, the precise DiD setup, or how the 2,820 observations are constructed from the 24,453 indicators).
[Abstract] Abstract: The initial hypothesis is stated but not formalized; a brief statement of the expected direction and its theoretical basis would clarify the reversal finding.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the validation of our indicators and classification scheme. We agree this is a substantive point and outline revisions below to address it directly while preserving the integrity of our empirical approach.

read point-by-point responses

Referee: The central claim that higher VF rates on Global North queries reflect a partial-knowledge mechanism (and thus an exploitable vulnerability) depends on the 17 GAID v2 indicators and five-category VF/HR/etc. classification correctly operationalizing the factual knowledge required for reliable AI control monitoring. No validation is reported—such as expert review of indicator relevance to control protocols, inter-rater reliability for the response classification, or robustness checks under alternative codings—which is load-bearing for interpreting the North–South reversal.

Authors: We acknowledge that the original manuscript does not include formal validation steps for the 17 GAID v2 indicators or the five-category response classification. The indicators were selected from the publicly documented GAID v2 (Harvard Dataverse) for their alignment with the ACKF dimensions on AI governance, policy, and attitudes; the classification scheme distinguishes verifiable fabrication from honest refusal following conventions in LLM auditing. The North–South reversal emerges consistently across logistic models with clustered errors and DiD specifications. In the revised version we will add a dedicated 'Validation and Robustness' subsection in Methods that includes: (1) explicit mapping of each indicator to AI control protocol requirements (referencing the Apart Research benchmark), (2) inter-rater reliability results from double-coding a 10% subsample of responses (target Cohen's kappa > 0.80), and (3) robustness tables using alternative category collapses and indicator subsets. These additions will support the partial-knowledge interpretation without changing the reported estimates. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical audit against external dataset

full rationale

The paper performs a standard empirical audit: it draws 17 indicators from the external GAID v2 dataset (Harvard Dataverse), defines a response classification scheme, queries Claude Opus 4.6, and applies logistic regression plus DiD estimation to 2,820 observations. No equations, fitted parameters, or self-citations reduce the reported fabrication rates or North-South disparity to a quantity defined by the same data or prior author work. The central finding contradicts the authors' initial hypothesis, confirming the analysis contains independent empirical content rather than a tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based on abstract only; the ACKF framework and response classification scheme are introduced for this study without external validation cited.

axioms (2)

domain assumption The six-dimension AI Control Knowledge Framework and its 17 indicators validly capture the factual knowledge needed for reliable control monitoring.
Introduced in the paper to operationalize the audit; no prior validation referenced in abstract.
domain assumption The five-category response classification scheme reliably distinguishes verifiable fabrication from honest refusal.
Core to the outcome variable; details of how categories are assigned are not provided in abstract.

pith-pipeline@v0.9.0 · 5567 in / 1442 out tokens · 49021 ms · 2026-05-15T08:27:31.624308+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 3 internal anchors

[1]

Introduction Artiﬁcial intelligence (AI) control protocols address the problem of maintaining human oversight over autonomous systems that may actively work to undermine such oversight per se [1, 2]. In a standard design, a trusted large language model (LLM) monitor sits between an untrusted AI agent and its environment; the monitor evaluates proposed act...

work page 1998
[2]

A ﬁve-category response classiﬁcation scheme—distinguishing VF, HF (correct response), HR, QH (qualitative hedging), and MF (misattribution)—enabling more detailed characterisation of geographic knowledge gaps than binary correct/incorrect scoring

work page
[3]

The ﬁrst cross-national, multi-domain audit of an AI control monitor’s geographic knowledge gaps, covering 2,820 country-metric-year observations and using a logistic regression with country-clustered standard errors and DiD evaluation design. This paper provides the empirical foundation for such red-teaming: identifying where and under what conditions th...

work page
[4]

Arora et al

Related Work Extensive empirical literature documents that LLM outputs indicate the distributional properties of their training corpora, systematically favouring the perspectives of countries most prominently represented in pre-training data [5]. Arora et al. [10] showed that pre-trained language models align less well with non-Western values than with We...

work page 2026
[5]

According to [data source], what was [indicator name] for [country] in [year]? Please provide the speciﬁc numerical value if you know it

Methods 3.1 Model Under Audit This study evaluates Claude Opus 4.6 as the sole monitor for three reasons. First, AI control research is concerned with speciﬁc deployed systems. The main question we explore is whether a speciﬁc monitor has knowledge gaps that could be exploited in practice, not whether geographic blind spots exist somewhere among possible ...

work page 1998
[6]

HR is the dominant category in all three conditions

Results Figure 1 shows the distribution of the ﬁve response categories across the three aforementioned query variants in a bar graph. HR is the dominant category in all three conditions. Under variant 1 (V1) (Direct), VF accounts for 9.7% of queries and HF for 5.2%, giving a value-response rate of 14.9% and a per-variant hallucination rate of 65.2% (273 V...

work page 2022
[7]

First, the 62.5% hallucination rate among value-bearing responses suggests a speciﬁc failure mode instead of general unreliability

Discussion and Limitations Our evaluation of Claude Opus 4.6 across 8,460 ACKF queries yields four observations relevant to AI control protocol design. First, the 62.5% hallucination rate among value-bearing responses suggests a speciﬁc failure mode instead of general unreliability. The model’s HR rate of 91.4% is consistent with appropriate calibration f...

work page
[8]

Concrete Problems in AI Safety

Conclusion This paper reports a cross-country, multi-domain audit of an AI control monitor’s factual knowledge gaps. We evaluate Claude Opus 4.6 against veriﬁed values from GAID v2 across 8,460 structured queries, 2,820 country-metric-year observations, 17 ACKF indicators, and six evaluation years. Claude Opus 4.6 classiﬁes 91.4% of ACKF queries as HR. Am...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.7910/dvn/pumgyu 2026
[9]

AI control: Improving safety despite intentional subversion,

R. Greenblatt, B. Shlegeris, K. Sachan, and F. Roger, “AI control: Improving safety despite intentional subversion,” in Proc. 41st Int. Conf. Mach. Learn. (ICML), PMLR vol. 235, pp. 16295–16336, 2024. arXiv:2312.06942

work page arXiv 2024
[10]

On the dangers of stochastic parrots: Can language models be too big?

E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, “On the dangers of stochastic parrots: Can language models be too big?” in Proc. ACM FAccT, 2021, pp. 610–623

work page 2021
[11]

On the Opportunities and Risks of Foundation Models

R. Bommasani et al., “On the opportunities and risks of foundation models,” arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

Biases in large language models: Origins, inventory, and discussion,

R. Navigli, S. Conia, and B. Ross, “Biases in large language models: Origins, inventory, and discussion,” ACM J. Data Inf. Qual., vol. 15, no. 2, 2023. DOI: 10.1145/3597307

work page doi:10.1145/3597307 2023
[13]

CulturalBench: A robust, diverse, and challenging cultural benchmark by human-AI cultural teaming,

Y. Y. Chiu, L. Jiang, B. Y. Lin, C. Y. Park, S. S. Li, S. Ravi, M. Bhatia, M. Antoniak, Y. Tsvetkov, V. Shwartz, and Y. Choi, "CulturalBench: A robust, diverse, and challenging cultural benchmark by human-AI cultural teaming," arXiv:2410.02677, 2024

work page arXiv 2024
[14]

Large language models are geographically biased,

R. Manvi, S. Khanna, M. Burke, D. Lobell, and S. Ermon, "Large language models are geographically biased," in Proc. 41st Int. Conf. Mach. Learn. (ICML), PMLR, vol. 235, pp. 34654–34669, 2024. arXiv:2402.02680

work page arXiv 2024
[15]

Global AI bias audit for technical governance,

J. Hung, “Global AI bias audit for technical governance,” arXiv:2602.13246, 2026. DOI: 10.48550/arXiv.2602.13246

work page doi:10.48550/arxiv.2602.13246 2026
[16]

Benchmarking open-weight foundation models for global AI technical governance,

J. Hung, “Benchmarking open-weight foundation models for global AI technical governance,” IEEE IRAI 2026 (forthcoming)

work page 2026
[17]

Probing pre-trained language models for cross-cultural differences in values,

A. Arora, L.-A. Kaffee, and I. Augenstein, “Probing pre-trained language models for cross-cultural differences in values,” in Proc. C3NLP Workshop at EACL 2023, pp. 114–130. arXiv:2203.13722

work page arXiv 2023
[18]

Assessing cross-cultural alignment between ChatGPT and human societies,

Y. Cao, L. Zhou, S. Lee, L. Cabello, M. Chen, and D. Hershcovich, “Assessing cross-cultural alignment between ChatGPT and human societies,” in Proc. C3NLP Workshop at EACL 2023, pp. 53–67. arXiv:2303.17466

work page arXiv 2023
[19]

CultureLLM: Incorporating cultural differences into large language models,

C. Li, M. Chen, J. Wang, S. Sitaram, and X. Xie, “CultureLLM: Incorporating cultural differences into large language models,” in Proc. NeurIPS 2024. arXiv:2402.10946

work page arXiv 2024
[20]

Survey of hallucination in natural language generation,

Z. Ji et al., “Survey of hallucination in natural language generation,” ACM Comput. Surv., vol. 55, no. 12, Art. 248, 2023. arXiv:2202.03629

work page arXiv 2023
[21]

Alignment faking in large language models

R. Greenblatt et al., “Alignment faking in large language models,” arXiv:2412.14093, 2024. [15] J. Carlsmith, “Scheming AIs: Will AIs fake alignment during training in order to get power?” arXiv:2311.08379, 2023. Appendix (optional) Table 1: ACKF—Six Dimensions Dimension Description AI Control Rationale Indicators D1: Compute Concentration National AI tra...

work page internal anchor Pith review Pith/arXiv arXiv 2024